LOTC Ethernet Bonding

Contents


1   Alarm Description

The alarm is raised when one or more Ethernet interfaces belonging to a bonded interface fail.

Table 1    LOTC Ethernet Bonding Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

Failed Ethernet interface on bond0

The physical link state of one or both Ethernet interfaces is down

Faulty physical Ethernet interface

Physical Ethernet interface

If one Ethernet interface is down, there is a loss in resilience. If both Ethernet interfaces are down, then internal cluster services such as booting and logging are affected.

Faulty external device (that is, Ethernet switch)

External device (that is, Ethernet switch)

Failed Ethernet interface on bond1

The physical link state of one or both Ethernet interfaces is down

Faulty physical Ethernet interface

Physical Ethernet interface

If one Ethernet interface is down, there is a loss in resilience. If both Ethernet interfaces are down, then external network traffic is down.

 
 

Faulty external device (that is, Ethernet switch)

External device (that is, Ethernet switch)

 
Note:  
This alarm can appear as a result of a maintenance activity.

2   Procedure

2.1   Handle Alarm LOTC Ethernet Bonding

Prerequisites

Steps

  1. Is the alarm present on multiple blades, that is, are there at least two LOTC Ethernet Bonding alarms with different values of alarm attribute Source?

    Yes: Proceed with Step 8.

    No: Continue with the next step.

  2. Is the alarm severity critical?

    Yes: Continue with the next step.

    No: Proceed with Step 11.

  3. Log on to the BSP to access a Linux shell, for example:

    ssh <user>@<hostname> -p 7022

    The hostname is part of alarm attribute Source.

  4. Collect the kernel messages:

    dmesg > $(hostname)-$(date +%d-%m-%y_%H-%M-%S).dmesg.log

    If there is no network connectivity to a blade, gather kernel information using a serial console using copy and paste.

    The following is an example output:

    [    0.000000] Initializing cgroup subsys cpuset
    [    0.000000] Initializing cgroup subsys cpu
    [    0.000000] Linux version 3.0.80-0.7.1.5895.4.PTF-default (geeko@buildhost) (gcc version ⇒
    4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Mon Aug 26 13:53:40 UTC 2013 ()
    [    0.000000] Command line: panic=10 console=tty0 cluster=(type=control,disk_cache=0,⇒
    clean_rootfs=0)
    [    0.000000] BIOS-provided physical RAM map:
    [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
    [    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
    [    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
    [    0.000000]  BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
    [    0.000000]  BIOS-e820: 000000003fff0000 - 0000000040000000 (ACPI data)
    [    0.000000]  BIOS-e820: 00000000fffc0000 - 0000000100000000 (reserved)
    [...]
  5. Using the ECLI on the BSP, reset the blade:

    >ManagedElement=<hw_ME_name>,Equipment=1,Shelf=<shelf>,Slot=<slot>,Blade=1,reset --resetType HARD --gracefulReset FALSE

    For example, when the issue is on shelf 2 and slot 3:

    >ManagedElement=BSP04ST,Equipment=1,Shelf=2,Slot=3,Blade=1,reset --resetType HARD --gracefulReset FALSE

  6. Wait 5 minutes.
  7. Is the alarm cleared?

    Yes: Proceed with Step 13.

    No: Proceed with Step 11.

  8. Are the alarms reporting from the same bond from all blades, for example, bond0 or bond1?

    Yes: With high probability, the fault is in the BSP. Continue with the next step.

    No: Proceed with Step 11.

  9. Is the BSP switch alive and functioning properly?

    Yes: Proceed with Step 11.

    No: Troubleshoot the BSP switch.

  10. Once the switch problem is solved, is the alarm cleared?

    Yes: Proceed with Step 13.

    No: Continue with the next step.

  11. Perform data collection, refer to Data Collection Guideline.
    Note:  
    The status of the bounded interfaces must be collected.

  12. Consult next level of support. Further actions are outside the scope of this instruction.
  13. Job is completed.