1 Introduction
This instruction concerns alarm handling.
1.1 Alarm Description
The alarm is raised when one or more Ethernet interfaces belonging to a bonded interface fail.
The possible alarm causes and fault locations are explained in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
Failed Ethernet interface on bond0 |
The physical link state of one or both Ethernet interfaces is down |
Faulty physical Ethernet interface |
Physical Ethernet interface |
If one Ethernet interface is down, there is a loss in resilience. If both Ethernet interfaces are down, then internal cluster services such as booting and logging are affected. |
|
Faulty external device (that is, Ethernet switch) |
External device (that is, Ethernet switch) | |||
|
Failed Ethernet interface on bond1 |
The physical link state of one or both Ethernet interfaces is down |
Faulty physical Ethernet interface |
Physical Ethernet interface |
If one Ethernet interface is down, there is a loss in resilience. If both Ethernet interfaces are down, then external network traffic is down. |
|
Faulty external device (that is, Ethernet switch) |
External device (that is, Ethernet switch) | |||
- Note:
- This alarm can appear as a result of a maintenance activity.
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value | |
|---|---|---|
|
Major Type |
193 | |
|
Minor Type |
3341942786 | |
|
Source |
One of the following:
| |
|
Specific Problem |
LOTC Ethernet Bonding | |
|
Event Type |
environmentalAlarm (6) | |
|
Probable Cause |
x736UnspecifiedReason (418) | |
|
Additional Text |
Bonding failure on <bond> (links down on <slave> and <slave>) |
One of the following:
|
|
Perceived Severity |
critical (3): both of the bonded interfaces have failed |
major (4): one of the bonded interfaces has failed |
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
This instruction references the following documents:
1.2.2 Tools
No tools are required.
1.2.3 Conditions
Before starting this procedure, ensure that the following conditions are met:
- A LOTC Ethernet Bonding alarm is raised.
- It is known how to map the HostName (part of alarm attribute Source), to the physical address (slot) in the Blade Server Platform (BSP).
- An Ericsson Command-Line Interface (ECLI) session in Exec mode is in progress.
2 Procedure
Do the following:
- Check the active alarm list.
For information how to check the active alarm list, refer to Check Alarm Status.
- Is the alarm present on multiple blades, that is, are
there at least two LOTC Ethernet Bonding alarms with different values
of alarm attribute Source?
Yes: Proceed with Step 9.
No: Continue with the next step.
- Is the alarm severity Critical?
Yes: Continue with the next step.
No: Proceed with Step 12.
- Log on to the BSP to access
a Linux shell:
ssh <user>@<hostname> -p 22
The hostname is part of alarm attribute Source.
- Collect the kernel messages:
dmesg > $(hostname)-$(date +%d-%m-%y_%H-%M-%S).dmesg.log
If there is no network connectivity to a blade, then gather kernel information using a serial console using copy/paste.
The following is an example output:
[ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 3.0.80-0.7.1.5895.4.PTF-default (geeko@buildhost) (gcc version ⇒ 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Mon Aug 26 13:53:40 UTC 2013 () [ 0.000000] Command line: panic=10 console=tty0 cluster=(type=control,disk_cache=0,⇒ clean_rootfs=0) [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) [ 0.000000] BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 000000003fff0000 (usable) [ 0.000000] BIOS-e820: 000000003fff0000 - 0000000040000000 (ACPI data) [ 0.000000] BIOS-e820: 00000000fffc0000 - 0000000100000000 (reserved) [...]
- Using the ECLI on the BSP, reset the blade:
>ManagedElement=<hw_ME_name>,Equipment=1,Shelf=<shelf>,Slot=<slot>,Blade=1,reset --resetType HARD --gracefulReset FALSE
For example, when the issue is on shelf 2 and slot 3:
>ManagedElement=BSP04ST,Equipment=1,Shelf=2,Slot=3,Blade=1,reset --resetType HARD --gracefulReset FALSE
- Wait 5 minutes.
- Is the alarm cleared?
Yes: Proceed with Step 14.
No: Proceed with Step 12.
- Are the alarms reporting from
the same bond from all blades, for example, bond0 or bond1?
Yes: With high probability, the fault is in the BSP. Continue with the next step.
No: Proceed with Step 12.
- Is the BSP switch alive and functioning
properly?
Yes: Proceed with Step 12.
No: Troubleshoot the BSP switch.
- Once the switch problem is solved, is the alarm cleared?
Yes: Proceed with Step 14.
No: Continue with the next step.
- Perform data collection,
refer to Data Collection Guideline.
- Note:
- The status of the bounded interfaces must be collected.
- Consult next level of support. Further actions are outside the scope of this instruction.
- Job is completed.

Contents