1 Introduction
This instruction concerns alarm handling.
1.1 Alarm Description
The Compute Host Failed alarm is issued by the Managed Object (MO) Node when the periodic supervision algorithm detects that the compute host has failed the availability test three consecutive times, and, after that, remains unavailable for more than five minutes.
The severity of the alarm is MINOR or CLEARED.
The possible alarm causes and fault locations are explained in Table 1.
|
Alarm |
Description |
Fault |
Fault |
Impact |
|---|---|---|---|---|
|
Compute node is down |
The compute node fails the availability test three consecutive times, and, after that, remains unavailable for more than five minutes. |
Compute node malfunction |
Compute node |
The compute node becomes permanently unavailable. |
The following is the consequence for the node if the alarm is not solved:
- The compute node remains unavailable.
The alarm attributes are listed in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Major Type |
193 |
|
Minor Type |
2031678 |
|
Managed Object Class |
Node |
|
Managed Object Instance |
Region=<name_of_the_region>, |
|
Specific Problem |
Compute Host Failed |
|
Event Type |
other (1) |
|
Probable Cause |
m3100ReplaceableUnitProblem(69) |
|
Additional Text |
;uuid=<hw_uuid_of_failed_server> |
|
Severity |
MINOR (5) or CLEARED |
- Note:
- The alarm does not specify which Virtual Machines (VMs) are
affected. Separate VM Unavailable alarms
are issued for each.
For more information about the VM Unavailable alarm, refer to VM Unavailable.
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
Not applicable.
1.2.2 Tools
No tools are required.
1.2.3 Conditions
Before starting this procedure, ensure that the following condition is met:
- Information about how to connect and use the out-of-band management is available.
2 Procedure
This section describes the procedure to follow when this alarm is received.
2.1 Actions
Perform the following:
- Use the unqualified part of the hostname to find out the
number of the corresponding shelf and server.
The unqualified hostname is displayed in the following format:
compute-<shelf_id>-<blade_id>
The first value reflects the number of the shelf, while the second value reflects the number of the server.
- Restart the server in question by using corresponding out-of-band management.
- The following scenarios are possible:
- The reboot of the server solved the problem, and the
alarm ceases.
If the alarm ceases, exit this procedure.
- Or the reboot of the server did not solve the problem.
In this case, proceed to Step 4.
- The reboot of the server solved the problem, and the
alarm ceases.
- Replace the server as described in Server Replacement.
- The following scenarios are possible:
- The replacement process was successful, the server became
online, and the alarm ceases.
If the alarm ceases, exit this procedure.
- Or the replacement process did not solve the problem.
In this case, proceed to Step 6.
- The replacement process was successful, the server became
online, and the alarm ceases.
- Collect troubleshooting data as described in the Data Collection Guideline.
- Contact the next level of maintenance support.
Further actions are outside the scope of this instruction.
- The job is completed.
3 Additional Information
The alarm is ceased for a compute host when the compute host passes the availability test.
As fencing is done by CM-HA through out-of-band management, in case fencing is turned on, there is a valid scenario when the compute host is switched off before evacuation, and only powered on at a later time.

Contents