Compute Host Failed
Cloud Execution Environment

Contents

1Introduction
1.1Alarm Description
1.2Prerequisites

2

Procedure
2.1Actions

3

Additional Information

1   Introduction

This instruction concerns alarm handling.

1.1   Alarm Description

The Compute Host Failed alarm is issued by the Managed Object (MO) Node when the periodic supervision algorithm detects that the compute host has failed the availability test three consecutive times, and, after that, remains unavailable for more than five minutes.

The severity of the alarm is MINOR or CLEARED.

The possible alarm causes and fault locations are explained in Table 1.

Table 1    Alarm Causes

Alarm
Cause

Description

Fault
Reason

Fault
Location

Impact

Compute node is down

The compute node fails the availability test three consecutive times, and, after that, remains unavailable for more than five minutes.

Compute node malfunction

Compute node

The compute node becomes permanently unavailable.

The following is the consequence for the node if the alarm is not solved:

The alarm attributes are listed in Table 2.

Table 2    Alarm Attributes

Attribute Name

Attribute Value

Major Type

193

Minor Type

2031678

Managed Object Class

Node

Managed Object Instance

Region=<name_of_the_region>,
CeeFunction=1,
Node=<hostname_of_the_node>

Specific Problem

Compute Host Failed

Event Type

other (1)

Probable Cause

m3100ReplaceableUnitProblem(69)

Additional Text

;uuid=<hw_uuid_of_failed_server>

Severity

MINOR (5) or CLEARED

Note:  
The alarm does not specify which Virtual Machines (VMs) are affected. Separate VM Unavailable alarms are issued for each.

For more information about the VM Unavailable alarm, refer to VM Unavailable.


1.2   Prerequisites

This section provides information on the documents, tools, and conditions that apply to the procedure.

1.2.1   Documents

Not applicable.

1.2.2   Tools

No tools are required.

1.2.3   Conditions

Before starting this procedure, ensure that the following condition is met:

2   Procedure

This section describes the procedure to follow when this alarm is received.

2.1   Actions

Perform the following:

  1. Use the unqualified part of the hostname to find out the number of the corresponding shelf and server.

    The unqualified hostname is displayed in the following format:

    compute-<shelf_id>-<blade_id>

    The first value reflects the number of the shelf, while the second value reflects the number of the server.

  2. Restart the server in question by using corresponding out-of-band management.
  3. The following scenarios are possible:
    • The reboot of the server solved the problem, and the alarm ceases.

      If the alarm ceases, exit this procedure.

    • Or the reboot of the server did not solve the problem.

      In this case, proceed to Step 4.

  4. Replace the server as described in Server Replacement.
  5. The following scenarios are possible:
    • The replacement process was successful, the server became online, and the alarm ceases.

      If the alarm ceases, exit this procedure.

    • Or the replacement process did not solve the problem.

      In this case, proceed to Step 6.

  6. Collect troubleshooting data as described in the Data Collection Guideline.
  7. Contact the next level of maintenance support.

    Further actions are outside the scope of this instruction.

  8. The job is completed.

3   Additional Information

The alarm is ceased for a compute host when the compute host passes the availability test.

As fencing is done by CM-HA through out-of-band management, in case fencing is turned on, there is a valid scenario when the compute host is switched off before evacuation, and only powered on at a later time.