1 Introduction
This instruction concerns alarm handling for the Storage Engine, DS Cluster Down alarm.
1.1 Alarm Description
This alarm is raised when some problem in the cluster prevents it from providing service.
The alarm is issued in the following situations:
- The local cluster is under maintenance operation.
- All management components of the local cluster are unreachable.
- All data nodes are unreachable.
Unfortunately the alarm does not state which cause triggered it.
The possible alarm causes and the corresponding fault reasons, fault locations and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
The local cluster is under maintenance operation. |
The local cluster is under maintenance operation. |
Due to explicit order, the cluster is under maintenance (data restoring, initializing, stopped or restarting) and thus cannot provide service. |
Cluster Supervisors on the System Controllers (SCs). |
The cluster cannot provide service until the operation completes. |
|
All management components of the local cluster are unreachable. |
All management components of the local cluster are unreachable. |
All management components of the local cluster are unable to start or started, but impossible to access both of them. |
Management components on the SCs. |
The cluster cannot provide service. |
|
All data nodes are unreachable. |
All data nodes are unreachable. |
The data nodes cannot even start or started, but do not provide service. The fault can have several causes, for example file system consistency errors due to non-graceful shutdown, uncontrolled crash or infrastructure errors. |
Data nodes on the payload blades or Virtual Machines (VMs) of the cluster. |
The cluster cannot provide service, data redundancy is decreased. |
- Note:
- An alarm can appear as a result of a maintenance activity.
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Module |
STORAGE-ENGINE |
|
Error Code |
6 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raise. |
|
Resource ID |
1.3.6.1.4.1.193.169.1.2.6.1 |
|
Timestamp |
Date when the alarm was raised. |
|
Model Description |
Cluster down, Storage Engine. |
|
Active Description |
Storage Engine (DS-group #<DG>): Storage Engine is down. |
|
Event Type |
4 |
|
Probable Cause |
546 |
|
Perceived Severity |
Critical |
|
Originating source IP |
Node IP where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which the alarms are raised. |
In Table 2, the indicated variables are as follows:
- <DG> is the Data Store Unit Group (DSG) this cluster belongs to.
For further information about attribute descriptions, refer to CUDB Node Fault Management Configuration Guide, Reference [1].
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
- CUDB Node Fault Management Configuration Guide, Reference [1], regarding alarm configuration.
- System Safety Information, Reference [4]
- Personal Health and Safety Information, Reference [5]
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
This section describes the procedure to follow when this alarm is received.
2.1 Actions for the Local Cluster Is Under Maintenance Operation
If this state is not by intention, contact the next level of maintenance support.
2.2 Actions for All Management Components of the Local Cluster Are Unreachable
Contact the next level of maintenance support.
2.3 Actions for All Data Nodes Are Unreachable
Do the following:
- Restore a previously created backup. For further information about the data backup and restore procedure, refer to CUDB Backup and Restore Procedures, Reference [2].
- If the alarm is not cleared automatically after the restore is completed, contact the next level of maintenance support.
Glossary
For the terms, definitions, acronyms and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [3].
Reference List
| CUDB Documents |
|---|
| [1] CUDB Node Fault Management Configuration Guide. |
| [2] CUDB Backup and Restore Procedures. |
| [3] CUDB Glossary of Terms and Acronyms. |
| Other Ericsson Documents |
|---|
| [4] System Safety Information. |
| [5] Personal Health and Safety Information. |

Contents