1 Introduction
This document provides the description and troubleshooting steps to take for the Storage Engine, Automatic Handling of Network Isolation not Completed for PDDB alarm.
1.1 Alarm Description
This alarm is raised when the Automatic Handling of Network Isolation process has failed to repair the Processing Layer Database (PLDB) cluster inconsistency between former and current master replica servers.
The alarm is issued in the following situations:
- Selective Replica Check task was not completed.
- Data Repair task was not completed.
- Triggering Reconciliation task was not completed.
The possible alarm causes and the corresponding fault reasons, fault locations and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
Selective Replica Check task was not completed. |
Automatic Handling of Network Isolation process was unsuccessful in repairing PLDB cluster inconsistency between former and current master replica servers. |
Slave (Selective Replica Check) replica server. |
Rescuing non-replicated data from former master has failed. | |
|
Data Repair task was not completed. |
Automatic Handling of Network Isolation process was unsuccessful in repairing PLDB cluster inconsistency between former and current master replica servers. |
Master (Data Repair) replica server. |
Rescuing non-replicated data from former master has failed. | |
|
Triggering Reconciliation task was not completed. |
Automatic Handling of Network Isolation process was unsuccessful in adding local DS units that were elected master for their DSG to the Reconciliation Pending Task List. |
|
Master replica server. |
No data reconciliation process. |
- Note:
- An alarm can appear as a result of a maintenance activity.
The following are the consequences for the node if the alarm is not solved:
- Non-replicated data transactions residing on former master (due to mastership change) are lost.
- Possible unresolved inconsistencies between PLDB and DSG data.
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Auto Cease |
No |
|
Module |
STORAGE-ENGINE |
|
Error Code |
29 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raise. |
|
Resource ID |
.1.3.6.1.4.1.193.169.1.1.29.<Timestamp> |
|
Alarm Model Description |
Automatic Handling of Network Isolation not Completed, Storage Engine. |
|
Alarm Active Description |
Storage Engine (PLDB): Automatic Handling of Network Isolation task <add_info> was not completed <add_info2> (task <taskid>, blade <Blade>), uuid: <uuid> |
|
ITU Alarm Event Type |
processingErrorAlarm (4) |
|
ITU Alarm Probable Cause |
softwareError (163) |
|
ITU Alarm Perceived Severity |
(4) – Major |
|
Originating source IP |
Node IP where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which the alarms are raised. |
In Table 2, the indicated variables are as follows:
- <Timestamp> is the Unix time representing the time of the incident, that is, the time stamp used to determine which events from the operational logs of the former master were analyzed.
- <add_info> is a variable description: "Selective Replica Check", "Data Repair", or "trigger Reconciliation".
- <add_info2> is an optional additional description field showing up when the Automatic Handling of Network Isolation process has terminated a Selective Replica Check or a Data Repair task due to too long execution duration. Its value is: "due to time limit exceeded".
- <taskid> is a Selective Replica Check or a Data Repair task identifier based on the Automatic Handling of Network Isolation activity start time.
- <Blade> is the host name of the blade or Virtual Machine (VM) where repair was executed and the logs are stored.
- <uuid> is the universally unique identifier of the computing resource (blade or VM). It is blank if it is not possible to figure out its value.
For more information about attributes description, refer to CUDB Node Fault Management Configuration Guide, Reference [1].
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
This section describes the procedure to follow when this alarm is received.
2.1 Actions for Selective Replica Check Task Was Not Completed
Do the following:
- If the Storage Engine, Unable to Synchronize Cluster in PLDB, Major alarm is raised when the Self-Ordered Backup and Restore function is not enabled or fails to restore the replication automatically, follow the procedure described in Storage Engine, Unable To Synchronize Cluster In PLDB, Major, Reference [2].
- Cease the alarm manually.
- Note:
- The procedure is to fix data inconsistency among master and slave replicas, but it cannot guarantee the full repair of system data.
2.2 Actions for Data Repair Task Was Not Completed
Do the following:
Cease the alarm manually.
- Note:
- Full repair of system data cannot be guaranteed in this case.
2.3 Actions for Triggering Reconciliation Task Was Not Completed
Do the following:
- Run the following command
to establish an admin "CUDB CLI" session towards the CUDB node where
the master for PLDB is:
ssh <admin_user>@<CUDB_Node_OAM_IP_Address>
Refer to CUDB System Administrator Guide, Reference [3] on how to list all master DSG replicas.
- Run the following command to check if there is any pending
or ongoing reconciliation task for the specific DSG(s) from Step 1:
cudbReconciliationMgr -l
This command returns the DSG(s) in an affirmative case. Otherwise, it returns nothing. In an affirmative case, exit this procedure. In a negative case, follow with the next step.Refer to CUDB Node Commands and Parameters, Reference [4] for further information about this command.
- Schedule reconciliation for the specific DSG(s) from Step 1:
cudbReconciliationMgr -a <dsId>
If the task for DSG identified with <dsId> is added, the command has no output. Otherwise, the output provides the error(s) fetched from the database. Refer to CUDB Data Storage Handling, Reference [5] for further information about the reconciliation process.
Glossary
For the terms, definitions, acronyms, and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [6].
Reference List
| Other Ericsson Documents |
|---|
| [7] System Safety Information. |
| [8] Personal Health and Safety Information. |

Contents