1 Introduction
This instruction describes the alarm handling for the Storage Engine, Unable to Synchronize Cluster in PLDB, Major alarm.
1.1 Alarm Description
This alarm is issued when a slave replica is not able to sync data with the Processing Layer Database (PLDB) group master replica.
The alarm is issued in the following situations:
- Replication information in the master replica has been removed or purged.
- There is a mismatch between the local and the remote replication information.
- Slave server has no replication information about the master server.
- Automatic Handling of Network Isolation was not executed.
- Self-Ordered Backup and Restore failed.
If the CUDB system enters a state in which no master replica can be reached from the current node for the PLDB, then this alarm is cleared automatically, and the Storage Engine, No Available Master Replica for PLDB, Reference [2] alarm is raised.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
Replication information in the master replica has been removed or purged. |
Operational log index or operational log files in master replica are missing. Or operational log index is inconsistent. |
Operational log index or operational log files in master replica have been removed or purged. Or operational log index table was found inconsistent after starting or restarting master server. Inconsistency could have been caused by a non-graceful stop of the running process or a file system error. |
Master replica server. |
Loss of geographical redundancy. |
|
There is a mismatch between the local and the remote replication information. |
There is a mismatch between the local operational log time stamp and the remote operational log one. |
Mastership change during non-replicated transaction (like traffic updates or provisioning) accompanied with reconciliation process on new master replica server. Replication process cannot start once the former master is rejoined (not necessarily working) because the new master server does not have correct (enough) time stamp information. |
Both replica servers. |
Loss of geographical redundancy. |
|
Slave server has no replication information about the master server. |
Slave replica server has no replication information about master replica server. |
Daemon serving remote master replica server is missing or has been killed (both instances). |
Slave replica server. |
Loss of geographical redundancy. |
|
Automatic Handling of Network Isolation was not executed. |
It was not possible to execute the Selective Replica Check task, or the Selective Replica Check fails to retrieve all applicable entries. |
Rescuing non-replicated data from former master has failed. |
Both replica servers. |
Loss of geographical redundancy. |
|
Self-Ordered Backup and Restore failed. |
It was not possible to restore the replication automatically. |
The automatic backup and restore task has failed during either the backup creation, backup transfer or slave replica restore. |
Both replica servers. |
Loss of geographical redundancy. |
- Note:
- An alarm can appear as a result of maintenance activity.
The following are the consequences for the node if the alarm is
not solved:
Partial or complete loss of geographical redundancy
since the alarm was raised. Take into account that complete loss of
redundancy occurs for Double Geographical Redundancy or Triple Geographical
Redundancy if alarm is raised on both slave replica servers.
The alarm attributes are listed and explained in Table 2:
|
Attribute Name |
Attribute Value |
|---|---|
|
Auto Cease |
Yes |
|
Module |
STORAGE-ENGINE |
|
Error Code |
1 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raised. |
|
Resource ID |
.1.3.6.1.4.1.193.169.1.1.1 |
|
Alarm Model Description |
Unable to synchronize cluster, Storage Engine. |
|
Alarm Active Description |
Storage Engine (PLDB): Synchronization to current master impossible. <add_info> (task <taskid>, time <Timestamp>). |
|
ITU Alarm Event Type |
qualityOfServiceAlarm (3) |
|
ITU Alarm Probable Cause |
equipmentMalfunction (514) |
|
ITU Alarm Perceived Severity |
(4) – Major |
|
Originating Source IP |
Node ID where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which alarms were raised. |
In Table 2, the indicated variables are as follows:
- <add_info> is
an optional additional description field showing up when the Automatic
Handling of Network Isolation or the Self-Ordered Backup and Restore
process has failed.
Its value is either an empty string, "It was not possible to execute Selective Replica Check", or "Self-Ordered Backup and Restore failed".
- <taskid is a Selective Replica Check task identifier based on the Automatic Handling of Network Isolation activity start time.
- <Timestamp> is the Unix time representing the time of the incident, that is, the time stamp used to determine which events from the operational logs of the former master were analyzed.
- Note:
- <taskid> and <Timestamp> are not shown in case of Self-Ordered Backup and Restore.
For further information about attribute descriptions, refer to CUDB Node Fault Management Configuration Guide, Reference [1].
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
- CUDB Node Fault Management Configuration Guide, Reference [1], regarding alarm configuration.
- System Safety Information, Reference [6].
- Personal Health and Safety Information, Reference [7].
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
If the Storage Engine, Unable to Synchronize Cluster in PLDB, Major alarm is cleared automatically, check if the Storage Engine, No Available Master Replica for PLDB alarm is raised. If yes, follow the procedure in Storage Engine, No Available Master Replica for PLDB, Reference [2].
If the alarm is not cleared automatically in a short period of time, perform the following steps:
- Perform a new backup in the master PLDB of the CUDB node
where the master replica is located and restore it in the faulty CUDB
node. To find out where the master replicas are, refer to CUDB System Administrator Guide, Reference [3]. For further information
about the data backup and restore procedure, refer to CUDB Backup and Restore Procedures, Reference [4].
Take into account that replication starts automatically after a successful restore.
- If the alarm is not ceased, consult the next level of maintenance support. Further actions are outside the scope of this Operating Instruction.
Glossary
For the terms, definitions, acronyms, and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [5].
Reference List
| Other Ericsson Documents |
|---|
| [6] System Safety Information. |
| [7] Personal Health and Safety Information. |

Contents