1 Introduction
This document provides the description and troubleshooting steps to take for the Storage Engine, Replication Stopped Working in PLDB alarm.
1.1 Description
This alarm is raised when replication stops working in the Processing Layer Database (PLDB) Storage Engine. The alarm is raised as a result of the periodical execution of the cudbCheckReplication command. For further information, refer to CUDB Node Commands and Parameters, Reference [1].
The alarm is issued in the following situation:
- The reallocation process is ongoing.
- The replication delay exceeds the time limit.
- Mastership change during cudbCheckReplication execution.
- Replication malfunction.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
The reallocation process is ongoing. |
Reallocation is in progress, and the replication lag exceeds the time limit set for cudbCheckReplication. |
Due to the reallocation process, data replication time exceeds the defined amount of seconds set for cudbCheckReplication. |
Temporary replication delay. No fault. |
No impact. |
|
The replication delay exceeds the time limit. |
No reallocation was executed, but the replication delay exceeds the time limit set for cudbCheckReplication. |
High write rate/load on PLDB. |
Temporary replication delay. No fault. |
No impact. |
|
Slow network link between master and slave. | ||||
|
Mastership change during cudbCheckReplication execution. |
A mastership change occurred while cudbCheckReplication was running preventing the script to work properly. |
A mastership change occurred while cudbCheckReplication was running preventing the script to work properly. |
No fault. |
No impact. |
|
Replication malfunction. |
The active replication channel between the local slave replica and the master one is not working properly. |
The slave replica has problems connecting the master PLDB. |
PLDB cluster. |
There might be service impact for the Front Ends (FE)s connecting to the affected slave PLDB cluster. Problems in replication could potentially trigger data inconsistency issues. |
|
Replication down inconsistencies on both replication channels. | ||||
|
Network issues, unstable link between master and slave. |
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Auto Cease |
NO |
|
Application Id |
STORAGE-ENGINE |
|
Error Code |
18 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raise. |
|
Model Description |
Replication stopped working, Storage Engine. |
|
Active Resource Id |
1.3.6.1.4.1.193.169.1.1.18 |
|
Active Description |
Storage Engine (PLDB): Replication stopped working. |
|
Alarm Event Type |
communicationsAlarm (2) |
|
Probable Cause |
communicationsSubsystemFailure (505) |
|
Severity |
major (4) |
|
Originating source IP |
Node IP where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which the alarms are raised. |
For further information about attribute descriptions, refer to CUDB Node Fault Management Configuration Guide, Reference [2].
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
- CUDB Node Fault Management Configuration Guide, Reference [2], regarding alarm configuration.
- The section on the cudbCheckReplication command in CUDB Node Commands and Parameters, Reference [1].
- CUDB Subscription Reallocation, Reference [3], regarding the reallocation feature.
- Storage Engine, Replication Channels Down in PLDB, Reference [4] for related alarm information.
- Storage Engine, Unable to Synchronize Cluster in PLDB, Major, Reference [5] for related alarm information.
- System Safety Information, Reference [7].
- Personal Health and Safety Information, Reference [8].
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
This section describes the procedure to follow when this alarm is received.
2.1 Actions for the Reallocation Process is Ongoing
Do the following:
- Run the cudbCheckReplication command, refer to CUDB Node Commands and Parameters, Reference [1] for more information.
- If it reports that the replication is working properly in PLDB (DSG 0) on the CUDB node where the alarm was raised, then clear the alarm manually, as described in CUDB Node Fault Management Configuration Guide, Reference [2].
2.2 Actions for the Replication Delay Exceeds the Time Limit
Do the following:
- Check network connections.
- Run the cudbCheckReplication command, refer to CUDB Node Commands and Parameters, Reference [1] for more information.
- If it reports that the replication is working properly in PLDB (DSG 0) on the CUDB node where the alarm was raised, then clear the alarm manually, as described in CUDB Node Fault Management Configuration Guide, Reference [2].
2.3 Actions for Mastership Change During cudbCheckReplication Execution
Do the following:
- Run the cudbCheckReplication command, refer to CUDB Node Commands and Parameters, Reference [1] for more information.
- If it reports that the replication is working properly in PLDB (DSG 0) on the CUDB node where the alarm was raised, then clear the alarm manually, as described in CUDB Node Fault Management Configuration Guide, Reference [2].
2.4 Actions for Replication Malfunction
Do the following:
- Check network connections.
- Check if the following alarms are raised:
- Storage Engine, Replication Channels Down in PLDB, Reference [4].
- Storage Engine, Unable to Synchronize Cluster in PLDB, Major, Reference [5].
If yes, follow the procedures in the corresponding documents above.
- Run the cudbCheckReplication command, refer to CUDB Node Commands and Parameters, Reference [1] for more information.
- If it reports that the replication is working properly in PLDB (DSG 0) on the CUDB node where the alarm was raised, then clear the alarm manually, as described in CUDB Node Fault Management Configuration Guide, Reference [2].
- If the problem still exists, consult the next level of maintenance support. Further actions are outside the scope of this operating instruction.
Glossary
For the terms, definitions, acronyms and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [6].
Reference List
| Other Ericsson Documents |
|---|
| [7] System Safety Information. |
| [8] Personal Health and Safety Information. |

Contents