| 1 | Overview |
| 1.1 | Alarm Description |
| 1.2 | Prerequisites |
2 | Procedure |
| 2.1 | Failing Node Introduction |
| 2.2 | Manual Replica Synchronization Procedure |
Glossary | |
Reference List | |
1 Overview
This instruction concerns alarm handling for the Control, Potential Split Brain Detected alarm.
1.1 Alarm Description
The alarm is issued when a potential symmetrical split situation is detected in the Ericsson Centralized User Database (CUDB) system.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
Potential symmetrical split situation detected. |
At least half of the sites of the CUDB system have failed, or are unreachable. |
|
Affected site(s) or node(s). |
The CUDB system is split in two equal halves. |
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Auto Cease |
Yes |
|
Module |
CONTROL |
|
Error Code |
2 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raise. |
|
Resource ID |
.1.3.6.1.4.1.193.169.7.2 |
|
Alarm Model Description |
CUDB system in potential split brain, Control. |
|
Alarm Active Description |
Control: Potential split brain detected. |
|
ITU Alarm Event Type |
communicationsAlarm (2) |
|
ITU Alarm Probable Cause |
communicationsSubsystemFailure (505) |
|
ITU Alarm Perceived Severity |
(3) – Critical |
|
Originating source IP |
Node IP where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which the alarms are raised. |
For further information about attribute descriptions, refer to CUDB Node Fault Management Configuration Guide, Reference [1].
1.2 Prerequisites
This section lists the prerequisites required for the procedure described in Section 2.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
- CUDB Node Fault Management Configuration Guide, Reference [1], regarding alarm configuration.
- The "CUDB System Split" section of CUDB High Availability, Reference [2], regarding CUDB system split situations.
- System Safety Information, Reference [9].
- Personal Health and Safety Information, Reference [10].
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
In most cases, the CUDB system is able to recover from the symmetrical split situation automatically when the unreachable sites are available again. Then, the alarm is automatically cleared.
Refer to CUDB High Availability, Reference [2] and CUDB System Split Partial Recovery Procedure, Reference [3] for more information on the behavior of the CUDB system in different split situations and recovery paths.
If the alarm does not cease, do the following:
- Manually verify the status of the system in both partitions, master and slave assignments.
- Verify the cause of the split situation: network, node, or site failure, or a combination of these. No specific actions can be taken until the nature of the failure is determined.
- Take the specific repairing actions.
- In case of node/site failure, continue with Step 5. In other cases, continue with Step 6.
- Verify the following:
- The failing elements in the system are fixed, that is, nodes in the formerly isolated site are operational again.
- Site infrastructure is fully operational, and the site is connected to the rest of the sites in the system.
- The failing nodes have not been restarted since they went down. The symmetrical split situation is still present.
Then the affected nodes must be reintroduced in the system. Perform the procedure described in Section 2.1.
- The failing nodes have been
started up after failure or the network failure has been repaired.
The symmetrical split situation has ended.
Verify if one or more instances of the Storage Engine, Unable to Synchronize Cluster in DS, Major, Reference [4] and Storage Engine, Unable to Synchronize Cluster in PLDB, Major, Reference [5] alarms are raised either in the surviving sites or in the recovering sites. If any of these alarms are raised, then refer to their specific Operating Instruction (OPI).
2.1 Failing Node Introduction
The following procedure operates on a per recovered node basis. It is assumed that the symmetrical split situation is still present at this point. To reintroduce a failing node, do the following:
- Check whether the system configuration has changed or not. If system configuration has changed, the procedure cannot continue.
- Prevent spontaneous reconnection of the recovered node by disabling the external interfaces: vlan CUDB_SITE, CUDB_FE and PROVISIONING.
- Put the PLDB cluster in maintenance mode for the recovered node.
- Obtain the list of DSGs that are to be resynchronized
manually. For each DSG in this list, perform manual replica synchronization
(see Section 2.2).
- Note:
- This step is optional. However, if the need for the manual resynchronization is expected, then it is recommended to perform it, as it can avoid the simultaneous raising of multiple alarms due to replica synchronization failures.
- Activate the external interfaces for Inter-CUDB communication in the recovered node (enable external interfaces for vlan CUDB_SITE).
- Activate ready mode for the PLDB cluster in the recovered node.
The system abandons the symmetrical split situation at this point.
- Verify whether the recovered node is hosting slave PLDB and DSG replicas. If it does not, then the procedure cannot continue.
- If there were alarms related to master replica synchronization, perform a manual replica synchronization (see Section 2.2 for the exact steps).
- Verify whether the replication is fully operational. This is performed by entering the command cudbCheckReplication. For further information on the command, refer to CUDB Node Commands and Parameters, Reference [6]. If replication is not fully operational, the process is unsuccessful.
- Activate the rest of the external interfaces in the recovered node (enable vlans: CUDB_FE and PROVISIONING).
- It is recommended to perform a software and data backup at this point. Refer to CUDB Backup and Restore Procedures, Reference [7], for information on backup and restore procedures.
2.2 Manual Replica Synchronization Procedure
In case manual replica synchronization is needed, do the following:
- Create a backup of the master DS Unit of the affected DSG that is already running. Refer to the "Unit Data Backup" section of CUDB Backup and Restore Procedures, Reference [7] for more information.
- Restore the backup of the master DS Unit in the local DS Unit of the affected CUDB node. Refer to the "Unit Data Restore" section of CUDB Backup and Restore Procedures, Reference [7] for more information.
The command cudbUnitDataBackupAndRestore makes the above process automatic. Refer to CUDB Node Commands and Parameters, Reference [6] for more information on the command.
Glossary
For the terms, definitions, acronyms and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [8].
Reference List
| Other Ericsson Documents |
|---|
| [9] System Safety Information. |
| [10] Personal Health and Safety Information. |

Contents