Control, Potential Split Brain Detected
Ericsson Centralized User Database

Contents

1Overview
1.1Alarm Description
1.2Prerequisites

2

Procedure
2.1Failing Node Introduction
2.2Manual Replica Synchronization Procedure

Glossary

Reference List

1   Overview

This instruction concerns alarm handling for the Control, Potential Split Brain Detected alarm.

1.1   Alarm Description

The alarm is issued when a potential symmetrical split situation is detected in the Ericsson Centralized User Database (CUDB) system.

The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.

Table 1    Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

Potential symmetrical split situation detected.

Half of the sites of the CUDB system have failed or are unreachable. In deployments with more than two sites, some sites could be considered auto-removed and are not taken into account when determining the number of sites in the system. For more information, refer to the Split Situations section of CUDB High Availability, Reference [2].

  • Network or site failure.

  • A CUDB node has been shut down (in case of a two-node, two-site deployment).

Affected site(s) or node(s).

The CUDB system is split in two equal halves.

The alarm attributes are listed and explained in Table 2.

Table 2    Alarm Attributes

Attribute Name

Attribute Value

Auto Cease

Yes

Module

CONTROL

Error Code

2

Timestamp First

Date and time when the alarm was raised for the first time.

Repeated Counter

Number which indicates how many times the alarm was raised.

Timestamp Last

Date and time of the most recent alarm raise.

Resource ID

.1.3.6.1.4.1.193.169.7.2

Alarm Model Description

CUDB system in potential split brain, Control.

Alarm Active Description

Control: Potential split brain detected.

ITU Alarm Event Type

communicationsAlarm (2)

ITU Alarm Probable Cause

communicationsSubsystemFailure (505)

ITU Alarm Perceived Severity

(3) – Critical

Originating source IP

Node IP where the alarm was raised.

Sequence Number

Number which indicates the order in which the alarms are raised.

For further information about attribute descriptions, refer to CUDB Node Fault Management Configuration Guide, Reference [1].

1.2   Prerequisites

This section lists the prerequisites required for the procedure described in Section 2.

1.2.1   Documents

Before starting this procedure, ensure that you have read the following documents:

1.2.2   Tools

Not applicable.

1.2.3   Conditions

Not applicable.

2   Procedure

In most cases, the CUDB system is able to recover from the symmetrical split situation automatically when the unreachable sites are available again. Then, the alarm is automatically cleared.

Refer to CUDB High Availability, Reference [2] and CUDB System Split Partial Recovery Procedure, Reference [3] for more information on the behavior of the CUDB system in different split situations and recovery paths.

If the alarm does not cease, do the following:

  1. Manually verify the status of the system in both partitions, master and slave assignments.
  2. Verify the cause of the split situation: network, node, or site failure, or a combination of these. No specific actions can be taken until the nature of the failure is determined.
  3. Take the specific repairing actions.
  4. In case of node/site failure, continue with Step 5. In other cases, continue with Step 6.
  5. Verify the following:
    • The failing elements in the system are fixed, that is, nodes in the formerly isolated site are operational again.
    • Site infrastructure is fully operational, and the site is connected to the rest of the sites in the system.
    • The failing nodes have not been restarted since they went down. The symmetrical split situation is still present.

    Then the affected nodes must be reintroduced in the system. Perform the procedure described in Section 2.1.

  6. The failing nodes have been started up after failure or the network failure has been repaired. The symmetrical split situation has ended.

    Verify if one or more instances of the Storage Engine, Unable to Synchronize Cluster in DS, Major, Reference [4] and Storage Engine, Unable to Synchronize Cluster in PLDB, Major, Reference [5] alarms are raised either in the surviving sites or in the recovering sites. If any of these alarms are raised, then refer to their specific Operating Instruction (OPI).

2.1   Failing Node Introduction

The following procedure operates on a per recovered node basis. It is assumed that the symmetrical split situation is still present at this point. To reintroduce a failing node, do the following:

  1. Check whether the system configuration has changed or not. If system configuration has changed, the procedure cannot continue.
  2. Prevent spontaneous reconnection of the recovered node by disabling the external interfaces: vlan CUDB_SITE, CUDB_FE and PROVISIONING.
  3. Put the PLDB cluster in maintenance mode for the recovered node.
  4. Obtain the list of DSGs that are to be resynchronized manually. For each DSG in this list, perform manual replica synchronization (see Section 2.2).
    Note:  
    This step is optional. However, if the need for the manual resynchronization is expected, then it is recommended to perform it, as it can avoid the simultaneous raising of multiple alarms due to replica synchronization failures.

  5. Activate the external interfaces for Inter-CUDB communication in the recovered node (enable external interfaces for vlan CUDB_SITE).
  6. Activate ready mode for the PLDB cluster in the recovered node.

The system abandons the symmetrical split situation at this point.

  1. Verify whether the recovered node is hosting slave PLDB and DSG replicas. If it does not, then the procedure cannot continue.
  2. If there were alarms related to master replica synchronization, perform a manual replica synchronization (see Section 2.2 for the exact steps).
  3. Verify whether the replication is fully operational. This is performed by entering the command cudbCheckReplication. For further information on the command, refer to CUDB Node Commands and Parameters, Reference [6]. If replication is not fully operational, the process is unsuccessful.
  4. Activate the rest of the external interfaces in the recovered node (enable vlans: CUDB_FE and PROVISIONING).
  5. It is recommended to perform a software and data backup at this point. Refer to CUDB Backup and Restore Procedures, Reference [7], for information on backup and restore procedures.

2.2   Manual Replica Synchronization Procedure

In case manual replica synchronization is needed, do the following:

  1. Create a backup of the master DS Unit of the affected DSG that is already running. Refer to the "Unit Data Backup" section of CUDB Backup and Restore Procedures, Reference [7] for more information.
  2. Restore the backup of the master DS Unit in the local DS Unit of the affected CUDB node. Refer to the "Unit Data Restore" section of CUDB Backup and Restore Procedures, Reference [7] for more information.

The command cudbUnitDataBackupAndRestore makes the above process automatic. Refer to CUDB Node Commands and Parameters, Reference [6] for more information on the command.


Glossary

For the terms, definitions, acronyms and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [8].


Reference List

CUDB Documents
[1] CUDB Node Fault Management Configuration Guide.
[2] CUDB High Availability.
[3] CUDB System Split Partial Recovery Procedure.
[4] Storage Engine, Unable to Synchronize Cluster in DS, Major.
[5] Storage Engine, Unable to Synchronize Cluster in PLDB, Major.
[6] CUDB Node Commands and Parameters.
[7] CUDB Backup and Restore Procedures.
[8] CUDB Glossary of Terms and Acronyms.
Other Ericsson Documents
[9] System Safety Information.
[10] Personal Health and Safety Information.


Copyright

© Ericsson AB 2017. All rights reserved. No part of this document may be reproduced in any form without the written permission of the copyright owner.

Disclaimer

The contents of this document are subject to revision without notice due to continued progress in methodology, design and manufacturing. Ericsson shall have no liability for any error or damage of any kind resulting from the use of this document.

Trademark List
All trademarks mentioned herein are the property of their respective owners. These are shown in the document Trademark Information.

    Control, Potential Split Brain Detected         Ericsson Centralized User Database