- Active-Active Geographical Redundancy Introduction
- Active-Active Geographical Redundancy Concepts
- Active-Active Geographical Redundancy Function
- Active-Active Geographical Redundancy Traffic Cases
- Active-Active Geographical Redundancy Capabilities
- Reference List
1 Active-Active Geographical Redundancy Introduction
This document describes the Active-Active Geographical Redundancy function provided by the SAPC.
2 Active-Active Geographical Redundancy Concepts
| Active SAPC |
A SAPC that is processing traffic and can handle provisioning operations. |
|
| Application Channel |
Connection between SAPC peers in a geographical redundant configuration used for geographical redundancy supervision and control functions. |
|
| Asynchronous replication |
The data is committed in an active SAPC and then it is replicated to the mated peer. |
|
| Database ownership |
When opening a geographical redundant object, a lock is held by the local SAPC, using a mechanism called ‘‘ownership’’ of the relevant object. This means that the next access to the same data from the SAPC mated peer, is possible and consistent, but requires more processing and adds latency. |
|
| Failover |
Mechanism that allows to SAPC neighbour peers to switch to a redundant SAPC upon the failure of the previously active SAPC. |
|
| Mated peer |
For a SAPC, the mated peer is the other SAPC that is part of the geographical redundancy function. |
|
| Replication channel |
Connection between SAPC peers in a geographical redundant configuration, used for data replication and also for geographical redundancy supervision and control functions. |
|
| Subscriber and Session Stickiness |
Feature provided by Diameter Clients to bind all the session requests for the same subscriber, to a specific SAPC peer instance. This prevents the database ownership latency. |
|
3 Active-Active Geographical Redundancy Function
3.1 Active-Active Geographical Redundancy Overview
The SAPC, as a network element, provides High Availability as explained in the document Availability and Scalability. However, this level of availability does not help in the case of complete power failure, natural disasters, such as fire or earthquakes, or deliberately destructive human behavior, such as bombings or terrorist attacks. Operators may also require the possibility to shut down clusters completely for planned or unplanned maintenance (for example, hardware or software change). The geographical network redundancy function provides this extra level of redundancy at network level. The SAPC with this feature offers a system availability target figure of 99.999%. The SAPC provides two geographical redundancy solutions which enhances the In-Service Performance (ISP) for traffic and O&M interfaces:
In Active-Active Geographical Redundancy, if one of the SAPC peers fails down, the neighbour nodes shall failover to the other SAPC (see Figure 1). The failover is not transparent for the SAPC clients (see Active-Active Geographical Redundancy Requirements).
Both SAPC peers are interconnected through two different network connections, called the Replication Channel and the Application Channel links. The Replication Channel link is used to transfer changes in database information done in one SAPC to the mated peer. The Replication Channel link is also used for geographical redundancy supervision and control, mainly to monitor the redundancy state of the mated peer. The Application Channel is only used for geographical redundancy supervision and control, to detect the availability of the mated peer, even when the Replication Channel fails.
The function consists of the following main parts:
To prevent time discrepancies in session management, for example when each SAPC belongs to different time zones, the SAPC in geographical redundancy uses the UTC time standard regardless of the configured time zone.
3.2 Active-Active Geographical Redundancy Data Mirroring
The SAPC geographical redundancy solution is based on the replication capability provided by the Database Service (DBS) of the Ericsson Component Based Architecture (CBA) platform. This is an asynchronous replication function, that guarantees that database changes done in each SAPC, are mirrored and applied in the mated peer.
Data mirroring between the SAPC peers, keeps data in each SAPC synchronized with the mated peer.The SAPC that commits data, forwards data transaction updates to the mated peer over the Replication Channel.
The data mirroring functionality is distributed over all payloads in the SAPC. There is one TCP/IP connection originating from each traffic processor in one SAPC, towards a traffic processor in the mated peer.
When opening a geographical redundant object, a lock is held by the local SAPC, using a mechanism called ‘‘ownership’’ of the relevant object. This means that the next access to the same data from the SAPC mated peer, is possible and consistent, but requires more processing and adds latency.
| Note: |
It is recommended to always have both SAPC peers
equally sized in terms of processing capacity and memory capacity.
Each SAPC shall also be sized to support the processing of
maximum amount of traffic (dimensioned traffic for the whole system),
when one SAPC is down or unavailable. |
3.2.1 Redundant Data
To allow neighbour peers a transparent failover, between one SAPC to the mated peer, the following information is replicated:
The SAPC does not replicate any other data which is not stored in the Database Service (DBS). It is responsibility of the operator, to align the non replicated data between both SAPC peers. This comprises the following data:
3.2.2 Backlog
Data mirroring is performed asynchronously for performance reasons. Hence, database changes in one SAPC are temporarily stored in memory while they are sent to the mated peer. Next, the received changes are saved and a confirmation is sent back to the mated peer. The SAPC can then apply the received changes in the local database, and the mated peer can release the transmitted data. These memory buffers in both SAPC peers, can be regarded as a backlog.
This procedure allows the SAPC to handle failures during data mirroring and ensure database consistency by forcing the same order of changes in both SAPC peers. Thus, having a backlog is a normal condition. However, there are situations when some transactions in the backlog cannot be processed immediately, for example, because of overload in either the sender or the receiver side, or disturbances in the Replication Channel. To handle this case in terms of resource use, it is possible to configure the maximum amount of memory used by the backlog.
3.2.3 SAPC DBS Synchronization
Data mirroring synchronizes database changes between SAPC peers. However, there are a number of cases where the original contents of the databases are different, therefore synchronizing the changes does not make the contents identical. In the following cases, a complete synchronization of the database is needed:
Synchronization itself, transmits a consistent view of the database from one SAPC to the mated peer. Any changes done in the SAPC transmitting the data in the meantime, are also transferred as normal database changes, which are applied in the mated peer once the initial database is fully imported. The synchronization process is automatically triggered by the SAPC when it is required, no manual intervention is needed. The SAPC DBS component, sends notifications about the start and the end (successful or unsuccessful) of the synchronization process.
When a SAPC reloads, it starts synchronization from the still-running mated peer. However, in other situations where a complete synchronization is needed, it is not possible to know which of the SAPC peers holds the most up-to-date database. In those cases, for example after a split-brain situation, the database from the preferred SAPC is maintained, and the non-preferred SAPC synchronizes from the preferred one.
The SAPC that starts a complete synchronization of the database with the mated peer data, can not neither handle traffic nor provisioning operation until the synchronization is finished.
3.3 Active-Active Geographical Redundancy Network traffic handling
3.3.1 Active-Active Geographical Redundancy Requirements
An Active-Active solution, means that any of the SAPC mated peers can handle traffic and provisioning requests from the client peers, maintaining a SAPC synchronization between them (see SAPC DBS Synchronization). It is required in the neighboring traffic and provisioning plane the following:
The SAPC configuration, operation and maintenance procedures are performed in each SAPC individually.
One of the SAPC peers must be configured as the preferred node in geographical redundancy. This is used when resolving some faulty situations where is not possible to know which of the SAPC peers holds the most up-to-date database. It is also used to prevent loss of data in some Replication Channel failure scenarios. These are the possible situations where the preferred configuration is used:
3.3.2 Active-Active Geographical Redundancy Network connectivity
Clients need to maintain one connection towards each SAPC in a redundant mated active-active pair. Seen from the external network, the two SAPC peers are as if they were two different standalone deployments, both of them with the capacity of handling traffic and provisioning. The redundant SAPC solution exposes two VIP addresses for traffic, another two VIP addresses for provisioning and two diameter host names.
Each SAPC handles the following mandatory VIP addresses:
| Traffic VIP |
This is the VIP address that the SAPC clients use to send diameter traffic to the SAPC. Each SAPC has its own traffic VIP. It is recommended the use of a DRA (Diameter Routing Agent), for traffic separation based on subscribers range, providing session stickiness, to ensure that all Diameter sessions established over the Gx and Rx, for a certain IP-CAN session, reach the same SAPC. In network deployments with Online Charging System, this is also the VIP address where the SAPC handles the Sy interface. This is also the VIP address that the SAPC exposes for the Application Channel to the mated peer. |
|
| Replication VIP |
This is the VIP address that the SAPC exposes for the Replication Channel to the mated peer. Each SAPC has its own replication VIP. |
|
| O&M Local VIP |
There is an extra local VIP associated to each SAPC. This is the VIP used to manage the SAPC information model through COM. |
|
And the following optional VIP addresses:
| Provisioning VIP |
This is an optional VIP address (if not configured, the O&M Local VIP can be used instead for provisioning), that the provisioning SAPC clients use to send provisioning orders. Each SAPC has its own provisioning VIP. It is recommended to perform the provisioning in the preferred SAPC. The data is replicated in the mated peer, transparent to the provisioning server. |
|
| ExtDB VIP |
Only required in network deployments with the CUDB or an external database function, this is the VIP address used to provide access to an external database system and receive SOAP notifications. Each SAPC has its own external database VIP. Both SAPC peers shall be configured in the CUDB or external database node, to broadcast the SOAP notifications. Only the SAPC with the database ownership of the affected subscriber's IP-CAN session/s, process the SOAP notification, being ignored by the SAPC mated peer. |
|
The following picture shows all the IP addresses involved in the geographical redundancy scenario with details about which IP addresses are available for each SAPC. As in standalone deployments, additional VIPs can be added if traffic separation between interfaces is required.
3.4 Active-Active Geographical Redundancy States
The Active-Active Geographical Redundancy function is initiated in the SAPC by performing an operational procedure. The SAPC geographical redundancy state, reflects the traffic handling ability. The states are the following:
| Note: |
When the SAPC does not handle traffic the ports
are blocked, so that any incoming traffic is not answered. In case
there are Diameter peers connected, the SAPC informs the
peers its intentions to shutdown the transport connection with Disconnect-Peer
requests. |
3.4.1 Transition between States
The SAPC executes transitions from one state to another depending on the information provided by the supervision functions explained in Active-Active Geographical Redundancy Supervision and Control Functions.
3.4.1.1 Transitions from Initial State
In this state, the SAPC can only make one transition when ordered by operational procedure:
3.4.1.2 Transitions from Synchronizing State
In this state, the SAPC can make the following transitions:
3.4.1.3 Transitions from Distributed State
In this state, the SAPC can make the following transitions:
3.4.1.4 Transitions from Active State
In this state, the SAPC can make the following transitions:
3.4.1.5 Transitions from Standby State
3.4.1.6 Transitions from Halted State
In this state, the SAPC can make the following transitions:
3.5 Active-Active Geographical Redundancy Supervision and Control Functions
The Geographical Redundancy control function has the following responsibilities:
3.5.1 Mated Peer Supervision
The SAPC stores its own replication state in a specific Managed Object Class (MOC) that can be managed by through the NETCONF interface. This MO also stores the previous state and a time stamp when the transition between states happened.
The SAPC uses a two simultaneous heartbeat mechanism:
|
Replication Heartbeat / Application Heartbeat |
SAPC1 state |
SAPC2 state |
|---|---|---|
|
Available / Available |
DISTRIBUTED |
DISTRIBUTED |
|
Available / Unavailable |
DISTRIBUTED |
DISTRIBUTED |
|
Unavailable / Available |
ACTIVE (preferred) |
STANDBY (non-preferred) |
|
Unavailable / Unavailable |
ACTIVE |
ACTIVE |
The Application Channel is only considered when the Replication Channel is down. There are three different supervision situations:
Steps
- When the Replication Channel heartbeat response is received, the redundancy and availability of the mated peer are considered available. The Application Channel heartbeat responses, are not considered. Both SAPC peers are in Distributed state.
- When the Replication Channel heartbeat response is not received (exceeding the maximum time period to consider that the channel is unavailable), but the Application Channel heartbeat response is received. The SAPC configured as preferred, makes the transition to Active state. The SAPC configured as non-preferred makes the transition to Standby state.
- When neither Replication Channel heartbeat nor Application Channel heartbeat responses are received. This is the split-brain situation, both SAPC peers make the transition to Active state.
3.5.2 Fault Detection and Recovery
The basic principle for the Active-Active Geographical Redundancy function, is that both SAPC peers process all incoming traffic and provisioning operations. Each SAPC keeps the state of the mated peer. However, there are situations that might cause one SAPC not to be synchronized with the peer. This chapter describes those scenarios and how they are handled.
3.5.2.1 Split Brain Scenario
This situation happens when the SAPC detects that neither Replication Channel nor Application Channel are available. If each SAPC is healthy, that is, only the heartbeats are lost (owing to loss of connectivity between the SAPC peers), it switches from distributed state to active state and continues serving the traffic and provisioning operations. In this scenario, the end result consists of two SAPC peers in active state serving traffic and provisioning but without data mirroring. This situation is known as an split brain scenario.
The split brain has the following consequences:
When the connectivity between the SAPC peers is re-established, a complete database synchronization is performed. Then, the non-preferred SAPC is automatically restarted. Once restarted, it recovers the most current database information from the mated peer. Once the complete synchronization is finished, both SAPC peers set the replication state to distributed.
3.5.2.2 Simultaneous SAPC Restart
If both SAPC peers fail, they probably do not reload at the same time. Therefore, the first one that restarts, observes loss of network connectivity to the mated peer and takes the necessary actions, as described in Split Brain Scenario.
If both SAPC peers come up nearly at the same time and both observe that its peer is already running, both of them try to synchronize data from the peer. Resolving this situation is automatic. The preferred SAPC provides data to the not preferred SAPC. The procedure requires a complete synchronization of the non-preferred SAPC.
3.5.2.3 Temporary Differences between Databases
The data mirroring functionality makes use of backlogs while transferring database changes from each SAPC to the mated peer as described in Backlog. Normally, any specific transaction should be processed soon, and, therefore, be removed from the backlog. If a transaction remains in a queue for more than a minute, an alarm is raised to report that redundancy is compromised. If the database differences turn out to be temporary, the alarm is automatically cleared.
However, it can happen that SAPC overload or network connectivity problems persist, and the backlog reaches the configured memory limit and transactions have to be dropped by either of the SAPC peers. In this case, the only way to reach a state where the database contents are the same is to fully synchronize the databases. Then, the preferred SAPC, provides data to the non-preferred SAPC by using the procedure described in SAPC DBS Synchronization.
3.5.3 Handling of the SAPC Origin State Id
In a standalone deployment, when the SAPC recovers from a restart, the database information recovered from the backup may not be fully up to date. Hence the SAPC increments its own Origin State Id and includes the new value in every response message alerting the peer diameter nodes about the loss of previous session state.
In an Active-Active Geographical Redundancy deployment, the Origin State Id information is replicated between both SAPC peers. Upon a SAPC restart, the Origin State Id is not incremented, it is obtained from the most up-to-date database information during synchronization with the mated peer. This enables to each SAPC to send the same Origin State Id value. Transitions between SAPC redundancy states do not increase the value of the SAPC Origin State Id, as those are transparent to the peer diameter nodes in the external network.
The SAPC only increments the Origin State Id if both SAPC peers restart. This is, when the SAPC recovers from a restart and cannot replicate the Origin State Id information (cannot sync with the mated peer, for example because of loss of network connectivity in the Replication Channel), the SAPC increments its own Origin State Id.
When one SAPC detects a split-brain situation, the own Origin State Id is not increased in order to provide service availability, as the session information is available in both SAPC peers. When the Replication Channel is re-established, the preferred SAPC continues providing service, but the Origin State Id is not increased (the non-preferred SAPC makes a complete synchronization with the mated peer and then, replicates the same Origin State Id).
If the SAPC restarts in a split-brain situation, the SAPC autonomously increments its own Origin State Id, and this information cannot be replicated in the mated peer. As a result each SAPC may send a different Origin State Id value as long as the Replication Channel is unavailable. When the Replication Channel is recovered, the Origin State Id in the preferred SAPC is maintained (the non-preferred SAPC makes a complete synchronization with the mated peer and then replicates the same Origin State Id).
3.5.4 Handling of the SAPC Origin-Host
When a diameter client requests for a new diameter session establishment, the SAPC Origin-Host included in the diameter answer, is the one from the SAPC peer that manages the session establishment. Each SAPC peer have a different Origin-Host value. In case of failure in the SAPC that managed the session establishment, the SAPC mated peer (that handles all the sessions while the failed SAPC recovers), always include as origin-host, the one used in the session establishment instead of its own, to avoid any problem in the diameter client.
4 Active-Active Geographical Redundancy Traffic Cases
This chapter explains the high level interactions that occur in the most common use cases for the Active-Active Geographical Redundancy functionality:
4.1 SAPC in Distributed state restarts
The following figure shows the high level flow that takes place when one SAPC in Distributed state restarts, and the main actions are taken by the peer SAPC, to perform the Active-Active Geographical Redundancy functionality.
A failure in one SAPC, makes the mated peer transition from distributed to active state. Diameter clients/DRA nodes shall be able to do failover to the SAPC that remains active. Once the SAPC recovers from the restart and completes synchronization from the mated peer, diameter clients/DRA nodes, shall be able to do failback to the recovered SAPC and continue the homogenous traffic distribution between both SAPC peers. During the transition, ongoing traffic events may fail. The duration of the traffic switch depends on several factors, see Active-Active Geographical Redundancy Capabilities.
Steps
- This is the initial working condition for Active-Active Geographical Redundancy. Both SAPC peers are in distributed state, processing traffic. Data mirroring is fully operational and keeps data in both SAPC peers synchronized.
- SAPC1 fails, SAPC2 detects that SAPC1 is not available (because of heartbeat time-outs), rises two alarms (Unable to Reach Peer, one for the Replication Channel and one for the Application Channel), and transitions to active state. Data mirroring is interrupted and the SAPC DBS component also rises an alarm (Connection Loss). Database transactions that were pending to be replicated, are dropped. As the SAPC2 starts to handle traffic, database changes are applied but can no longer be replicated in the mated peer. This makes the SAPC DBS component to rise another alarm in the SAPC2 (Synchronization Needed).
- SAPC1 completes the software reload, connects to the mated peer (which is in active state) and sets the redundancy state to synchronizing. The SAPC2 clears the corresponding alarms (Connection Loss, Unable to Reach Peer).
- Simultaneously with the previous step, the SAPC1 recovers all persistent database data from the latest backup and detects that the local database is out of sync. Then the SAPC DBS component rises an alarm (Initial Synchronization Needed) and starts synchronization from the active SAPC. The synchronization process transmits a snapshot of the database from the active SAPC to the recovered one, where it is imported. Any changes done on the active SAPC in the meantime, are also transferred as normal database changes, which are applied in the recovered SAPC when the base view is fully imported. The SAPC DBS component in the SAPC2 also clears the corresponding alarm (Synchronization Needed).
- SAPC1 completes successfully synchronization from the active SAPC2 and clears the corresponding alarm (Initial Synchronization Needed). In the final state, both SAPC peers are in distributed state, processing traffic.
4.2 Replication Channel unavailable but Application Channel available
The following figure shows the high level flow that takes place when the Replication Channel becomes temporarily unavailable, but the Application Channel keeps available.
A failure in the Replication Channel, makes each SAPC to check the availability of the mated peer using the Application Channel heartbeat. Once each SAPC determines that the peer is alive, the preferred SAPC transitions from distributed to active state. The non-preferred SAPC transitions from distributed to standby, setting the traffic and provisioning VIP addresses unavailable. Diameter clients/DRA nodes shall be able to do failover to the SAPC that remains active. Once the SAPC recovers the Replication Channel, the non-preferred SAPC makes a complete synchronization from the mated peer, and makes available again the traffic and provisioning VIP addresses. Diameter clients/DRA nodes, shall be able to do failback to the recovered SAPC and continue the homogenous traffic distribution between both SAPC peers. During the transition, ongoing traffic events may fail. The duration of the traffic switch depends on several factors, see Active-Active Geographical Redundancy Capabilities.
Steps
- This is the initial working condition for Active-Active Geographical Redundancy. Both SAPC peers are in distributed state, processing traffic. Data mirroring is fully operational and keeps data in both SAPC peers synchronized. The SAPC1 is configured as the preferred node for geographical redundancy.
- The Replication Channel fails, but SAPC1 detects that SAPC2 is available through the Application Channel. SAPC1 rises an alarm (Unable to Reach Peer) for the Replication Channel and transitions to Active state. SAPC2 also detects the same situation, transitions to Standby state and makes unavailable the traffic and provisioning VIP addresses. Data mirroring is interrupted and the SAPC DBS component also rises an alarm (Connection Loss) in both SAPC peers. Only SAPC1 handles traffic, and database changes are applied locally. This handle of traffic in the SAPC1, makes the SAPC DBS component to rise another alarm in the SAPC1 (Synchronization Needed).
- The Replication Channel becomes available and both SAPC peers clear the corresponding alarms (Unable to Reach Peer, Connection Loss and Synchronization Needed). Data mirroring restarts and detects that a complete synchronization is needed. The SAPC2 connects to the mated peer (which is in active state) and sets the redundancy state to synchronizing. Then the SAPC2 recovers all persistent database data from the latest backup and starts synchronization from the active SAPC. In the meantime, the SAPC1 continues handling traffic.
- SAPC2 completes successfully synchronization from the active SAPC1 and the system goes back to the initial state.
4.2.1 SAPC in Active state restarts
The following actions are done if before the Replication Channel recovers, the Active SAPC1 fails:
Steps
- SAPC2 detects that the SAPC1 is down through the Application Channel. It transitions from Standby to Active state, makes available the traffic and provisioning VIP addresses and rises two instance alarms (Unable to Reach Peer), one for the Application Channel and another for the Replication Channel. Only SAPC2 handles traffic and provisioning operations, applying database changes locally. The SAPC DBS component also rises two alarms (Connection Loss and Synchronization Needed) in SAPC2.
- Once SAPC1 is recovered, detects that the non-preferred SAPC2 is available through the Application Channel, but the Replication Channel continues unavailable, so it transitions to Active state again and makes available the traffic and provisioning VIP addresses. SAPC1 rises an alarm (Unable to Reach Peer) for the Replication Channel. The SAPC DBS component also rises two alarms (Connection Loss and Synchronization Needed) in SAPC1.
- The non-preferred SAPC2 detects that SAPC1 is available again, so it transitions to Standby state. SAPC2 makes unavailable the traffic and provisioning VIP addresses. Only SAPC1 handles traffic and provisioning operations, applying database changes locally. All the session data managed in the SAPC2 until the SAPC1 recovered, is lost, because the Replication Channel continues unavailable.
4.3 Both Replication Channel and Application Channel unavailable
The following figure shows the high level flow that takes place when both the Replication Channel and Application Channel become temporarily unavailable, and the main actions taken by the SAPC to perform the Active-Active Geographical Redundancy functionality.
A failure in both channels, results in both SAPC peers taking the role of active SAPC, handling traffic but without data replication. In this situation, the databases become inconsistent. When the connectivity is re-established, the database from the preferred SAPC is maintained, and the database from the non-preferred SAPC is discarded.
Steps
- This is the initial working condition for Active-Active Geographical Redundancy. Both SAPC peers are in distributed state, processing traffic. Data mirroring is fully operational and keeps data in both SAPC peers synchronized. The SAPC1 is configured as the preferred node for geographical redundancy.
- Both, the Replication Channel and Application Channel fail, SAPC1 detects that SAPC2 is not available (because of heartbeat time-outs), rises two alarms (Unable to Reach Peer, one for the Replication Channel and one for the Application Channel), and transitions to active state. SAPC2 also detects that SAPC1 is not available, and make the same actions. Data mirroring is interrupted and the SAPC DBS component also rises an alarm (Connection Loss) in both SAPC peers. This is an split brain scenario. Both SAPC peers start handling traffic, database changes are applied locally but can no longer be replicated, so they become inconsistent.
- The Replication Channel becomes available and both SAPC peers clear the corresponding alarms (Unable to Reach Peer, Connection Loss). Data mirroring restarts and detects that a complete synchronization is needed. The SAPC2 connects to the mated peer (which is in active state) and sets the redundancy state to synchronizing. Then the SAPC2 starts synchronization from the preferred SAPC, discarding the local data. In the meantime, the SAPC1 continues handling traffic.
- SAPC2 completes successfully synchronization from the active SAPC1 and the system goes back to the initial state.
5 Active-Active Geographical Redundancy Capabilities
For Active-Active Geographical Redundancy the following capabilities must be considered:
Regarding response times, the following capabilities must be considered for SAPC transitions:

Contents