Availability and Scalability
Ericsson Service-Aware Policy Controller

Contents

1Availability and Scalability Introduction

2

Availability and Scalability Function
2.1Availability
2.2Scalability

3

Availability and Scalability Operational Conditions
3.1Availability and Scalability External Conditions
3.2Availability and Scalability Function Administration
3.3Availability and Scalability Security

Reference List

1   Availability and Scalability Introduction

This document provides a description of the Availability and Scalability function provided by the Ericsson Service-Aware Policy Controller (SAPC).

2   Availability and Scalability Function

2.1   Availability

The SAPC is built on a high available architecture where a single failure does not stop the operation of the cluster. It is built over a cluster of nodes of three types:

Figure 1   SAPC Cluster Architecture

The two SCs provide the OAM and provisioning services in an active-standby mode, which means that if an SC goes down, all services considering it as the active one, are managed by the other SC. The rest of the node types work in an active-active mode. The incoming traffic is distributed by a maximum of six TPs (usually, the first six TPs) among all the available traffic processors in the cluster. These TPs are also the ones publishing the traffic virtual IP address to the external network. If one of these TPs goes down, the publishing of the virtual IP address and the traffic distribution functions are moved to another available TP. Also, this TP is not considered to receive traffic in the distribution until it is up again.

The following situations, in which multiple failures were produced simultaneously, would affect the SAPC service availability:

To increase the reliability and availability of the system, the SAPC includes several control mechanisms, such as restoration procedures, overload control, session clean up procedure, and mechanisms to overcome connectivity loss.

2.1.1   Restart and Restore Procedures

The SAPC provides mechanisms to handle restart situations both for the SAPC itself and for the peer traffic plane nodes, and also procedures to restore.

2.1.1.1   SAPC Restart

Even when the SAPC provides a high level of availability, in case both SCs fail simultaneously during more than 15 minutes, the SAPC is restarted. Once the SAPC recovers from a restart, the last database information is recovered from the stored backups. The information recovered may not be fully up to date and, for this reason, some actions are performed by the SAPC to consolidate this information.

The next sections describe the actions performed by the SAPC , in PCC deployment scenario, after a cluster restart.

The SAPC increments its own Origin-State-Id and includes the new value in every response message alerting the peer nodes about the loss of previous session state.

Note:  
The Origin-State-Id is a monotonously increasing value that is increased whenever a Diameter entity restarts with the loss of the previous state.

The sessions available before the restart are not recovered from a backup. Therefore, all dynamic data related to sessions are not recovered either:

The Gx, Sd, Sy, and Rx sessions are identified by the Diameter session id. A session is considered unknown if the SAPC does not find a session in its internal database with the same session_id. After a SAPC restart, requests sent from the PCEFs, AFs, TDFs or Online Charging Systems for an unknown session will be answered by the SAPC with the DIAMETER_UNKNOWN_SESSION_ID error code.

All subscriber-related data are recovered from the stored backups.

2.1.1.2   Peer Restart

The SAPC is able to detect diameter peer node restarts based on the standard mechanism described for Diameter nodes in RFC 6733 (refer to Diameter Base Protocol, IETF RFC 6733).

The SAPC provides the following mechanisms to handle restart situations for the peer traffic plane nodes:

2.1.1.3   SAPC Restore

The SAPC provides the System Data type of restore.

System Data backup is used to do a system data fallback to recover to a former version of the whole system with consistency.
After restoring a System Data backup, the SAPC reestablishes the following information:

And the SAPC loses the following data:

2.1.2   Session Cleanup Mechanisms

The following mechanisms are implemented in the SAPC to remove obsolete information.

2.1.2.1   Basic Session Cleanup Mechanism

The following mechanism is related to the removal of specific obsolete sessions:

2.1.2.2   Massive Cleanup Mechanism

Massive Gx Session Clean up at PCEF Restart

This clean up mechanism consists of deleting all the obsolete IP-CAN sessions existing in the SAPC for a restarted PCEF considering also:

Massive Gx Session Clean up at PCEF Peer Removal

When a diameterNode peer is removed from the configuration data, the SAPC removes all the IP-CAN sessions established by that peer, using the same PCEF restart mechanism.

Massive Rx Session Clean up at AF Restart

This clean up mechanism consists of deleting all the obsolete AF sessions existing in the SAPC for a restarted AF considering also:

Note:  
The SAPC provides a robust mechanism that allows to clean the obsolete sessions even in case of geored switchover or scaling scenarios.

Both massive Gx and Rx clean up processes continue scanning and removing sessions until all the obsolete IP-CAN or AF sessions of the restarted peer have been removed.


2.1.2.3   Session Inactivity Cleanup Mechanism

This clean up mechanism consists of deleting all the inactive Gx sessions (no request is received or sent for them in a configurable period of time) existing in the SAPC considering also:

This mechanism is daily and enabled or disabled by configuration together with other parameters, as explained in Configure Session Inactivity Cleanup Mechanism.

If there is a massive clean up running or detected at the same time with a session inactivity cleanup process, the SAPC stops the session inactivity cleanup process.

2.1.3   Virtual Machine Evacuation

Virtual Machine (VM) Evacuation is a feature provided by the Network Function Virtualization Infrastructure (NFVI).

The evacuation of SAPC VMs permits that when the physical host where any of them is allocated is down, the Infrastructure re-creates the VM in a different host, restoring the VNF High Availability state. The evacuated VM is created from scratch, so runtime data is lost. To fully guarantee the HA of the SAPC, the evacuation of the VM should be done in such a way that the anti-affinity rules recommended for the VNF are applied during the re-creation of the new VM. See chapter for anti-affinity rules in Virtual Service-Aware Policy Controller 1.

No specific configuration is needed in the SAPC to support the feature, except for CEE deployments, where an evacuation policy must be configured during the VNF deployment. See the ha-policy parameter of the Descriptor Generator Tool Configuration File in SAPC VNF Descriptor Generator Tool.

For SAPC deployments on OpenStack based NFVIs, the VM Evacuation is only supported for Traffic Processors and Virtual Router VMs. In those deployments, when a host allocating an SC is down, the SC VM is started in this same host once it is up again.

2.1.4   EBM Server Connectivity Loss

The events generated by the SAPC are sent to the Event-Based Monitoring (EBM) server through a set of connections that the SAPC establishes. Connections towards the EBM server can be lost or suffer network disturbances. Meanwhile, if any of these connections are not working because of connectivity loss or server unavailability, events generated by the SAPC are stored in an internal buffer per each EBM server. For each EBM server there is a pool of connections as well.

This buffer is dimensioned to store the number of events generated to each EBM Server in a 10 second period, in which the SAPC handles the maximum incoming traffic the node can withstand. When the maximum capacity of the buffer is reached, the new events will overwrite the older events, following the First In First Out (FIFO) criteria in the buffer. As a result, some events can be lost if the network disturbance lasts more than 10 seconds.

When the SAPC detects that a connection is broken, the following actions are performed:

2.1.5   Live Migration with VMware vMotion

Live Migration is a feature provided by the NFVI.

To ease physical host maintenance tasks (for example HW upgrade), Live Migration allows to move a running VM between different physical hosts without perceived downtime. Memory, storage, and network connectivity of the VM are transferred from the original guest machine to the destination.

No specific configuration is needed in the SAPC to support the VMware vMotion feature.

The migration may involve only the VM, or only the datastore for this VM, or both. During the procedure, the new host, where they are migrated, is selected.

Shared storage is required for VMware vMotion. Also, the same networking configuration (same port groups) must be guaranteed in the destination host.

2.2   Scalability

The SAPC is built on a scalable architecture providing the ability to, on runtime, increase the capacity for traffic processing adding additional processors (Scale-out) or reduce the capacity removing existing processors (Scale-in). The SAPC is able to keep performance levels with few seconds impact on the ongoing traffic when Scale-out or Scale-in functions are performed. The node types to be scaled are only the Traffic Processors (TPs).

Figure 2 shows a SAPC cluster initially installed with m TP nodes that have been scaled-out up to z TP nodes.

Figure 2   SAPC Cluster where New TP Nodes Are Added

When TPs are scaled, the traffic interface and traffic distribution functionalities are also included, up to six running instances in six different TPs. From this number onwards, the new scaled TPs provide these functions in a spare way (as standby to become ready if any other active instance gets down).

2.2.1   Multi-Site Support

The SAPC supports geographical distribution (multi-site) configurations when a single SAPC does not have enough capacity to handle all the subscribers' traffic in the following scenarios.

2.2.1.1   SAPC with Common Database

In this deployment, the operator has multiple SAPCs deployed in different sites and a common database to store the subscriber data. Hence, any SAPC can serve IP-CAN session from any subscriber. Fair Usage Accumulators must be stored in the common database, so that any of the SAPCs can access and modify the data at any time

Figure 3   Multi-Site Deployment with Common Database

The subscribers' static data and Fair Usage Accumulators are centralized in the external database. The rest of the static data as Subscriber Groups, Services, and so on, together with operator defined policies, are provisioned in all the SAPCs.

2.2.1.2   Network Dependencies

The following considerations must be taken into account in deployments with multiple SAPCs:

3   Availability and Scalability Operational Conditions

3.1   Availability and Scalability External Conditions

VIP Gateway routers are not part of a SAPC but are needed in all kinds of deployments of a SAPC.

3.2   Availability and Scalability Function Administration

The following sections list the relevant Operation and Maintenance related actions, alarms, logs, notifications, and statistics data related to the function.

3.2.1   Availability and Scalability Alarms

There are no specific SAPC alarms related to its availability, apart from the ones provided by the platform:

3.2.2   Availability and Scalability Logging

The following events are logged:

3.2.3   Availability and Scalability Notifications

There are no specific SAPC notifications related to service availability, apart from the ones provided by the platform.

3.3   Availability and Scalability Security

Not applicable.


Reference List

Standards
[1] Diameter Base Protocol, IETF RFC 6733.