Emergency Recovery Procedure for MTAS
MTAS

Contents

1Introduction
1.1Prerequisite

2

Recovery Procedures

3

Problem Reporting
3.1Problem Solved
3.2Consult Next Level of Support

1   Introduction

This document describes the emergency recovery procedure for the virtual MTAS node.

Scope

Only MTAS-related recovery actions are included. The root cause can sometimes be inside the system, but troubleshooting outside MTAS is out of the scope for this procedure.

The recovery actions are intended for critical situations with an apparent disturbance in traffic handling. It is also applicable for situations where an important redundant function is lost (no single point of failure safe, any more). This document is a recovery instruction, the affected systems are assumed to have been in a fully working state before the problems started. No troubleshooting steps that are related to faulty configuration, or wrong software or hardware versions are explained.

Often, the critical situation has been caused by some kind of manual activity ongoing at the site. This is often some kind of upgrade, or other maintenance tasks. For such situations the fallback, or recovery procedures belonging to that particular activity (often delivered together with the upgrade), is to be tried first when the problem occurs. If that does not help, this recovery procedure can be tried.

If there are alarms raised from the system, the procedures for solving the cause of the alarms are also to be tried before this recovery procedure is consulted. Such documentation is a part of the Customer Product Information (CPI) stored in the Active Library Explorer. Also, MTAS Troubleshooting Guideline is to be consulted before this recovery procedure is considered.

This document provides an emergency recovery procedure for conditions where it is required to restore MTAS.

1.1   Prerequisite

This section states the prerequisite for performing the emergency recovery procedure.

1.1.1   Hardware and Software

The following hardware and software is required:

1.1.2   Documents

This instruction references the following documents:

1.1.3   Conditions

Before starting this procedure, ensure the underlying hardware and environment, for example, Ericsson Cloud Solution, and an OpenStack or equivalent environment, have already been restored.

2   Recovery Procedures

It is assumed that MTAS must be reinstalled for some reason, for example, as a result of a failure in the underlying cloud infrastructure. This, in turn, could be because of a major hardware failure, an extensive power loss, flood, or fire.

Note:  
Do not perform the following activities in the event of a system failure, unless otherwise stated:
  • Altering databases.
  • Modifying anything other than configuration. Before modifying the configuration, always make a backup copy of the original configuration file.
  • Introducing any additional changes (deltas).
  • Powering off or rebooting.
  • Altering the network level.
  • Changing any system passwords.

To restore MTAS:

  1. Deploy MTAS, see MTAS SW Installation.
  2. Restore the latest known working backup, if possible; see Restore Backup and View Progress Report.
    Note:  
    This step assumes that a backup has been created, and has been exported from the last working configuration. See Create Backup and Export Backup.

  3. Restore MTAS to its original size, see MTAS Scaling Management.
  4. Perform node health check, see MTAS Health Check.
  5. Is the problem solved?

3   Problem Reporting

All recovery situations must be seen as abnormal, and must be reported to the next level of support or according to other documented procedure. This applies even if the recovery has been successful. Often a Customer Service Request (CSR) is written to a responsible support organization.

If the situation has affected the In-Service Performance (ISP), it must be reported as such according to documented procedure.

It is often required to perform a Root Cause Analysis (RCA) later to determine the source of the problem. It is therefore important to document the problematic situation and all the recovery steps that have been taken. Several log files in the system must be saved or copied to prevent them from being overwritten with newer information. It is important that these logs are available for any future RCA.

3.1   Problem Solved

Keep the site and the affected functions under extra observation for a while to ensure that the fault does not reoccur. Record the incident according to local procedures using a log book or similar.

3.2   Consult Next Level of Support

Provide the receiving support organization with the following information:

For information on how to collect data and log files, see Data Collection Guideline for MTAS.