Emergency Recovery Procedure for IPWorks

Contents

1Introduction
1.1Prerequisites
1.2Related Information

2

Emergency Definitions

3

Scenarios for Recovery
3.1Node Reboots Cyclically
3.2Node PXE Booting Fails
3.3PL Node Does Not Start Installation from SC By PXE
3.4IPWorks System Data Restore Fails
3.5IPWorks User Data Restore Fails
3.6IPWorks Application Component Cannot Start
3.7IPWorks Application Component Patch Fails
3.8Node Damaged
3.9Blade is Damaged
3.10IPWorks environment file, images and yaml file missing

4

Recovery Procedures
4.1Node Reboots Cyclically
4.2Node PXE Booting Fails
4.3PL Node Does Not Start Installation from SC By PXE
4.4IPWorks System Data Restore Fails
4.5IPWorks User Data Restore Fails
4.6IPWorks Application Component Cannot Start
4.7IPWorks Application Component Patch Fails
4.8Hard Reboot Instance
4.9Blade Replacement
4.10IPWorks environment file, images and yaml file missing

5

Problem Reporting
5.1Problem Solved
5.2Consult Next Level of Support

Reference List

1   Introduction

This document gives an overview of the emergency recovery tasks to be performed on the IPWorks that is deployed on the Ericsson Cloud Execution Environment (CEE) system.

Typically, an emergency procedure is required for conditions that make communication or normal management and alarm handling impossible. In a Worst Case Scenario, a procedure is required to restore the product. Emergency in this document refers to the situations described in Section 3 Scenarios for Recovery.

Scope

This document focuses on the hardware, platform, and application recovery. The network level recovery is out of the scope.

The system is assumed to have been in a fully working state before the problems started. Therefore no troubleshooting procedures that relate to faulty configuration or incorrect software version or hardware version, or both, are explained. For this type of information, refer to IPWorks Configuration Management and IPWorks Troubleshooting Guideline.

Some steps that have been identified as risky from an In-Service Performance (ISP) point of view are avoided in this document. When such steps are necessary, it is recommended to contact the next level of support, see Section 5.2 Consult Next Level of Support. Thus, at least two levels of support are involved before making a risky decision.

The recovery actions described in the recovery procedures are expected to be executed by Ericsson local or global support organizations, or both.

Note:  
Problematic situations and all the recovery actions that have been taken should be carefully documented.

Target Groups

This document is intended for telecommunication technicians authorized to perform emergency recovery procedures on Ericsson IPWorks systems.

1.1   Prerequisites

This section states the prerequisites for performing the emergency recovery procedures.

1.1.1   Personnel

The personnel performing the emergency recovery procedure must have solid knowledge of and training in the following areas:

1.1.2   Documents

Before starting this procedure, ensure that the following information or documents are available:

1.1.3   Tools

The following tools are required:

In addition, verify that all network, hardware, and cables are free of faults.

1.1.4   Access

The following access information is required for both on-site and remote access:

Note:  
  • Ensure that IP connectivity between the IPWorks and the management terminal has been correctly established before attempting to perform emergency handling procedures.
  • Ensure the console port connection between the IPWorks nodes and the management terminal has been correctly established before attempting to perform emergency handling procedures.
  • Cloud emergency recovery, refer to Emergency Recovery Procedure.

1.2   Related Information

Definition and explanation of acronyms and terminology, trademark information, and typographic conventions can be found in the following documents:

2   Emergency Definitions

Emergency in this document refers to the situation when a loss of service occurs in an IPWorks application.

Refer to following documents to handle situations where some processes cannot be started or where redundancy is affected but traffic and provisioning can still continue:

3   Scenarios for Recovery

This section describes different scenarios from which the system must recover, as shown in Table 1.

Table 1    Scenarios for Recovery

Scenario

Symptom

Recovery Procedure

Node Reboots Cyclically

A node keeps rebooting cyclically.

See Section 3.1

PXE Booting the System Fails

DHCP or TFTP fails when perform PXE boot for the system.

See Section 3.2

PL Node Does Not Start Installation from SC by PXE

A PL node has successfully booted and connected to an SC node by PXE, but does not start installation. Also, the PL node keeps returning an error message.

See Section 3.3

IPWorks System Data Restore Fails

IPWorks System Data restore fails.

See Section 3.4

IPWorks User Data Restore Fails

IPWorks User Data restore fails.

See Section 3.5

IPWorks Application Component Cannot Start

IPWorks application component cannot be started by using ipw-ctr.

See Section 3.6

IPWorks Application Component Patch Fails

IPWorks application component patch fails to install.

See Section 3.7

Node Damaged

IPWorks VNF Node is damaged unexpected.

See Section 3.8

Blade is Damaged

BSP GEP blade is damaged unexpectedly.

See Section 3.9

3.1   Node Reboots Cyclically

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

A node keeps rebooting cyclically.

Possible Reasons:

The possible reasons are the following:


  • The cluster.conf file contains a wrong configuration.

  • The evip.xml file contains a wrong configuration.

  • Some component cannot start up and then escalate to reboot node to try to recover it. (for example, COM, Storage Server, ENUN, DNS service).

  • The /dev/drbd0 folder is lost.

  • The BIOS boot order is incorrect.

  • No free memory left.

  • No disk space left.

Recovery procedures:

Section 4.1 Node Reboots Cyclically

Risks:

In case the /dev/drbd0 folder is lost, reinstallation of the whole system might be necessary.

Duration:

  • Recover the cluster.conf file and then reboot system: about 30 minutes.

  • Reboot the system from the GRUB boot loader: about 20 minutes.

  • In case the /dev/drbd0 folder is lost, the duration depends on the actions that must be taken to resolve the issue.

Expected outcome:

The system boots correctly with no cyclic reboot.

3.2   Node PXE Booting Fails

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

When try to PXE boot the system, one node cannot boot from networks.


1. DHCP failed as below:

2. DHCP connection is successful but cannot download boot image.

Recovery procedures:

Section 4.2 Node PXE Booting Fails

Risks:

Not applicable.

Duration:

About 20 minutes.

Expected outcome:

The system PXE boots successfully with no error message displayed.

3.3   PL Node Does Not Start Installation from SC By PXE

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

A PL node has successfully booted and connected to an SC node by PXE, but does not start installation.

Possible Reasons:

The possible reasons are the following:


  • A symbolic boot link is on the SC node dedicated to this PL node used at booting.

  • The DHCP service does not start properly.

  • The backplane port is disabled.

Recovery procedures:

Section 4.3 PL Node Does Not Start Installation from SC By PXE

Risks:

Not applicable

Duration:

About 30 minutes.

Expected outcome:

The PL node successfully starts installation.

3.4   IPWorks System Data Restore Fails

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

IPWorks system data restore fails.

Possible Reasons:

The possible reasons are the following:


  • System backup file is broken.

  • System backup of an abnormal system

Recovery procedures:

Before applying this recovery, you must contact next level of support firstly.

Risks:

Restore would fail for unexpected reason.

Duration:

About 30 minutes for only system restore.


It depends on the following:


  • Whether IPWorks User Data backup file includes NDB data

  • The data size in NDB.

Expected outcome:

The restore operation returns the " PERMIT_PHASE is completed", for details, refer to Restore Backup.

3.5   IPWorks User Data Restore Fails

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

IPWorks User Data restore fails. ECLI returns error message for the restore operation.

Possible Reasons:

The possible reasons are the following:


  • IPWorks Backup file is broken.

  • MySQL NDB processes are not running.

  • Unknown system problem.

Recovery procedures:

Before applying this recovery, you must contact next level of support firstly.

Risks:

Restore would fail for unexpected reason.

Duration:

About 30 minutes.

Expected outcome:

The restore operation returns the " PERMIT_PHASE is completed", for details, refer to Restore Backup.

3.6   IPWorks Application Component Cannot Start

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

AMF SI Unassigned alarm which is related to Storage Server, DNS, ENUM, and so on.


For more information on checking active alarms in the system, refer to Check Alarm Status.

Notification/Event:

Not applicable

Symptom:

Command ipw-ctr start [comp] [<hostname>] cannot start IPWorks application component.


The following command does not return empty:


#ipw-ctr status [comp] [<hostname>] | grep saAmfSUPresenceState | grep FAIL

Possible Reasons:

 

Recovery procedures:

Section 4.6 IPWorks Application Component Cannot Start

Risks:

Not applicable

Duration:

About 30 minutes.

Expected outcome:

The IPWorks application is working properly.

3.7   IPWorks Application Component Patch Fails

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

IPWorks application component patch failed and cannot roll back.

Possible Reasons:

The possible reasons are the following:


  • The IPWorks patch fails to install and fails to be rolled back.

  • The IPWorks path causes another big problem need to be removed.

Recovery procedures:

Section 4.7 IPWorks Application Component Patch Fails

Risks:

Restore would fail for unexpected reason.

Duration:

About 30 minutes for only system restore.


It depends on the following:


  • Whether IPWorks User Data backup includes NDB

  • The data size in NDB

Expected outcome:

IPWorks is determined to be healthy, and the patched application can provide service.

3.8   Node Damaged

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

IPWorks VNF node cannot be accessed from console (without console port hardening).


And the node cannot be accessed by SSH from healthy node (like SC or PL through internal network) or from external network. The command "tipc-config -n" on one healthy node shows the node is down.

Possible Reasons:

Not applicable

Recovery procedures:

Section 4.8 Hard Reboot Instance

Risks:

Not applicable

Duration:

About 1 hour.

Expected outcome:

The new VNF node can power on and boot up, and "tipc-config -n" can find it.

3.9   Blade is Damaged

Hardware Platform:

BSP GEP

Operating System:

CEE

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

GEP blade cannot be accessed by console (without console port hardening) and cannot recover by reboot, power cycle.

Possible Reasons:

Not applicable

Recovery procedures:

Section 4.9

Risks:

Not applicable

Duration:

Not applicable

Expected outcome:

The new GEP blade replaced can power on and boot up.

3.10   IPWorks environment file, images and yaml file missing

Hardware Platform:

BSP CEE

Operating System:

SUSE Linux Enterprise Server 12 (x86_64)

Alarm:

Not applicable

Notification/Event:

Not applicable

Symptom:

IPWorks environment file, images and yaml file are missing.

Possible Reasons:

Not applicable

Recovery procedures:

Section 4.10 IPWorks environment file, images and yaml file missing

Risks:

Not applicable

Duration:

Not applicable

Expected outcome:

The missing files are recovered.

4   Recovery Procedures

The procedures in this section describe the various scenarios used to find and resolve faults that can cause an IPWorks emergency situation.

The execution of the emergency recovery procedure follows the workflow as described in the following steps and shown in Figure 1:

  1. Identify the problem type best matching the problem experienced.
  2. Identify the recovery scenario best matching the problem experienced.
  3. Execute recovery actions in increasing order of severity.
  4. If recovery is successful, take preventive actions to prevent the problem from reoccurring.

Figure 1   Workflow

4.1   Node Reboots Cyclically

Perform the recovery procedure according to the following scenario:

  1. In case of cluster.conf has invalid configuration, check current cluster.conf configuration with template and make sure all network info is correct. Then, you can modify the cluster.conf in maintenance mode, see Section 4.1.2 Rebooting the System from GRUB Boot Loader.
  2. In case of evip.xml has invalid configuration. Modify the /cluster/storage/system/config/evip-apr9010467/evip.xml by following the IP plan and then validate the schema by executing command xmllint –schema /opt/vip/etc/evipconf.xsd /cluster/storage/system/config/evip-apr9010467/evip.xml. It will print whole evip.xml body if no error found. This issue only occurs in PL because SC does not use eVIP function.
  3. In case application keeps restarting and escalating to node reboot, follow the steps below:
    1. Investigate which application causes the node reboot.
    2. Stop the application in the node which is not cyclically reboot, see Section 4.1.1 Stop Application Causing Node Cyclic Reboot.
  4. In case the /dev/drbd0 folder is lost, check the DRBD configurations by cat /proc/drbd. Execute command "drbd-overview" to get drbd overview status. User can perform power cycle to try to recover it. See Section 4.8 Hard Reboot Instance for details. If the problem is not resolved, see Section 5.2 Consult Next Level of Support. Reinstallation and IPWorks restoration might be necessary.
  5. In case of no free memory left, after system reboot, check system memory usage by using top command to find whether this process occupies much memory, and then stop it. Refer to Section Checking CPU and Memory in IPWorks Manual Health Check.
  6. In case of no free disk left, after system reboot, check the system disk usage in rebooting node console terminal. Refer to the Section Checking Disk Usage in IPWorks Manual Health Check.
  7. In all other cases, go to Section 4.1.2 Rebooting the System from GRUB Boot Loader. And then go to OS maintenance mode to find what is wrong, like disk usage, disk label and configuration for cluster.conf.

4.1.1   Stop Application Causing Node Cyclic Reboot

It is possible that IPWorks application failed to restart and then escalate to node reboot to try to recover the system automatically. But according to that the failure still exists, so this causes the node reboots again and again. When node cyclic reboot occurs, check the node-related log to find the possible cause first.

  1. Check console terminal output.

    Find if there is any abnormal or any fail, error message in console terminal output. Show status of IPWorks application status, such as Storage Server, DNS, ENUM, DNS SM, AAA SM, in SC or PL.

  2. Check the reboot node messages log.

    SC messages log is /var/log/<SC-ID>/messages, if only one SC cyclic reboot, check the log content in another healthy SC because the message can be accessed in both SC. PL messages log can also be accessed in both SC nodes in /var/log/<PL-ID>/messages.

    Note:  
    <SC-ID> can be SC-1 or SC-2, <PL-ID> can be PL-3 or PL-4 etc.

  3. Search failed application information.

    Search restart, reboot, recovery, escalate in messages log to find which IPWorks component triggers node reboot.

  4. Stop application which causes cyclic reboot.

    The ipw-ctr command can be executed in any node of IPWorks cluster nodes (both SC and both PL). So, execute the command to stop application in the healthy node. Or this command can be executed in console terminate.

    #ipw-ctr stop <IPW application> <hostname>

    For example:

    • Stop Storage Server in SC-1:

      #ipw-ctr stop ss SC-1

    • Stop ENUM/DNS in PL-3:

      #ipw-ctr stop enum PL-3;ipw-ctr stop dns PL-3

    • Stop AAA Diameter in PL-3:

      #ipw-ctr stop aaa_diameter PL-3

  5. Troubleshoot why the application fails to start by checking application log.

    For Storage Server in SC:

    Refer to the Section Failed to Stop/Start/Restart Storage Server by ipw-ctr in IPWorks Troubleshooting Guideline.

    For DNS Server in PL:

    >dn ManagedElement=1,IpworksFunction=1,IpworksDnsRoot=1,DnsServer=1,BindService=1,DnsLog=1 
    >configure
    (config-DnsLog=1)>level=DNS_LOG_LEVEL_DEBUG

    The log can be found in /cluster/storage/no-backup/ipworks/logs/<PL-ID>.

    Refer to the Section DNS Server Fails to Start after System Boot in IPWorks Troubleshooting Guideline.

    For ENUM Server in PL:

    Modify log level in ECLI:

    > dn ManagedElement=1,IpworksFunction=1,IpworksDnsRoot=1,IpworksEnumRoot=1,EnumServer=1,Log=1
    >configure
    (config-Log=1)>level=LOG_LEVEL_TRACE
    

    Refer to the Section Failed to Stop/Start/Restart ENUM Server by ipw-ctr in IPWorks Troubleshooting Guideline.

    For AAA Server in PL:

    Modify log level in ECLI (use PL-3 as example):

    >ManagedElement=1,IpworksFunction=1,IPWorksAAARoot=1,IPWorksAAACommonRoot=1,AAAServer=PL-3,LogManagement=1,IPWorksLog=AAA_DIAMETER_SERVER

    (IPWorksLog=AAA_DIAMETER_SERVER)>configure

    (config-IPWorksLog=AAA_DIAMETER_SERVER)> level=LOG_LEVEL_DEBUG

    The log can be found in /cluster/storage/no-backup/ipworks/logs/<PL-X>.

    Refer to Section AAA Server in IPWorks Troubleshooting Guideline.

4.1.2   Rebooting the System from GRUB Boot Loader

Rebooting the system from the GRUB boot loader is necessary in the following scenarios:

To reboot the system from the GRUB boot loader, follow the steps below:

  1. Go to the GRUB boot loader and initiate booting the system.
  2. Press any key when the following text is displayed:

    Press any key to continue.

  3. When the GRUB boot menu is displayed, select Maintenance mode (Serial console).
    Note:  
    To see the GRUB boot menu, a serial console must be attached to the machine.

    Note:  
    Do not select Maintenance mode (VGA console) which will be hung.

  4. Log on to the system as root in maintenance mode.

    The password for logging in as root is rootroot.

    [ OK ]Started LSB: Early LDE configuration.
    [ OK ]Reached target Rescure Mode.
    Welcome to rescuGive root password for maintenace (or press Control-D to continue):
    linux:~#

  5. Execute the following command:

    # cluster config --create-devices

  6. Execute the following command:

    # swapon /dev/part_swap

  7. Mount /boot using the following command:

    # mount -t ext3 -o data=journal,commit=1 /dev/part_boot /boot

  8. Update the cached version of the cluster.conf file with the new input by entering the following:

    # vi /boot/.cluster.conf

  9. Reboot the system.

    #reboot

    Note:  
    When the system has rebooted, update the original cluster.conf file in /cluster/etc with the new input as in Step 8.

    In case the problem was that a node kept rebooting cyclically, go to Step 11. If the problem was that PXE booting the system failed, go to Step 10.


  10. Initiate PXE booting for the system again.
  11. When the procedure is completed, do the following:

4.2   Node PXE Booting Fails

Perform the recovery procedure according to the PL node boot fails.

Check whether the ipw_lde_sp network configuration in BSP is correct.

4.3   PL Node Does Not Start Installation from SC By PXE

Perform the recovery procedure according to the following scenario:

4.3.1   Creating a New Symbolic Boot Link to PL Node

To create a new symbolic boot link to a PL node that does not start installation after connecting to an SC node, follow the steps below:

  1. Go to the following directory on one of the SC nodes:

    # cd /cluster/nodes/<PL_id>/

    The variable <PL_id> refers to the PL node that does not start installation after connecting to the SC node.

    For example, if PL3 is the node that does not start installation after connecting to the SC node, execute the following command:

    # cd /cluster/nodes/3/

  2. Delete the symbolic boot link dedicated to the PL node:

    # rm boot

    Example boot link:

    boot ->  ../.sw/linux-payload-R3B02- 0
    /fa2a5eab751fa45fe91b4417e59cab5e

  3. Go to the directory of another PL node, for example, PL-4:

    # cd /cluster/nodes/4/

  4. List the boot link for this PL node (PL-4):

    # ls -l boot

    Example output:

    lrwxrwxrwx 1 root root 61 Aug 23 11:46 boot -> 
    ../.sw/linux-payload-R3B02-
    0/fa2a5eab751fa45fe91b4417e59cab5e
    

  5. Go back to the directory of the PL node that does not start installation (PL-3):

    # cd /cluster/nodes/3/

  6. Create a new symbolic boot link using the path for the other PL node (PL-4), for example:

    #ln –s ../.sw/linux-payload-R3B02-0/fa2a5eab751fa45fe91b4417e59cab5e boot

  7. When the procedure is completed, do the following:

4.3.2   Restart DHCP Server

Execute the following command on one SC node to restart DHCP server:

# systemctl restart dhcpd.service

Execute the following command to check the DHCP status:

# systemctl status dhcpd.service

4.3.3   Check IPW_INT_SP Connection

Check SC /var/log/messages to find if there is any DHCP and TFTP log when PL node tries to boot by using PXE, if there is not any log, the network connection in IPW_INIT_SP is broken. For this kind of issue, the L2 connection generated by cloud BSP plug-in is broken, contact the cloud administrator for support.

4.4   IPWorks System Data Restore Fails

Prerequisites:

Before the IPWorks System Data restore, periodical System Data backup and User Data backup (with or without NDB) have been performed. And several backup files are created before the restore operation.

Perform the recovery procedure according to the following scenario:

4.4.1   Perform System Data Backup By Selecting Another Backup File

  1. Select another system backup file and then perform the restore operation again. Refer to Restore Backup for details.

    It is recommended to restore the system backup file that is generated just after the whole IPWorks installation and initial configuration.

  2. Start MySQL NDB Cluster manually. Refer to Configure MySQL NDB Cluster for details.
    Note:  
    IPWorks MySQL NDB does not start automatically after IPWorks System Data restore.

4.5   IPWorks User Data Restore Fails

Prerequisites:

Before the IPWorks User Data restore, periodical System Data backup and User Data backup (with or without NDB) have been performed. And several backup files are created before the restore operation.

Perform the recovery procedure according to the following scenario:

4.5.1   Recover MySQL NDB for IPWorks User Data Restore

Recovering MySQL NDB for IPWorks User Data restore is necessary in the following scenarios:

To start IPWorks User Data restore, all MySQL NDB nodes in both SCs must start up first. However, if you want to restore IPWorks MySQL NDB data while MySQL NDB is already crashed, do the following to recover MySQL NDB to the "running" status first:

  1. Check MySQL NDB status.

    <SC hostname>:~ # /etc/init.d/ipworks.mysql show-status

    Connected to Management Server at: localhost:1186
    Cluster Configuration
    ---------------------
    [ndbd(NDB)]     2 node(s)
    id=27   @169.254.101.1  (mysql-5.6.31 ndb-7.4.12, Nodegroup: 0, *)
    id=28   @169.254.101.2  (mysql-5.6.31 ndb-7.4.12, Nodegroup: 0, Master)
    
    [ndb_mgmd(MGM)] 2 node(s)
    id=1    @169.254.101.1  (mysql-5.6.31 ndb-7.4.12)
    id=2    @169.254.101.2  (mysql-5.6.31 ndb-7.4.12)
    
    [mysqld(API)]   24 node(s)
    id=3    @169.254.101.1  (mysql-5.6.31 ndb-7.4.12)
    id=4 (not connected, accepting connect from SC-2)
    id=5 (not connected, accepting connect from any host)
    id=6 (not connected, accepting connect from any host)
    id=7 (not connected, accepting connect from any host)
    id=8 (not connected, accepting connect from any host)
    id=9 (not connected, accepting connect from any host)
    id=10 (not connected, accepting connect from any host)
    id=11 (not connected, accepting connect from any host)
    id=12 (not connected, accepting connect from any host)
    id=13 (not connected, accepting connect from any host)
    id=14 (not connected, accepting connect from any host)
    id=15 (not connected, accepting connect from any host)
    id=16 (not connected, accepting connect from any host)
    id=17 (not connected, accepting connect from any host)
    id=18 (not connected, accepting connect from any host)
    id=19 (not connected, accepting connect from any host)
    id=20 (not connected, accepting connect from any host)
    id=21 (not connected, accepting connect from any host)
    id=22 (not connected, accepting connect from any host)
    id=23 (not connected, accepting connect from any host)
    id=24 (not connected, accepting connect from any host)
    id=25 (not connected, accepting connect from any host)
    id=26 (not connected, accepting connect from any host)
    

    This example shows that all the MySQL NDB nodes are running. If any node is not running, try to restart it. Refer to the Section MySQL NDB Cluster in IPWorks Troubleshooting Guideline for details.

    • If any of the MySQL NDB nodes (Management Node, Data Node, SQL Node) still cannot be started in one SC, go to Step 2.
    • If MySQL NDB nodes in both SC still cannot be started, go to Step 3.
  2. Recover MySQL NDB in one SC.
    1. Log on to SC where the MySQL NDB nodes cannot start. Meanwhile MySQL NDB nodes in another SC are running.
    2. Stop Storage Server in the SC.

      # ipw-ctr stop ss

      If both Storage Server processes are started in both SCs with Active-Standby mode and Storage Server is “Active” in this SC, stopping the Storage Server will activate it in another SC.

    3. Stop all MySQL NDB nodes in this SC that have the problem.

      # /etc/init.d/ipworks.mysql stop

    4. Recover MySQL NDB in this SC.

      # /etc/init.d/ipworks.mysql recover

      After this step, the MySQL NDB in the problematic SC shall synchronize with MySQL NDB in another healthy SC and recover the status.

    If the MySQL NDB can start in the previous problematic SC. You can continue to perform IPWorks User Data restore that includes NDB data. Refer to Restore Backup for details.

    If the problem is not resolved, perform Step 3 to recover the whole NDB cluster in both SCs.

  3. Recover MySQL NDB in both SCs.

    Refer to the Section MySQL NDB Cluster Cannot Work Normally in IPWorks Troubleshooting Guideline to recover whole MySQL NDB cluster to initialization status. You can continue to perform IPWorks User Data restore that includes NDB data running in both SCs.

4.6   IPWorks Application Component Cannot Start

In case certain application is in failure status, use command ipw-ctr repaired [comp] [<hostname>] to recover it. If the problem cannot be resolved, go to Section 5.2 Consult Next Level of Support. Execute the following recover procedures if next support asks:

  1. Reboot the node to recover it, go to Section 4.6.1 Rebooting PL Node.
  2. Perform power cycle to recover the SC node when SC OS is available, go to Section 4.6.2 SC Node Power Cycle Graceful.
  3. Perform health check. Refer to IPWorks Manual Health Check for details.

4.6.1   Rebooting PL Node

For the unknown reason that the IPWorks application cannot start up, rebooting node would help.

To reboot the SC or PL node, follow the steps below:

  1. Reboot the PL node:

    # lde-reboot –n <node id>

    For example:

    #lde-reboot –n 3

    This command will also reboot PL-3.

  2. When the procedure is completed, do the following:

4.6.2   SC Node Power Cycle Graceful

Usually, you shall use lde-reboot –n <node id> command instead of power off and then power on (power cycle) the SC node after IPWorks installation. If you really want to perform node power cycle when the node OS is still available, follow the below steps.

Note:  
If the node OS is unavailable, only perform Step 4.

  1. Log on to the SC that will be powered off.

    #ssh root@<SC IP>

    Enter root password.

  2. Stop Storage Server.

    # ipw-ctr stop ss

  3. Stop MySQL NDB nodes (Management Node, Data Node, and SQL Node).

    # /etc/init.d/ipworks.mysql stop

  4. Perform power cycle for node.

    Perform Section 4.8 Hard Reboot Instance in Atlas GUI or CLI.

  5. Check MySQL NDB status after SC node startup.

    # /etc/init.d/ipworks.mysql show-status

    If there is any issue in MySQL NDB, refer to the Section MySQL NDB Cluster in IPWorks Troubleshooting Guideline.

  6. Start Storage Server.

    # ipw-ctr stop ss

    For PL node, this graceful power cycle is unnecessary.

4.7   IPWorks Application Component Patch Fails

Prerequisites:

Before the IPWorks patching, periodical System Data backup and User Data backup (with or without NDB) have been performed. And several backup files are created before the patching.

Perform the following recovery procedures:

  1. Restore System Data backup.

    User shall not cancel the system restore. Refer to Restore Backup for details.

    Note:  
    System Data restore only reinstalls OS and IPWorks application components, it does not remove MySQL NDB data.

  2. Perform IPWorks User Data restore that includes MySQL NDB data. Refer to Restore Backup for details.
    Note:  
    • You shall not perform any cancel operation during this system restore.
    • You shall not start or stop any services during the system restore.
    • You shall not reboot or power cycle any node during the system restore.

    You should wait for the system restore finished before performing any further actions.


  3. Manually start MySQL NDB cluster in both SCs.

    In case IPWorks User Data backup file includes NDB data, make sure that MySQL NDB cluster is running in both SCs before restore. Otherwise, the restore fails. Refer to the Section MySQL NDB status in IPWorks Troubleshooting Guideline to start MySQL NDB cluster. If restore failed, go to Section 4.5.1 Recover MySQL NDB for IPWorks User Data Restore.

  4. In case IPWorks User Data backup file does not include MySQL NDB data, the restore does not impact the existed NDB data. You shall not cancel the User Data restore
  5. Perform health check. Refer to IPWorks Manual Health Check for details.

4.8   Hard Reboot Instance

Use one of the methods to hard reboot instance (node):

4.8.1   By Using Atlas GUI

User logs on to Atlas GUI and then go to Project view, select instance in left Compute panel, and then in the right panel, search the instance name and then click the Actions drop-down list, click Hard Reboot Instance. A hard reboot power cycles the instance.

Figure 2   Hard Reboot Instance

4.8.2   By Using Atlas CLI

User can also execute the operation in CLI in Atlas CLI or CIC CLI.

$nova reboot --reboot <instance-id or instance-name>

The instance-id or instance-name can be found by using command:

$nova list

Example:

$nova reboot -reboot ipw6a_SC-1

$nova reboot -reboot 2212c933-71aa-4b92-b53f-3e0081946203

4.9   Blade Replacement

This section describes how to replace and recover a GEP blade (SC or PL node blade) when the blade is damaged.

4.9.1   Duration for Blade Replacement

Table 2 records the estimated time during the blade replacement.

Table 2    Estimated Time for Blade Replacement

Replacement Area

Estimated Time (min)

Replacement Period

CEE

60~80

Server replacement

IPWorks (SC node)

5~10(1)

Storage Server recovery

5~10

DRBD synchronization

10(2)

MySQL NDB recovery

IPWorks (PL node)

5~10(3)

PL recovery

(1)  This duration is from “heat stack-update is executed” to “Storage Server is running”.

(2)  This operation costs several minutes, and actual duration is based on the data size in MySQL NDB database.

(3)  The duration is for the general operation not including heavy loaded data happened when the service is starting.


4.9.2   Replacing GEP Blade

To replace the GEP blade physically and recover the system successfully, do:

Replace the server

  1. Replace the server in the CEE.

    For details, refer to Server Replacement.

Note:  
To avoid node recovery failure, do the following operation before executing expandcee command:
  1. Navigate to tmp directory on fuel.

    cd /tmp/

  2. Download the network file.

    fuel --env 1 network --download

  3. Modify the downloaded file (network_1.yaml).

    Remove NIC part for the compute. For example:

          compute-0-11:
            if1: eth0
            if2: eth1
            if3: eth6
            if4: eth2
            if5: eth7
            if6: eth5
    

  4. Upload the modified file.

    fuel --env 1 network --upload


Inspect the lost node

  1. Log in to a healthy SC node, then execute command:

    SC-<x>:~ # tipc-config -n

  2. Confirm the lost node type (SC or PL) from the command output. For example:

    SC-1:~ # tipc-config -n
    Neighbors:
    <1.1.2>: up
    <1.1.3>: down
    <1.1.4>: up
    <1.1.5>: up
    <1.1.6>: up
    

    In the example, the IPWorks system includes two SC nodes and four PL nodes, and the corresponding relationship between the label number and the node number, shows as below:

    <1.1.1>: SC-1
    <1.1.2>: SC-2
    <1.1.3>: PL-3
    <1.1.4>: PL-4
    <1.1.5>: PL-5
    <1.1.6>: PL-6
    


    The command output shows the status of PL-3 is down, and indicate that the lost node is PL-3.

    If no node is shown down status, the lost node is a PL node. Since the PL node evacuates to other compute resource based on ha-offline policy, recover the PL node is not necessary. To check all the compute resources where the PL evacuates to, refer to Step Get the resource name.

Recover the lost node

  1. To recover the PL node, refer to Recover the PL node.
  2. To recover the SC node, refer to Recover the SC node.

Recover the PL node

Use nova rebuild to recover the PL. During the recovery, the PL is boot up from a healthy SC blade, and installs all rpm packages automatically.

  1. Connect to Atlas.

    #ssh atlasadm@<ATLAS_VM_IP_ADDRESS>

    atlasadm@atlas:~$ source openrc

  2. Get the relative information of the lost PL node.
    1. Identify the stack name of the IPWorks.

      atlasadm@atlas:~$ openstack stack list

    2. Get the resource name.

      atlasadm@atlas:~$ openstack stack resource list <stack_name>

    3. Get the PL image ID.

      atlasadm@atlas:~$ openstack stack resource show <stack_name> <resource_name> |grep image

      In the following example, the image ID is 3637a421-25ea-41f3-b878-8e716a9fff95.

      atlasadm@atlas:~$ openstack stack resource show sub811_lsv22_0827 ⇒
      ipw_PL-3 |grep image
      |                        |   "image": {                                                                                                                            ⇒
                                      |
      |                        |         "href": "https://⇒
      ipworks.sub8.ctrl.ericsson.se:8774/images/⇒
      3637a421-25ea-41f3-b878-8e716a9fff95",
      

  3. Rebuild a new PL.

    atlasadm@atlas:~$ nova rebuild <server> <image>

    Note:  
    server is the ID number, and can be got from command:
    nova list | grep <stack_name>|grep <Internal IP>.
    Internal IP is the internal IP of the lost PL node, and can be got in /etc/hosts on a healthy SC with the PL node confirmed in Step 2.

    image is the image ID got from sl-Get_PL_image_ID.


    The following is an example of how to get the server ID number.

    atlasadm@atlas:~$ 
    nova list |grep sub811_lsv22_0827 |grep 169.254.100.3
    | 08516d2f-9dae-4c33-9f92-52767c2557ea | sub811_lsv22_0827_⇒
    PL-3 | ACTIVE | - | Running | ⇒ 
    sub811_lsv22_0827_int_sp=169.254.100.3; ⇒
    sub811_lsv22_0827_sig_sp=192.168.15.3; ⇒
    sub811_lsv22_0827_data_sp=192.168.16.3 |

    The following is an example of how to rebuild a new PL.

    atlasadm@atlas:~$ 
    
    nova rebuild 08516d2f-9dae-4c33-9f92-52767c2557ea 3637a421-25ea-41f3-b878-8e716a9fff95

  4. Wait for several minutes, then check the status of the recovered PL.
    1. Log in to a healthy SC.
    2. To check the status of the recovered PL is up, execute commands:

      SC-<x>:~ # tipc-config -n

    3. To check the service is running on the recovered PL, execute commands:

      SC-<x>:~ # ipw-ctr status all

Recover the SC node

Use heat stack-update to recover the SC with the new SC image and yaml file.

Note:  
The environment file (env_file.yaml) is saved after the IPWorks deployment process.

  1. Connect to Atlas.

    #ssh atlasadm@<ATLAS_VM_IP_ADDRESS>

    atlasadm@atlas:~$ source openrc

  2. Get image sc-pxeboot.qcow2.
    1. Get IPW1.9 (or latter) package from GW link.
    2. Upload the package to directory: /tmp.
    3. Uncompress the package (shown as below), and find image sc-pxeboot.qcow2 in /images directory.

      atlasadm@atlas:~$ cd /tmp
      atlasadm@atlas:~$ tar -xzvf <package>
      atlasadm@atlas:~$ cd images/
      atlasadm@atlas:~$ ls

  3. Create SC image based on sc-pxeboot.qcow2.

    atlasadm@atlas:~$
    glance image-create --name <sc_pxeboot_image_name> --disk-format qcow2 --container-format bare --file sc-pxeboot.qcow2

    Note:  
    sc_pxeboot_image_name is a specific name used for GEP replacement process.

    The following is an example of how to create a new SC image:

    atlasadm@atlas:~$
    glance image-create --name SC_pxeboot --disk-format qcow2 --container-format bare --file /tmp/images/sc-pxeboot.qcow2

  4. Modify the HOT yaml file.
    1. Get the image ID of the image name (sc_pxeboot_image_name).

      atlasadm@atlas:~$ glance image-list | grep <sc_pxeboot_image_name>

      In the following example, the image ID is 55e4b5d7-3c78-4fce-b4b0-70fc9d4bff94.

      atlasadm@atlas:~$ glance image-list |grep SC_pxeboot
      | 55e4b5d7-3c78-4fce-b4b0-70fc9d4bff94 | SC_pxeboot               |
      

    2. Get the active HOT yaml file.

      atlasadm@atlas:~$ openstack stack template show <stack_id> > ipw_hot_onboarding.yaml

    3. Modify the active HOT yaml file.

      In the yaml file, replace the image param of the damaged resource (for example, ipw_SC-1) with the image ID got from Step a. In the following example, the lost node is SC-2:

      ipw_SC-2:
                 ...
                 properties:
                    config_drive: 'True'
                    flavor:
                      get_param: SC_FLAVOR_NAME
                    image:
                        get_param: SC_IMAGE
                 ...

      Replace the image param with the image ID.

      ipw_SC-2:
                 ...
                 properties:
                    config_drive: 'True'
                    flavor:
                      get_param: SC_FLAVOR_NAME
                    image: 55e4b5d7-3c78-4fce-b4b0-70fc9d4bff94
                 ...
      

  5. Update the stack to rebuild the new SC.

    atlasadm@atlas:~$ heat stack-update -f ipw_hot_onboarding.yaml -e <env_file.yaml> <stack_name> --rollback true

  6. Keep checking the status until UPDATE_COMPLETE is shown.

    atlasadm@atlas:~$ openstack stack event list <stack_name>

    UPDATE_COMPLETE means the recovery is successfully.

  7. Log in to an SC node and check the DRBD synchronization status.

    SC-<x>:~ # drbd-overview

    The DRBD can synchronize automatically. In the following example, the synchronization process is 3.5%.

    SC-1:~ # drbd-overview
          0:drbd0/0 SyncSource Primary/Secondary UpToDate/⇒
    Inconsistent C r----- lvm-pv: lde-cluster-vg 100.00g 50.06g 
                     [>....................] ⇒
    sync'ed: 3.5% (98916/102400)M         
    

    The following shows that the synchronization is completed.

    SC-x:~ # drbd-overview
      0:drbd0/0  Connected Primary/Secondary UpToDate/UpToDate ⇒
    C r----- lvm-pv: lde-cluster-vg 100.00g 50.06g
    

  8. Resize disk and reboot on SC node.

    # ssh root@<MIP_OAM_IP>

    SC-X:~# ipworks.mysql stop-ndbcluster

    SC-X:~# umount /local/ipworks

    SC-X:~# /opt/ipworks/common/scripts/ipwResizePartition.sh -d /dev/vda -p 7

    Note:  
    If you receive the following error message, this means that the new partition cannot be used immediately, this issue will be resolved after SC reboot.

    Calling ioctl() to re-read partition table.
    Re-reading the partition table failed.: Device or resource busy
    

    SC-X:~# reboot


    Resize partition on both SC-X after reboot is done.

    # ssh root@<MIP_OAM_IP>

    SC-X:~# resize2fs /dev/vda7

    SC-X:~# ipworks.mysql start-ndbcluster

  9. Recover the MySQL NDB in the recovered SC blade.
    1. Log in to the recovered SC.
    2. Stop the Storage Server and MySQL.

      SC-X:~# ipw-ctr stop ss

      SC-X:~# /etc/init.d/ipworks.mysql stop

    3. Recover MySQL NDB.

      SC-X:~# /etc/init.d/ipworks.mysql recover

      After the recovery, the MySQL can be logged in, and all the data in /local/ipworks directory is restored.

    4. Check the status of all the nodes on SCs.

      SC-X:~# /etc/init.d/ipworks.mysql show-status

      For example:

      SC-2:~ # /etc/init.d/ipworks.mysql show-status
      Connected to Management Server at: localhost:1186
      Cluster Configuration
      ---------------------
      [ndbd(NDB)]     2 node(s)
      id=27   @169.254.100.1  (mysql-5.6.31 ndb-7.4.12, ⇒
      Nodegroup: 0, *)
      id=28   @169.254.100.2  (mysql-5.6.31 ndb-7.4.12, ⇒
      Nodegroup: 0)
      [ndb_mgmd(MGM)] 2 node(s)
      id=1    @169.254.100.1  (mysql-5.6.31 ndb-7.4.12)
      id=2    @169.254.100.2  (mysql-5.6.31 ndb-7.4.12)
      [mysqld(API)]   24 node(s)
      id=3 (not connected, accepting connect from SC-1)
      id=4    @169.254.100.2  (mysql-5.6.31 ndb-7.4.12)
      id=5 (not connected, accepting connect from any host)
      id=6 (not connected, accepting connect from any host)
      id=7 (not connected, accepting connect from any host)
      id=8 (not connected, accepting connect from any host)
      

      Note:  
      The expected result is that at least the nodes with ID value 1, 2, 27 and 28, are started, and one of the nodes with ID value 3 or 4, is started.

    5. Restart Storage Server on the new SC.

      SC-X:~# ipw-ctr restart ss

      SC-X:~# ipw-ctr restart sqlnodemgr

4.9.3   Heath Check

Perform health check activities, refer to IPWorks Auto Health Check or IPWorks Manual Health Check for details.

4.9.4   Create a Backup

Create a backup after the replacement, refer to Create Backup.

4.10   IPWorks environment file, images and yaml file missing

Perform the following recovery procedures:

  1. Log on Atlas server.

    #ssh atlasadm@<Atlas_addr>

  2. Locate the stack name.

    #openstack stack list

  3. Recover IPWorks environment information.

    #openstack stack output show --all <stack_name>

  4. Create a new environment file (such as <stack_name>_env.yaml). And copy ALL the Property and Value information from last command output to this file according to the following format.

  5. Upload images and yaml file from IPWorks VNF package to /home/atlasadm/ipwokrs, then unpack. For detailed information, refer to section Transfer IPWorks VNF Package to Atlas Server in IPWorks Deployment Guide.

5   Problem Reporting

In general, all the described recovery situations must be seen as abnormal and must be reported to the next level of support or according to other documented procedure such as log book, even if the recovery has been successful. Often a Customer Service Request (CSR) is written to a responsible support organization.

If the situation has affected the ISP, it must be reported as such according to documented procedure.

In many situations, it is required to perform a Root Cause Analysis (RCA) afterwards to determine the source of the problem. It is therefore important to carefully document the problematic situation and all the recovery steps that have been taken.

Many log files in the system must be saved or copied to another place to prevent them from being overwritten with newer information. It is important that these logs are available for any future RCA.

For information about how to collect data and log files, refer to Data Collection Guideline for IPWorks.

5.1   Problem Solved

The recovery seems to have worked. Keep the site and the affected functions under extra observation for a while to ensure that the fault does not reoccur.

Record the incident according to local procedures using a log book or similar.

5.2   Consult Next Level of Support

Provide the receiving support organization with the following information:


Reference List

Ericsson Documents
[1] Glossary of Terms and Acronyms.
[2] Trademark Information.
[3] Typographic Conventions.
[4] IPWorks Configuration Management.
[5] IPWorks Troubleshooting Guideline.
[6] Data Collection Guideline for IPWorks.
[7] Backup and Restore.
[8] Personal Health and Safety Information.
[9] System Safety Information.
[10] IPWorks Auto Health Check.
[11] IPWorks Manual Health Check.
[12] Configure MySQL NDB Cluster.
[13] Create Backup.
[14] Restore Backup.
[15] Check Alarm Status.
[16] IPWorks Deployment Guide, 21/1553-AVA 901 33/3 Uen
[17] Emergency Recovery Procedure, 2/154 32-AZE 102 01 Uen
[18] Server Replacement, 4/1543-CSA 113 125/4 Uen


Copyright

© Ericsson 2017, 2018. All rights reserved. No part of this document may be reproduced in any form without the written permission of the copyright owner.

Disclaimer

The contents of this document are subject to revision without notice due to continued progress in methodology, design and manufacturing. Ericsson shall have no liability for any error or damage of any kind resulting from the use of this document.

Trademark List
All trademarks mentioned herein are the property of their respective owners. These are shown in the document Trademark Information.

    Emergency Recovery Procedure for IPWorks