BRM, Scheduled Backup Failed

Contents


1   Alarm Description

The alarm is raised when a scheduled backup has failed.

Table 1    BRM, Scheduled Backup Failed Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

A scheduled backup has failed.

A scheduled backup event was triggered but failed to create a backup.

Insufficient disk space

Local hard disk

The Managed Element (ME) cannot be restored to its current state later. This can imply more efforts to bring back the ME from an unstable state to a controlled state and can have impact on service availability. Subsequent scheduled backups also fail may also fail until the fault condition is cleared.

Conflict with another ongoing task

Managed Element (ME)

Error reported by participant

Managed Element (ME)

System failover or reboot

Managed Element (ME)

Attention!

Risk of data loss or data corruption.

For Insufficient Disk Space faults, the fault is non-transient and the user must take action or else all subsequent scheduled backups will also fail.

For all other possible fault reasons, subsequent scheduled backups will fail until the fault condition reported in the alarm no longer exists.

This alarm is only cleared after the creation of a scheduled backup of the type (System Data or User Data) that raised the alarm. For example, if the alarm is raised for a failed System Data backup, it can only be cleared when a scheduled System Data backup is successfully created.

2   Procedure

2.1   Handle Alarm BRM, Scheduled Backup Failed

Prerequisites

Steps

  1. Check the Additional Text attribute of the alarm.
  2. Select action based on the attribute value:

2.2   Handle Reason Insufficient Disk Space

Steps

  1. Does this alarm occur every time a scheduled backup takes place?

    Yes: Continue with the next step.

    No: Proceed with Step 7.

  2. Contact the backup administrator about the backup policy. Is the maximum number of stored scheduled backups too high?

    Yes: Continue with the next step.

    No: Proceed with Step 6.

  3. Decrease the maximum number of stored scheduled backups.

    Decreasing the value of attribute maxStoredSceduledBackups below the number of scheduled backups in the system automatically deletes the oldest scheduled backups and triggers a new scheduled backup. If the new scheduled backup is successful, the alarm is cleared.

    For information on how to decrease the maxStoredSceduledBackups value, refer to Set Maximum Number of Scheduled Backups.

  4. Check whether a scheduled backup is triggered and successfully created.

    For information on how to list the backups, refer to List Backups.

  5. Is the alarm cleared?

    Yes: Proceed with Step 25.

    No: Proceed with Step 7.

  6. More storage capacity can be needed on the ME. Contact the planning organization and proceed with Step 25.
  7. List the backups locally stored in the ME.

    For information on how to list the backups, refer to List Backups.

  8. Is any locally stored manual or scheduled backup no longer required on the ME?

    Yes: Continue with the next step.

    No: Proceed with Step 16.

    Note:  
    A local backup file is not required if there is no immediate need to restore it on the ME or once it has been exported to a remote file storage.

  9. If needed, export to the remote file storage the following locally stored backups:
    • Backups that need to be preserved and have not been exported yet
    • Backups that have been deleted from the remote file storage

    For information on how to export a backup, refer to Export Backup.

  10. Delete any locally stored backup not required on the ME.
    Attention!

    Risk of data loss or data corruption.

    Do not delete backups listed in attribute restoreEscalationList.

    For information on how to delete a backup, refer to Delete Backup.

  11. Has any scheduled backup been manually deleted?

    Yes: Continue with the next step.

    No: Proceed with Step 14.

  12. Check whether a scheduled backup is triggered and successfully created.

    For information on how to list the backups, refer to List Backups.

  13. Is the alarm cleared?

    Yes: Proceed with Step 25.

    No: Proceed with Step 16.

  14. Schedule a single backup.

    For information on how to schedule a single backup, refer to Schedule Single Backup.

    Note:  
    Ensure to create a scheduled backup of the backup type that generated the alarm. The backup type SYSTEM_DATA or USER_DATA is indicated by additionalText in the alarm.

  15. Is the new scheduled backup successfully created and is the alarm cleared?

    Yes: Proceed with Step 25.

    No: Continue with the next step.

  16. Identify which files are taking the most space and which files are the oldest by listing the files in the file system as follows:
    1. du -xak /| sort -n | tail -20

      The following is an example output:

      37120   /usr/lib/perl5/5.10.0
      46616   /usr/bin
      46908   /usr/lib/perl5
      47916   /usr/share
      51800   /var
      60688   /lib/modules/3.0.74-0.6.10.1.5564.0.⇒
      PTF-default/kernel/drivers
      62752   /opt/lpmsv/loader
      66364   /usr/lib
      71100   /opt/com/lib/comp
      77900   /opt/com/lib
      82564   /opt/lpmsv
      90328   /lib/modules/3.0.74-0.6.10.1.5564.0.⇒
      PTF-default/kernel
      94164   /lib/modules/3.0.74-0.6.10.1.5564.0.⇒
      PTF-default
      100168  /lib/modules
      103560  /opt/com
      111096  /lib
      128280  /usr/lib64
      308568  /usr
      333108  /opt
      851148  /
    2. Show a list of files older than some days, for example:

      find /cluster/ -mtime +5

      The following is an example output:

      [...]
      /cluster/home
             /cluster/hooks
             /cluster/hooks/2
             /cluster/snapshot
             /cluster/lost+found
             /cluster/dumps
             /cluster/etc/pam.d
             /cluster/etc/login.allow
      [...]
  17. Are some of these files normally deleted automatically?

    Yes: Continue with the next step.

    No: Proceed with Step 20.

  18. Schedule a single backup.

    For information on how to schedule a single backup, refer to Schedule Single Backup.

    Note:  
    Ensure to create a scheduled backup of the backup type that generated the alarm. Attribute additionalText for command show on the alarm identifies the backup type.

  19. Is the new scheduled backup successfully created and is the alarm cleared?

    Yes: Proceed with Step 25.

    No: Proceed with Step 23.

  20. Can significant file space be saved by deleting some of these files without damaging the system?

    Yes: Continue with the next step.

    No: Proceed with Step 23.

  21. Delete the files:

    rm <file1> [<file2> …]

  22. Proceed with Step 18.
  23. Perform data collection, refer to Data Collection Guideline.
  24. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  25. Job is completed.

2.3   Handle Reason BRF Conflict with Other Task

Steps

  1. Refer to the alarm Additional Text to determine which task was conflicting with the scheduled backup, for example.

    Scheduled Backup for System Data failed due to conflict with Create Backup task for MANUAL backup CMWBackup_20190502_10 of type BRM_SYSTEM_DATA

  2. Confirm that the ongoing operation has completed.

    Navigate to the BrmBackupManager Managed Object (MO) corresponding the scheduled backup type, for example:

    >dn ManagedElement=NODE06ST,SystemFunctions=1,BrM=1,BrmBackupManager=SYSTEM_DATA

  3. Wait until all tasks have completed. Continue to check the state of the current operation until it is finished:

    >show progressReport,state

    state=FINISHED

  4. Schedule a single backup.

    For information on how to schedule a single backup, refer to Schedule Single Backup.

    Note:  
    It is assumed that there are no scheduled backup events left in the ME, or the existing scheduled backup events are too far in time and therefore not appropriate to wait for to clear the alarm.

  5. Wait for the scheduled backup to complete.
  6. Is the alarm cleared?

    Yes: Proceed with Step 9.

    No: Continue with the next step.

  7. Perform data collection, refer to Data Collection Guideline.
  8. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  9. Job is completed.

2.4   Handle Reason Participant Reported Error

Steps

  1. Perform data collection, refer to Data Collection Guideline.
  2. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  3. Job is completed.

2.5   Handle Reason System Failover or Reboot

Steps

  1. Wait for the system to fully recover from the failover or reboot. Continue to check the status using:

    # cmw-status app

    Status OK

  2. Schedule a single backup.

    For information on how to schedule a single backup, refer to Schedule Single Backup.

    Note:  
    It is assumed that there are no scheduled backup events left in the ME, or the existing scheduled backup events are too far in time and therefore not appropriate to wait for to clear the alarm.

  3. Wait for the scheduled backup to complete.
  4. Is the alarm cleared?

    Yes: Proceed with Step 7.

    No: Continue with the next step.

  5. Perform data collection, refer to Data Collection Guideline.
  6. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  7. Job is completed.