| 1 | Introduction |
| 1.1 | Alarm Description |
| 1.2 | Prerequisites |
2 | Procedure |
| 2.1 | Severity MINOR |
| 2.2 | Severity MAJOR and CRITICAL |
1 Introduction
This instruction concerns alarm handling.
1.1 Alarm Description
The High Memory Utilization alarm is issued by the Managed Object (MO) Host.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm |
Description |
Fault |
Fault |
Impact |
|---|---|---|---|---|
|
The memory utilization is high. |
This alarm is sent when the memory utilization is high and exceeds a set threshold level. |
The memory utilization is higher than expected, more memory is needed. |
This is a |
The system capacity can be degraded when the memory utilization is exceeds the threshold. |
- Note:
- The High Memory Utilization alarm can appear as a result of the maintenance activity.
The alarm attributes are listed in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Major Type |
193 |
|
Minor Type |
2031689 |
|
Managed Object Class |
Host |
|
Managed Object Instance |
Region=<region_name>, |
|
Specific Problem |
High memory utilization |
|
Event Type |
equipmentAlarm (5) |
|
Probable Cause |
systemResourcesOverload (207) |
|
Additional Text |
Available memory fell below the threshold, alarm is cleared when the available memory exceeds the threshold;uuid=<hw_uuid_of_corresponding_server> |
|
Severity |
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
The following document is used in this procedure:
1.2.2 Tools
No tools are required.
1.2.3 Conditions
Before starting this procedure, ensure that the following conditions are met:
- No ongoing maintenance activities on application level are assumed.
- SSH credentials for vCIC node and compute node are available.
2 Procedure
This section describes the procedure to follow when this alarm is received.
Based on the severity indicated in the alarm text, continue with the relevant section:
- If the severity is MINOR, continue with Section 2.1.
- If the severity is MAJOR or CRITICAL, continue with Section 2.2.
2.1 Severity MINOR
If the alarm severity is MINOR, do the following at the maintenance center:
- Check if any related alarms are active. Act on any related alarms.
- Wait 10 minutes.
- Check the available memory
by executing the following command:
/etc/zabbix/scripts/check_free_memory.sh
- If the available memory is more than 1 GB, the alarm has ceased, exit this procedure.
- If the available memory is less than 1 GB and more than 0.5 GB, the alarm severity remains MINOR. Return to Step 2.
- If the available memory is less than 0.5 GB and more than 0.25 GB, the alarm severity is MAJOR. Continue with Section 2.2.
- If the available memory is less than 0.25 GB, the alarm severity is CRITICAL. Continue with Section 2.2.
2.2 Severity MAJOR and CRITICAL
If the alarm severity is MAJOR or CRITICAL, continue with the relevant section depending on the type of the reported host:
- If the alarm is related to a compute node, continue with Section 2.2.1.
- If the alarm is related to a vCIC node, continue with Section 2.2.2.
2.2.1 Procedure for Compute Node
Do the following at the maintenance center:
- Log in to the affected compute
node using SSH:
ssh <compute_name>
- Find out which VM is using
the largest amount of memory on the node:
ps -eo pid,rss,cmd --sort rss | grep [q]emu.*instance- | awk '{print "PID:"$1 " MEMORY:"$2 " NAME:"$6}'ps -eo pid,rss,cmd --sort rss | grep [q]emu.*instance⇒
- | awk '{print "PID:"$1 " MEMORY:"$2 " NAME:"$6}'Example output:
PID:43397 MEMORY:59380 NAME:instance-000002f9 PID:25380 MEMORY:59560 NAME:instance-000002f6 PID:47210 MEMORY:60136 NAME:instance-000002fc
- Log out from the compute node.
- Perform either of the following steps:
- Execute source openrc.
- Use the following command to obtain the instance ID:
nova list --fields name,status,host,instance_name| grep <affected_host>nova list --fields name,status,host,instance_name|⇒
grep <affected_host> - If the VMs High Availability (HA) policy allows it to be migrated, migrate the VM to another node which has sufficient memory space. Use the command nova migrate <server> with the appropriate parameters.
- Wait for 10 minutes and check the active alarm list.
- If the alarm persists, continue removing VMs from the node (continuing always with the VM that takes up the most memory on the node) by repeating Step 1 to Step 7.
- If there are no more VMs left on the node (or the only VMS on the node are VMs that cannot be migrated due to their HA policy), and the alarm still persists, continue with Step 9.
- If the alarm ceases, continue with Step 11.
- Restart the node:
reboot
- Note:
- Never reboot a compute node running vFuel.
Wait 15 minutes for the restart process to complete.
- Log in to the affected
compute node using SSH:
ssh <compute_name>
- If logging in is not possible, contact the next level of maintenance support and exit this procedure.
- If logging in is successful, continue with Step 11.
- Collect troubleshooting data as described in the Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- The job is completed.
2.2.2 Procedure for vCIC Node
Do the following at the maintenance center:
- Log in to the node using SSH:
ssh <admin_user>@<vcic_address>
- If logging in was not possible, contact the next level of maintenance support and exit this procedure.
- If logging in was successful, collect troubleshooting data as described in the Data Collection Guideline.
- Restart the node with the following command:
reboot
- Wait 15 minutes for the restart to complete.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- The job is completed.

Contents