1 Introduction
This instruction concerns alarm handling.
1.1 Alarm Description
The alarm is issued by the Managed Object (MO) Host.
The alarm is sent if the CPU workload, CPU utilization, or both exceed the threshold configured in the monitoring tool for triggering the alarm. The alarm ceases if the triggering measures go under the threshold configured for ceasing.
- Note:
- Generally, the configured threshold for ceasing is lower than the threshold for triggering the alarm.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm |
Description |
Fault |
Fault |
Impact |
|---|---|---|---|---|
|
The CPU workload exceeds the configured threshold. |
The alarm is sent if the CPU workload or CPU utilization or both exceed the configured threshold. |
|
Compute node or vCIC node. |
The system capacity can be degraded causing loss of payload. |
|
The CPU utilization exceeds the configured threshold. | ||||
|
Both the CPU workload and utilization exceed the configured thresholds. |
- Note:
- The High CPU Load alarm can appear as a result of network disturbances, or a maintenance activity on infrastructure or application level. If a maintenance activity is ongoing, wait until it is completed and five additional minutes.
The alarm attributes are listed in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Major Type |
193 |
|
Minor Type |
2031688 |
|
Managed Object Class |
Host |
|
Managed Object Instance |
Region=<region_name>, |
|
Specific Problem |
High CPU load |
|
Event Type |
equipmentAlarm (5) |
|
Probable Cause |
systemResourcesOverload (207) |
|
Additional Text |
The average load per CPU or the CPU utilization or both exceeded the configured thresholds during the measuring period;uuid=<hw_uuid_of_corresponding_server> |
|
Severity |
|
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
The following documents are used in the procedure:
1.2.2 Tools
No tools are required.
1.2.3 Conditions
Before starting this procedure, ensure that the following conditions are met:
- No ongoing maintenance activities on application level are assumed.
- SSH credentials for vCIC node and compute node are available.
2 Procedure
This section describes the procedure to follow when this alarm is received.
Based on the severity indicated in the alarm text, continue with the relevant section:
- If the severity is MINOR, continue with Section 2.1.
- If the severity is MAJOR or CRITICAL, continue with Section 2.2.
2.1 Severity MINOR
If the alarm severity is MINOR, do the following at the maintenance center:
- Check if any related alarms are active. Act on any related alarms.
- Wait 10 minutes for the alarm to cease.
- If this alarm ceases, exit this procedure.
- If the alarm severity increases to MAJOR or CRITICAL, continue with Section 2.2.
- Note:
- The Graphical User Interface (GUI) of the Zabbix monitoring tool or the performance management northbound API shows the actual CPU load and utilization, see Section 3.
2.2 Severity MAJOR and CRITICAL
If the alarm severity is MAJOR or CRITICAL, continue with the relevant section depending on the type of the reported node:
- If the alarm is related to a compute node, continue with Section 2.2.1.
- If the alarm is related to a vCIC node, continue with Section 2.2.2.
2.2.1 Procedure for Compute Nodes
Do the following at the maintenance center:
- Perform either of the following steps:
- Investigate the total resource use on the available nodes.
Use the below commands:
nova hypervisor-stats
nova host-describe <host_id>- If there are not enough resources in the region or if they are too fragmented to move VMs, refer to Region Expansion to install additional compute servers and increase the number of compute nodes. Exit this procedure.
- If there are enough resources to migrate VMs, start
migrating to decrease CPU load or CPU utilization or both.
- In case of MAJOR severity,
start with the VM that is using the least amount of CPU resource on
the node issuing the alarm.
- In case of CRITICAL severity, start migrating VMs immediately. Migrate at least half of the VMs to decrease the CPU load or utilization.
- Note:
- Never migrate a vCIC.
- In case of MAJOR severity,
start with the VM that is using the least amount of CPU resource on
the node issuing the alarm.
- Migrate the selected VMs to a node with available CPU
resources, if they can be migrated. Use the below command:
nova migrate <server>Verify the migration with the command:
nova resize-confirm <server>
- Check the actual CPU load and utilization either by using the Performance Management Northbound API or the GUI of the Zabbix monitoring tool as described in Section 3.
- Wait 10 minutes, then
check the active alarm list and perform the relevant action:
- If the alarm has ceased, exit this procedure.
- If the alarm remains, migrate all remaining VMs from
the node issuing the command:
nova migrate <server>Verify the migration with the below command:
nova resize-confirm <server>
- If all VMs have been migrated from the node, log in to
the node by using SSH:
ssh <admin_user>@<node_address>
- Restart the node by using the
command:
reboot. - Wait 15 minutes for the restart to complete.
- Collect troubleshooting data as described
in the Data Collection Guideline.
Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- The job is completed.
2.2.2 Procedure for vCIC Nodes
Do the following at the maintenance center:
- Log in to the node using SSH:
ssh <admin_user>@<vcic_address>
- Check if the other two vCICs are
running.
- If both of the other vCICs are running, restart the
node with the following command:
reboot - Else, continue with Step 4.
- If both of the other vCICs are running, restart the
node with the following command:
- Wait 15 minutes for the restart to complete.
- If the alarm reappears after the node has been restarted, continue with Step 4.
- If the alarm ceases, do the following:
- Log in to another vCIC using SSH:
ssh <admin_user>@<vcic_address>
- Check that all three vCIC nodes are up in normal operation
by issuing the command:
crm statusVerify that the response in the line starting with Online: contains all three vCIC nodes:
Online: [cic<id> cic<id> cic<id>]- If any of the three vCICs is not running, continue with Step 4.
- If both of the other two vCICs are running, exit this procedure.
- Log in to another vCIC using SSH:
- Collect troubleshooting data as
described in the Data Collection Guideline.
Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- The job is completed.
3 Checking CPU Load and Utilization
To check the CPU load and CPU utilization, use either of the following tools:
- The GUI of the Zabbix monitoring tool, see Section 3.1.
- The performance management northbound API, see Section 3.2.
3.1 Zabbix Monitoring Tool
To access the Zabbix monitoring tool, use the address:
https://192.168.2.22/zabbix
The user group, user name, and password can be configured before deployment by setting the correct parameters in the config.yaml file. Refer to the Configuration File Guide.
The default user group is CEEUserGroup, the default user is ceeuser. The default password is generated during deployment, and can be found in /etc/openstack_deploy/user_secrets.yml under zabbix_cee_user_password on vFuel.
3.2 Performance Management Northbound API
To check CPU load in the performance management northbound API, refer to the section Monitoring API in the Performance Management Northbound API.

Contents