1 Introduction
This instruction concerns alarm handling.
1.1 Alarm Description
The alarm is issued by the Managed Object (MO) Host.
The alarm is sent if the CPU workload, CPU utilization, or both exceed the threshold configured in the monitoring tool for triggering the alarm. The alarm ceases if the triggering measures go under the threshold configured for ceasing.
- Note:
- Generally, the configured threshold for ceasing is lower than the threshold for triggering the alarm.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm |
Description |
Fault |
Fault |
Impact |
|---|---|---|---|---|
|
The CPU workload exceeds the configured threshold. |
The alarm is sent if the CPU workload or CPU utilization or both exceed the configured threshold. |
|
The system capacity can be degraded causing loss of payload. | |
|
The CPU utilization exceeds the configured threshold. | ||||
|
Both the CPU workload and utilization exceed the configured thresholds. |
- Note:
- The High CPU Load alarm can appear as a result of network disturbances, or a maintenance activity on infrastructure or application level. If a maintenance activity is ongoing, wait until it is completed and five additional minutes.
The alarm attributes are listed in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Major Type |
193 |
|
Minor Type |
2031688 |
|
Managed Object Class |
Host |
|
Managed Object Instance |
Region=<region_name>, |
|
Specific Problem |
High CPU load |
|
Event Type |
equipmentAlarm (5) |
|
Probable Cause |
systemResourcesOverload (207) |
|
Additional Text |
The average load per CPU or the CPU utilization or both exceeded the configured thresholds during the measuring period;uuid=<HW_UUID_of_corresponding_server>(1) |
|
Severity |
|
(1) The
format of this field is expected to change in CEE R6.
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
The following documents are used in the procedure:
1.2.2 Tools
No tools are required.
1.2.3 Conditions
- No ongoing maintenance activities on application level are assumed.
- SSH credentials for vCIC node and compute node are available.
2 Procedure
This section describes the procedure to follow when this alarm is received.
Based on the severity indicated in the alarm text, continue with the relevant section:
- If the severity is MINOR, continue with Section 2.1.
- If the severity is MAJOR or CRITICAL, continue with Section 2.2.
2.1 Severity MINOR
If the alarm severity is MINOR, do the following at the maintenance center:
- Check if any related alarms are active. Act on any related alarms.
- Wait 10 minutes for the alarm to cease.
- If this alarm ceases, exit this procedure.
- If the alarm severity increases to MAJOR or CRITICAL, continue with Section 2.2.
- Note:
- The Graphical User Interface (GUI) of the Zabbix monitoring tool or the performance management northbound API shows the actual CPU load and utilization, see Section 3.
2.2 Severity MAJOR and CRITICAL
If the alarm severity is MAJOR or CRITICAL, continue with the relevant section depending on the type of the reported node:
- If the alarm is related to a compute node, continue with Section 2.2.1.
- If the alarm is related to a vCIC node, continue with Section 2.2.2.
2.2.1 Procedure for Compute Nodes
Do the following at the maintenance center:
- Perform either of the following steps:
- Investigate the total resource use on the available nodes.
Use the below commands:
nova hypervisor-stats
nova host-describe <hostid>- If there are not enough resources in the region or if they are too fragmented to move VMs, refer to Region Expansion to install additional compute servers and increase the number of compute nodes. Exit this procedure.
- If there are enough resources to migrate VMs, start
migrating to decrease CPU load or CPU utilization or both.
- In case of MAJOR severity,
start with the VM that is using the least amount of CPU resource on
the node issuing the alarm.
- In case of CRITICAL severity, start migrating VMs immediately. Migrate at least half of the VMs to decrease the CPU load or utilization.
- Note:
- Never migrate a vCIC.
- In case of MAJOR severity,
start with the VM that is using the least amount of CPU resource on
the node issuing the alarm.
- Migrate the selected VMs to a node with available CPU
resources, if they can be migrated. Use the below command:
nova migrate <server>Verify the migration with the command:
nova resize-confirm <server>
- Check the actual CPU load and utilization either by using the Performance Management Northbound API or the GUI of the Zabbix monitoring tool as described in Section 3.
- Wait 10 minutes, then check the active alarm list and perform the relevant action:
- If all VMs have been migrated from the node:
- Log in to the node by using SSH:
ssh <admin-user>@<node_address>
If logging in was not possible, continue with Step 7.
- If logging in was successful, collect troubleshooting data as described in the Data Collection Guideline. For alarm-specific logs, refer to the Table Data Collection for Alarms and Alerts in the Data Collection Guideline.
- Restart the node by using the command:
reboot
- Log in to the node by using SSH:
- Wait 15 minutes for the restart to complete.
- If the alarm ceases, exit this procedure.
- If the alarm reappears when the node has been restarted and VMs are running on the node, run the check config command on the vCIC to collect log files and perform data collection, as described in the Data Collection Guideline.
- Consult the next level of maintenance support. Attach the previously collected sosreport or the screenshot of the running processes to the customer service request. Further actions are outside the scope of this instruction.
- The job is completed.
2.2.2 Procedure for vCIC Nodes
Do the following at the maintenance center:
- Log in to the node using SSH:
ssh <admin-user>@<vcic_address>
- If logging in was not possible, continue with Step 3.
- If logging in was successful, collect troubleshooting data as described in the Data Collection Guideline. For alarm-specific logs, refer to the Table Data Collection for Alarms and Alerts in the Data Collection Guideline.
- Check if the other two vCICs are running.
- Wait 15 minutes for the restart to complete.
- If the alarm ceases, do the following:
- Log in to another vCIC using SSH:
ssh <admin-user>@<vcic_address>
- Check that all three vCIC nodes are up in normal operation
by issuing the command:
crm statusVerify that the response in the line starting with Online: contains all three vCIC nodes:
Online: [cic<id> cic<id> cic<id>]- If any of the three vCICs is not running, continue with Step 3.
- If both of the other two vCICs are running, exit the procedure.
- Log in to another vCIC using SSH:
- If the alarm reappears when the node has been restarted
and VMs are running on the node, run the following command:
check config
on the vCIC to collect log files and perform data collection, as described in the Data Collection Guideline.Continue with Step 3.
- If the alarm ceases, do the following:
- Consult the next level of maintenance support. Attach the previously collected sosreport or the screenshot of the running processes to the customer service request. Further actions are outside the scope of this instruction.
- The job is completed.
3 Checking CPU Load and Utilization
To check the CPU load and CPU utilization, use either of the following tools:
- The GUI of the Zabbix monitoring tool, see Section 3.1.
- The performance management northbound API, see Section 3.2.
3.1 Zabbix Monitoring Tool
To access the Zabbix monitoring tool, use the address:
https://192.168.2.22/zabbix
The user group, user name, and password can be configured before deployment by setting the correct parameters in the config.yaml file. Refer to the Configuration File Guide.
The default user group is CEEUserGroup, the default user is ceeuser. The default password is generated during deployment, and can be found in /etc/openstack_deploy/user_secrets.yml under zabbix_cee_user_password on vFuel.
3.2 Performance Management Northbound API
To check CPU load in the performance management northbound API, refer to the section Monitoring API in the Performance Management Northbound API.

Contents