1 Introduction
This instruction concerns alarm handling.
1.1 Alarm Description
The alarm is raised in the following situations:
- The time difference within the cluster exceeds tolerance.
- The Managed Element (ME) fails to use the Network Time Protocol (NTP) service.
The possible alarm causes and fault locations are explained in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
Time difference within the cluster exceeds tolerance |
There are time differences between the hosts in the cluster exceeding the threshold value of 10 seconds |
Timing within the cluster is disrupted because of maintenance activities |
One or more blades can be rebooting |
Time stamps used in cluster services (such as logging, alarms, or charging records) start to differ from the real time |
|
A blade is rebooting | ||||
|
Failing to use the NTP service |
The configured NTP servers do not respond to a request for time synchronization or provide an invalid answer to the ME. |
NTP server |
If one or more NTP servers are unreachable, the result is a loss in resilience with no service impact. If all NTP servers are unreachable, then time stamps used in cluster services (such as logging, alarms, or charging records) start to differ from the real time. | |
|
NTP server configuration, firewall configuration | ||||
|
Loss of connectivity to one or more NTP servers (unreachable NTP server) |
Network problems | |||
|
The NTP server is unusable and its Fully Qualified Domain Name (FQDN) cannot be resolved |
Domain Name System (DNS) server | |||
|
Faulty network interface |
Network interface |
- Note:
- This alarm can appear as a result of a maintenance activity.
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value | ||
|---|---|---|---|
|
Major Type |
193 | ||
|
Minor Type |
3341942785 | ||
|
Source |
One of the following:
| ||
|
Specific Problem |
LOTC Time Synchronization | ||
|
Event Type |
environmentalAlarm (6) | ||
|
Probable Cause |
x736UnspecifiedReason (418) | ||
|
Additional Text |
Time incorrect (off by <value> seconds) |
One of the following:
|
Time servers not reachable: <ip_address/name> (unreachable) |
|
Perceived Severity |
critical (3): there are time differences between the blades in the cluster exceeding the threshold value |
major (4): contact with all NTP servers is lost |
minor (5): one or more NTP servers cannot be used or reached |
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
This instruction references the following document:
1.2.2 Tools
No tools are required.
1.2.3 Conditions
Before starting this procedure, ensure that the following conditions are met:
- A LOTC Time Synchronization alarm is raised.
- It is known how to map the HostName (part of alarm attribute Source) to its IP address.
2 Procedure
This section describes the procedure to follow when this alarm is received.
2.1 Analyzing Alarm
Select the appropriate action:
- If Additional Text contains Time incorrect, proceed with Section 2.2 Actions for Time Difference over Threshold.
- If Additional Text contains unusable, proceed with Section 2.3 Actions for Unusable Time Servers.
- If Additional Text contains rejected, proceed with Section 2.4 Actions for Rejected Time Servers.
- If Additional Text contains unreachable, proceed with Section 2.5 Actions for Unreachable Time Servers.
In all other situations, do the following:
- Perform data collection, refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
2.2 Actions for Time Difference over Threshold
Do the following:
- Log on to the host to access a Linux® shell:
ssh <user>@<hostname> -p 22
The hostname is part of alarm attribute Source.
- Wait up to 20 minutes until the cluster reaches a stable
state (that is, no node is rebooting). Check the state:
>cmw-status node
Status OK
- Is the alarm cleared?
Yes: Proceed with Step 6.
No: Continue with the next step.
- Perform data collection, refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- Job is completed.
2.3 Actions for Unusable Time Servers
Do the following:
- Log on to the host to access a Linux shell:
ssh <user>@<hostname> -p 22
The hostname is part of alarm attribute Source.
- Perform a lookup of the NTP server:
>nslookup <ntp_fqdn>
- Does the command return an error?
Yes: The DNS server can have a configuration fault. Request the DNS server administrator to act on the fault. Proceed with Step 6.
No: Continue with the next step.
- Perform data collection, refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- Job is completed.
2.4 Actions for Rejected Time Servers
Do the following:
- Log on to the host to access a Linux shell:
ssh <user>@<hostname> -p 22
The hostname is part of alarm attribute Source.
- Check the NTP status:
>ntpq -p
The NTP is functional if the output includes an active server, indicated by *. Backup sources are indicated with + in the output.
The following is an example output:
node1-kvm1:~ # ntpq -p remote refid st t when poll reach delay offset jitter ================================================================= +ns1.ericsson.se 192.0.2.10 2 u 239 1024 377 1.390 1.099 0.147 *ns2.ericsson.se 192.0.2.11 2 u 287 1024 377 1.260 1.272 0.181 +node2-kvm1 193.180.251.38 3 u 735 1024 377 0.321 0.121 0.142
- Does the output show that an NTP server is active?
Yes: The NTP server can have a configuration fault. Request the NTP server administrator to act on the fault. Proceed with Step 5.
No: Continue with the next step.
- The network blocking the NTP traffic can have a configuration fault. Request the network administrator to act on the fault. Continue with the next step.
- Job is completed.
2.5 Actions for Unreachable Time Servers
Do the following:
- Log on to the host to access a Linux shell:
ssh <user>@<hostname> -p 22
The hostname is part of alarm attribute Source.
- Is the affected node a payload node?
Yes: Proceed with Step 9.
No: Continue with the next step.
- Check the connection to the NTP server using ping and traceroute.
The NTP server FQDN is pointed at by alarm attribute Additional Text.
- Can the NTP server be reached with a delay less than 10
seconds?
Yes: Proceed with Step 6.
No: Continue with the next step.
- The network can have a configuration fault. Request the NTP server administrator or network administrator to act on the fault. Proceed with Step 17.
- Check the NTP configuration in configuration file cluster.conf.
- Is the NTP server FQDN or IP address correct?
Yes: Proceed with Step 12.
No: Continue with the next step.
- Update the NTP server FQDN or IP address in configuration file cluster.conf.
- Restart the alarm service:
>service alarmd restart
- Restart the NTP service:
>service ntp restart
- Wait up to 20 minutes and
check if the alarm is cleared. Is the alarm cleared?
Yes: Proceed with Step 17.
No: Continue with the next step.
- Check the connection to the DNS server using ping and traceroute.
- Can the DNS server be reached with a delay less than 10
seconds?
Yes: Proceed with Step 15.
No: Continue with the next step.
- The network can have a configuration fault. Request the DNS server administrator or network administrator to act on the fault. Proceed with Step 17.
- Perform data collection,
refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- Job is completed.

Contents