1 Alarm Description
The alarm is raised in the following situations:
- The time difference within the cluster exceeds tolerance.
- The Managed Element (ME) fails to use the Network Time Protocol (NTP) service.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
Time difference within the cluster exceeds tolerance |
There are time differences between the hosts in the cluster exceeding the threshold value of 10 seconds |
Timing within the cluster is disrupted because of maintenance activities |
One or more hosts can be rebooting |
Time stamps used in cluster services (such as logging, alarms, or charging records) start to differ from the real time |
|
A host is rebooting | ||||
|
Failing to use the NTP service |
The configured NTP servers do not respond to a request for time synchronization or provide an invalid answer to the ME. The ME cannot use the NTP service. |
NTP server |
If one or more NTP servers are unreachable, the result is a loss in resilience with no service impact. If all NTP servers are unreachable, then time stamps used in cluster services (such as logging, alarms, or charging records) start to differ from the real time. | |
|
The ME rejected the time offered by the NTP server |
NTP server configuration, firewall configuration | |||
|
Loss of connectivity to one or more NTP servers (unreachable NTP server) |
Network problems | |||
|
The NTP server is unusable and its Fully Qualified Domain Name (FQDN) cannot be resolved |
Domain Name System (DNS) server | |||
|
Faulty network interface |
Network interface |
- Note:
- This alarm can appear as a result of a maintenance activity.
2 Procedure
2.1 Handle Alarm LOTC Time Synchronization
Prerequisites
- This instruction references the following document:
- No tools are required.
- The following conditions must apply:
- The alarm is raised.
- It is known how to map the HostName (part of alarm attribute Source) to its IP address.
- The user has administrative privileges.
Steps
- Check the Additional Text attribute of the alarm.
- Select the appropriate action based on the attribute value:
- If Additional Text contains Time incorrect, proceed with Section 2.2 Handle Reason Time Difference over Threshold.
- If Additional Text contains unusable, proceed with Section 2.3 Handle Reason Unusable Time Servers.
- If Additional Text contains rejected, proceed with Section 2.4 Handle Reason Rejected Time Servers.
- If Additional Text contains unreachable, proceed with Section 2.5 Handle Reason Unreachable Time Servers.
- For other cases, proceed with the next step.
- Perform data collection, refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
2.2 Handle Reason Time Difference over Threshold
Steps
- Log on to the host to access a Linux® shell, for example:
ssh <user>@<hostname> -p 7022
The hostname is part of alarm attribute Source.
- Wait up to 20 minutes until the cluster reaches a stable
state (that is, no node is rebooting). Check the state:
>cmw-status node
The following is an example output:
Status OK
- Is the alarm cleared?
Yes: Proceed with Step 6.
No: Continue with the next step.
- Perform data collection, refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- Job is completed.
2.3 Handle Reason Unusable Time Servers
Steps
- Log on to the host to access a Linux shell, for example:
ssh <user>@<hostname> -p 7022
The hostname is part of alarm attribute Source.
- Perform a lookup of the NTP server:
>nslookup <ntp_fqdn>
- Does the command return an error?
Yes: The DNS server can have a configuration fault. Request the DNS server administrator to act on the fault. Proceed with Step 6.
No: Continue with the next step.
- Perform data collection, refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- Job is completed.
2.4 Handle Reason Rejected Time Servers
Steps
- Log on to the host to access a Linux shell, for example:
ssh <user>@<hostname> -p 7022
The hostname is part of alarm attribute Source.
- Check the NTP status:
>ntpq -p
The NTP is functional if the output includes an active server, indicated by *. Backup sources are indicated with + in the output.
The following is an example output:
node1-kvm1:~ # ntpq -p remote refid st t when poll reach delay offset jitter ================================================================= +ns1.ericsson.se 192.0.2.10 2 u 239 1024 377 1.390 1.099 0.147 *ns2.ericsson.se 192.0.2.11 2 u 287 1024 377 1.260 1.272 0.181 +node2-kvm1 193.180.251.38 3 u 735 1024 377 0.321 0.121 0.142
- Does the output show that an NTP server is active?
Yes: The NTP server can have a configuration fault. Request the NTP server administrator to act on the fault. Proceed with Step 5.
No: Continue with the next step.
- The network blocking the NTP traffic can have a configuration fault. Request the network administrator to act on the fault. Continue with the next step.
- Job is completed.
2.5 Handle Reason Unreachable Time Servers
Steps
- Log on to the host to access a Linux shell, for example:
ssh <user>@<hostname> -p 7022
The hostname is part of alarm attribute Source.
- Is the affected node a payload node?
Yes: Proceed with Step 11.
No: Continue with the next step.
- Check the connection to the NTP server using ping and traceroute.
The NTP server FQDN is pointed at by alarm attribute Additional Text.
- Can the NTP server be reached with a delay less than 10
seconds?
Yes: Proceed with Step 6.
No: Continue with the next step.
- The network can have a configuration fault. Request the NTP server administrator or network administrator to act on the fault. Proceed with Step 16.
- Check the NTP configuration in configuration file cluster.conf.
- Is the NTP server FQDN or IP address correct?
Yes: Proceed with Step 11.
No: Continue with the next step.
- Update the NTP server FQDN or IP address in configuration file cluster.conf.
- Validate the configuration:
>lde-config -v
- Reload the updated configuration:
>lde-config --reload
- Wait up to 20 minutes and
check if the alarm is cleared. Is the alarm cleared?
Yes: Proceed with Step 16.
No: Continue with the next step.
- Reboot the node.
As a consequence of a reboot, applications can lose sessions or traffic. Therefore, restart only one node at a time and only if the state of the cluster as whole is stable and running.
- Wait up to 20 minutes and check if the alarm is cleared.
Is the alarm cleared?
Yes: Proceed with Step 16.
No: Continue with the next step.
- Perform data collection,
refer to Data Collection Guideline.
- Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
- Job is completed.

Contents