LOTC Time Synchronization

Contents

1Introduction
1.1Alarm Description
1.2Prerequisites

2

Procedure
2.1Analyzing Alarm
2.2Actions for Time Difference over Threshold
2.3Actions for Unusable Time Servers
2.4Actions for Rejected Time Servers
2.5Actions for Unreachable Time Servers

1   Introduction

This instruction concerns alarm handling.

1.1   Alarm Description

The alarm is raised in the following situations:

The possible alarm causes and fault locations are explained in Table 1.

Table 1    Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

Time difference within the cluster exceeds tolerance

There are time differences between the hosts in the cluster exceeding the threshold value of 10 seconds

Timing within the cluster is disrupted because of maintenance activities

One or more blades can be rebooting

Time stamps used in cluster services (such as logging, alarms, or charging records) start to differ from the real time

A blade is rebooting

Failing to use the NTP service

The configured NTP servers do not respond to a request for time synchronization or provide an invalid answer to the ME.


The ME cannot use the NTP service.

One or more NTP servers are down (unreachable NTP servers)

NTP server

If one or more NTP servers are unreachable, the result is a loss in resilience with no service impact.


If all NTP servers are unreachable, then time stamps used in cluster services (such as logging, alarms, or charging records) start to differ from the real time.

The ME rejected the time offered by the NTP server

NTP server configuration, firewall configuration

Loss of connectivity to one or more NTP servers (unreachable NTP server)

Network problems

The NTP server is unusable and its Fully Qualified Domain Name (FQDN) cannot be resolved

Domain Name System (DNS) server

Faulty network interface

Network interface

Note:  
This alarm can appear as a result of a maintenance activity.

The alarm attributes are listed and explained in Table 2.

Table 2    Alarm Attributes

Attribute Name

Attribute Value

Major Type

193

Minor Type

3341942785

Source

One of the following:


  • ManagedElement=<node_name>,HostName=<hostname>,ERIC-LINUX_CONTROL-*

  • ManagedElement=<node_name>,HostName=<hostname>,ERIC-LINUX_PAYLOAD-*

Specific Problem

LOTC Time Synchronization

Event Type

environmentalAlarm (6)

Probable Cause

x736UnspecifiedReason (418)

Additional Text

Time incorrect (off by <value> seconds)

One of the following:


  • Time servers not reachable: <ip_address/name> (unusable)

  • Time servers not reachable: <ip_address/name> (rejected at initial selection)

  • Time servers not reachable: <ip_address/name> (rejected at reselection)

  • Time servers not reachable: <ip_address/name> (unreachable)

  • Could not initialize socket

  • Could not send/receive ntp system status query

  • Could not send/receive ntp peer status query

  • Failed to interpret answer from NTP server

  • Time servers not configured

  • Local time server raised alarm condition

Time servers not reachable: <ip_address/name> (unreachable)

Perceived Severity

critical (3): there are time differences between the blades in the cluster exceeding the threshold value

major (4): contact with all NTP servers is lost

minor (5): one or more NTP servers cannot be used or reached

1.2   Prerequisites

This section provides information on the documents, tools, and conditions that apply to the procedure.

1.2.1   Documents

This instruction references the following document:

1.2.2   Tools

No tools are required.

1.2.3   Conditions

Before starting this procedure, ensure that the following conditions are met:

2   Procedure

This section describes the procedure to follow when this alarm is received.

2.1   Analyzing Alarm

Select the appropriate action:

In all other situations, do the following:

  1. Perform data collection, refer to Data Collection Guideline.
  2. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.

2.2   Actions for Time Difference over Threshold

Do the following:

  1. Log on to the host to access a Linux® shell:

    ssh <user>@<hostname> -p 22

    The hostname is part of alarm attribute Source.

  2. Wait up to 20 minutes until the cluster reaches a stable state (that is, no node is rebooting). Check the state:

    >cmw-status node

    Status OK

  3. Is the alarm cleared?

    Yes: Proceed with Step 6.

    No: Continue with the next step.

  4. Perform data collection, refer to Data Collection Guideline.
  5. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  6. Job is completed.

2.3   Actions for Unusable Time Servers

Do the following:

  1. Log on to the host to access a Linux shell:

    ssh <user>@<hostname> -p 22

    The hostname is part of alarm attribute Source.

  2. Perform a lookup of the NTP server:

    >nslookup <ntp_fqdn>

    Note:  
    The NTP server FQDN is pointed at by alarm attribute Additional Text.

  3. Does the command return an error?

    Yes: The DNS server can have a configuration fault. Request the DNS server administrator to act on the fault. Proceed with Step 6.

    No: Continue with the next step.

  4. Perform data collection, refer to Data Collection Guideline.
  5. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  6. Job is completed.

2.4   Actions for Rejected Time Servers

Do the following:

  1. Log on to the host to access a Linux shell:

    ssh <user>@<hostname> -p 22

    The hostname is part of alarm attribute Source.

  2. Check the NTP status:

    >ntpq -p

    The NTP is functional if the output includes an active server, indicated by *. Backup sources are indicated with + in the output.

    The following is an example output:

    node1-kvm1:~ # ntpq -p
    remote refid st t when poll reach delay offset jitter
    =================================================================
    +ns1.ericsson.se 192.0.2.10 2 u 239 1024 377 1.390 1.099 0.147
    *ns2.ericsson.se 192.0.2.11 2 u 287 1024 377 1.260 1.272 0.181
    +node2-kvm1 193.180.251.38 3 u 735 1024 377 0.321 0.121 0.142
  3. Does the output show that an NTP server is active?

    Yes: The NTP server can have a configuration fault. Request the NTP server administrator to act on the fault. Proceed with Step 5.

    No: Continue with the next step.

  4. The network blocking the NTP traffic can have a configuration fault. Request the network administrator to act on the fault. Continue with the next step.
  5. Job is completed.

2.5   Actions for Unreachable Time Servers

Do the following:

  1. Log on to the host to access a Linux shell:

    ssh <user>@<hostname> -p 22

    The hostname is part of alarm attribute Source.

  2. Is the affected node a payload node?

    Yes: Proceed with Step 9.

    No: Continue with the next step.

  3. Check the connection to the NTP server using ping and traceroute.

    The NTP server FQDN is pointed at by alarm attribute Additional Text.

  4. Can the NTP server be reached with a delay less than 10 seconds?

    Yes: Proceed with Step 6.

    No: Continue with the next step.

  5. The network can have a configuration fault. Request the NTP server administrator or network administrator to act on the fault. Proceed with Step 17.
  6. Check the NTP configuration in configuration file cluster.conf.
  7. Is the NTP server FQDN or IP address correct?

    Yes: Proceed with Step 12.

    No: Continue with the next step.

  8. Update the NTP server FQDN or IP address in configuration file cluster.conf.
  9. Restart the alarm service:

    >service alarmd restart

  10. Restart the NTP service:

    >service ntp restart

  11. Wait up to 20 minutes and check if the alarm is cleared. Is the alarm cleared?

    Yes: Proceed with Step 17.

    No: Continue with the next step.

  12. Check the connection to the DNS server using ping and traceroute.
  13. Can the DNS server be reached with a delay less than 10 seconds?

    Yes: Proceed with Step 15.

    No: Continue with the next step.

  14. The network can have a configuration fault. Request the DNS server administrator or network administrator to act on the fault. Proceed with Step 17.
  15. Perform data collection, refer to Data Collection Guideline.
    Note:  
    Collect the NTP status and ARP tables status.

  16. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  17. Job is completed.


Copyright

© Ericsson AB 2014, 2015. All rights reserved. No part of this document may be reproduced in any form without the written permission of the copyright owner.

Disclaimer

The contents of this document are subject to revision without notice due to continued progress in methodology, design and manufacturing. Ericsson shall have no liability for any error or damage of any kind resulting from the use of this document.

Trademark List
All trademarks mentioned herein are the property of their respective owners. These are shown in the document Trademark Information.

    LOTC Time Synchronization