LOTC Disk Replication Communication
[Use only for Virtualized]

Contents


1   Alarm Description

The alarm is raised when the control nodes have lost connection to each other for more than 20 minutes, and are no longer in redundant mode. The control node pair is in a non-redundant mode when the control nodes have no connection with each other.

Table 1    LOTC Disk Replication Communication Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

Loss of connection between control nodes for more than 20 minutes

The control nodes have lost connection to each other for more than 20 minutes. The Linux® service Distributed Replicated Block Device (DRBD) is not in connected mode.

Network failure leading to communication problems between the control nodes

Network

Both controllers take the primary role and no data is transferred between the nodes

Hardware failure on the secondary control node

Secondary control node

If one of the controller nodes is down, the cluster does not have a controller node to which it can fail over

Note:  
This alarm can appear as a result of a maintenance activity.

2   Procedure

2.1   Handle Alarm LOTC Disk Replication Communication

Prerequisites

Steps

  1. Log on to the host to access a Linux shell, for example:

    ssh <user>@<hostname>

    The hostname is part of alarm attribute Source.

  2. Is the alarm raised during initial installation or replacement of a control node?

    Yes: Continue with the next step.

    No: Proceed with Step 5.

  3. Wait for DRBD connection to be established. Check if the following command results in output Connected:

    drbdadm cstate drbd0

    The following is an example output in a normal situation. The connection state (cstate) is Connected. The alarm is cleared within 5 seconds.

    Connected

    The following is an example output in a faulty situation when running drbd version 9 or newer. The connection state (cstate) is Connecting.

    Connecting

    The following is an example output in a faulty situation when running drbd version 8 or older. The connection state (cstate) is WFConnection (Waiting For Connection).

    WFConnection
  4. Does the output contain Connected and is the alarm cleared?

    Yes: Proceed with Step 15.

    No: Continue with the next step.

  5. Identify the DRBD interfaces as follows:
    1. Get the name of the interface (eth<x>):

      cat /etc/cluster/nodes/this/networks/internal/primary/interface/name

      The following is an example output:

      eth0
    2. Get the IP address (<ip>):

      cat /etc/cluster/nodes/this/networks/internal/primary/address

      The following is an example output:

      169.254.43.11
    3. Get the network mask (<netmask>):

      cat /etc/cluster/nodes/this/networks/internal/primary/network/netmask

      The following is an example output:

      255.255.255.0
  6. Check the log /var/log/messages for recent system log messages indicating DRBD interface-related issues, for example (to show the last 1000 lines in the log):

    tail -1000 /var/log/messages

    The following is an example output in a faulty situation:

    Aug 26 12:17:52 SC-1 kernel: [  277.720545] hrtimer: interrupt took 572013 ns
    Aug 26 12:32:50 SC-1 kernel: [ 1175.612842] tipc: Resetting bearer <eth:eth0>
    Aug 26 12:32:50 SC-1 dhcpd: receive_packet failed on eth0: Network is down
    Aug 26 12:32:50 SC-1 syslog-ng[1810]: I/O error occurred while writing; fd='6', error='Network ⇒
    is unreachable (101)'
    Aug 26 12:32:50 SC-1 syslog-ng[1810]: Connection broken; time_reopen='10'
    Aug 26 12:32:59 SC-1 ntpd[2240]: sendto(192.0.2.10) (fd=23): Network is unreachable
    Aug 26 12:33:00 SC-1 syslog-ng[1810]: Connection failed; error='Network is unreachable (101)'
    Aug 26 12:33:00 SC-1 syslog-ng[1810]: Initiating connection failed, reconnecting; time_reopen='10'
    Aug 26 12:33:10 SC-1 syslog-ng[1810]: Connection failed; error='Network is unreachable (101)'
    Aug 26 12:33:10 SC-1 syslog-ng[1810]: Initiating connection failed, reconnecting; time_reopen='10'
    Aug 26 12:33:20 SC-1 syslog-ng[1810]: Connection failed; error='Network is unreachable (101)'
    Aug 26 12:33:20 SC-1 syslog-ng[1810]: Initiating connection failed, reconnecting; time_reopen='10'
    Note:  
    In this output, eth0 is the interface used by the DRBD.

  7. Are there any issues with the network interface used for the DRBD?

    Yes: Continue with the next step.

    No: Proceed with Step 13.

  8. Check the status of the interface used by the DRBD:

    ifconfig

    The following is an example output:

    eth0
    Link encap:Ethernet  HWaddr 00:50:56:92:02:38
    inet addr:10.64.87.136  Bcast:10.64.87.191  Mask:255.255.255.192
    inet6 addr: fe80::250:56ff:fe92:238/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
    RX packets:3884520 errors:0 dropped:0 overruns:0 frame:0
    TX packets:178358 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
    RX bytes:333841297 (318.3 Mb)  TX bytes:10087705 (9.6 Mb)
    Note:  
    The keywords UP and RUNNING in the output means that the DRBD interface is operational.

  9. Is the DRBD interface operational?

    Yes: With a high probability, there is a network issue. Perform data collection, refer to Data Collection Guideline. Contact the network administrator. Proceed with Step 15.

    No: Continue with the next step.

  10. Try to bring up the interface used by the DRBD:

    ifconfig <interface> <ip> netmask <mask>

    Use the values collected in Step 5, for example:

    ifconfig eth0 169.254.43.11 netmask 255.255.255.0

  11. Check the status of the interface used by the DRBD:

    ifconfig

    The following is an example output:

    eth0
    Link encap:Ethernet  HWaddr 00:50:56:92:02:38
    inet addr:10.64.87.136  Bcast:10.64.87.191  Mask:255.255.255.192
    inet6 addr: fe80::250:56ff:fe92:238/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
    RX packets:3884520 errors:0 dropped:0 overruns:0 frame:0
    TX packets:178358 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
    RX bytes:333841297 (318.3 Mb)  TX bytes:10087705 (9.6 Mb)
  12. Is the DRBD interface operational and is the alarm cleared?

    Yes: Proceed with Step 15.

    No: Continue with the next step.

  13. Perform data collection, refer to Data Collection Guideline.
  14. Consult the next level of maintenance support. Further actions are outside the scope of this instruction.
  15. Job is completed.