Replacing a faulty node in the cluster using the CLI

You can use the command-line interface (CLI) and the SAN Volume Controller front panel to replace a faulty node in the cluster.

Before you attempt to replace a faulty node with a spare node you must ensure that you meet the following requirements:
  • You know the name of the cluster that contains the faulty node.
  • A spare node is installed in the same rack as the cluster that contains the faulty node.
  • You must make a record of the last five characters of the original worldwide node name (WWNN) of the spare node. If you repair a faulty node, and you want to make it a spare node, you can use the WWNN of the node. You do not want to duplicate the WWNN because it is unique. It is easier to swap in a node when you use the WWNN.
Attention: Never connect a node with a WWNN of 00000 to the cluster. If this node is no longer required as a spare and is to be used for normal attachment to a cluster, you must change the WWNN to the number you recorded when a spare was created. Using any other number might cause data corruption.

If a node fails, the cluster continues to operate with degraded performance until the faulty node is repaired. If the repair operation takes an unacceptable amount of time, it is useful to replace the faulty node with a spare node. However, the appropriate procedures must be followed and precautions must be taken so you do not interrupt I/O operations and compromise the integrity of your data.

The following table describes the changes that are made to your configuration when you replace a faulty node in the cluster:
Node attributes Description
Front panel ID This is the number that is printed on the front of the node and is used to select the node that is added to a cluster.
Node ID This is the ID that is assigned to the node. A new node ID is assigned each time a node is added to a cluster; the node name remains the same following service activity on the cluster. You can use the node ID or the node name to perform management tasks on the cluster. However, if you are using scripts to perform those tasks, use the node name rather than the node ID. This ID will change during this procedure.
Node name This is the name that is assigned to the node. If you are using SAN Volume Controller version 5.1.0 nodes, the SAN Volume Controller automatically re-adds nodes that have failed back to the cluster. If the cluster reports an error for a node missing (error code 1195) and that node has been repaired and restarted, the cluster automatically re-adds the node back into the cluster. For releases prior to 5.1.0, if you do not specify a name, the SAN Volume Controller assigns a default name. The SAN Volume Controller creates a new default name each time a node is added to a cluster. If you choose to assign your own names, you must type the node name on the Adding a node to a cluster panel. You cannot manually assign a name that matches the naming convention used for names assigned automatically by SAN Volume Controller. If you are using scripts to perform management tasks on the cluster and those scripts use the node name, you can avoid the need to make changes to the scripts by assigning the original name of the node to a spare node. This name might change during this procedure.
Worldwide node name This is the WWNN that is assigned to the node. The WWNN is used to uniquely identify the node and the fibre-channel ports. During this procedure, the WWNN of the spare node changes to that of the faulty node. The node replacement procedures must be followed exactly to avoid any duplication of WWNNs. This name does not change during this procedure.
Worldwide port names These are the WWPNs that are assigned to the node. WWPNs are derived from the WWNN that is written to the spare node as part of this procedure. For example, if the WWNN for a node is 50050768010000F6, the four WWPNs for this node are derived as follows:
WWNN                          50050768010000F6
WWNN displayed on front panel 000F6
WWPN Port 1                   50050768014000F6
WWPN Port 2                   50050768013000F6
WWPN Port 3                   50050768011000F6
WWPN Port 4                   50050768012000F6
These names do not change during this procedure.

Complete the following steps to replace a faulty node in the cluster:

  1. Verify the name and ID of the node that you want to replace.

    Complete the following step to verify the name and ID:

    1. Issue the svcinfo lsnode CLI command to ensure that the partner node in the I/O group is online.
    • If the other node in the I/O group is offline, start Directed Maintenance Procedures (DMPs) to determine the fault.
    • If you have been directed here by the DMPs, and subsequently the partner node in the I/O group has failed, see the procedure for recovering from offline VDisks after a node or an I/O group failed.
    • If you are replacing the node for other reasons, determine the node you want to replace and ensure that the partner node in the I/O group is online.
    • If the partner node is offline, you will lose access to the VDisks that belong to this I/O group. Start the DMPs and fix the other node before proceeding to the next step.
  2. Find and record the following information about the faulty node using Steps 2a through 2h:
    • Node serial number
    • Worldwide node name
    • All of the worldwide port names
    • Name or ID of the I/O group that contains the node
    • Front panel ID
    • Uninterruptible power supply serial number
    1. Issue the svcinfo lsnode CLI command to find and record the node name and I/O group name. The faulty node will be offline.
    2. Issue the following CLI command:
      svcinfo lsnodevpd nodename

      Where nodename is the name that you recorded in step 2a.

    3. Find the WWNN field in the output.
    4. Record the last five characters of the WWNN.
    5. Find the front_panel_id field in the output.
    6. Record the front panel ID.
    7. Find the UPS_serial_number field in the output.
    8. Record the uninterruptible power supply serial number.
  3. Ensure that the faulty node has been powered off.
  4. Issue the following CLI command to remove the faulty node from the cluster:
    svctask rmnode nodename/id

    Where nodename/id is the name or ID of the faulty node.

  5. Disconnect all four fibre-channel cables from the node.
    Important: Do not plug the fibre-channel cables into the spare node until the spare node is configured with the WWNN of the faulty node.
  6. Connect the power and signal cables from the spare node to the uninterruptible power supply that has the serial number you recorded in step 2.h.
    Note: For 2145 UPS-1U units, you must disconnect the cables from the faulty node.
  7. Disconnect the faulty node's power and serial cable from the 2145 UPS-1U and connect the new node's power and signal cable in their place.
  8. Power on the spare node.
  9. Display the node status on the front-panel display.
  10. You must change the WWNN of the spare node to that of the faulty node. The procedure for doing this depends on the SAN Volume Controller version that is installed on the spare node. Press and release the down button until the Node: panel displays. Then press and release the right button until the WWNN: panel displays. If repeated pressing of the right button returns you to the Node: panel, without displaying a Node WWNN: panel, go to step 12; otherwise, continue with step 11.
  11. Change the WWNN of the spare node (with SAN Volume Controller V4.3 and above installed) to match the WWNN of the faulty node by completing the following steps:
    1. With the Node WWNN: panel displayed, press and hold the down button, press and release the select button, and then release the down button.The display switches into edit mode. Edit WWNN is displayed on line 1. Line 2 of the display contains the last five numbers of the WWNN.
    2. Change the WWNN that is displayed to match the last five numbers of the WWNN that you recorded in step 13. To edit the highlighted number, use the up and down buttons to increase or decrease the numbers. The numbers wrap F to 0 or 0 to F. Use the left and right buttons to move between the numbers.
    3. When the five numbers match the last five numbers of the WWNN that you recorded in step 2.d, press the select button to accept the numbers.
  12. Change the WWNN of the spare node (with SAN Volume Controller versions prior to V4.3 installed) to match the WWNN of the faulty node by performing the following steps:
    1. Press and release the right button until the Status: panel is displayed.
    2. With the node status displayed on the front panel, press and hold the down button; press and release the select button; release the down button. WWNN is displayed on line 1 of the display. Line 2 of the display contains the last five numbers of the WWNN.
    3. With the WWNN displayed on the front panel; press and hold the down button; press and release the select button; release the down button. The display switches into edit mode.
    4. Change the WWNN that is displayed to match the last five numbers of the WWNN that you recorded in step 2.d. To edit the highlighted number, use the up and down buttons to increase or decrease the numbers. The numbers wrap F to 0 or 0 to F. Use the left and right buttons to move between the numbers.
    5. When the five numbers match the last five numbers of the WWNN that you recorded in step 2.d, press the select button to retain the numbers that you have updated and return to the WWNN panel.
    6. Press the select button to apply the numbers as the new WWNN for the node.
  13. Connect the four fibre-channel cables that you disconnected from the faulty node to the spare node.

    If the spare node has less Ethernet cables connected than the faulty node, move the Ethernet cables from the faulty node to the spare node. Ensure you connect the cable into the same port on the spare node as it was in on the faulty node.

  14. Issue the following command to add the spare node to the cluster:
    svctask addnode -wwnodename WWNN -iogrp iogroupname/id 

    where WWNN and iogroupname/id are the values that you recorded for the original node.

    The SAN Volume Controller V5.1 automatically reassigns the node with the name that was used originally. For versions prior to V5.1, use the name parameter with the svctask addnode command to assign a name. If the original node's name was automatically assigned by SAN Volume Controller, it is not possible to reuse the same name. It was automatically assigned if its name starts with node. In this case, either specify a different name that does not start with node or do not use the name parameter so that SAN Volume Controller automatically assigns a new name to the node.

    If necessary, the new node is updated to the same SAN Volume Controller software version as the cluster. This update can take up to 20 minutes.

  15. Use the tools that are provided with your multipathing device driver on the host systems to verify that all paths are now online. See the documentation that is provided with your multipathing device driver for more information. For example, if you are using the subsystem device driver (SDD), see the IBM® System Storage® Multipath Subsystem Device Driver User's Guide for instructions on how to use the SDD management tool on host systems. It might take up to 30 minutes for the paths to come online.
  16. Repair the faulty node.
    Attention: When the faulty node is repaired, do not connect the fibre-channel cables to it. Connecting the cables might cause data corruption because the spare node is using the same WWNN as the faulty node.

    If you want to use the repaired node as a spare node, perform the following steps.

    For SAN Volume Controller V4.3 and later versions:

    1. With the Node WWNN: panel displayed, press and hold the down button, press and release the select button, and then release the down button.
    2. The display switches into edit mode. Edit WWNN is displayed on line 1. Line 2 of the display contains the last five numbers of the WWNN.
    3. Change the displayed number to 00000. To edit the highlighted number, use the up and down buttons to increase or decrease the numbers. The numbers wrap F to 0 or 0 to F. Use the left and right buttons to move between the numbers.
    4. Press the select button to accept the numbers.

      This node can now be used as a spare node.

    For SAN Volume Controller versions prior to V4.3:

    1. Press and release the right button until the Status: panel is displayed.
    2. With the node status displayed on the front panel, press and hold the down button; press and release the select button; release the down button. WWNN is displayed on line 1 of the display. Line 2 of the display contains the last five numbers of the WWNN.
    3. With the WWNN displayed on the front panel; press and hold the down button; press and release the select button; release the down button. The display switches into edit mode.
    4. Change the displayed number to 00000. To edit the highlighted number, use the up and down buttons to increase or decrease the numbers. The numbers wrap F to 0 or 0 to F. Use the left and right buttons to move between the numbers.
    5. Press the select button to accept the numbers.
    6. Press the select button to retain the numbers that you have updated and return to the WWNN panel.

      This node can now be used as a spare node.

Library | Support | Terms of use | Feedback
© Copyright IBM Corporation 2003, 2009. All Rights Reserved.