You can use the svctask chcluster CLI command or the SAN Volume Controller Console to
set the gmlinktolerance feature. The gmlinktolerance feature represents
the number of seconds that the primary SAN Volume Controller cluster
tolerates slow response times from the secondary cluster.
If the poor response extends past the specified tolerance,
a 1920 error is logged and one or more Global Mirror relationships
are automatically stopped. This protects the application hosts at
the primary site. During normal operation, application hosts see a
minimal impact to response times because the Global Mirror feature
uses asynchronous replication. However, if Global Mirror operations
experience degraded response times from the secondary cluster for
an extended period of time, I/O operations begin to queue at the primary
cluster. This results in an extended response time to application
hosts. In this situation, the gmlinktolerance feature stops Global
Mirror relationships and the application hosts response time returns
to normal. After a 1920 error has occurred, the Global Mirror auxiliary
VDisks are no longer in the consistent_synchronized state until you
fix the cause of the error and restart your Global Mirror relationships.
For this reason, ensure that you monitor the cluster to track when
this occurs.
You can disable the gmlinktolerance feature by setting
the gmlinktolerance value to 0 (zero). However, the gmlinktolerance
cannot protect applications from extended response times if it is
disabled. It might be appropriate to disable the gmlinktolerance feature
in the following circumstances:
- During SAN maintenance windows where degraded performance is expected
from SAN components and application hosts can withstand extended response
times from Global Mirror VDisks.
- During periods when application hosts can tolerate extended response
times and it is expected that the gmlinktolerance feature might stop
the Global Mirror relationships. For example, if you are testing using
an I/O generator which is configured to stress the backend storage,
the gmlinktolerance feature might detect the high latency and stop
the Global Mirror relationships. Disabling gmlinktolerance prevents
this at the risk of exposing the test host to extended response times.
Diagnosing and fixing 1920 errors
A 1920
error indicates that one or more of the SAN components are unable
to provide the performance that is required by the application hosts.
This can be temporary (for example, a result of maintenance activity)
or permanent (for example, a result of a hardware failure or unexpected
host I/O workload).
If you are experiencing 1920 errors, set up
a SAN performance analysis tool, such as the IBM® Tivoli® Storage Productivity Center,
and make sure that it is correctly configured and monitoring
statistics when the problem occurs. Set your SAN performance analysis
tool to the minimum available statistics collection interval. For
the
IBM Tivoli Storage Productivity Center,
the minimum interval is five minutes. If several 1920 errors have
occurred, diagnose the cause of the earliest error first. The following
questions can help you determine the cause of the error:
- Was maintenance occurring at the time of the error? This might
include replacing a storage controller's physical disk, upgrading
a storage controller's firmware, or performing a code upgrade
on one of the SAN Volume Controller clusters.
You must wait until the maintenance procedure is complete and then
restart the Global Mirror relationships. You must wait until the maintenance
procedure is complete to prevent a second 1920 error because the system
has not yet returned to a stable state with good performance.
- Were there any unfixed errors on either the source or target system?
If yes, analyze them to determine if they might have been the reason
for the error. In particular, see if they either relate to the VDisk
or MDisks that are being used in the relationship, or if they would
have caused a reduction in performance of the target system. Ensure
that the error is fixed before you restart the Global Mirror relationship.
- Is the long distance link overloaded? If your link is not capable
of sustaining the short-term peak Global Mirror workload, a 1920 error
can occur. Perform the following checks to determine if the long distance
link is overloaded:
- Look at the total Global Mirror auxiliary VDisk write throughput
before the Global Mirror relationships were stopped. If this is approximately
equal to your link bandwidth, your link might be overloaded. This
might be due to application host I/O operations or a combination of
host I/O and background (synchronization) copy activities.
- Look at the total Global Mirror source VDisk write throughput
before the Global Mirror relationships were stopped. This represents
the I/O operations that are being performed by the application hosts.
If these operations are approaching the link bandwidth, upgrade the
link's bandwidth, reduce the I/O operations that the application
is attempting to perform, or use Global Mirror to copy fewer VDisks.
If the auxiliary disks show significantly more write I/O operations
than the source VDisks, there is a high level of background copy.
Decrease the Global Mirror partnership's background copy rate
parameter to bring the total application I/O bandwidth and background
copy rate within the link's capabilities.
- Look at the total Global Mirror source VDisk write throughput
after the Global Mirror relationships were stopped. If write throughput
increases by 30% or more when the relationships are stopped, the application
hosts are attempting to perform more I/O operations than the link
can sustain. While the Global Mirror relationships are active, the
overloaded link causes higher response times to the application host,
which decreases the throughput it can achieve. After the Global Mirror
relationships have stopped, the application host sees lower response
times. In this case, the link bandwidth must be increased, the application
host I/O rate must be decreased, or fewer VDisks must be copied using
Global Mirror.
- Are the storage controllers at the secondary cluster overloaded?
If one or more of the MDisks on a storage controller are providing
poor service to the SAN Volume Controller cluster,
a 1920 error occurs if this prevents application I/O operations from
proceeding at the rate that is required by the application host. If
the backend storage controller requirements have been followed, the
error might have been caused by a decrease in controller performance.
Use IBM Tivoli Storage Productivity Center to
obtain the backend write response time for each MDisk at the secondary
cluster. If the response time for any individual MDisk exhibits a
sudden increase of 50 ms or more or if the response time is above
100 ms, this indicates a problem. Perform the following checks to
determine if the storage controllers are overloaded:
- Check the storage controller for error conditions such as media
errors, a failed physical disk, or associated activity such as RAID
array rebuilding. If there is an error, you should fix the problem
and then restart the Global Mirror relationships.
- If there is no error, determine if the secondary controller is
capable of processing the required level of application host I/O operations.
It might be possible to improve the performance of the controller
by adding more physical disks to a RAID array, changing the RAID level
of the array, changing the controller's cache settings and checkin
the cache battery to ensure it is operational, or changing other controller-specific
configuration parameters.
- Are the storage controllers at the primary cluster overloaded?
Analyze the performance of the primary backend storage using the same
steps as for the secondary backend storage. If performance is bad,
limit the amount of I/O operations that can be performed by application
hosts. Monitor the backend storage at the primary site even if the
Global Mirror relationships have not been affected. If bad performance
continues for a prolonged period, a 1920 error occurs and the Global
Mirror relationships are stopped.
- Is one of your SAN Volume Controller clusters
overloaded? Use IBM Tivoli Storage Productivity Center to
obtain the port to local node send response time and the port to local
node send queue time. If the total of these two statistics for either
cluster is above 1 millisecond, the SAN Volume Controller might
be experiencing a very high I/O load. Also check the SAN Volume Controller node
CPU utilization. If this figure is above 50%, this can also be contributing
to the problem. In either case, contact your IBM service
representative for
further assistance. If CPU utilization is much higher for one node
than for the other node in the same I/O group, this might be caused
by having different node hardware types within the same I/O group.
For example, a SAN Volume Controller 2145-8F4 in
the same I/O group as a SAN Volume Controller 2145-8G4.
If this is the case, contact your IBM service
representative.
- Do you have FlashCopy® operations
in the prepared state at the secondary cluster? If the Global Mirror
auxiliary VDisks are the sources of a FlashCopy mapping and that mapping is in
the prepared state for an extended time, performance to those VDisks
can be impacted because the cache is disabled. Start the FlashCopy mapping to enable the cache and
improve performance for Global Mirror I/O operations.