<!-- ""@(#)ha 1.2 97/10/16 Sun Microsystems Inc"" >
 
<title>High Availability</title>
<h1>High Availability</h1>

Sun Cluster 2.0 provides  high availability support and automatic data
service failover for selected applications. These applications include
<a href="sc/hanfs">NFS</a>,<a href="sc/haipro">Netscape Web services</a>,
and <a href="sc/hadbms">standard  Oracle, Sybase, and Informix RDBMS</a>.
Various combinations of these applications   can be mixed on the  same
cluster with independently configurable failover capabilities. <p>

In addition, Sun Cluster   2.0 provides a <a href="sc/haapi">   highly
available Data  Services   API</a> by  means  of which  customers  can
register an arbitrary application with  the cluster framework, thereby
rendering their applications highly available.<p>

The  Sun Cluster 2.0 framework  provides  hardware and software  fault
detection, system  administration,  and system  takeover and automatic
restart of registered  data services in the  event of system and  data
service failures. Each  HA data service  performs fault detection that
is specific for the data service. This addresses  the issue of whether
the data service  is performing useful  work, not just the question of
whether the machine and operating system appear to be running.<p>

The basic  logic of the data  service  fault probes  is that the probe
behaves like a client of the data service. The fault probes running on
a machine monitor both  the data service exported  by the machine and,
more importantly,  the  data services exported  by peer   nodes in the
cluster. A  faulty server is  not reliable  enough to  monitor its own
data services, so each server is monitoring  other servers in addition
to itself.<p>

The fault probes react to the  absence of service  state by having one
server forcibly take over  the data service from  its faulty  peer. In
some cases, the fault probes will  attempt to restart the data service
on the faulty server before  attempting the takeover. If many restarts
occur within  a short time,  the indication  is  that the  server  has
serious problems.   In this case,  a takeover  by  the peer  server is
executed immediately without attempting another local restart.<p>
