 HSZ40 Switching - Design Spec  V2 Draft, 1-Jul-1996 Glenn C. Everhart   D --------------------------------------------------------------------   Problem Statement:D The HSZ40 series with the next release of HSOF will offer a dual bus> failover capability. This is characterized by some new INQUIRYI information so that a host can be informed that the failover is possible, C and new logic to provide a "preferred" initial path. (A single SCSI E bus failover is also offered, but that requires no software changes.)   E When the devices come up under current autoconfiguration, it is to be G expected that each device will appear twice, once via its path over the > first SCSI bus from the HSZ40 to the host, once via the other.F Notwithstanding this, the devices are not duplicated, and because theyE have two aliases, the file system can readily corrupt file structures  located on these devices.   G Some means to control access so that a single path is used at any given H moment, and so that normal VMS operations will not notice the dual path,I is needed, which will allow access to the devices via the second SCSI bus E in the event the first fails. Allowing accesses to be shared over the C busses initially is highly desirable as well, and is supported to a H degree by the HSZ firmware. (This is done by allowing a preference to beI stated for each device, so that some devices can be set to be "preferred"  over each bus.)   E This failover must be available for disks. It should be available for  other devices also.    Background:   E Some HSZ devices have multiple SCSI bus connections, and the issue of D failover between them has arisen. These connections can be connectedI either to the same SCSI bus (providing dual paths to that bus so that the I failure of either controller does not prevent access to devices connected ) to the HSZ) or to different SCSI busses.    G If both SCSI controllers on the HSZ are connected to the same SCSI bus, G the HSZ will be able to handle failover within itself so that a host on D the bus will not notice any change. However, when each controller is= connected to a different SCSI bus, the host must be involved.   E In this case, an HSZ might be on two ports on a system, with two SCSI D controllers, and all LUNs attached to the HSZ will therefore show upF twice; a disk might show up as DKB300: and as DKD300:, for example, ifD the HSZ were connected to the second and fourth SCSI adapters on theI machine. At the HSZ itself, it is possible to set a preferred path to the G device, and it will appear unready on the other path, but both could be . configured and would refer to the same device.  F Having dual names for the same storage violates the VMS cluster namingH scheme and can result in disk corruption, so this situation by itself is not satisfactory.   H Fortunately the HSZ itself provides certain bits of information which an; operating system can use to figure which devices are which.   I First, when in this dual-bus configuration, an HSZ will return some extra . data in INQUIRY responses. This data includes:  )    * The serial number of this controller 2    * The serial number of the alternate controller?    * A bitmask of LUNs which are preferred for this controller.   G Therefore one can determine, from the INQUIRY data, if the device is an I HSZ, what this and the "other" controller is, and whether this particular I device is preferred on "this" controller. (The bitmask changes to reflect C the actual situation, so that if one controller fails, all LUNs are I marked as preferred on the other). This extra information is present only G in the dual bus case (the serial numbers being nulled otherwise).  This C permits a driver to determine, when configuring a device, that this E particular path to the device is the preferred one or is an alternate E non-preferred one. Moreover, the controller serial numbers are unique B and visible to all nodes on a cluster, so that if a device name isG chosen based on them, it will automatically be the same for all cluster  nodes./   B In addition, the HSZ firmware is being given the ability to notifyD drivers when a controller fails. This presumes that some devices areA active on each controller, and works by having the HSZ detect the H controller failure. If this happens, the next I/O to the good controllerF will receive a CHECK CONDITION status (unit attention). The sense dataE then uses some vendor unique sense codes for failover (and eventually C failback) events and returns the good controller serial number, the G failed controller serial number, failed controller target number, and a D bitmask of LUNs moved. In addition, when this happens, the survivingH controller kills (resets) the other controller to keep it from trying to continue operation.   B This information can permit the processor to be notified of a pathD failure without necessarily having to incur timeout and mount verifyH delays. On VMS, however, a SCSI adapter on a failed path may have I/O inG various states within its control, and if this is the case, some method C of extracting it is needed. The usual path for this function is for I timeouts to occur and force I/O requeue and mount verify. Where I/O is in I progress to a device, there is no convenient external handle available to H extract it (and the notion that as a side effect of a successful I/O on,I say, mkb200:, we might stop and redirect all I/O active on DKD400:, seems G likely to be far more complex and error prone than can be tolerated, if E it can be done at all on all adapters). Therefore this information is H likely to be most useful where the failed path devices are in fact idle.E Where I/O is in progress at some stage within a SCSI adapter, it will H have to be timed out or otherwise cleared from the adapter before a pathI switchover can take place. (This also means that in the event a transient E failure occurs, nothing will be left "in the pipeline" to a device at 
 switch time.)   G Actual HSZ switchover is done by a SCSI START command (which is done as H part of the IO$_PACKACK operation in VMS) so that host software has some control.  H There is a proposal to the SCSI-3 committee which details a more generalI configuration, in which some number of devices are controlled by a set of E controllers, where a device may be accessible from one or more of the I controllers at a time. It is anticipated that LUN ownership might have to D be established in this case via reserve/ release to set initial path5 preference (if only one path at a time may be used).    G This proposal defines some SCSI commands which may be sent to a storage I control device to report which controllers and devices are associated and I to set up access. Since these devices will have their own LUNs and device H types (apart from disks, tapes, etc. behind them) it is apparent that anA io$_packack to a disk would have to have been preceded by some FC H initialization commands. The unit init code of a new class driver may beG the most logical place for such commands. Failover or failback is to be 6 reported by ASC/ASCq event codes, same as for the HSZ.  B While this suggestion is not yet definite, this specification doesE attempt to be generally compatible with it. (A server, for a specific I case, can communicate with a control device if need be when a failover is  signalled.)    Goals:@ * Support HSZ failover for HSZ7x type controllers where two SCSI* 	busses are connected to a single machine.$ * Leave open expansion possibilitiesF * Be compatible with planed HSG failover mechanism (which is generallyA 	similar to the HSZ one, with some differences due to the changes  	between SCSI-2 and SCSI-3) K * If possible, facilitate failover between direct SCSI connections and MSCP C 	or other server connections. (That is, a design that may help with 9 	MSCP failover should be preferred over one that cannot.)   
 Non-Goals: * Support more than 2 bussesG * Support the case where both HSZ controllers are on a single bus (this  	is supported within the HSZ) + * Solve the device naming problem generally L * Dynamic routing or load balancing between paths to a device in full detailB * Describe details of compatibility with the HSG proposed failover 	scheme.   Discussion of goals:  G Much more complex situations may arise in the future, where devices areiE reachable via any of several paths. Controllers are under discussion lE which have 16 bus interconnects available to different computers, andoC which will need to do load balancing, and will need to have devicespD handled in such a way that confusion does not result due to multipleG names. The approach discussed herein does not attempt to deal with thiscG complexity yet, but to find a way to deal with the part of the failoverdG problem defined by the HSZ firmware (HSOF 3.0 and later) which requires I host CPU cooperation. It does attempt not to constrain its implementationeE too much, so that extension of the switching to more than two busses,eF routing of I/O dynamically between several paths, and failover betweenE paths regardless of their method of connection can be contemplated as C extensions to it rather than total reworks. All these are possible,tC but all will require additional design effort, which is not coveredSC directly here. The techniques here appear to be usefully extensibledE in the directions mentioned, but the full set of issues around any oft< these other but related problems has not yet been addressed.  H The configurations being addressed are therefore limited at this time toD the dual-bus HSZ cases. The more general case of many paths via manyI controller types with possible load balancing is not addressed here, saveuH in part and with important issues over how to generalize synchronizationI boundary conditions not dealt with in their full generality. That general G discussion is beyond the scope of this design. The design proposed here I is also a VMS variant of the kind of driver interface called "streams" in H the Unix world. This is an interesting sidelight which may be suggestiveF but going beyond this sidebar comment is also beyond the scope of this design.i  F This document should be considered as the design spec for HSZ failover> primarily, though critiques of the design where it may be overI specialized in ways which will make it harder to solve follow-on problemsp might be appropriate.t  	 Approach:t  H It was initially considered that some form of altering SCSI connections'F purely SCSI structure based "routing" to devices might be feasible forI the switching needed here.  However, in principle, two SCSI busses can be I controlled by entirely different SCSI port drivers, so that an attempt towE alter connections on the fly at port level could involve considerabled@ complexity in ensuring that port driver specific structures wereF initialized as the port drivers expect. (These initializations are not@ all alike.) Also, a "port level" approach does not deal with theG appearance of multiple class driver units after autoconfiguration. Idle F drivers might be revectored, but any links between SCSI structures andL class drivers would need to be traced and reset, and any future asynchronousI events would need to be blocked from access to the structures during thisrG time, and any port driver specificities in the SCDT in particular wouldk
 be a problem.a  D Since the failover scheme used in DUdriver is basically near the topB of the I/O chain in the class driver, this seemed a more promisingE direction to go, and had the extra advantage that it might facilitate" failover between DK and DU.k  J Therefore a simpler approach has been investigated. This approach involvesI small modifications to DKdriver (and possibly, but not necessarily, otherpG drivers) to recognize HSZ units which are non-preferred path aliases ofwE other devices and to mark them so that the MSCP server and normal VMS J mounting services do not attempt to access them. This will ensure that forE each device, one and only one mountable, serveable class level deviceuI appears. The alternate path will however still be autoconfigured, so that I the SCSI connections will be created and initialized as at present by theoK class drivers. The alternate path will however have its data structures set I so that they will be effectively invisible to normal VMS users. This willeD mean that the device will exist, but will not be found by VMS search6 routines as a device available for channel assignment.  @ Then at some point moderately early in system startup, but afterD autoconfigure, a switching driver will be inserted. This driver willD implement the failover policy by gaining control at the class driverH start_io entry point for the preferred path device, and doing monitoringG or switching. Sufficient units of this virtual driver will be connectedeH to handle all pairs of disks present, and a server will be started whichI will scan the device configuration (from the SCSI data base which by thenoH will have been set up) and connect the pairs of disks appropriately, andC also remain active awaiting notification of failures so that it canfF direct the failover of idle devices to a remaining good path. Only one2 server is required for any number of such devices.  E (Insertion of the switching component earlier in the boot sequence iscF possible, and may be desired at some point, but it must occur at leastH after all local disks are configured. This may not be difficult with theI new file oriented configurator, but remains to be investigated. The basictH feasibility of the approach appears adequate even if startup is deferredD to one of the startup scripts, though earlier connection may make itH unnecessary to use a sysgen parameter, at the expense of some early boot% code to effectively rename a device.)e  D (The switching driver will also intercept all other relevant entriesC pointed to by the DDTAB tables of drivers, to ensure that where theaH device is being accessed, the accesses are properly routed to the "live"I device. Entries relevant are altstart, mountverify, pending io, auxiliary"H routines, and cancel io from current examinations; register dump appearsJ not to need to be switched due to its calling usage. The pending I/O entryK will be used primarily to ensure that I/O is seen even if a driver directly- pulls requests off its queue.)  D The intercept driver's monitoring function will monitor I/O requestsI coming to the device so that when an IO$_PACKACK coming from mount verifyeH is seen after the Nth time (initially, the third), this will be taken toG mean that the I/O via the currently active path is infeasible, and thataE it is time to try switching. When this happens, I/O packets will havepB their path switches. The driver will either be set to requeue IRPsF received to the alternate path driver (and gain control at I/O postingF time to complete the I/O in the original device's context), or to stopG doing this and allow IRPs to continue to the original start_io entry oftF the initially preferred path's driver. Also, some "Special" I/O statusG returns will be monitored (implemented as alternate success statuses in H the current thinking) so that a server can be notified if an I/O returnsB from one controller and indicates the HSZ has found that its other controller has failed.  G The switching driver can switch paths on command as well, provided thattD there is no I/O active on a device being switched. I/O is defined asG active if an IRP has been seen at the driver's start-io entry point and ( has not been seen at I/O postprocessing.  
 IMPORTANT:  H What VMS needs for valid file structures is that the device name as seenE by the rest of the system be uniform. Once the switching component isrI present, this name can be that of either path, regardless of which devicebK is actually preferred. The intent is not to force a preferred allocation of H HSZ slots, but to set names uniformly, permitting the HSZ console choiceJ of actual path preference to be honored. The switching takes place "under"I the chosen device name, with the initial state of the switch being set soeI that the preferred device is used initially. If the switching software isHH to be loaded early in boot path, some cooperation with DKdriver to honorH the HSZ preference (or later, generic SCSI3 preferences) will be needed.2 This is not expected to be a large amount of code.   DESIGN:i  G There are two new component, SWDRIVER and SWCTL, and some modifications I to DKDRIVER used to produce the failover. (Similar changes can be made toeG other class drivers in a second pass; the switching software is largelymB independent of device class and can readily have those limitations7 removed for devices which cannot support mount verify.)    DKDRIVER CHANGES:t  I DKdriver is to be modified so that in unit init, when it looks at INQUIRYcA data from the HSZ, it determines whether this device is on a "noniH preferred" path (this being returned by the HSZ INQUIRY data). If so, itK sets the DEV$V_NOCLU bit in its DEVCHAR2 field so that the MSCP server willtI not initially serve the non-preferred path. The preferred path for namingeD will be chosen as that with the lower controller serial number whereG possible (or the higher, depending on a parameter which must be set the F same clusterwide). Thus, all nodes will see the same path, and it willF be possible to boot the cluster even if one path's controller is down.E (The problem of a shared SCSI bus being sampled by one node with pathtB A down and soon after by another node with path A back up again isE otherwise rather intractable.) In this way, boot time consistency canf  be assured in naming and access.  L In the HSZ failover case, the device will come up with two aliases, and will@ return to each DKdriver unit the conroller serial numbers of theD "current" and "other" path controllers. Thus a given device might beG visible as, say, DKA300 and DKB300, but while (where the "A" controller I happens to be the preferred path) DKA300 will be identified as a disk andsK come up normally, this code will cause DKB300 to be reset to not be visiblee to other nodes or to users.i  F When the switching driver is connected, it will permit the extra pathsD to be seen by the MSCP server again, so that DK to DU switchover canE be started once these paths are created. They will still be marked as F invisible by existing code in DUTUSUBS. The switching driver will testE which units are actually on the HSZ "preferred" side and will set its-F switch to connect the preferred path initially to the uniformly chosenD device name, provided the paths are idle. (It will not switch during
 activity.)  E The HSZ will experience timeouts when a bus fails, which will produce H mount verify conditions. In addition, should the HSZ detect a controllerE failure, it will allow failover to take place and will signal this bytI generating CHECK CONDITION on the next I/O to the "good" side controller.a  G The CHECK CONDITION operations within DKdriver to handle UNIT ATTENTIONeF will in fact return success with the current DKdriver. To preserve theH status that the devices are operating correctly, yet allow the switchingE server to obtain the signal, DKdriver will, in this situation, returnsH alternate success reports which will set the 16384 bit of the I/O statusG word (unused by DKdriver in any other context) and also the 8192 bit ifcG this is a failback.  These returns will be sent to the DKdriver caller.eH However, it is expected that the switching driver SWdriver will act uponI them. The I/O status will "really" always be SS$_NORMAL in this case, andlD DKdriver will check the sense data flags to ensure that the (DigitalF vendor unique) codes are present before setting these flag bits in theH return code.  DKdriver will NOT however perform any switching operationsH on its own.  This means that minimal DKdriver modification is made here,H but the vital information needed is present and passed on by DKdriver toE layers of the failover system above it. Where these alternate successaG statuses are seen by the switching driver, it will remove them prior to @ really completing the I/O, thus hiding any unusual behavior from! applications or other VMS layers.a   SWDRIVER  C SWdriver stands for "SWitching Driver" and is a two way (currently)eE toggle switch sending I/O either to one disk or another, assuming therB disks used are in fact the same but accessed over different paths.F (Extending the driver to be an N-way switch should be straightforward,@ treating paths 3-N the same as path 2, but is not needed for anyB currently known problem. Future systems may however require this.)  H If Bus B fails and some operation is completed on Bus A (these being theI two busses on the HSZ40), the HSZ will generate CHECK CONDITION responsesyF which DKdriver and other drivers need to be able to turn into statusesI the switch can recogzize. The CHECK CONDITION data will indicate that BushG B has failed, not that anything is wrong with the current device on BustG A. To perform failover promptly when this happens, it will be necessary(D to have some server aware of the whole HSZ configuration and able toI command switchover promptly. Accordingly, the switch driver is programmediI to send a signal to a server when it recognizes such a condition, so thatoH the server can command switchover to the remaining path. This server canG have the necessary global configuration information so that all devicese? can be switched to the good path. (The server will also send ansA IO$_PACKACK to get the device to come online at that time, beforetI anything else is queued there.) Also, some code will be added to DKdrivernI to ensure the controller serial numbers are made available to the server,fG so that it can find the pairs of controllers automatically, rather thanc+ needing to have it generated by a customer.   F Periodic polling of devices will also be added to the server componentD here, so that an operator can be notified of device failover. (ThereD is a special I/O path in the switching driver allowing the server toD contact all actually-known channels in spite of the otherwise opaqueK overloading of the chosen device name.) The server will initially determine C device pairs by issuing INQUIRY packets using io$_diagnose, so thatgH DKdriver need not store information about controller IDs. It will ensureF that the UCB$V_NOASSIGN flag is set in UCB$L_STS of nonpreferred pathsF to help set these invisible, and will make such other modifications asC shall be needed to ensure that the scan_device routines in VMS exec D cannot see the extra paths either. These must scale so that multiple extra paths can be managed.a  H Operationally, then, autoconfig does not change.  Since DKDRIVER will beI altered to ensure that no disks are served via multiple paths, the switchsH logic can be loaded during normal startup commands and need not run veryG early in the boot path. Tapes and generic devices for the most part arei@ not made visible as early, and it is possible that resetting theF alternate units' characteristics for those device types can be done byI the switching software itself, after autoconfiguration shall have run. IfsD this causes problems, the tape driver will need to be edited also to@ prevent too-early detection of tape alternate paths. Loading theI switching code after full VMS is up simplifies it greatly, at the cost of I failover not functioning until this code is loaded. Normal disk operationrI would be unchanged by the switch (the actual intercept is synchronized atcG fork level, which is necessary for any access to the intercepted path),bC but an HSZ controller failure would not be recovered if it occurred G within the first few seconds (up to a few minutes) of system operation.uF However, once the software loaded, a switchover could be accomplished,I presuming the failed devices were in mount verify state and had not timediI out during the interval. Thus even in the case of a very early controllersB failure, a remedy could be applied partially "ex post facto". (TheH swdriver code would simply have to count MV Packacks starting after theyI had been going a while.) Only a system disk failure early on would not beiI covered in this way, since the recovery code would not load, and this can G be considered much the same as a failure during early booting; a rebooto+ would use the other controller and succeed.n  F In only one case does something unusual need to be done: when the bootJ disk is on the higher numbered controller. In this case, setting a booleanF sysgen parameter will allow boot off a higher serial number controllerJ by making it preferred. While this effectively changes the device physicalD names, a configuration file option will allow them to be effectivelyG reset for all but the system disk. It is hoped that this will be a raree
 circumstance.h  G The system will then, when running, see one device name per device, andbD the path switching will take place below the start_io level in a wayI invisible to anything in VMS above driver level. By simply requeueing thetH IRP, high performance can be achieved and only minimal changes to driverG operation (mainly to handle the new information in the INQUIRY data and B the extra CHECK CONDITION flags) are needed, none of them of majorE import. The functionality here is completely orthogonal to the deviceeG naming scheme in use, and in practice it doesn't matter what the deviceuG name scheme is so long as IOC$SEARCHDEV can still find both devices. Itl? is further expected that the qio server will eventually performe! operations somewhat akin to this.   G By functioning in this way, the system will avoid adding greatly to thepD complexity of DKdriver (et.alia) and can be extended to handle otherI failover situations rather simply, though the custom signals from the HSZi& will not be used only in limited ways.  C It should be added that for SCSI drivers, the mere startup of mountaF verify does not in itself mean that bus failover is appropriate, sinceC SCSI RESET can be a normal part of system function. This is why thetI switch is not set to switch paths at the first pack-ack (or indeed at thefE start of the mount verify condition). This is also the reason why the H switch does not simply intercept the start-mount-verify driver entry. InI fact, the IO$_PACKACK will generate a SCSI START command on the new path,SF which the HSZ40 needs in order to switch its internal indicators. ThisD situation is different from that obtaining for DUdriver, where mount< verify generally does mean a path failure may have occurred.   SWDRIVER INTERNALS  G SWdriver is an intercept driver which intercepts disk start-io entries.hF This is done by code which creates a copy of the DDT table, located inI the intercept driver's UCB, and points the intercepted driver's UCB$L_DDTeF vector at it. This permits a per-drive intercept and is done in such aB way that the vector can be intercepted by other similar interceptsH totally reversibly, and in any order, just so they follow the connectionI logic (which has been published). (Because the intercepted DDT is locatedpB within the intercept driver UCB, the intercept code can locate theI intercept driver UCB using this DDT. Some additional code exists to allowiG the code to be sure it has this data for its own intercept, not anothere on a possible chain of them.)   G When the intercept is present, start-io for the "primary" path disk nowoE points at the intercept address within a unit of SWdriver, which alsosI knows the UCB addresses of the "primary" and "secondary" path devices. AnDB IRP entering here is first examined to see if it is a mount verifyD pack-ack IRP (and counted; if 3 of these are seen in a row, SWdriverF switches to the "secondary" path.) By using mount verification in thisF way, SWdriver assures that I/O through the failed path has been idled.F (The mount verify driver entries are NOT used because for SCSI a mount6 verify condition does not necesarily mean a bad path.)  G SWdriver also counts up outstanding I/O and arranges to gain control ateI I/O post time (so it can count down the I/O and post it). This is done by G saving IRP$L_PID and replacing it with an address within SWDRIVER whichiG will count the I/O down and, after replacing modified fields, perform a  real I/O completion on the IRP.   I Now if the I/O request is being routed to the primary path, SWdriver justiI calls the primary path start-io entry and returns. Since it is entered ast4 part of the primary driver, it has all needed locks.  H If on the other hand the path routed to is the secondary, SWdriver callsI INSIOQC instead, redirecting the IRP to the secondary device. The primary I device is unbusied in this case also, since SWdriver is acting in lieu ofmH the primary device, which will not in fact get any I/O when it is routedB this way. IRP$L_UCB is pointed at the secondary device during thisE operation, to be replaced with its original value when I/O is posted.s  H In all cases, when the I/O completes (and without a detour through IPL 4B if assembled that way), SWdriver regains control. At this point itI decrements the outstanding I/O count, replaces a few IRP fields it neededlG to regain control, and completes the I/O (via a call to COM$POST, sincetG it has no right to alter the underlying driver's busy or unbusy state).rF If on the secondary path, SWdriver checks the I/O to ensure that mountF verification is begun on it also, as this would not otherwise be done.I The I/O checking, mount verify processing, and postprocessing is all doneoD in the context of the primary path, so that the primary path remainsG mounted and apparently active, though the secondary path may in fact beo the one in use.h  F To save volatile parameters from an IRP during the switching, SWdriverD currently overwrites the IRP argument areas (which are used prior toC start_io but are not used after that point) to hold a number of IRPs2 fields which are being reused to route the packet.      The usage is as follows:d      Field:		Saves contents of:e:    IRP$Q_QIO_P1+4	IRP$L_STS (if fast finish shortcut only)+    IRP$Q_QIO_P2		IRP$L_MEDIA (block number)aB    IRP$Q_QIO_P2+4	IRP$L_PID (PID, used to capture post processing)    IRP$Q_QIO_P2+8	IRP$L_UCBn  I While it is of course possible to allocate another structure to hold this I information, these IRP fields are used by no other driver code since theytB are present only to make the $QIO arguments available to FDT code,A completed before start-io code can be run. It may be desirable toi> consider extending the IRP to supply dedicated fields for thisD functionality, or perhaps to consider reusing some of the structuresG shadowing uses where the device is not shadowed, and otherwise use somet@ separate structure. This approach does however provide very fastF operation. The fields mentioned are saved and restored so that the IRPG can be passed to another driver, yet have its I/O posted in the contexthE of the correct driver. Saving IRP$L_MEDIA is necessary to ensure thatnE IRPs which are re-inserted in device I/O queues at the start of mount I verify have the correct block information. The UCB and PID fields must betI altered to redirect the IRP to another driver and regain control when thenF I/O is posted by that driver. The IRP$L_STS field must also be treatedI this way if a "shortcut" to avoid IPL 4 processing is used, which is also F present to minimize extra code caused by this approach, using the fastF path I/O processing to eliminate most of the completion overhead whichI would otherwise be seen due to the need for two request completion calls.r    lE SWdriver also has an interface for program controlled path switching. D This is built using the IO$_RETCENTER function code sent to SWdriverG itself. (It is meant as a private interface.) This code passes a single A parameter, 1 or 2, to indicate whether to take the primary or theiI secondary path. When this function is sent to SWdriver, it will switch to,E the selected path, provided that its count of active I/O (I/O seen atiI start-io and not yet seen at I/O post) is ZERO. When the HSZ sends noticelI that "the other controller has failed", the switch server sends a packackwG to the currently inactive path to flush out all I/O before switching inaI this way. The secondary device exists independently and is just addressedoG directly. The primary device, recall, has its start-io entry stolen, so H there is code in SWdriver which will notice an I/O with all I/O functionE modifiers set, and which will strip all these and send the I/O to theoD primary path, whether it is connected or not for other purposes. TheI reason for this packack is to ensure that any "left over" activity on the G path will be flushed, and also to issue the necessary SCSI functions to H activate the path. This will be required for HSZ40 and up, and is likely to be important for others.t  B To interact with the failover server, SWdriver sends messages to aG mailbox allocated by the failover server and whose UCB address has beengC stored in part of the SWdriver UCB extension. Thus SWdriver can usep@ CALL_WRTMAILBOX, a documented interface, to send messages to theB controller indicating that a mount-verify-initiated switchover hasE occurred, or that an I/O status with the 16384 bit set has been seen.lD These messages are simply sent, provided the server is present.  TheI server is sent enough information to tell which devices are involved, andnF one server can handle any number of pairs of switched devices.  It hasH the convention that SW units must be allocated and enabled starting withG unit zero. (There is a UCB table in SWdriver which limits the number of I units permitted, but its size is an assembly parameter and can be made ash< large as needed. Currently it is set for 500 units or less.)   Mount VerifyG The mount verify service functions only with a normally mounted device. B It is desirable for similar service to be optionally available forF foreign device pairs, where a database vendor may be handling the diskH itself. This cannot be the default, but is sensible as a general matter.  E Fortunately, there is a server available which is able to handle muchpG of the complexity here. If this function is implemented, it is feasiblehH for swdriver to notice error codes that currently result in mount verifyG being used, communicate these to the server, and have the server/switchtI driver call mount verify entry points (if any) in the appropriate driverstA (to flush I/O) and within the intercept driver to requeue any I/OoI that may have been outstanding, handle device busy, and for the server tohC issue the periodic packack functions via its private "wormhole" I/OpH functions permitting access to separate paths as needed. (The "wormhole"E functions use patterns of some of the function modifier bits as flags B as currently planned, so that the design scales easily to a modestI number of paths, one or two dozen perhaps being a practical maximum. Thist# should exceed what will be needed.)a  H By the use of such functions, this system should be able to provide whatH amounts to mount verify functions on foreign devices, and thus to handle	 failover.n   Defect Containment  D The investigation has resulted in a driver and control suite alreadyB which will serve as a source of a code count. The software writtenC for this purpose (not counting some library functions used to allowtA the optional configuration file to be free form) totals some 3216a> lines of code. It is estimated that another ~250 lines of codeA will be needed for the automatic controller-pair recognition, andrB the DKdriver lines already added (to side copies) to support theseC functions total 180. Thus there are so far about 3400 lines of code E and the total for HSZ failover functionality may be expected to totalnE when all is said and done 3650 to 4000 (to pick a round number) linesa of code.  A The bogey number of defects expected in 4000 lines of code at onetB per 40 lines of code would be 100. However, for code which is unitA tested already (the driver and control daemon code) this estimateoE is reported high, and an estimate of 10 defects per KLOC is suggested/A for that segment of the code. This would mean about 34 defects ind: the code so far, plus another ~10 in code to be generated.  C Not all of this code is new (in that some older virtual disk drivereE examples were built on which have been functioning for several years)mB and the switching driver code has been tested in one system, which> is why it is expected that a lower defect count will cover the
 code so far. n  ? Methods for defect removal include (in addition to unit tests):nD * Overall design - minimal modifications will be introduced into the7 	(already complex) SCSI drivers to support the failovert< 	functions. This can be expected to be the chief contributor7 	to defect containment, since the effects of changes tor; 	existing SCSI drivers form a small fraction of the overallo> 	effort and their function is limited to reporting information% 	to the failover system on the whole.hG * Reviews. It will be important to have the code in the driver reviewedm@ 	so that its design, and particularly its detailed control flow,9 	can be reviewed. The same goes for the server componentse 	particularly where privileged.aB * Stress testing. The code must be tested in SMP and large cluster- 	environments to catch any timing subtleties.h      