OS/2 LAN Manager Sleeping Server Stops Responding to Requests

PSS ID Number: Q102777
Article last modified on 09-20-1993

2.00 2.10 2.10a 2.20
OS/2

SUMMARY
=======
LAN Manager servers may fail or "sleep" under extreme operating
stress, refusing to initiate new sessions while allowing active
sessions to continue for a while. Three problems cause stress-related
symptoms:
 - SCSI bid time-out failures
 - Kernel failures
 - NETAPI.DLL or SCSI bid time-out problems
This article discusses these three problems and several other topics
in the following order:
 - Causes of Server Stress
 - Server-Stress-Related Problems
 - Non-Server-Stress-Related Problems
 - Problem 1: SCSI Bid Time-out Failures
 - Problem 2: Kernel Failures
 - Problem 3: NETAPI.DLL or SCSI Bid Time-out Problems
 - Configuration (of servers prone to these problems)
 - Tuning Recommendations
 - Utilities and Diagnostics
Each problem section is organized in the standard style: Symptoms,
Cause, Resolution/Workaround.
You can obtain Microsoft OS/2 LAN Manager updates from Microsoft
Corporate Network Support. Some fixes will be included in an update in
LAN Manager 2.2a.

MORE INFORMATION
================

                    CAUSES OF SERVER STRESS
                    =======================
Several system conditions can create server stress: a large number of
workstations with active sessions accessing the file server, or
problems with foreground processes such as ring 3 services (SQL
Server, the Netlogon service, and backup operations).

                  SERVER-STRESS-RELATED PROBLEMS
                  ==============================
Stress failures can disrupt some server processes while other OS/2
processes continue unaffected. Even when critical ring 0 or kernel-
level failures affect the server, trap error messages are not always
displayed on the server monitor or through an OS/2 kernel debugger,
and some OS/2 processes can continue to operate for a while (for
instance, the HPFS386 cache and operations serviced by it) although
the Program Manager cannot be accessed and keyboard interrupts are
shut down.

                NON-SERVER-STRESS-RELATED PROBLEMS
                ==================================
Complete server hangs are often caused by something other than
sleeping servers:
 - System CPU hardware failures that hang the server and require a
   cold boot.
 - A system device or ring 0 device driver that shuts down interrupts
   and does not reenable them, locking the server keyboard without an
   error message.
 - Network adapters that halt all server processing.
 - A tight I/O-bound application loop such as a process that
   continuously writes to the screen, starving other processes on the
   server. If a process remains at high priority in a PM screen group,
   it blocks other processes. It is a good idea to avoid running local
   processes involving continuous screen or other device I/O on an
   OS/2 LAN Manager file server. Run only specifically designed server
   applications, and avoid running applications such as tape backup
   software during peak hours.

              PROBLEM 1: SCSI BID TIME-OUT FAILURES
              =====================================
Sleeping or comatose server; disk access blocked.

SYMPTOMS
========
There are many possible symptoms. When a SCSI bid time-out occurs for
disk drive requests, drive access is blocked even though network I/O
does not stop, and users can connect to the server by way of NET USE
or NET ADMIN. If primary disk drive access is blocked and SWAPPER.DAT
is located there, local OS/2 foreground processes may become
inaccessible, the system may not respond to keyboard input, and Task
Manager may fail. No operation can write to the blocked disk drive.
Only operations serviced by HPFS386 cache continue to operate. A PSTAT
listing will reveal all processes blocked other than those serviced by
cache.

CAUSE
=====
This often is caused by a lack of time-out handling code in the OS/2
bid, which in turn causes disk requests to time-out due to server
stress and slow responses from I/O devices. Requests (called RCBs) are
passed through the file system to the SCSI bid through IOS$, which
provides I/O access to the bid from file systems. The requests include
time-out values. For SCSI bids, this value is passed as an SSCI
request block (SRB) parameter for time-out, and time-outs cannot be
handled properly unless the bid monitors the SRB time-out value for
all I/O queues. If a request expires, the SCSI bus and I/O queue time-
outs should be reset and the operations allowed to retry. If
operations are not allowed to retry, threads involved with the I/O
process hang.
Certain hardware items and quantities can cause or contribute to these
problems:
 - Slow SCSI devices
 - Multiple devices
 - Multiple large hard-disk drives
 - Multiple tape drives
 - CD-ROM drives
When server performance falls, access to CD-ROM drives fails, and
attempting to access a logical drive letter from the server console
associated with a CD-ROM hangs that OS/2 screen group. Also, if
multiple CD-ROM devices containing large amounts of data are attached
to the server, this failure can result in a sleeping server hang:
   Net start server
   The server is starting.....................................

RESOLUTION
==========
Update SCSI bids to address the lack of time-out code. Following are
instructions for current SCSI bids divided into four classes.
A. SCSI controller bids for which updates are available
B. SCSI controllers without time-out handling code that can be
   replaced with monolithic drivers 
C. SCSI controllers without time-out handling code or currently
   available monolithic drivers 
D. SCSI bids with time-out code for OS/2 1.301 LAN Manager 2.2
The manufacturers are working on fixes for deficient controller bids.
A. SCSI controller bids for which updates are available:
    - COMPAQ Cpq710 bid
    - UltraStor Ultra24 bid (installed as BOOTBID.BID,
      not preinstalled)
    - Adaptec 174x bid
   Updates are available from Microsoft PSS, on CompuServe in the MS
   Networks forum (see BIDS.ZIP), and on the PSS Internet server in
   CS\LANMAN\UNSUP-ED (see GOWINNT.MICROSOFT.COM). Use FTP to get the
   files.
B. SCSI controllers without time-out handling code which can be
   replaced with monolithic drivers:
    - IBM PS/2 ABIOS.BID   (From OS/2 1.3 csd5050 or later)
    - COMPAQ CPQARRAY BID  (From OS/2 1.21)
   To work around the problem, replace these with monolithic drivers.
   Monolithic drivers do not support LADDR-specific features such as
   FT or CdRomIfs, but proper time-out code is available for hard disk
   drives.
   NOTE: Monolithic drivers have not been certified or exhaustively
   tested with OS/2 1.301 csd5015 LAN Manager 2.2.
C. SCSI controllers without time-out handling code or currently
   available monolithic drivers:
    - Adaptec 154X and 164X
    - Future Domain WD7000EX and FD16-700 bids
    - Dell001 bid
   To work around the problem, install an adapter and driver that
   support time-outs.
D. SCSI bids with time-out code for OS/2 1.301 LAN Manager 2.2:
    - ESDI-506 bid used for IDE, ESDI, and ST-506 compatible 
      controllers
    - DPT201X bid
    - NCRC700, NCRC710 and NCRC90

                  PROBLEM 2: KERNEL FAILURES
                  ==========================
Server slows down (OS2KRNL).

SYMPTOMS
========
The server slows or halts. CPU utilization increases and workstations
receive poor server response.

CAUSE
=====
The kernel memory-management routines for CSD5050 and subsequent
revisions have been updated from 286-specific code to 386-specific
code. Memory compaction on a 386 does not take advantage of the 386
processor double word capability, resulting in poor performance,
especially with memory-intensive operations.

RESOLUTION
==========
Update the OS/2 kernel and redirector. Current versions are:
   OS2KRNL           OS2 1.301 CSD01.001
   NETWKSTA.SYS      LM22 CSD00.013

       PROBLEM 3: NETAPI.DLL OR SCSI BID TIME-OUT PROBLEMS
       ===================================================
Server rejects new sessions. (NETAPI.DLL, SCSI bid, or resource
problems).

SYMPTOMS
========
If the server is experiencing SCSI bid time-outs and disk access is
blocked, the server service degrades gradually. As long as an HPFS386
server is operating in RAM, it maintains existing sessions and
continues to report that no listens are available; when the server
begins workstation file-copy sessions, however, NetBIOS session-alive
traffic remains active but workstations cannot connect to the server.
Before long, the only continuing traffic is LLC NetBEUI low-level
transport operations.
Workstations attempting new connections receive error 51. Error 53 is
sometimes returned, but this actually is error 51 erroneously reported
by LAN Manager. Likewise, error net3779, sometimes returned to users
attempting to log on to a sleeping primary domain controller (PDC), is
incorrect and should be error 51. PSTAT may show that due to an
Announcer thread failure Netlogon was stuck in a critical section.
No new listens are posted by the ring 3 server, Netservr Scavenger
thread. Net session at the server reports existing sessions, but the
ring 3 server revokes new requests.
Possible error messages:
   Error 51:  The remote computer is not available.
   Error 53:  The network path was not found.
   Error 240: The network connection is disconnected.
   Net3779:   Your logon attempt has failed due to an incorrect 
              username or password.
Servers with a NETAPI.DLL failure respond to console keyboard
commands, and their workstation servers allow network access to other
servers by means of NET USE for users logged on at the server. Since a
NETAPI.DLL failure leads to a Netlogon service failure, users cannot
log on if the server is a domain controller.
Attempting to shut down the server may return:
   Net Stop Server
   Net2190: The service ended abnormally
   Net Stop Workstation
   Net2189: The service cannot be controlled in its present state
   OS/2 shutdown may cause the server to hang.

CAUSE
=====
The symptoms indicate that the ring 3 server Scavenger thread has
failed when it checked disk space for alert purposes. This thread also
checks disk drives for free space, and hangs if a SCSI bid time-out
failure occurs while the check is in progress. If the ring 3 server
fails, active sessions continue operating but new connections are
refused. Primary domain controller ring 3 server threads can also hang
if a backup domain controller causes a semaphore deadlock by calling
NetAccountSync().
On an OS/2 LAN Manager HPFS386 server, the ring 3 server (Netservr)
provides all file services for FAT partitions, but only new server
connection requests for HPFS386 partitions. The ring 0 HPFS386 server
is optimized for performance, and it--not the ring 3 server--handles
file service. As a result, if the ring 3 server fails, HPFS386
partitions continue to service requests, new connections are refused,
and the server appears to sleep.

RESOLUTION
==========
Update NETAPI.DLL and NETLOGON.EXE CSD00.036. Following are procedures
for installing the fixes on a LAN Manager 2.2 OS/2 1.301 server:

Procedure 1: Installing OS2KRNL
-------------------------------
1. From the OS/2 File Manager, do the following:
   a. Select these options:
      - View
      - Include
      - All File Flags (Hidden)
      - Set View
      - Select OS2KRNL
      - File
      - Change Flags
   b. Deselect these options:
      - System
      - Hidden
      - Read Only
2. Issue a NET STOP command on the workstation.
3. Shut down the server from the OS/2 desktop.
4. Use an HPFS386 recovery disk to boot the server.
5. Use OS/2 Disk 1 to perform the following command:
      chkdsk c: /f:386
6. Issue the following commands:
      rename c:\OS2KRNL c:\OS2KRNL.old
      copy a:\OS2KRNL c:\OS2KRNL (USE CAPITAL LETTERS ONLY)
      rename c:\lanman\netprog\netwksta.sys *.old
      copy a:\netwksta.sys c:\lanman\netprog
7. Restart the server.

Procedure 2: Installing monolithic drivers 
on a LAN Manager 2.2 OS/2 1.301 server
--------------------------------------
1. In the root directory, issue the following commands:
      md laddr
      copy *.sys laddr (EXCEPT CONFIG.SYS)
      copy *.bid laddr
      copy *.tsd laddr
      copy *.vsd laddr
2. Copy the following files from OS/2 1.21 or 1.3 installation Disk 1
   (ISA computers) or Disk 2 (PS/2--Micro Channel computers) to the
   root directory
      BASEDD0X.SYS
      DISK0X.SYS
   where X = 1 for ISA computers and 2 for PS/2--Micro Channel.
3. REM out the lines in the CONFIG.SYS file from DEVICE=DENON.VSD to
   IFS=CDROM.IFS.
4. Reboot.

Procedure 3: Installing updated NETAPI.DLL and NETLOGON.EXE
-----------------------------------------------------------
1. Issue the following command: 
      copy c:\config.sys c:\config.sav
2. Make this TEMPORARY change to the CONFIG.SYS file:
      E Config.sys
      Libpath=c:\lanman\netlib;...
         (remove c:\lanman\netlib)
      Libpath=...
3. Shut down the server.
4. Restart the server.
5. Issue the following commands:
rename c:\lanman\netlib\netapi.dll c:\lanman\netlib\netapi.old
copy a:\netapi.dll c:\lanman\netlib
rename c:\lanman\services\netlogon.exe c:\lanman\services\netlogon.old
copy a:\netlogon.exe c:\lanman\services
copy c:\config.sav c:\config.sys

                          CONFIGURATION
                          =============
Here is how servers exhibiting "sleeping" problems were configured:
 - 486 (> 33 mhz) PC server
 - SCSI controller or IDE controller (16-bit ISA or
   32-bit EISA or MCA)
 - LAN Manager 2.1, 2.1a, 2.2
 - Microsoft OS/2 1.301
 - HPFS386 partitions
 - Primary domain controller operation Netlogon service
 - Ifs ...... /cache:4096 or larger cache size
 - OS/2 ring 3 applications such as Netlogon or SQL Server

                      TUNING RECOMMENDATIONS
                      ======================
Check the server error log for this error
   Net3101: The system ran out of a resource controlled
            by the *** option
where *** is the numbigbuf or numreqbuf parameter.
   Lanman.ini
   [Server]
   Numbigbuf = x    (1-80)
   Numreqbuf = x    (1-300)
If you find this error, edit LANMAN.INI and increase the corresponding
parameters to correct the problem and prevent future server failures.
LAN Manager allocates request and big buffers statically at server
startup. Under high-stress operating conditions, these resources can
be depleted, causing the ring 3 server threads (including the
Scavenger) to fail.

                    UTILITIES AND DIAGNOSTICS
                    =========================
PSTAT: PSTAT reports made before or after the failure will verify that
one or more Netlogon threads became stuck in a critical section, or
Netservr threads, including the Scavenger thread, have been
terminated.
Process and Thread Information on a sample PSTAT screen:
Process   Thread
Name       ID    Priority   Block ID   State
NETLOGON   04      06FF     00000000  CritSec
Sniffer protocol analyzer traces will reveal that the server has no
listen commands outstanding. As the workstation repeatedly fails to
connect, it receives this packet and returns error 51 to the user.
Sample detail of a Sniffer screen:
                        - Frame 1 -
   SUMMARY  Delta T     Destination   Source        Summary
   M        1           Workstation   Server  NETB Name 2TFRUIT
   Recognized
   NETB: ----- NETBIOS Name Recognized -----
   NETB:
   NETB: Header length = 44, Data length = 0
   NETB: Delimiter = EFFF (NETBIOS)
   NETB: Command = 0E
   NETB: No LISTEN command outstanding for this name.
   NETB: Caller's name type = 00 (Unique name)
   NETB: Transmit correlator = 000D
   NETB: Response correlator = 0000
   NETB: Receiver's name = Workstation<00>
   NETB: Sender's name = Server
   NETB:
If the server service is swapped to disk, however, sessions are
dropped and only LLC and NETB traffic remains active. The NETB traffic
may eventually end as well.
Sample from a Sniffer summary report:
    98    0.0369  SERVER        WORKSTATION   SMB C Open  \test.cmd
    99    0.0429  WORKSTATION   SERVER        NETB D=68 S=05 Data ACK
   100    0.0011  SERVER        WORKSTATION   LLC R D=F0 S=F0 RRNR=117
   101   15.5651  SERVER        WORKSTATION   NETB Session alive
   102    0.2152  WORKSTATION   SERVER        LLC R D=F0 S=F0 RR NR=36
   103    2.0314  WORKSTATION   SERVER        NETB Session alive
   104    0.0008  SERVER        WORKSTATION   LLC R D=F0 S=F0 RRNR=118

Additional reference words: Sleeping, 51, scsi, bid, hang, 2.00 2.10
2.10a 2.20
Copyright Microsoft Corporation 1993.