Networking 
on the Sun Workstation 


Sun Microsystems, Inc. • 2550 Garcia Avenue • Mountain View, CA 94043 » 415-960-1300 


Part No: 800-1177-01 
Revision A of 15 May, 1985 


Copyright ® 1985 by Sun Microsystems. 

This publication is protected by Federal Copyright Law, with all rights reserved. No part of this 
publication may be reproduced, stored in a retrieval system, translated, transcribed, or transmit- 
ted, in any form, or by any means manual, electric, electronic, electro-magnetic, mechanical, 
chemical, optical, or otherwise, without prior explicit written permission from Sun Microsystems. 


Revision History 


Sun’s Network File System 


Contents 


1. Introduction j 

1.1. Computing Environments 2 

1.2. Terms and Concepts 3 

1.3. Comparison with Predecessors 3 

1.3.1. NFS and RCP 4 

1.3.2. NFS and ND 4 

2. Examples of How it Works 5 

2.1. Mounting a Remote Filesystem 5 

2.2. Exporting a Filesystem 6 

2.3. Administering a Server Machine 6 

3. Architecture of NFS 7 

3.1. Design Goals 7 

3.1.1. Transparent Information Access 7 

3.1.2. Different Machines and Operating Systems 7 

3.1.3. Easily Extensible 7 

3.1.4. Easy Network Administration 7 

3.1.5. Reliable 8 

3.1.6. High Performance 8 

3.2. The NFS Implementation 9 

3.3. The NFS Interface 10 

4. Network Documentation Roadmap 12 


Sun’s Network File System 


1. Introduction 

This document gives an overview of Sun’s network file system, which allows users to mount 
directories across the network, and then to treat remote files as if they were local. The first sec- 
tion is a bit elementary, so advanced users may want to skip straight to the examples of how it 
works. Beginning users may not be interested in the third section, which discusses network file 
system architecture. 

The Network File System (NFS) is a facility for sharing files in a heterogeneous environment of 
machines, operating systems, and networks. Sharing is accomplished by mounting a remote 
filesystem, then reading or writing files in place. The NFS is open-ended, and Sun Microsystems 
encourages customers and other vendors to take advantage of the interface to extend the capa- 
bilities of other systems. 

A distributed network of personal workstations can provide more aggregate computing power 
than a mainframe computer, with far less variation in response time over the course of the day. 
Thus, a network of personal computers is generally more cost-effective than a central mainframe 
computer, particularly when considering the value of people’s time. However, for large program- 
ming projects and database applications, a mainframe has often been preferred, because all files 
can be stored on a single machine. 

Those who work with unconnected personal computers know the inconveniences resulting from 
data fragmentation. Even in a network environment, sharing programs and data has sometimes 
been difficult. Files either had to be copied to each machine where they were needed, or users 
had to log in to the remote machine with the required files. Network logins were time- 
consuming, and having multiple copies of a file got confusing as incompatible changes were made 
to separate copies. 

To solve this problem, Sun designed a distributed filesystem that permits client systems to gain 
access to shared files on a remote system. Client machines request resources provided by other 
machines, called servers. A server machine makes particular filesystems available, which client 
machines can mount as local filesystems. Thus, users can access remote files as if they were on 
the local machine. 

The NFS was not designed by extending the UNEXf operating system onto the network. Instead, 
the NFS was designed to fit into Sun’s network services architecture. Thus, NFS is not a distri- 
buted operating system, but rather, an interface to allow a variety of machines and operating 
systems to play the role of client or server. Sun has opened the NFS interface to customers and 
other vendors, in order to encourage the development of a rich set of applications working 
together on a single network. 


f UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


Page 2 


Network File System 


1.1. Computing Environments 

The current computing environment in many businesses and universities looks like this: 


The major problem with this environment is competition for CPU cycles. The workstation 
environment solves that problem, but introduces more disk drives into the picture. A network of 
workstations looks like this: 


Sun’s goal with NFS was to make all disks available as needed. Individual workstations have 
access to all information residing anywhere on the network. Printers and supercomputers may 
also be available somewhere on the network. 


Sun Microsystems 


Release 2.0 


Network File System 


Page 3 


1.2. Terms and Concepts 

A machine that provides resources to the network is a server, while a machine that employs 
these resources is a client. A machine may be both a server and a client. A person logged in on 
a client machine is a user, while a program or set of programs that run on a client is an applica- 
tion. There is a distinction between the code implementing the operations of a filesystem, 
(called filesystem operations), and the data making up the filesystem’s structure and contents, 
(called filesystem data). 

A traditional UNIX filesystem is composed of directories and files, each of which has a 
corresponding inode (index node), containing administrative information about the file, such as 
location, size, ownership, permissions, and access times. Inodes are assigned unique numbers 
within a filesystem, but a file on one filesystem could have the same number as a file on another 
filesystem. This is a problem in a network environment, because remote filesystems need to be 
mounted dynamically, and numbering conflicts would cause havoc. To solve this problem, Sun 
has designed the virtual file system (VFS), based on the vnode, a generalized implementation of 
inodes that are unique across filesystems. 

The Remote Procedure Call (RPC) facility provides a mechanism whereby one process (the 
caller process) can have another process (the server process) execute a procedure call, as if the 
caller process had executed the procedure call in its own address space (as in the local model of a 
procedure call). Because the caller and the server are now two separate processes, they no 
longer have to live on the same physical machine. 

The RPC mechanism is implemented as a library of procedures, plus a specification for portable 
data transmission, known as the eXternal Data Representation (XDR). Both RPC and XDR are 
portable, providing a kind of standard I/O library for interprocess communication. Thus pro- 
grammers now have a standardized access to sockets without having to be concerned about the 
low-level details of the accept, bind, and select procedures. 

The Yellow Pages (YP) is a network service to ease the job of administering networked 
machines. The YP is a centralized read-only database. For a client on the network file system, 
this means that an application’s access to data served by the YP is independent of the relative 
locations of the client and the server. The YP database on the server provides password, group, 
network, and host information to client machines. 


1.3. Comparison with Predecessors 

The Network File System (NFS) is composed of a modified UNIX kernel, a set of library routines, 
and a collection of utility commands. The NFS presents a network client with a complete 
remote filesystem. Since NFS is largely transparent to the user, this document tells you about 
things you might not otherwise notice. Sun’s NFS is an open system that can accommodate 
other machines on the net, even non-UNIX systems, without compromising security. 

Sun users may be familiar with two previous networking schemes, rep and ND. The first is a 
remote copy utility program that uses the networking facilities of 4.2 BSD to copy files from one 
machine to another. The second is a proprietary device driver for the Sun that makes raw disk 
available over the network. The NFS does not completely replace ND, so servers and clients will 
be running both ND and NFS. 

Because machines need ND to boot, an NFS server still needs a /pub partition. However, unlike 
the old ND configuration, under NFS this partition contains only /pub/vmunix, / pub/ boot , 


Sun Microsystems 


Release 2.0 


Page 4 


Network File System 


/pub/ stand and /pub/bin. There is a separate file system mounted on /usr containing every- 
thing else important. For example, /usr/bin used to be a symbolic link to /pub/ usr /bin; now 
the server gets /usr/bin off its own disk, while a client gets it by mounting the remote /usr 
filesystem onto the local /usr directory. This is true of /lib as well. The other standard NFS 
remote mount is called / usr2 , where users’ home directories reside. 

An exception arises when a client mounts a server’s /usr filesystem on its directory. Some files 
in /usr should be private, such as /usr/adm, /usr/ spool, / usr/tmp , among others. To get 
around the problem, these private files are symbolic links to / private/ usr . In an ND 
configuration, a few files in /usr/ lib, such as crontab, aliases, and sendmail.cf were private; 
these files are now symbolic links to / private/ usr /lib . 

1.3.1. NFS and RCP 

The remote copy utility (rep) allows data transfer only in units of files. The client of rep sup- 
plies the path name of a file on a remote machine, and receives a stream of bytes in return. 
Access control is based on the client’s login name and host name. 

The major problem with rep is that it is not transparent to the user, who winds up with a 
redundant copy of the desired file. The NFS, by contrast, is transparent — only one copy of the 
file is necessary. Another problem is that rep does nothing but copy files. In a sense, there 
needs to be one remote command for every regular command: for example, rdiff to perform 
differential file comparisons across machines. By providing entire filesystems, NFS makes this 
unnecessary. 


1.3.2. NFS and ND 

Sun’s Network Disk (ND) is a device driver that makes a raw disk available using a simple proto- 
col. The ND client builds its own filesystem, given the disk. Disk space on the server machine is 
partitioned, and diskless client machines mount one partition as their root filesystem, and 
another as their /usr filesystem. Symbolic links can be made between this pseudo-filesystem and 
files on the server machine. 

Under ND, access control of disk areas is based solely on the requester’s Internet Protocol (IP) 
address. Since IP addresses are assumed to be unique, this does not permit file sharing by the 
ND server. The NFS, on the other hand, allows file sharing. The use of the IP address as the 
basis of access control has two other drawbacks: first, an erroneous or malicious piece of network 
software can easily corrupt a user’s disk just by supplying an IP address; and second, it violates 
protocol layering concepts and makes it difficult to change a client’s IP address or ND server. 
Since the server emulates only a disk and not a filesystem, there can be no cacheing on the 
server side. The NFS permits cacheing, with concomitant performance improvements. 


^ Sun Microsystems 


Release 2.0 


Network File System 


Page 5 


2o Examples of How it Works 


2.1. Mounting a Remote Filesystem 


Suppose that you want to read some on-line manual pages. These pages are not available on the 
server machine, called server, but are available on a machine called docserv. You can 
mount the directory containing the manuals as follows: 


client# /etc/mount docserv : /usr/man /usr/man 


Note that you have to be superuser in order to do this. Now you can use the man command 
whenever you want. Try running the df command after you’ve mounted the remote filesystem. 
Its output will look something like this: 


Filesystem 
/dev/ndO 
/dev/ndpO 
server : /lib 
server : /usr 
server : /usr2 
docserv : /usr/man 

Here is a diagram of the three 


kbytes 

used 

avail 

capacity 

Mounted on 

4775 

2765 

1532 

64% 

/ 

5695 

3666 

1459 

72% 

/pub 

7295 

4137 

2428 

63% 

/lib 

39315 

31451 

3932 

89% 

/usr 

326215 

245993 

47600 

84% 

/usr 2 

346111 

216894 

94605 

70% 

/usr/man 


machines involved here. Ellipses represent machines, boxes 


represent remote filesystems, and dotted boxes represent ND partitions. 


man 


Sun Microsystems 


Release 2.0 


Page 6 


Network File System 


2.2. Exporting a Filesystem 

Suppose that you and a colleague need to work together on a programming project. The source 
code is on your machine, in the directory /usr/proj . It does not matter whether your worksta- 
tion is a diskless node, or has local disk. Suppose that after creating the proper directory, your 
colleague tried to remote mount your directory. Unless you have explicitly exported the direc- 
tory, your colleague’s remote mount will fail with a “permission denied” message. 

To export a directory, become superuser, and edit the file / etc/ exports . If your colleague is on a 
machine named cohort, then you need to put this one line in / etc/ exports: 

/usr/proj cohort 

Without the keyword cohort, anybody on the network could remote mount your directory 
/usr/proj. The NFS mount request server mountd( 8c) will read the / etc/ exports file if neces- 
sary whenever it receives a request for a remote mount. Now your colleague can remote mount 
the source directory by issuing this command: 

cohort# /etc/mount client: /usr/proj /usr/proj 

Since both you and your colleague will be able to change files on / usr/proj , it would be best to 
use the «cc«(l) source code control system for concurrency control. 


2.3. Administering a Server Machine 

System administrators must know how to set up the NFS server machine so that client worksta- 
tions can mount all the necessary filesystems. You export filesystems (that is, make them avail- 
able) by placing appropriate lines in the /etc/ exports file. Here is a sample / etc/ exports file for 
a typical server machine: 

/ 

/us r 
/usr2 

/usr/src staff 

The pathnames specified in / etc/ exports must be real filesystems — that is, directory mount 
points for disk devices. The root filesystem must be exported so that /lib is available to NFS 
clients. A netgroup, such as staff, may be specified after the filesystem, in which case remote 
mounts are limited to machines that are a member of this netgroup. At any one time, the sys- 
tem administrator can see which filesystems have been remote mounted, by executing the 
showmount{8) command. 


Sun Microsystems 


Release 2.0 


Network File System 


Page 7 


3. Architecture of NFS 


3.1. Design Goals 


8.1.1. Transparent Information Access 

Users are able to get directly to the files they want without knowing the network address of the 
data. To the user, all universes look alike: there seems to be no difference between reading or 
writing a file contained on a private disk, and reading or writing a file on a disk in the next 
building. Information on the network is truly distributed. 


8.1.2. Different Machines and Operating Systems 

No single vendor can supply tools for all the work that needs to get done, so appropriate services 
must be integrated on a network. In keeping with its policy of supplying open systems, Sun is 
promoting the NFS as a standard for the exchange of data between different machines and 
operating systems. 

8.1.8. Easily Extensible 

A distributed system must have an architecture that allows integration of new software technolo- 
gies without disturbing the extant software environment. To allow this, the NFS provides net- 
work services, rather than a new network operating system. That is, the NFS does not depend 
on extending the underlying operating system onto the network, but instead offers a set of proto- 
cols for data exchange. These protocols can be easily extended. 


8.1.4- Easy Network Administration 

The administration of large networks can be complicated and time-consuming. Sun wishes to 
make sure that a set of network filesystems is no more difficult to administer than a set of local 
filesystems on a timesharing system. UNIX has a convenient set of maintenance commands 
developed over the years. Some new utilities are provided for network administration, but most 
of the old utilities have been retained. 

The Yellow Pages (YP) facility is the first example of a network service made possible with NFS. 
By storing password information and host addresses in a centralized database, the yellow pages 
ease the task of network administration. An overview of the YP facility is presented in the Net- 
work Services Guide . 

The most obvious use of the YP is for administration of /etc/paaawd. Since the NFS uses a 
UNIX protection scheme across the network, it is advantageous to have a common / etc/paaawd 
database for all machines on the network. The YP allows a single point of administration, and 
gives all machines access to a recent version of the data, whether or not it is held locally. To 
install the YP version of / etc/paaawd , existing applications were not changed; they were simply 
relinked with library routines that know about the YP service. Conventions have been added to 


Sun Microsystems 


Release 2.0 


Page 8 


Network File System 


library routines that access /etc/paaawd to allow each client to administer its own local subset of 
/ etc/passwd ; the local subset modifies the client’s view of the system version. Thus, a client is 
not forced to completely bypass the system administrator in order to accomplish a small amount 
of personalization. 

The YP interface is implemented using RPC and XDR, so the service is available to non-UNIX 
operating systems and non-Sun machines. YP servers do not interpret data, so it is possible for 
new databases to take advantage of the YP service without modifying the servers. 


8.1.5. Reliable 

Reliability of the UNIX-based filesystem derives primarily from the robustness of the 4.2BSD 
filesystem. In addition, the file server protocol is designed so that client workstations can con- 
tinue to operate even when the server crashes and reboots. This property is shared with the 
current ND protocol, and has proven to be quite desirable. Sun achieves continuation after 
reboot without making assumptions about the fail-stop nature of the underlying server hardware. 

The major advantage of a stateless server is robustness in the face of client, server, or network 
failures. Should a client fail, it is not necessary for a server (or human administrator) to take 
any action to continue normal operation. Should a server or the network fail, it is only necessary 
that clients continue to attempt to complete NFS operations until the server or network gets 
fixed. This robustness is especially important in a complex network of heterogeneous systems, 
many of which are not under the control of a disciplined operations staff, and which may be run- 
ning untested systems often rebooted without warning. 


8 . 1 . 6 . High Performance 

The flexibility of the NFS allows configuration for a variety of cost and performance trade-offs. 
For example, configuring servers with large, high-performance disks, and clients with no disks, 
may yield better performance at lower cost than having many machines with small, inexpensive 
disks. Furthermore, it is possible to distribute the filesystem data across many servers and get 
the added benefit of multiprocessing without losing transparency. In the case of read-only files, 
copies can be kept on several servers to avoid bottlenecks. 

Sun has also added several performance enhancements to the NFS, such as “fast paths” to elim- 
inate the work done for high-runner operations, asynchronous service of multiple requests, cache- 
ing of disk blocks, and asynchronous read-ahead and write-behind. The fact that cacheing and 
read-ahead occur on both client and server effectively increases the cache size and read-ahead 
distance. Cacheing and read-ahead do not add state to the server; nothing (except performance) 
is lost if cached information is thrown away. In the case of write-behind, both the client and 
server attempt to flush critical information to disk whenever necessary, to reduce the impact of 
an unanticipated failure; clients do not free write-behind blocks until the server verifies that the 
data is written. 

Our performance goal was to achieve the same throughput as a previous release of the system 
that used the network only as a disk (and thus did not permit sharing). This goal has been 
achieved. 


Sun Microsystems 


Release 2.0 


Network File System 


Page 9 


3.2. The NFS Implementation 

In the Sun implementation of the NFS, there are three entities to be considered: the operating 
system interface, the virtual file system (VFS) interface, and the network file system (NFS) inter- 
face. The UNIX operating system interface has been preserved in the Sun implementation of the 
NFS, thereby insuring compatibility for existing applications. 

Vnodes are a re-implementation of inodes that cleanly separate filesystem operations from the 
semantics of their implementation. Above the VFS interface, the operating system deals in 
vnodes ; below this interface, the filesystem may or may not implement inodes. The VFS inter- 
face can connect the operating system to a variety of filesystems (for example, 4.2 BSD or MS- 
DOS). A local VFS connects to filesystem data on a local device. 

The remote VFS defines and implements the NFS interface, using the remote procedure call 
(RPC) mechanism. RPC allows communication with remote services in a manner similar to the 
procedure calling mechanism available in many programming languages. The RPC protocols are 
described using the external data representation (XDR) package. XDR permits a machine- 
independent representation and definition of high-level protocols on the network. 

The figure below shows the flow of a request from a client (at the top left) to a collection of 
filesystems. 


sys calls 


In the case of access through a local VFS, requests are directed to filesystem data on devices con- 
nected to the client machine. In the case of access through a remote VFS, the request is passed 
through the RPC and XDR layers onto the net. In the current implementation, Sun uses the 
UDP/IP protocols and the Ethernet. On the server side, requests are passed through the RPC 
and XDR layers to an NFS server; the server uses vnodes to access one of its local VFSs and ser- 
vice the request. This path is retraced to return results. 


Sun Microsystems 


Release 2.0 


Page 10 


Network File System 


Sun’s implementation of the NFS provides five types of transparency: 

1. Filesystem Type: The vnode, in conjunction with one or more local VFSs (and possibly 
remote VFSs) permits an operating system (hence client and application) to interface tran- 
sparently to a variety of filesystem types. 

2. Filesystem Location: Since there is no differentiation between a local and a remote VFS, the 
location of filesystem data is transparent. 

3. Operating System Type: The RPC mechanism allows interconnection of a variety of operat- 
ing systems on the network, and makes the operating system type of a remote server tran- 
sparent. 

4. Machine Type: The XDR definition facility allows a variety of machines to communicate on 
the network and makes the machine type of a remote server transparent. 

5. Network Type: RPC and XDR can be implemented for a variety of network and internet 
protocols, thereby making the network type-transparent. 

Simpler NFS implementations are possible at the expense of some advantages of the Sun version. 
In particular, a client (or server) may be added to the network by implementing one side of the 
NFS interface. An advantage of the Sun implementation is that the client and server sides are 
identical; thus, it is possible for any machine to be client, server or both. Users at client 
machines with disks can arrange to share over the NFS without having to appeal to a system 
administrator, or configure a different system on their workstation. 


3.3. The NFS Interface 

As mentioned in the preceding section, a major advantage of the NFS is the ability to mix 
filesystems. In keeping with this, Sun encourages other vendors to develop products to interface 
with Sun network services. RPC and XDR have been placed in the public domain, and serve as 
a standard for anyone wishing to develop applications for the network. Furthermore, the NFS 
interface itself is open and can be used by anyone wishing to implement an NFS client or server 
for the network. 

The NFS interface defines traditional filesystem operations for reading directories, creating and 
destroying files, reading and writing files, and reading and setting file attributes. The interface is 
designed so that file operations address files with an uninterpreted identifier, starting byte 
address, and length in bytes. 

Commands are provided for NFS servers to initiate service ( mountd ), and to serve a portion of 
their filesystem to the network (/ etc/ exports). Many commands are provided for constructing 
the YP database facility. A client builds its view of the filesystems available on the network 
with the mount command. 

The NFS interface is defined so that a server can be stateless. This means that a server does 
not have to remember from one transaction to the next anything about its clients, transactions 
completed or files operated on. For example, there is no open operation, as this would imply 
state in the server; of course, the UNIX interface uses an open operation, but the information in 
the UNIX operation is remembered by the client for use in later NFS operations. 

An interesting problem occurs when a UNIX application unlinks an open file. This is done to 
achieve the effect of a temporary file that is automatically removed when the application ter- 
minates. If the file in question is served by the NFS, the unlink will remove the file, since the 
server does not remember that the file is open. Thus, subsequent operations on the file will fail. 


Sun Microsystems 


Release 2.0 


Network File System 


Page 11 


In order to avoid state on the server, the client operating system detects the situation, renames 
the file rather than unlinking it, and unlinks the file when the application terminates. In certain 
failure cases, this leaves unwanted “temporary” files on the server; these files are removed as a 
part of periodic filesystem maintenance. 

Another example of how the NFS provides a friendly interface to UNIX without introducing state 
is the mount command. A UNIX client of the NFS “builds” its view of the filesystem on its local 
devices using the mount command; thus, it is natural for the UNIX client to initiate its contact 
with the NFS and build its view of the filesystem on the network via an extended mount com- 
mand. This mount command does not imply state in the server, since it only acquires informa- 
tion for the client to establish contact with a server. The mount command may be issued at any 
time, but is typically executed as a part of client initialization. The corresponding unmount 
command (which replaces the UNIX umount) is only an informative message to the server, but it 
does change state in the client by modifying its view of the filesystem on the network. 

The major advantage of a stateless server is robustness in the face of client, server or network 
failures. Should a client fail, it is not necessary for a server (or human administrator) to take 
any action to continue normal operation. Should a server or the network fail, it is only necessary 
that clients continue to attempt to complete NFS operations until the server or network is fixed. 
This robustness is especially important in a complex network of heterogeneous systems, many of 
which are not under the control of a disciplined operations staff and may be running untested 
systems and/or may be rebooted without warning. 

An NFS server can be a client of another NFS server. However, a server will not act as an 
intermediary between a client and another server. Instead, a client may ask what remote 
mounts the server has and then attempt to make similar remote mounts. The decision to disal- 
low intermediary servers is based on several factors. First, the existence of an intermediary will 
impact the performance characteristics of the system; the potential performance implications 
are so complex that it seems best to require direct communication between a client and server. 
Second, the existence of an intermediary complicates access control; it is much simpler to require 
a client and server to establish direct agreements for service. Finally, disallowing intermediaries 
prevents cycles in the service arrangements; Sun prefers this to detection or avoidance schemes. 

The NFS currently implements UNIX file protection by making use of the authentication 
mechanisms built into RPC. This retains transparency for clients and applications that make 
use of UNIX file protection. Although the RPC definition allows other authentication schemes, 
their use may have adverse effects on transparency. 

Although the NFS is UNIX-friendly, it does not support all UNIX filesystem operations. For 
example, the “special file” abstraction of devices is not supported for remote filesystems because 
it is felt that the interface to devices would greatly complicate the NFS interface; instead, dev- 
ices are implemented in a local /dev VFS. Other incompatibilities are due to the fact that NFS 
servers are stateless. For example, file locking and guaranteed APPEND_MODE are not sup- 
ported in the remote case. 

Our decision to omit certain features from the NFS is motivated by a desire to preserve the 
stateless implementation of servers and to define a simple, general interface to be implemented 
and used by a wide variety of customers. The availability of open RPC and NFS interfaces 
means that customers and users who need stateful or complex features can implement them 
“beside” or “within” the NFS. Sun is considering implementation of a set of tools for use by 
applications that need file or record locking, replicated data, or other features implying state 
and/or distributed synchronization; however, these will not be made part of the base NFS 
definition. 


Sun Microsystems 


Release 2.0 


Page 12 


Network File System 


4. Network Documentation Roadmap 

The document Network Services Guide is intended for users who have a general interest in net- 
work services. It explains the yellow pages facility in some detail. Although it is not a manual 
for system administrators, the material is heavily slanted in that direction. 

The document Remote Procedure Call Programming Guide is intended for programmers who 
wish to write network applications using remote procedure calls, thus avoiding low-level system 
primitives based on sockets. Readers must be familiar with the C programming language, and 
should have a working knowledge of network theory. 

The document External Data Representation Protocol Specification is intended for programmers 
writing complicated applications using remote procedure calls, who need to pass complicated 
data across the network. It is also a reference guide for system programmers implementing Sun’s 
Network File System on new machines. 

The document Remote Procedure Call Protocol Specification is a reference guide for system pro- 
grammers implementing Sun’s Network File System on new machines. It is of little interest to 
programmers writing network applications. 

The document Network File System Protocol Specification is a reference guide for system pro- 
grammers implementing Sun’s Network File System on new machines. It is of little interest to 
programmers writing network applications. 

The document Yellow Pages Protocol Specification is a reference guide for system programmers 
implementing a Yellow Pages database facility on new machines. It is of little interest to pro- 
grammers writing network applications. 

The document Inter-Process Communications Primer, taken from Berkeley’s 4.2 release, is for 
system programmers who need to use low-level networking primitives based on sockets. Since 
remote procedure calls are easier to use than sockets, this primer is of little interest to most net- 
work programmers. 

The document Network Implementation describes the low-level networking primitives in the 4.2 
UNIX kernel. It is of interest primarily to system programmers and aspiring UNIX gurus. 


Sun Microsystems 


Release 2.0 


Network Services Guide 


Contents 


1. Introduction 1 

2. What Are The Yellow Pages? 1 

2.1. The YP Map 1 

2.2. The YP Domain 1 

2.3. Servers And Clients 2 

2.4. Masters and Slaves 2 

3. Overview of the Yellow Pages 3 

3.1. The YP Network Service 3 

3.1.1. Naming 3 

3.1.2. Data Storage 3 

3.1.3. Servers 4 

3.1.4. Clients 4 

3.2. Default YP Files 4 

3.2.1. Hosts 4 

3.2.2. Passwd 5 

3.2.3. Others 5 

3.2.4. Changing your passwd 5 


Network Services Guide 


1. Introduction 

This document is intended for users who have a general interest in network services. Although 
this is not a manual for system administrators, the material is heavily slanted in that direction. 

Sun provides several network services, such as Network Disk (ND), and the Network File System 
(NFS), discussed in the document Sun's Network File System. The yellow pages are another net- 
work service offered for the first time on the 2.0 release. They permit password information and 
host addresses for an entire network to be held in a single database. This greatly eases the task 
of system and network administration. Sun will provide more network services in the future. 


2. What Are The Yellow Pages? 

The yellow pages (YP) constitute a distributed network lookup service: 

• YP is a lookup service: it maintains a set of databases for querying. Programs can ask for 
the value associated with a particular key, or all the keys, in a database. 

• YP is a network service: programs need not know the location of data, or how it is stored. 
Instead, they use a network protocol to communicate with a database server that knows 
those details. 

• YP is distributed: databases are fully replicated on several machines, known as YP servers. 
Servers propagate updated databases among themselves, ensuring consistency. At steady 
state, it doesn’t matter which server answers a request; the answer is the same everywhere. 


2.1. The YP Map 

The yellow pages serve information stored in YP maps. Each map contains a set of keys and 
associated values. For example, the hosts map contains (as keys) all host names on a network, 
and (as values) the corresponding Internet addresses. Each YP map has a mapname, used by 
programs to access data in the map. Programs must know the format of the data in the map. 
Currently, most maps are derived from ASCII files formerly found in /etc: passwd, group, hosts, 
networks, and others. The format of data in the YP map is in most cases identical to the format 
of the ASCII file. Maps are implemented by dbm(3) files located in subdirectories of / etc/yp on 
YP server machines. 


2.2. The YP Domain 

A YP domain is a named set of YP maps. You can determine your YP domain by executing the 
domainname(l ) command. Note that YP domains are different from both Internet domains and 
sendmail domains. A YP domain is simply a directory in j etc/yp containing a set of maps. 


Sun Microsystems 


Release 2.0 


Page 2 


Network Services 


A domain name is required for retrieving data from a YP database. For instance, if your YP 
domain is sun and you want to find the Internet address of host dbscrver , you must ask YP for 
the value associated with the key dbserver in the map hosta.bynamc within the YP domain sun. 
Each machine on the network belongs to a default domain, set in /etc/rc.local at boot time with 
the domainname( 8) command. 

A YP server holds all the maps of a YP domain in a subdirectory of /etc/yp, named after the 
domain. In the example above, maps for the sun domain would be held in /etc/yp/ sun. Every 
YP server must have the directory / etc/yp/ yp_private , which contains information about 
servers, domains, and maps. This information is used internally by the YP. For completeness, 
the YP server machine is its own client. 


2.3. Servers And Clients 

Servers provide resources, while clients consume them. A server or a client is not necessarily the 
same thing as a machine. To illustrate, let’s consider three different services: ND (network disk), 
the YP, and the NFS (network file system). 

ND ND is a method of providing virtual disk, used by diskless nodes. With ND, it makes sense 
to speak of server and client machines, since both provider and consumer are coterminous 
with machines. Furthermore, the server and client are always the same. 

NFS The NFS allows client machines to mount remote filesystems and access files in place, pro- 
vided a server machine has exported the filesystem. However, a server that exports filesys- 
tems may also mount remote filesystems exported by other machines, thus becoming a 
client. So a given machine may be both server and client, or client only, or server only. 
Furthermore, NFS servers and clients need not coincide with ND servers and clients. 

YP The YP server, by contrast, is a process rather than a machine, running on a machine that 
may be neither ND server nor NFS server. All processes that make use of YP services are 
YP clients. Sometimes clients are served by YP servers on the same machine, but other 
times by YP servers running on another machine. To further muddy the waters, processes 
on master YP server machines (discussed below) don’t use YP services at all, and aren’t YP 
clients. But processes using YP services on slave YP servers are YP clients. 

2.4. Masters and Slaves 

YP servers are either master or slave. For any map, one YP server is designated the master, 
and all changes to the map should be made on that machine. The changes propagate from mas- 
ter to slaves. A newly built map is timestamped internally when makedbm creates it. If you 
build a YP map on a slave server, you will break the YP update algorithm (temporarily), and 
you will have to get all versions in synch manually. Moral: after you decide which server is the 
master, do all database updates and builds there, not on slaves. 

It is possible for different maps to have different servers as master. Therefore, a given server 
may be a master with regard to one map, and a slave with regard to another. This can get 
confusing quickly. It is suggested that a single server be master for all the maps created by 
ypinit in a single domain. This document assumes the simple case, in which one server is the 
master for all maps in the database. 


Sun Microsystems 


Release 2.0 


Network Services 


Page 3 


3. Overview of the Yellow Pages 

In releases before 2.0, each machine on the network had its own copy of /etc/ hosts, a file con- 
taining the Internet address of each machine on the network. Every time a machine was added 
to the network, each /etc /hosts file had to be updated. 

The YP is a network service containing network-wide databases such as /etc/ hosts. There are 
servers spread throughout the network containing copies of the databases. When an arbitrary 
machine on the network wants to look up something in /etc/ hosts, it makes an RPC call to one 
of the servers to get the information. One server is the master — -the only one whose database 
may be modified. The other servers are slaves, and they are periodically updated so that their 
information is in synch with that of the master. 

The YP can serve up any number of databases. Normally that will include files that previously 
lived in /etc, such as /etc/ hosts and / etc/ networks . However, users can add their own datar 
bases to the YP. 

The YP itself simply serves up information, and has no idea what it means. Thus there are two 
parts of YP we need to consider: how it operates, and what files formerly in / etc now live in the 
YP. This has serious ramifications for users. 


3.1. The YP Network Service 


3.1.1. Naming 

Imagine a company with two different networks, each of which has its own separate list of hosts 
and passwords. Within each network, user names, numerical user IDs, and host names are 
unique. However, there is duplication between the two networks. If these two networks are ever 
connected, chaos could result. The host name, returned by the hostname(l) command and the 
gethostname{ ) system call, may no longer uniquely identify a machine. Thus a new command 
and system call, domainname(l) and getdomainname(2) have been added. In the example 
above, each of the two networks could be given a different domain name. However, it is always 
simpler to use a single domain whenever possible. 

The relevance of domains to YP is that data is stored in /etc/yp/ domainname. In particular, a 
machine can contain data for several different domains. 


3.1.2. Data Storage 

The data is stored in dbm(3) format. Thus the database hosts. byname for the domain sun is 
stored as / etc/yp/ sun/ hosts. byname.pag and fetc/yp/ sun/ hosts. byname. dir. The command 
makedbm(8) takes an ASCII file such as /etc/ hosts and converts it into a dbm file suitable for 
use by the YP. However, system administrators normally use the makefile in /etc/yp to create 
new dbm files (read on for details). This makefile in turn calls makedbm. 


Sun Microsystems 


Release 2.0 


Page 4 


Network Services 


3.1.8. Servers 

To become a server, a machine must contain the YP databases, and must also be running the YP 
daemon ypserv. The ypinit( 8) command invokes this daemon automatically. It also takes a flag 
saying whether you are creating a master or a slave. When updating the master copy of a data/- 
base, you can force the change to be propagated to all the slaves with the yppush( 8) command. 
This pushes the information out to all the slaves. Conversely, from a slave, the yppull( 8) com- 
mand gets the latest information from the master. The makefile in /etc/yp first executes mak- 
edbm to make a new database, and then calls yppush to propagate the change throughout the 
network. 


3.1.4- Clients 

Remember that a client machine (which is not a server) does not contain any data itself, but 
rather makes an RPC call to a YP server each time it needs information from a YP database. 
The ypbind(8 ) daemon caches the name of a server. When a client boots, ypbind broadcasts ask- 
ing for the name of the YP server. Similarly, if the cached server crashes, ypbind broadcasts 
asking for the name of a new server. The ypwhich(l) command gives the name of the server 
that ypbind currently points at. 

Since client machines no longer have entire copies of files in the YP, a new command ypcat{ 1) 
has been provided. The command ypcat hosts is equivalent to cat /etc/hosts in a pre 
2.0 system; as you might guess, ypcat passwd is equivalent to cat /etc/passwd. To look 
for someone’s password entry, searching through the password file no longer suffices; you have to 
issue the following command 

% ypcat passwd | grep userid 

where you replace userid with the login name you’re searching for. 


3.2. Default YP Files 

By default, Sun workstations have six files from /etc in the YP: / etc/passwd , / etc/ groups , 
/ etc/ networks , / etc/ hosts , / etc/ services , and / etc/ protocols . In addition, there is a new file 
netgroup, which many sites ought to create and put in the YP database. 

Library routines such as ,getpwent(3) ,getgrent(3) and gethostent{ 3) have been rewritten to take 
advantage of the YP. Thus, C programs that call these library routines will have to be relinked 
in order to function correctly. 


8.2.1. Hosts 

The hosts file is stored as two different files in the YP. The first, hosts. byname , is indexed by 
hostname. The second, hosts. byaddr , is indexed by Internet address. Remember that this actu- 
ally expands into four files, with suffixes .pag, and .dir. When a user program calls the library 
routine gethostbyname(3), a single RPC call to a server retrieves the entry from the 
hosts. byname file. Similarly, gethostbyaddr{3) retrieves the entry from the hosts. byaddr file. Of 
course if the YP is not running (which is caused by commenting ypbind out of the /etc/rc file), 
then gethostbyname will read the /etc/ hosts files, just as it always has. 


Sun Microsystems 


Release 2.0 


Network Services 


Page 5 


Although the ypcat command is a general YP database print program, it knows about the stan- 
dard files in the YP. Thus ypcat hosts is translated into ypcat hosts .byaddr, since 
there is no file called hosts in the YP. 

Normally, the hosts file for the YP will be the same as the /etc/hosts file on the machine serving 
as a YP master. In this case, the makefile in /etc/yp will check to see if /etc/hosts is newer 
than the dbm file. If it is, it will use a simple sed script to recreate hosts.byname and 
hosts. byaddr , run them through makedbm(8 ) and then call yppush( 8). See ypmake( 8) for details. 


3. 2. 2. Passwd 

The passwd file is similar to the hosts file. It exists as two separate files, passwd. byname and 
passwd.byuid . The ypcat program prints it, and ypmake updates it. However, if getpwent{8 ) 
always went directly to the YP as does gethostent{ 3), then everyone would be forced to have an 
identical password file! Consequently, getpwent reads the local /etc/ passwd file, just as it always 
did. But now it interprets “+” entries in the password file to mean, interpolate entries from the 
YP database. If you wrote a simple program using getpwent to print out all the entries from 
your password file, it would print out a virtual password file: rather than printing out + signs, it 
would print out whatever entries the local password file included from the YP database. The 
difference between /etc/ hosts and /etc/ passwd is discussed in more detail in the section “How 
Security is Changed with the Yellow Pages,” part of the System Administrator’s Manual. 


3.2.3. Others 

Of the other four files in /etc, / etc/ group is treated like / etc/ passwd , in that getgrentf) will 
only consult the YP if explicitly told to do so by the /etc/group file. The files / etc / networks , 
/ etc/ protocols , / etc/ services , and / etc/ networks are treated like / etc/hosts : for these files, the 
library routines go directly to the YP, without consulting the local files. 


3.2.4- Changing your passwd 

To change data in the YP, you must log onto the master machine, and edit databases there; 
ypwhich{\) tells where the master server is. However, since changing a password is so commonly 
done, the yppasswd(l) command has been provided to change your YP password. It has the 
same user interface as the pa5«u;</(l) command. This command will only work if the 
yppasswdd{ 8c) server has been started up on the YP master server machine. 


Sun Microsystems 


Release 2.0 


Remote Procedure Call 
Programming Guide 


Contents 


1. Introduction 1 

2. Introductory Examples 3 

2.1. Highest Layer 3 

2.2. Intermediate Layer 4 

2.3. Assigning Program Numbers 5 

2.4. Passing Arbitrary Data Types 0 

3. Lower Layers of RPC g 

3.1. More on the Server Side 9 

3.2. Memory Allocation with XDR 11 

3.3. The Calling Side 13 

4. Other RPC Features 15 

4.1. Select on the Server Side 15 

4.2. Broadcast RPC 15 

4.2.1. Broadcast RPC Synopsis 16 

4.3. Batching 16 

4.4. Authentication 20 

4.4.1. The Client Side 20 

4.4.2. The Server Side 21 

4.5. Using Inetd 23 

6. More Examples 24 

5.1. Versions 24 

5.2. TCP 25 

5.3. Callback Procedures 29 

A, Synopsis of RPC Routines 34 


Remote Procedure Gall 

Programming Guide 


lo Introduction 

This document is intended for programmers who wish to write network applications using remote 
procedure calls (explained below), thus avoiding low-level system primitives based on sockets. 
The reader must be familiar with the C programming language, and should have a working 
knowledge of network theory. 

Programs that communicate over a network need a paradigm for communication. A low-level 
mechanism might send a signal on the arrival of incoming packets, causing a network signal 
handler to execute. A high-level mechanism would be the Ada rendezvous. The method used 
at Sun is the Remote Procedure Call (RPC) paradigm, in which a client communicates with a 
server. In this process, the client first calls a procedure to send a data packet to the server. 
When the packet arrives, the server calls a dispatch routine, performs whatever service is 
requested, sends back the reply, and the procedure call returns to the client. 

The RPC interface is divided into three layers. The highest layer is totally transparent to the 
programmer. To illustrate, at this level a program can contain a call to rnusers () , which 
returns the number of users on a remote machine. You don’t have to be aware that RPC is 
being used, since you simply make the call in a program, just as you would call malloc () . 

At the middle layer, the routines registerrpc () and callrpc() are used to make RPC 
calls: registerrpc () obtains a unique system-wide number, while callrpc() executes a 
remote procedure call. The rnusers () call is implemented using these two routines The 
middle-layer routines are designed for most common applications, and shield the user from know- 
ing about sockets. 

The lowest layer is used for more sophisticated applications, which may want to alter the 
defaults of the routines. At this layer, you can explicitly manipulate sockets used for transmit- 
ting RPC messages. This level should be avoided if possible. 

Section 2 of this manual illustrates use of the highest two layers while Section 3 presents the 
low-level interface. Section 4 of the manual discusses miscellaneous topics. The final section 
summarizes all the entry points into the RPC system. 

Although this document only discusses the interface to C, remote procedure calls can be made 
from any language. Even though this document discusses RPC when it is used to communicate 
between processes on different machines, it works just as well for communication between 
different processes on the same machine. 

There is a diagram of the RPC paradigm on the next page. 


Sun Microsystems 


Release 2.0 


Page 2 


RPC Programming 


client 

program 


cal lrpc () 
function 


Machine A 


return 


reply 


service 

daemon 


execute 

request 


Machine B 


call 


service 


request 

completed 


return 


service 

executes 


answer 


program 

continues 


Si 


Figure 1: Network Communication with the Remote Procedure Call 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 3 


2. Introductory Examples 


2.1. Highest Layer 

Imagine you’re writing a program that needs to know how many users are logged into a remote 
machine. You can do this by calling the library routine rnusers () , as illustrated below: 

#include <stdio.h> 

main (argc, argv) 
int argc; 
char * * argv ; 

{ 

unsigned num; 
if (argc < 2) { 

fprintf (stderr , "usage: rnusers hostname\n") ; 
exit (1) ; 

> 

if ((num = rnusers (argv [1] ) ) < 0) { 

fprintf (stderr , "error: rnusers\n") ; 
exit (-1) ; 

> 

printf("%d users on %s\n", num, argv[l]); 
exit (0) ; 

> 

RPC library routines such as rnusers () are included in the C library libc.a. Thus, the pro- 
gram above could be compiled with 

% cc program. c 

Some other library routines are rstat () to gather remote performance statistics, and 
ypmatch () to glean information from the yellow pages (YP). The YP library routines are docu- 
mented on the manual page ypclnt(3N). 


Sun Microsystems 


Release 2.0 


Page 4 


RPC Programming 


2.2. Intermediate Layer 

The simplest interface, which explicitly makes RPC calls, uses the functions callrpc() and 
registerrpc () . Using this method, another way to get the number of remote users is: 

#includ© <stdio.h> 

^include <rpcsvc/rusers .h> 

main (argc , argv) 
int argc; 
char **argv; 

{ 

unsigned long nusers; 
if (argc < 2) { 

fprintf (stderr , "usage: nusers hostname\n") ; 
exit (-1) ; 

> 

if (callrpc (argv [1] , RUSERSPROG, RUSERSVERS, RUSERSPROC_NUM ( 
xdr_void, 0, xdr_u_long, finusers) != 0) { 
fprintf (stderr, "error : callrpc\n") ; 
exit (1) ; 

> 

print f ("number of users on %s is %d\n", argv [1] , nusers); 
exit (0) ; 

> 

A program number, version number, and procedure number defines each RPC procedure. The 
program number defines a group of related remote procedures, each of which has a different pro- 
cedure number. Each program also has a version number, so when a minor change is made to a 
remote service (adding a new procedure, for example), a new program number doesn’t have to be 
assigned. When you want to call a procedure to find the number of remote users, you look up 
the appropriate program, version and procedure numbers in a manual, similar to when you look 
up the name of memory allocator when you want to allocate memory. 

The simplest routine in the RPC library used to make remote procedure calls is callrpc () . It 
has eight parameters. The first is the name of the remote machine. The next three parameters 
are the program, version, and procedure numbers. The following two parameters define the 
argument of the RPC call, and the final two parameters are for the return value of the call. If it 
completes successfully, callrpc () returns zero, but nonzero otherwise. The exact meaning of 
the return codes is found in <rpc/clrvt .h>, and is in fact an enum clnt_stat cast into an 
integer. 

Since data types may be represented differently on different machines, callrpc () needs both 
the type of the RPC argument, as well as a pointer to the argument itself (and similarly for the 
result). For RUSERSPROC_NUM, the return value is an unsigned long, so callrpc () has 
xdr_u_long as its first return parameter, which says that the result is of type unsigned 
long, and inusers as its second return parameter, which is a pointer to where the long result 
will be placed. Since RUSERSPROC_NUM takes no argument, the argument parameter of 
callrpc() is xdr_void. 

After trying several times to deliver a message, if callrpc () gets no answer, it returns with an 
error code. The delivery mechanism is UDP, which stands for User Datagram Protocol. 
Methods for adjusting the number of retries or for using a different protocol require you to use 
the lower layer of the RPC library, discussed later in this document. The remote server 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 5 


procedure corresponding to the above might look like this: 


char 4 

nuser (indata) 

char * indata; 


< 


static int nusers; 


/* 

* code here to compute the number of users 

* and place result in variable nusers 

V 

return ((char *)&nusers); 


It takes one argument, which is a pointer to the input of the remote procedure call (ignored in 
our example), and it returns a pointer to the result. In the current version of C, character 
pointers are the generic pointers, so both the input argument and the return value are cast to 
char *. 

Normally, a server registers all of the RPC calls it plans to handle, and then goes into an infinite 
loop waiting to service requests. In this example, there is only a single procedure to register, so 
the main body of the server would look like this: 

^include <stdio.h> 

#include <rpcsvc/rusers.h> 

char * nuser (); 

main () 

{ 

registerrpc (RUSERSPROG, RUSERSVERS, RUSERSPROC_NUM, nuser, 
xdr_void, xdr_u_long) ; 
svc_run () ; /* never returns */ 

fprintf (stderr , "Error: svc_run returned !\n") ; 
exit (1) ; 

> 


The registerrpc () routine establishes what C procedure corresponds to each RPC procedure 
number. The first three parameters, RUSERPROG, RUSERSVERS, and RUSERSPROC_NUM 
are the program, version, and procedure numbers of the remote procedure to be registered; 
nuser is the name of the C procedure implementing it; and xdr_void and xdr_u_long are 
the types of the input to and output from the procedure. 

Only the UDP transport mechanism can use registerrpc () ; thus, it is always safe in conjunc- 
tion with calls generated by callrpc () . 

Warning: the UDP transport mechanism can only deal with arguments and results less than 8K 
bytes in length. 


2.3. Assigning Program Numbers 

Program numbers are assigned in groups of 0x20000000 (536870912) according to the following 
chart: 


$ 0 ? Sun Microsystems 


Release 2.0 


Page 6 


RPC Programming 


0 - 

lfffffff 

defined by sun 

20000000 - 


defined by user 

40000000 - 


transient 

60000000 - 

7fffffff 

reserved 

80000000 - 

9fffffff 

reserved 

aOOOOOOO - 

bfffffff 

reserved 

cOOOOOOO - 

dfffffff 

reserved 

eOOOOOOO - 

ffffffff 

reserved 


Sun Microsystems administers the first group of numbers, which should be identical for all Sun 
customers. If a customer develops an application that might be of general interest, that applica- 
tion should be given an assigned number in the first range. The second group of numbers is 
reserved for specific customer applications. This range is intended primarily for debugging new 
programs. The third group is reserved for applications that generate program numbers dynami- 
cally. The final groups are reserved for future use, and should not be used. 

The exact registration process for Sun defined numbers is yet to be established. 


2.4. Passing Arbitrary Data Types 

In the previous example, the RPC call passes a single unsigned long. RPC can handle arbi- 
trary data structures, regardless of different machines’ byte orders or structure layout conven- 
tions, by always converting them to a network standard called external Data Representation 
(XDR) before sending them over the wire. The process of converting from a particular machine 
representation to XDR format is called serializing , and the reverse process is called deserializing . 
The type field parameters of callrpc() and registerrpc () can be a built-in procedure like 
xdr_u_long() in the previous example, or a user supplied one. XDR has these built-in type 
routines: 

xdr_int () xdr_u_int() 

xdr_long() xdr_u_long() 

xdr_short () xdr_u_short () 

As an example of a user-defined type routine 

struct simple { 
int a; 
short b ; 

} simple; 

then you would call callrpc as 

cal lrpc (hostname, PROGNUM, VERSNUM, PROCNUM, xdr_simple, fisimple ...); 
where xdr_simple() is written as: 


xdr_enum() 
xdr_bool () 
xdr_string () 

, if you wanted to send the structure 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 7 


#include <rpc/rpc.h> 


xdr_simple (xdrsp, simplep) 

XDR ‘xdrsp; 

struct single ‘simplep; 


{ 

if ( !xdr_int (xdrsp, &simplep->a) ) 
return (O) ; 

if ( !xdr_short (xdrsp, &simplep->b) ) 
return (O) ; 
return (1) ; 

> 


An XDR routine returns nonzero (true in the sense of C) if it completes successfully, and zero 
otherwise. A complete description of XDR is in the XDR Protocol Specification, so this section 
only gives a few examples of XDR implementation. 

In addition to the built-in primitives, there are also the prefabricated building blocks: 

xdr_array () xdr_bytes () 

xdr_reference () xdr_union() 

To send a variable array of integers, you might package them up as a structure like this 

struct varintarr ■{ 
int ‘data; 
int arrlnth; 

> arr; 

and make an RPC call such as 

callrpc (hostname, PROGNUM, VERSNUM, PROCNUM, xdr_varintarr , &arr...); 

with xdr_varintarr () defined as: 

xdr_varintarr (xdrsp, varintarr) 

XDR ‘xdrsp; 

struct varintarr *arrp; 

{ 

xdr_arr ay (xdrsp, &arrp->data, &arrp->arrlnth, MAXLEN, 
sizeof (int) , xdr_int) ; 

> 

This routine takes as parameters the XDR handle, a pointer to the array, a pointer to the size of 
the array, the maximum allowable array size, the size of each array element, and an XDR rou- 
tine for handling each array element. 

If the size of the array is known in advance, then the following could also be used to send out an 
array of length SIZE: 


Sun Microsystems 


Release 2.0 


Page 8 


RPC Programming 


int intarr[SIZE]; 

xdr_intarr (xdrsp, intarr) 

XDR ‘xdrsp; 
int intarr [] ; 

int i ; 

for (i = 0; i < SIZE; i++) { 

if (! xdr_int (xdrsp , &intarr[i])) 
return (0) ; 

> 

return (1) ; 

> 

XDR always converts quantities to 4-byte multiples when deserializing. Thus, if either of the 
examples above involved characters instead of integers, each character would occupy 32 bits. 
That is the reason for the XDR routine xdr_bytes () , which is like xdr_array() except that 
it packs characters. It has four parameters, the same as the first four parameters of 
xdr_array(). For null-terminated strings, there is also the xdr_string () routine, which is 
the same as xdr_bytes () without the length parameter. On serializing it gets the string 
length from strlen () , and on deserializing it creates a null-terminated string. 

Here is a final example that calls the previously written xdr_simple() as well as the built-in 
functions xdr_string() and xdr_reference () , which chases pointers: 

struct finalexample ■{ 
char * string; 
struct simple ‘simplep; 

} finalexample; 

xdr_finalexample (xdrsp, f inalp) 

XDR ‘xdrsp; 

struct finalexample * f inalp; 

{ 

int i ; 

if ( !xdr_string (xdrsp, &f inalp->string, MAXSTRLEN) ) 
return (0) ; 

if ( ! xdr_r e f erence (xdrsp , &finalp->simplep , 
sizeof (struct simple), xdr_simple) ; 
return (0) ; 
return (1) ; 

> 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 9 


3o Lower Layers of RPC 

In the examples given so far, RPC takes care of many details automatically for you. In this sec- 
tion, we’ll show you how you can change the defaults by using lower layers of the RPC library. 
It is assumed that you are familiar with sockets and the system calls for dealing with them. If 
not, consult The IPC Tutorial. 


3.1. More on the Server Side 

There are a number of assumptions built into registerrpc () . One is that you are using the 
UDP datagram protocol. Another is that you don’t want to do anything unusual while deserializ- 
ing, since the deserialization process happens automatically before the user’s server routine is 
called. The server for the nusers program shown below is written using a lower layer of the 
RPC package, which does not make these assumptions. 

^include <stdio.h> 

#include <rpc/rpc.h> 

#include <rpcsvc/rusers.h> 

int nuser () ; 

main () 

{ 

SVCXPRT *transp ; 

transp = svcudp_create (RPC_ANYSOCK) ; 
if (transp == NULL) { 

fprintf (stderr , "couldn't create an RPC server\n") ; 
exit (1) ; 

> 

pmap_unset (RUSERSPROG, RUSERSVERS) ; 

if (! svc_register (transp , RUSERSPROG, RUSERSVERS, nuser, 

I PPROTO_UDP ) ) { 

fprintf (stderr, "couldn't register RUSER service\n") ; 
exit (1) ; 

> 

svc_run () ; /* never returns */ 

fprintf (stderr, "should never reach this point\n") ; 

> 


Sun Microsystems 


Release 2.0 


Page 10 


RPC Programming 


nuser(rqstp, tranp) 

struct svc_req *rqstp; 

SVCXPRT ‘transp; 

unsigned long nusers; 

switch (rqstp->rq_proc) ■( 
case NULLPROC: 

if (! svc_sendreply (transp , xdr_void, O) ) •{ 

fprintf (stderr , "couldn't reply to RPC call\n") ; 
exit (1) ; 

> 

return ; 

case RUSERSPROC_NUM : 

/* 

* code here to compute the number of users 

* and put in variable nusers 

V 

if ( !svc_sendreply (transp, xdr_u_long, finusers) •{ 

fprintf (stderr , "couldn't reply to RPC call\n") ; 
exit (1) ; 

> 

return; 

default : 

svcerr_noproc (transp) ; 
return; 

> 

> 

First, the server gets a transport handle, which is used for sending out RPC messages, 
registerrpc () uses svcudp_create () to get a UDP handle. If you require a reliable pro- 
tocol, call svctcp_create () instead. If the argument to svcudp_create () is 
RPC_ANYSOCK, the RPC library creates a socket on which to send out RPC calls. Otherwise, 
svcudp_create () expects its argument to be a valid socket number. If you specify your own 
socket, it can be bound or unbound. If it is bound to a port by the user, the port numbers of 
svcudp_create () and clntudp_create () (the low-level client routine) must match. 

When the user specifies RPC_ANYSOCK for a socket or gives an unbound socket, the system 
determines port numbers in the following way; when a server starts up, it advertises to a port 
mapper demon on its local machine, which picks a port number for the RPC procedure if the 
socket specified to svcudp_create () isn’t already bound. When the clntudp_create () 
call is made with an unbound socket, the system queries the port mapper on the machine to 
which the call is being made, and gets the appropriate port number. If the port mapper is not 
running or has no port corresponding to the RPC call, the RPC call fails. Users can make RPC 
calls to the port mapper themselves. The appropriate procedure numbers are in the include file 
<rpc/pmap_prot . h>. 

After creating an SVCXPRT, the next step is to call pmap_unset () so that if the nusers 
server crashed earlier, any previous trace of it is erased before restarting. More precisely, 
pmap_unset () erases the entry for RUSERS from the port mapper’s tables. 

Finally, we associate the program number for nusers with the procedure nuser () . The final 
argument to svc_register () is normally the protocol being used, which, in this case, is 
IPPROTO_UDP. Notice that unlike registerrpc () , there are no XDR routines involved in 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 11 


the registration process. Also, registration is done on the program, rather than procedure, level. 

The user routine nuser() must call and dispatch the appropriate XDR routines based on the 
procedure number. Note that two things are handled by nuser () that registerrpc () han- 
dles automatically. The first is that procedure NULLPROC (currently zero) returns with no 
arguments. This can be used as a simple test for detecting if a remote program is running. 
Second, there is a check for invalid procedure numbers. If one is detected, svcerr_noproc () 
is called to handle the error. 

The user service routine serializes the results and returns them to the RPC caller via 
svc_sendreply () . Its first parameter is the SVCXPRT handle, the second is the XDR rou- 
tine, and the third is a pointer to the data to be returned. Not illustrated above is how a server 
handles an RPC program that passes data. As an example, we can add a procedure 
RUSERSPROC_BOOL, which has an argument nusers, and returns TRUE or FALSE depend- 
ing on whether there are nusers logged on. It would look like this: 

case RUSERSPROC_BOOL : { 
int bool; 

unsigned nuserquery; 

if ( !svc_getargs (transp, xdr_u_int, finuserquery) { 
svcerr_decode (transp) ; 
return ; 

> 

/* 

* code to set nusers = number of users 

V 

if (nuserquery == nusers) 
bool = TRUE; 

else 

bool = FALSE; 

if (! svc_sendreply (transp , xdr_bool, fibool) ■{ 

fprintf (stderr , "couldn’t reply to RPC call\n") ; 
exit (1) ; 

> 

return ; 

> 

The relevant routine is svc_getargs () , which takes an SVCXPRT handle, the XDR routine, 
and a pointer to where the input is to be placed as arguments. 


3.2. Memory Allocation with XDR 

XDR routines not only do input and output, they also do memory allocation. This is why the 
second parameter of xdr_array () is a pointer to an array, rather than the array itself. If it is 
NULL, then xdr.array () allocates space for the array and returns a pointer to it, putting the 
size of the array in the third argument. As an example, consider the following XDR routine 
xdr_chararrl () , which deals with a fixed array of bytes with length SIZE: 


Sun Microsystems 


Release 2.0 


Page 12 


RPC Programming 


xdr_chararrl (xdrsp, chararr) 

XDR * xdrsp; 
char chararr [] ; 

{ 

char *p ; 
int len; 

p = chararr; 
len = SIZE; 

return (xdr_bytes (xdrsp , &p, &len, SIZE)); 

> 

It might be called from a server like this, 
char chararr [SIZE] ; 

svc_getargs (transp, xdr_chararrl, chararr); 

where chararr has already allocated space. If you want XDR to do the allocation, you would 
have to rewrite this routine in the following way: 

xdr_chararr2 (xdrsp, chararrp) 

XDR * xdrsp; 
char “chararrp; 

{ 

int len; 
len = SIZE; 

return (xdr_bytes (xdrsp, charrarrp, &len, SIZE)); 

> 

Then the RPC call might look like this: 
char ‘arrptr; 
arrptr = NULL ; 

svc_getargs (transp, xdr_chararr2 , &arrptr) ; 

/* 

* use the result here 

V 

svc_freeargs (xdrsp , xdr_chararr2 , fiarrptr) ; 

After using the character array, it can be freed with svc_freeargs () . In the routine 
xdr_ final example () given earlier, if finalp->string was NULL in the call 

svc_getargs (transp , xdr_finalexample, & f inalp) ; 

then 

svc_freeargs (xdrsp , xdr_f inalexample, fifinalp) ; 

frees the array allocated to hold finalp->string; otherwise, it frees nothing. The same is 
true for finalp->simplep. 

To summarize, each XDR routine is responsible for serializing, deserializing, and allocating 
memory. When an XDR routine is called from callrpcQ, the serializing part is used. When 
called from svc_getargs () , the deserializer is used. And when called from 
svc_freeargs () , the memory deallocator is used. When building simple examples like those 
in this section, a user doesn’t have to worry about the three modes. The XDR reference manual 
has examples of more sophisticated XDR routines that determine which of the three modes they 
are in to function correctly. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 13 


3.3. The Calling Side 


When you use callrpc, you have no control over the RPC delivery mechanism or the socket 
used to transport the data. To illustrate the layer of RPC that lets you adjust these parameters, 
consider the following code to call the nusers service: 


#include <stdio.h> 

#include <rpc/rpc.h> 

#include <rpcsvc/rusers.h> 

#include <sys/socket .h> 

#include <sys/time.h> 

#include <netdb.h> 

main (argc, argv) 
int argc; 
char “argv; 

{ 

struct hostent *hp; 

struct timeval pertry_timeout , total_timeout; 

struct sockaddr_in server_addr; 

int addrlen, sock = RPC^ANYSOCK; 

register CLIENT ^client; 

enum clnt_stat clnt_stat; 

unsigned long nusers; 


if (argc < 2) { 

fprintf (stderr , "usage: nusers hostname\n") ; 
exit (-1) ; 

> 

if ( (hp = gethostbyname (argv [1] ) ) == NULL) •( 

fprintf (stderr , "cannot get addr for '%s'\n" , argv[l]); 
exit (-1) ; 

> 

pertry_timeout . tv_sec = 3; 

pertry_timeout .tv_usec = 0; 

addrlen = sizeof (struct sockaddr_in) ; 

bcopy (hp->h_addr , (caddr_t) &server_addr . sin_addr , hp->h_length) ; 
server_addr . sin_family = AF_INET; 
server_addr . sin_port = 0; 

if ((client = clntudp_create (&server_addr , RUSERSPROG, 
RUSERSVERS, pertry_timeout , &sock) ) == NULL) { 
perror ("clntudp_create") ; 
exit (-1) ; 

> 

total_timeout . tv_sec = 20; 
total_timeout . tv_usec = 0; 

clnt_stat = clnt_call (client, RUSERSPROC_NUM, xdr_void, 0, 
xdr_u_long, &nusers , total_timeout) ; 
if (clnt_stat != RPC_SUCCESS) { 

clnt_perror (client , "rpc") ; 
exit (-1) ; 

> 

clnt_destroy (client) ; 

> 

The low-level version of callrpc () is clnt_call(), which takes a CLIENT pointer rather 


Sun Microsystems 


Release 2.0 


Page 14 


RPC Programming 


than a host name. The parameters to clnt_call() are a CLIENT pointer, the procedure 
number, the XDR routine for serializing the argument, a pointer to the argument, the XDR rou- 
tine for deserializing the return value, a pointer to where the return value will be placed, and 
the time in seconds to wait for a reply. 

The CLIENT pointer is encoded with the transport mechanism. callrpcQ uses UDP, thus it 
calls clntudp_create () to get a CLIENT pointer. To get TCP (Transport Control Protocol), 
you would use clnttcp_create () . 

The parameters to clntudp_create () are the server address, the length of the server 
address, the program number, the version number, a timeout value (between tries), and a pointer 
to a socket. The final argument to clnt_call() is the total time to wait for a response. 
Thus, the number of tries is the clnt_call() timeout divided by the clntudp_create () 
timeout. 

There is one thing to note when using the clrvt_destroy () call. It deallocates any space asso- 
ciated with the CLIENT handle, but it does not close the socket associated with it, which was 
passed as an argument to clntudp_create () . The reason is that if there are multiple client 
handles using the same socket, then it is possible to close one handle without destroying the 
socket that other handles are using. 

To make a stream connection, the call to clntudp_create () is replaced with a call to 
clnttcp_create () . 

clnttcp_create (&server_addr , prognum, versnum, fisocket, inputsize, 
outputs ize) ; 

There is no timeout argument; instead, the receive and send buffer sizes must be specified. 
When the clnttcp_create () call is made, a TCP connection is established. All RPC calls 
using that CLIENT handle would use this connection. The server side of an RPC call using TCP 
has svcudp_create () replaced by svctcp_create () . 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 15 


4. Other RPC Features 

This section discusses some other aspects of RPC that are occasionally useful. 


4.1. Select on the Server Side 


Suppose a process is processing RPC requests while performing some other activity. If the other 
activity involves periodically updating a data structure, the process can set an alarm signal 
before calling svc_run(). But if the other activity involves waiting on a a file descriptor, the 
svc_run() call won’t work. The code for svc_run() is as follows: 

void 

svc_run () 

< 

int readfds; 


> 


for 


> 


(;;) { 

readfds = svc_fds; 

switch (select (32, fireadfds, NULL, NULL, NULL)) 
case -1: 

if (errno == EINTR) 
continue; 

perror ("rstat : select") ; 
return; 

case O: 

break ; 

default : 


> 


svc_getreq (readfds) ; 


You can bypass svc_run () and call svc_getreq() yourself. All you need to know are the 
file descriptors of the socket(s) associated with the programs you are waiting on. Thus you can 
have your own select () that waits on both the RPC socket, and your own descriptors. 


4.2. Broadcast RPC 

The pmap and RPC protocols implement broadcast RPC. Here are the main differences between 

broadcast RPC and normal RPC calls: 

1) Normal RPC expects one answer, whereas broadcast RPC expects many answers (one or 
more answer from each responding machine). 

2) Broadcast RPC can only be supported by packet-oriented (connectionless) transport proto- 
cols like UPD/IP. 

3) The implementation of broadcast RPC treats all unsuccessful responses as garbage by filter- 
ing them out. Thus, if there is a version mismatch between the broadcaster and a remote 
service, the user of broadcast RPC never knows. 


Sun Microsystems 


Release 2.0 


Page 16 


RPC Programming 


4) All broadcast messages are sent to the portmap port. Thus, only services that register them- 
selves with their portmapper are accessible via the broadcast RPC mechanism. 


4-2.1. Broadcast RPC Synopsis 


#include <rpc/pmap_clnt .h> 
enum clnt_stat clnt_stat; 


clnt_stat = 

clnt_broadcast (prog, vers. 


u_long 

u_long 

u_long 

xdrproc_t 

caddr_t 

xdrproc_t 

caddr_t 


prog; 

vers; 

proc; 

xargs; 

argsp ; 

xresults; 

resultsp; 


bool_t (‘eachresult) () ; 


proc, xargs, argsp, xresults, resultsp, eachresult) 
/* program number */ 

/* version number */ 

/* procedure number */ 

/* xdr routine for args */ 

/* pointer to args */ 

/* xdr routine for results */ 

/* pointer to results */ 

/* call with each result obtained */ 


The procedure eachresult () is called each time a valid result is obtained. It returns a 
boolean that indicates whether or not the client wants more responses. 


bool_t 


done ; 


done = 

eachresult (resultsp, 
caddr_t 

struct sockaddr_in 


raddr) 

resultsp; 

♦raddr; /* address of machine that sent response */ 


If done is TRUE, then broadcasting stops and clnt_broadcast () returns successfully. Oth- 
erwise, the routine waits for another response. The request is rebroadcast after a few seconds of 
waiting. If no responses come back, the routine returns with RPC_TIMEDOUT. To interpret 
clnt_stat errors, feed the error code to clnt_perrno () . 


4«3. Batching 

The RPC architecture is designed so that clients send a call message, and wait for servers to 
reply that the call succeeded. This implies that clients do not compute while servers are process- 
ing a call. This is inefficient if the client does not want or need an acknowledgement for every 
message sent. It is possible for clients to continue computing while waiting for a response, using 
RPC batch facilities. 

RPC messages can be placed in a “pipeline” of calls to a desired server; this is called batching. 
Batching assumes that: 1) each RPC call in the pipeline requires no response from the server, 
and the server does not send a response message; and 2) the pipeline of calls is transported on a 
reliable byte stream transport such as TCP/IP. Since the server does not respond to every call, 
the client can generate new calls in parallel with the server executing previous calls. Further- 
more, the TCP/IP implementation can buffer up many call messages, and send them to the 
server in one write system call. This overlapped execution greatly decreases the interprocess 
communication overhead of the client and server processes, and the total elapsed time of a series 
of calls. 


Sun Microsystems 


Release 2.0 


RPC Programming Page 17 

Since the batched calls are buffered, the client should eventually do a legitimate call in order to 
flush the pipeline. 

A contrived example of batching follows. Assume a string rendering service (like a window sys- 
tem) has two similar calls: one renders a string and returns void results, while the other renders 
a string and remains silent. The service (using the TCP/IP transport) may look like: 

#include <stdio.h> 

#include <rpc/rpc.h> 

#include <rpcsvc/windows .h> 

void windowdispatch () ; 

main () 

{ 

SVCXPRT *transp; 

transp = svctcp_create (RPC^ANYSOCK , 0, O) ; 
if (transp == NULL) { 

fprintf (stderr , "couldn't create an RPC server\n") ; 
exit (1) ; 

> 

pmap_unset (WINDOWPROG, WINDOWVERS) ; 

if ( !svc_register (transp, WINDOWPROG, WINDOWVERS, windowdispatch, 
IPPROTO_TCP) ) { 

fprintf (stderr , "couldn't register WINDOW service\n") ; 
exit (1) ; 

> 

svc_run(); /* never returns */ 

fprintf (stderr , "should never reach this point\n") ; 

> 

void 

windowdispatch (rqstp, transp) 
struct svc_req ‘rqstp; 

SVCXPRT ‘transp; 

{ 

char *s = NULL; 

switch (rqstp ->rq_proc) { 
case NULLPROC: 

if ( ! svc_sendrep ly (transp , xdr_void, 0)) { 

fprintf (stderr , "couldn't reply to RPC call\n") ; 
exit (1) ; 

> 

return ; 

case RENDERSTRING: 

if ( ! svc_getargs (transp , xdr_wrapstring, &s) ) { 

fprintf (stderr , "couldn't decode arguments\n") ; 
svcerr_decode (transp) ; /* tell caller he screwed up */ 
break ; 

> 

/* 

* call here to to render the string s 

V 

if ( !svc_sendreply (transp, xdr_void, NULL)) { 

fprintf (stderr , "couldn't reply to RPC call\n") ; 


Sun Microsystems 


Release 2.0 


Page 18 


RPC Programming 


exit (1) ; 

> 

break ; 

case RENDERSTRING_BATCHED: 

if ( !svc_getargs (transp, xdr_wrapstring, &s) ) { 

fprintf (stderr , "couldn't decode arguments\n") ; 

/* 

* we are silent in the face of protocol errors 

V 

break ; 

> 

/* 

* call here to to render the string s, 

* but sends no reply! 

V 

break ; 

default : 

svcerr_noproc (transp) ; 
return ; 

> 

/* 

* now free string allocated while decoding arguments 

*/ 

svc_freeargs (transp, xdr_wrapstring, &s) ; 

> 

Of course the service could have one procedure that takes the string and a boolean to indicate 
whether or not the procedure should respond. 

In order for a client to take advantage of batching, the client must perform RPC calls on a 
TCP-based transport and the actual calls must have the following attributes: 1) the result’s XDR 
routine must be zero (NULL), and 2) the RPC call’s timeout must be zero. 

Here is an example of a client that uses batching to render a bunch of strings; the batching is 
flushed when the client gets a null string: 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 19 


#include <stdio.h> 

#include <rpc/rpc.h> 

#include <rpcsvc/windows .h> 

#include <sys/socket .h> 

#include <sys/time.h> 

#include <netdb.h> 

main(argc, argv) 
int argc ; 
char “argv; 

< 

struct hostent *hp; 

struct timeval pertry_timeout, total_timeout ; 

struct sockaddr_in server _addr; 

int addrlen, sock = RPC _ANYSOCK; 

register CLIENT *client; 

enum clnt_stat clnt_stat; 

char buf[1000]; 

char *s = buf; 

/* 

* initial as in example 3.3 

V 

if ((client = clnttcp_create (&server_addr , WINDOWPROG, 
WINDOWVERS, fisock, 0, 0) ) == NULL) < 
perror ("clnttcp_create") ; 
exit (-1) ; 

> 

total_timeout . tv_sec = 0; 
total_timeout . tv_usec = 0; 
while (scanf ("%s" , s) 1 = EOF) { 

clnt_stat = clnt_call (client, RENDERSTRING_BATCHED, 
xdr_wrapstring, &s, NULL, NULL, total_timeout) ; 
if (clnt_stat != RPC_SUCCESS) { 

clnt_perror (client, "batched rpc") ; 
exit (-1) ; 

> 

> 

/* 

* now flush the pipeline 

V 

total_timeout . tv_sec = 20; 

clnt_stat = clnt_call (client, NULLPROC, 

xdr_void, NULL, xdr_void, NULL, total_timeout) ; 
if (clnt_stat != RPC_SUCCESS) < 

clnt_perror (client , "rpc"); 
exit (-1) ; 

> 

clnt_destroy (client) ; 

> 

Since the server sends no message, the clients cannot be notified of any of the failures that may 
occur. Therefore, clients are on their own when it comes to handling errors. 


Sun Microsystems 


Release 2.0 


Page 20 


RPC Programming 


The above example was completed to render all of the (2000) lines in the file / etc/termcap . The 
rendering service did nothing but to throw the lines away. The example was run in the following 
four configurations: 1) machine to itself, regular RPC; 2) machine to itself, batched RPC; 3) 
machine to another, regular RPC; and 4) machine to another, batched RPC. The results are as 
follows: 1) 50 seconds; 2) 16 seconds; 3) 52 seconds; 4) 10 seconds. Running fscanfQ on 
/etc/termcap only requires six seconds. These timings show the advantage of protocols that 
allow for overlapped execution, though these protocols are often hard to design. 


4.4. Authentication 

In the examples presented so far, the caller never identified itself to the server, and the server 
never required an ID from the caller. Clearly, some network services, such as a network filesys- 
tem, require stronger security than what has been presented so far. 

In reality, every RPC call is authenticated by the RPC package on the server, and similarly, the 
RPC client package generates and sends authentication parameters. Just as different transports 
(TCP/IP or UDP/IP) can be used when creating RPC clients and servers, different forms of 
authentication can be associated with RPC clients; the default authentication type used as a 
default is type none. 

The authentication subsystem of the RPC package is open ended. That is, numerous types of 
authentication are easy to support. However, this section deals only with unix type authentica- 
tion, which besides none is the only supported type. 


4-4-1- The Client Side 

When a caller creates a new RPC client handle as in: 

clnt = clntudp_create (address, prognum, versnum, wait, sockp) 
the appropriate transport instance defaults the associate authentication handle to be 

clnt->cl_auth = authnone_create () ; 

The RPC client can choose to use unix style authentication by setting clnt->cl_auth after 
creating the RPC client handle: 

clnt->cl_auth = authunix_create_de fault () ; 

This causes each RPC call associated with clnt to carry with it the following authentication 
credentials structure: 

/* 

a Unix style credentials. 

V 

struct authunix_parms { 

u_l ong aup_t ime ; 

char *aup_machnams ; 

int aup_uid; 

int aup_gid; 

u_int aup_len; 

int *aup_gids; 

>; 

These fields are set by authunix_create 


/* credentials creation time */ 

/* host name of where the client is calling */ 
/* client's UNIX -effective uid */ 

/* client's current UNIX group id */ 

/* the element length of aup_gids array */ 

/* array of 4.2 groups to which user belongs */ 

de fault () by invoking the appropriate system 


Sun Microsyste 


ms 


Release 2.0 


RPC Programming 


Page 21 


calls. 

Since the RPC user created this new style of authentication, he is responsible for destroying it 
with: 

auth_destroy (clnt->cl_auth) ; 


4-4- The Server Side 

Service implementors have a harder time dealing with authentication issues since the RPC pack- 
age passes the service dispatch routine a request that has an arbitrary authentication style asso- 
ciated with it. Consider the fields of a request handle passed to a service dispatch routine: 

/* 

* An ] 

V 

struct 


>-• 

The rq_cred 

credentials: 

/* 

* Authentication info. Mostly opaque to the programmer . 

V 

struct opaque_auth { 

enum_t oa_f lavor ; /* style of credentials */ 

caddr_t oa_base; /* address of more auth stuff */ 

u_int oa_length; /* not to exceed MAX_AUTH_BYTES */ 

}; 

The RPC package guarantees the following to the service dispatch routine: 

1) That the request’s rq_cred is well formed. Thus the service implementor may inspect the 
request’s rq_cred . oa_ flavor to determine which style of authentication the caller used. 
The service implementor may also wish to inspect the other fields of rq_cred if the style is 
not one of the styles supported by the RPC package. 

2) That the request’s rq_clrvtcred field is either NULL or points to a well formed structure 
that corresponds to a supported style of authentication credentials. Remember that only 
unix style is currently supported, so (currently) rq_clntcred could be cast to a pointer to 
an authunix_parms structure. If rq_clntcred is NULL, the service implementor may 
wish to inspect the other (opaque) fileds of rq_cred in case the service knows about a new 
type of authentication that the RPC package does not know about. 

Our remote users service example can be extended so that it computes results for all users 

except UID 16: 


Service request 


svc_req { 

u_long rq_prog; /* 
u_long rq_vers; /* 
u_long rq_proc; /* 
struct opaque_auth rq_cred; /* 
caddr_t rq_clntcred; /* 


is mostly opaque, except for one field of 


service program number */ 
service protocol version number*/ 
the desired procedure number*/ 
raw credentials from the "wire* ' 
read only, cooked credentials */ 

interest: the style of authentication 


Sun Microsystems 


Release 2.0 


Page 22 


RPC Programming 


nuser (rqstp, tranp) 

struct svc_req ‘rqstp; 

SVCXPRT ‘transp; 

{ 

struct authunix_parms *unix_cred; 
int uid; 

unsigned long misers; 

/* 

* we don't care about authentication for the null procedure 

V 

if (rqstp ->rq_proc == NULLPROC) { 

if (! svc_sendreply (transp, xdr_void, 0) ) { 

fprintf (stderr , "couldn't reply to RPC call\n") ; 
exit (1) ; 

> 

return ; 

> 

/* 

* now get the uid 

V 

switch (rqstp- >rq_cred . oa_flavor) { 
case AUTH_UNIX : 

unix_cred = (struct authunix_parms *) rqstp- >rq_clntcred; 
uid = unix_cred->aup_uid; 
break ; 

case AUTH_NULL : 
default : 

svcerr_weakauth (transp) ; 
return; 

> 

switch (rqstp ->rq_proc) { 
case RUSERSPROC_NUM : 

/* 

* make sure the caller is allow to call this procedure. 

V 

if (uid == 16) { 

svcerr_systemerr (transp) ; 
return; 

> 

/* 

* code here to compute the number of users 

* and put in variable nusers 

V 

if (!svc_sendreply (transp, xdr_u_long, finusers) { 

fprintf (stderr, "couldn't reply to RPC call\n") ; 
exit (1) ; 

> 

return; 

default : 

svcerr_noproc (transp) ; 
return ; 

> 

> 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 23 


A few things should be noted here. First, it is customary not to check the authentication param- 
eters associated with the NULLPROC (procedure number zero). Second, if the authentication 
parameter’s type is not suitable for your service, you should call svcerr.weakauth () . And 
finally, the service protocol itself should return status for access denied; in the case of our exam- 
ple, the protocol does not have such a status, so we call the service primitive 
svcerr_systemerr () instead. 

The last point underscores the relation between the RPC authentication package and the ser- 
vices; RPC deals only with authentication and not with individual services’ access control. The 
services themselves must implement their own access control policies and reflect these policies as 
return statuses in their protocols. 


4.5. Using Inetd 

An RPC server can be started from inetd. The only difference from the usual code is that 
svcudp_create () should be called as 

transp = svcudp_create (0) ; 

since inet passes a socket as file descriptor 0. Also, svc_register () should be called as 
svc_register (PROGNUM, VERSNUM, service, transp, O) ; 

with the final flag as 0, since the program would already be registered by inetd. Remember 
that if you want to exit from the server process and return control to inet, you need to expli- 
citly exit, since svc_run () never returns. 

The format of entries in /etc/servers for RPC services is 
rpc udp server program version 

where server is the C code implementing the server, and program and version are the program 
and version numbers of the service. The key word udp can be replaced by tcp for TCP-based 
RPC services. 

If the same program handles multiple versions, then the version number can be a range, as in 
this example: 

rpc udp /usr/etc/rstatd 100001 1-2 


Sun Microsystems 


Release 2.0 


Page 24 


RPC Programming 


5. More Examples 


5.1. Versions 

By convention, the first version number of program FOO is FOOVERS_ORIG and the most 
recent version is FOOVERS. Suppose there is a new version of the user program that returns 
an unsigned short rather than a long. If we name this version RUSERSVERS_SHORT, 
then a server that wants to support both versions would do a double register. 

if ( ! svc_register (transp, RUSERSPROG, RUSERSVERS_ORIG, nuser, 
IPPROTO_TCP) ) { 

fprintf (stderr , "couldn't register RUSER service\n") ; 
exit (1) ; 

> 

if ( !svc_register (transp, RUSERSPROG, RUSERSVERS_SHORT , nuser, 

I PPROTO_TCP ) ) { 

fprintf (stderr , "couldn't register RUSER service\n") ; 
exit (1) ; 

> 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 25 


Both versions can be handled by the same C procedure: 

nuser(rqstp, tranp) 

struct svc_req *rqstp; 

SVCXPRT 4 transp; 

{ 

unsigned long nusers; 
unsigned short nusers2 

switch (rqstp->rq_proc) ■{ 
case NULLPROC: 

if ( !svc_sendreply (transp, xdr_void, 0)) { 

fprintf (stderr, "couldn't reply to RPC call\n") ; 
exit (1) ; 

> 

return; 

case RUSERSPROC_NUM : 

/* 

* code here to compute the number of users 

* and put in variable nusers 

V 

nusers 2 = nusers; 

if (rqstp->rq_vers == RUSERSVERS_ORIG) 

if ( !svc_sendreply (transp, xdr_u_long, inusers) { 

fprintf (stderr , "couldn't reply to RPC call\n") 
exit (1) ; 

> 

else 

if ( !svc_sendreply (transp, xdr_u_short, &nusers2) { 

fprintf (stderr , "couldn't reply to RPC call\n") 
exit (1) ; 

return; 

default : 

svcerr_noproc (transp) ; 
return; 

> 

> 


5.2. TCP 

Here is an example that is essentially rep. The initiator of the RPC snd() call takes its stan- 
dard input and sends it to the server rcv() , which prints it on standard output. The RPC call 
uses TCP. This also illustrates an XDR procedure that behaves differently on serialization than 
on deserialization. 


Sun Microsystems 


Release 2.0 


Page 26 


RPC Programming 


/* 

* The xdr routine: 

a 

* on decode, read from wire, write onto fp 

* on encode, read from fp, write onto wire 

V 

#include <stdio.h> 
tinclude <rpc/rpc.h> 


xdr_rcp (xdrs, fp) 

XDR *xdrs; 

FILE * fp ; 

unsigned long size; 
char buf [MAXCHUNK] , *p; 


> 


if (xdrs->x_op == XDR_FREE) /* nothing to free */ 
return 1 ; 


while (1) •( 

if (xdrs->x_op == XDR_ENCODE) { 
if ((size = fread (buf, 
== 0 && ferror(fp)) 
fprintf (stderr, 
exit (1) ; 

> 

> 

p = buf; 

if ( !xdr_bytes (xdrs, &p, &size, 
return O; 


sizeof (char) , MAXCHUNK, 
"couldn't fread\n") ; 


MAXCHUNK) ) 


fp)) 


if (size == 0) 

return 1; 

if (xdrs->x_op == XDR_DECODE) { 

if (fwrite(buf, sizeof (char) , size, fp) != size) { 
fprintf (stderr , "couldn't fwrite\n") ; 
exit (1) ; 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 27 


/* 

* The sender routines 

V 

#include <stdio.h> 
#include <netdb.h> 
#include <rpc/rpc.h> 
#include <sys/socket .h> 
#include <sys/time.h> 

malnjargc, argv) 
int argc; 


char 


'argv; 


int err; 
if (argc < 2) •( 

fprintf (stderr , "usage: %s server-name\n", argv[0]); 
exit (-1) ; 

> 

if ((err = cal lrpctcp (argv [1] , RCPPROG, RCPPROC_FP, RCPVERS, 
xdr_rcp, stdin, xdr_void, 0) ! = 0) ) { 

clnt_perrno (err) ; 

fprintf (stderr , " couldn't make RPC call\n") ; 
exit (1) ; 

> 

> 

cal lrpctcp (host, prognum, procnum, versnum, inproc, in, outproc, out) 
char *host, *in, *out; 
xdrproc_t inproc, outproc; 

< 

struct sockaddr_in server.addr ; 
int socket = RPC_ANYSOCK; 
enum clnt_stat clnt_stat; 
struct hostent *hp; 
register CLIENT ‘client; 
struct timeval total_timeout ; 

if ( (hp = gethostbyname (host) ) == NULL) { 

fprintf (stderr , "cannot get addr for '%s'\n", host) ; 
exit (-1) ; 

> 

bcopy (hp->h_addr , (caddr_t) &server_addr . sin_addr , hp->h_length) ; 
server_addr . sin_family = AF_INET; 
server_addr . sin_port = 0; 

if ((client = clnttcp_create (&server_addr , prognum, 
versnum, fisocket, BUFSIZ, BUFSIZ)) == NULL) { 
perror ("rpctcp_create") ; 
exit (-1) ; 

> 

total_timeout . tv_sec = 20; 
total_timeout . tv_usec = 0; 

clnt_stat = clnt_call (client, procnum, inproc, in, outproc, out, total_timeout) 
clnt_destroy (client) 
return (int) clnt_stat; 


Sun Microsystems 


Release 2.0 


Page 28 


RPC Programming 


/* 

4 The receiving routines 

V 

#include <stdio.h> 

#include <rpc/rpc.h> 

main () 

< 

register SVCXPRT ‘transp; 

if ( (transp = svctcp_create (RPC^ANYSOCK , 1024, 1024)) == NULL) { 
fprintf ("svctcp_create: error\n") ; 
exit (1) ; 

> 

pmap_unset (RCPPROG, RCPVERS) ; 

if (! svc_register (transp , RCPPROG, RCPVERS, rcp_service, IPPROTO_TCP) ) { 
fprintf (stderr, "svc_register : error\n") ; 
exit (1) ; 

> 

svc_run() ; /* never returns */ 

fprintf (stderr , "svc_run should never return\n") ; 

> 

rcp_service (rqstp , transp) 

register struct svc_req ‘rqstp; 
register SVCXPRT ‘transp; 

{ 

switch (rqstp ->rq_proc) { 
case NULLPROC: 

if (svc_sendreply (transp, xdr_void, 0) == 0) { 
fprintf (stderr , "err: rcp_service") ; 
exit (1) ; 

> 

return ; 

case RCPPROC_FP : 

if (! svc_getargs (transp, xdr_rcp, stdout) ) •{ 
svcerr_decode (transp) ; 
return; 

> 

if (! svc_sendreply (transp , xdr_void, O) ) •{ 
fprintf (stderr , "can't reply\n") ; 
return; 

> 

exit (0) ; 

default : 

svcerr_noproc (transp) ; 
return; 

> 

> 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 29 


5.3. Callback Procedures 

Occasionally, it is useful to have a server become a client, and make an RPC call back the pro- 
cess which is its client. An example is remote debugging, where the client is a window system 
program, and the server is a debugger running on the remote machine. Most of the time, the 
user clicks a mouse button at the debugging window, which converts this to a debugger com- 
mand, and then makes an RPC call to the server (where the debugger is actually running), tel- 
ling it to execute that command. However, when the debugger hits a breakpoint, the roles are 
reversed, and the debugger wants to make an rpc call to the window program, so that it can 
inform the user that a breakpoint has been reached. 

In order to do an RPC callback, you need a program number to make the RPC call on. Since 
this will be a dynamically generated program number, it should be in the transient range, 
0x40000000 - 0x5Sfffff. The routine gettransient () returns a valid program number in the 
transient range, and registers it with the portmapper. It only talks to the portmapper running 
on the same machine as the gettransient () routine itself. The call to pmap_set () is a test 
and set operation, in that it indivisibly tests whether a program number has already been 
registered, and if it has not, then reserves it. On return, the sockp argument will contain a 
socket that can be used as the argument to an svcudp.create () or svctcp.create () call. 


♦ Sun Microsystems 


Release 2.0 


Page 30 


RPC Programming 


tincluda <stdio.h> 
tinclude <rpc/rpc.h> 

#include <sys/socket .h> 

gettransient (proto, vers, sockp) 
int * sockp; 

{ 

static int prognum = 0x40000000; 
int s, len, socktype; 
struct sockaddr_in addr; 


> 


switch (proto) •( 

case IPPR0T0_UDP : 

socktype = S0CK_DGRAM; 
break ; 

case IPPR0T0_TCP: 

socktype = S0CK_STREAM; 
break; 

default : 

fprintf (stderr , "unknown protocol type\n") ; 
return 0; 

> 

if (*sockp == RPC_ANYS0CK) { 

if ( (s = socket (AF_INET, socktype, 0)) < 0) ■{ 
perror ("socket") ; 
return (0) ; 

> 

‘sockp = s; 

> 

else 

s = * sockp ; 

addr . sin_addr . s_addr = 0; 
addr . sin_family = AF_INET; 
addr . sin_port = 0; 
len = sizeof (addr) ; 

/* 

* may be already bound, so don't check for err 

V 

bind(s, &addr, len); 

if (getsockname (s, fiaddr, &len) < 0) •( 
perror ("getsockname") ; 
return (0) ; 

> 

while (pmap_set (prognum++ , vers, proto, addr . sin_port) == 0) 
continue; 

return (prognum- 1) ; 


The following pair of programs illustrate how to use the gettransient () routine. The client 
makes an RPC call to the server, passing it a transient program number. Then the client waits 
around to receive a callback from the server at that program number. The server registers the 
program EXAMPELPROG, so that it can receive the RPC call informing it of the callback pro- 
gram number. Then at some random time (on receiving an ALRM signal in this example), it 
sends a callback RPC call, using the program number it received earlier. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 31 


/* 

4 client 

V 

#include <stdio.h> 

#include <rpc/rpc.h> 

int callback(); 
char hostname [256] ; 

main(argc, argv) 

char * * argv ; 

int x, ans, s; 

SVCXPRT *xprt ; 

gethostname (hostname, sizeof (hostname) ) ; 
s = RPC ^ANYSOCK; 

x = gettransient (IPPROTO_UDP, 1, &s) ; 

fprintf (stderr , "client gets prognum %d\n", x) ; 

if ( (xprt = svcudp_create (s) ) == NULL) ■( 

fprintf (stderr, "rpc_server : svcudp_create\n") ; 
exit (1) ; 

> 

(void) svc_register (xprt, x, 1, callback, O) ; 

ans = cal lrpc (hostname, EXAMPLEPROG, EXAMPLEPROC_CALLBACK , 
EXAMPLEVERS, xdr_int, &x, xdr_void, 0) ; 
if (ans != 0) { 

fprintf (stderr , "call: ") ; 
clnt_perrno (ans) ; 
fprintf (stderr , "\n") ; 

> 

svc.run () ; 

fprintf (stderr , "Error: svc_run shouldn't have returned\n") ; 

> 


Sun Microsystems 


Release 2.0 


Page 32 


RPC Programming 


callback (rqstp, transp) 

register struct sve_req ‘rqstp; 
register SVCXPRT ‘transp; 

{ 

switch (rqstp- >rq_proc) -( 
case O: 


if (svc_sendreply (transp, xdr_void, 0) = 

fprintf (stderr , "err: rusersd\n") 
exit (1) ; 

> 

exit (O) ; 

case 1 : 


if ( !svc_getargs (transp, xdr_void, 0)) { 
svcerr_decode (transp) ; 
exit (1) ; 

> 

fprintf (stderr, "client got callback\n") ; 
if (svc_sendreply (transp , xdr_void, 0) = 

fprintf (stderr, "err : rusersd") ; 
exit (1) ; 


Sun Microsystems 


= FALSE) { 


= FALSE) { 


Release 2.0 


RPC Programming 


Page 33 


/* 

4 server 

V 

#include <stdio.h> 

#include <rpc/rpc.h> 

#include <sys/signal .h> 

char ‘getnewprog () ; 
char hostname [256] ; 
int docallback () ; 

int pnum; /‘program number for callback routine */ 


main(argc, argv) 

char “argv; 

{ 

gethostname (hostname, sizeof (hostname) ) ; 

registerrpc (EXAMPLEPROG, EXAMPLEPROC_CALLBACK, EXAMPLEVERS, 
getnewprog, xdr_int, xdr_void) ; 
fprintf (stderr , "server going into svc_run\n") ; 
alarm (10) ; 

signal (SIGALRM, docallback) ; 
svc_run () ; 

fprintf (stderr , "Error: svc_run shouldn't have returned\n") ; 

> 

char * 

getnewprog (pnump) 

char *pnump ; 

pnum = * (int * ) pnump; 
return NULL; 

> 


docallback () 

{ 

int ans; 


ans = callrpc (hostname, pnum, 1, 1, xdr_void, 0, xdr_void, 0) ; 
if (ans != 0) { 

fprintf (stderr , "server: ") ; 
clnt_perrno (ans) ; 
fprintf (stderr , "\n") ; 

> 

> 


Sun Microsystems 


Release 2.0 


Page 34 


RPC Programming 


Appendix A„ Synopsis of RPC Routines 

auth_destroy() 

void 

auth_destroy (auth) 

AUTH * auth ; 

A macro that destroys the authentication information associated with auth. Destruction usually 
involves deallocation of private data structures. The use of auth is undefined after calling 
auth_destroy () . 

authnone_create() 

AUTH * 

authnone_create () 

Creates and returns an RPC authentication handle that passes no usable authentication informa- 
tion with each remote procedure call. 

authunix_create() 

AUTH * 

authunix_create (host , uid, gid, len, aup_gids) 
char *host ; 

int uid, gid, len, *aup_gids; 

Creates and returns an RPC authentication handle that contains UNIXf authentication informa- 
tion. The parameter host is the name of the machine on which the information was created; 
uid is the user’s user ID; gid is the user’s current group ID; len and aup_gids refer to a 
counted array of groups to which the user belongs. It is easy to impersonate a user. 

authunix_create_default() 

AUTH * 

authunix_create_de fault () 

Calls authunix_create () with the appropriate parameters. 

callrpc() 

callrpc (host, prognum, versnum, procnum, inproc, in, outproc, out) 
char *host; 

u_long prognum, versnum, procnum; 

char *in, *out; 

xdrproc_t inproc, outproc; 

Calls the remote procedure associated with prognum, versnum, and procnum on the machine, 
host. The parameter in is the address of the procedure’s argument(s), and out is the address 
of where to place the result(s); inproc is used to encode the procedure’s parameters, and 
outproc is used to decode the procedure’s results. This routine returns zero if it succeeds, or 

t UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 35 


the value of enum clnt_stat cast to an integer if it fails. The routine clnt_perrno () is 
handy for translating failure statuses into messages. Warning: calling remote procedures with 
this routine uses UDP/IP as a transport; see clntudp_create () for restrictions. 

clnt_broadcast() 

enum clnt_stat 

clnt_broadcast (prognum, versnum, procnum, inproc, in, outproc, out, eachresult) 
u_long prognum, versnum, procnum; 
char *in, *out; 
xdrproc_t inproc, outproc; 
resultproc_t eachresult; 

Like callrpc(), except the call message is broadcast to all locally connected broadcast nets. 

Each time it receives a response, this routine calls eachresult, whose form is 

eachresult (out, addr) 
char * out ; 

struct sockaddr_in *addr ; 

where out is the same as out passed to clnt_broadcast () , except that the remote 
procedure’s output is decoded there; addr points to the address of the machine that sent the 
results. If eachresult () returns zero, clnt_broadcast () waits for more replies; otherwise 
it returns with appropriate status. 

clnt_call() 

enum clnt_stat 

clnt_call (clnt, procnum, inproc, in, outproc, out, tout) 

CLIENT *clnt; long procnum; 
xdrproc_t inproc, outproc; 
char *in, *out; 
struct timeval tout; 

A macro that calls the remote procedure procnum associated with the client handle, clnt, 
which is obtained with an RPC client creation routine such as clntudp_create. The parame- 
ter in is the address of the procedure’s argument(s), and out is the address of where to place 
the result(s); inproc is used to encode the procedure’s parameters, and outproc is used to 
decode the procedure’s results; tout is the time allowed for results to come back. 

clnt_destroy() 

clnt_destroy (clnt) 

CLIENT *clnt; 

A macro that destroys the client’s RPC handle. Destruction usually involves deallocation of 
private data structures, including clnt itself. Use of clnt is undefined after calling 
clnt_destroy () . Warning: client destruction routines do not close sockets associated with 
clnt; this is the responsibility of the user. 


Sun Microsystems 


Release 2.0 


Page 36 


RPC Programming 


c!nt_freeres() 

clnt_freeres (clnt, outproc, out) 

CLIENT *clnt; 
xdrproc_t outproc; 
char * out ; 

A macro that frees any data allocated by the RPC/XDR system when it decoded the results of 
an RPC call. The parameter out is the address of the results, and outproc is the XDR routine 
describing the results in simple primitives. This routine returns one if the results were success- 
fully freed, and zero otherwise. 

clnt_geterr() 

void 

clnt_geterr (clnt, errp) 

CLIENT *clnt ; 
struct rpc_err *errp; 

A macro that copies the error structure out of the client handle to the structure at address 
errp. 

clnt_pcreateerror() 

void 

clnt_pcreateerror (s) 
char *s; 

Prints a message to standard error indicating why a client RPC handle could not be created. 
The message is prepended with string s and a colon. 

clnt_perrno() 

void 

clnt_perrno (stat) 

enum clnt_stat; 

Prints a message to standard error corresponding to the condition indicated by stat. 
clnt_perror() 

clnt_perror (clnt, s) 

CLIENT *clnt ; 
char * s ; 

Prints a message to standard error indicating why an RPC call failed; clnt is the handle used to 
do the call. The message is prepended with string s and a colon. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 37 


clntraw_create() 

CLIENT * 

clntraw_create (prognum, versnum) 
u_long prognum, versnum; 

This routine creates a toy RPC client for the remote program prognum, version versnum. The 
transport used to pass messages to the service is actually a buffer within the process’s address 
space, so the corresponding RPC server should live in the same address space; see 
svcrav_create () . This allows simulation of RPC and acquisition of RPC overheads, such as 
round trip times, without any kernel interference. This routine returns NULL if it fails. 

clnttcp_create() 

CLIENT 4 

clnttcp_create (addr , prognum, versnum, sockp, sendsz, recvsz) 
struct sockaddr_in 4 addr; 
u_long prognum, versnum; 
int 4 sockp; 
u_int sendsz , recvsz ; 

This routine creates an RPC client for the remote program prognum, version versnum; the 
client uses TCP/IP as a transport. The remote program is located at Internet address 4 addr. 
If addr->sin_port is zero, then it is set to the actual port that the remote program is listen- 
ing on (the remote portmap service is consulted for this information). The parameter *sockp is 
a socket; if it is RPC_ANYSOCK, then this routine opens a new one and sets *sockp. Since 
TCP-based RPC uses buffered I/O, the user may specify the size of the send and receive buffers 
with the parameters sendsz and recvsz; values of zero choose suitable defaults. This routine 
returns NULL if it fails. 

clntudp_create() 

CLIENT 4 

clntudp_create (addr , prognum, versnum, wait, sockp) 
struct sockaddr_in 4 addr; 
u_long prognum, versnum; 
struct timeval wait; 
int 4 sockp; 

This routine creates an RPC client for the remote program prognum, version versnum; the 
client uses use UDP/IP as a transport. The remote program is located at Internet address 
*addr. If addr->sin_port is zero, then it is set to actual port that the remote program is 
listening on (the remote portmap service is consulted for this information). The parameter 
♦sockp is a socket; if it is RPC_ANYSOCK, then this routine opens a new one and sets 
♦sockp. The UDP transport resends the call message in intervals of wait time until a response 
is received or until the call times out. Warning: since UDP-based RPC messages can only hold 
up to 8 Kbytes of encoded data, this transport cannot be used for procedures that take large 
arguments or return huge results. 


Sun Microsystems 


Release 2.0 


Page 38 


RPC Programming 


get_my address () 
void 

get_myaddress (addr) 

struct sockaddr_in *addr; 

Stuffs the machine’s IP address into *addr, without consulting the library routines that deal 
with /etc/ hosts. The port number is always set to htons (PMAPPORT) . 

pmap_getmaps() 

struct pmaplist * 
pmap_getmaps (addr) 

struct sockaddr_in *addr; 

A user interface to the portmap service, which returns a list of the current RPC program-to-port 
mappings on the host located at IP address *addr. This routine can return NULL. The com- 
mand rpcinfo -p uses this routine. 

pmap_getportQ 

u_short 

pmap_getport (addr , prognum, versnum, protocol) 
struct sockaddr_in *addr; 
u_long prognum, versnum, protocol; 

A user interface to the portmap service, which returns the port number on which waits a service 
that supports program number prognum, version versnum, and speaks the transport protocol 
associated with protocol. A return value of zero means that the mapping does not exist or that 
the RPC system failured to contact the remote portmap service. In the latter case, the global 
variable rpc_createerr contains the RPC status. 

pmap_rmtcall() 

enum clnt_stat 

pmap_rmtcall (addr , prognum, versnum, procnum, 

inproc, in, outproc, out, tout, portp) 
struct sockaddr_in *addr; 
u_long prognum, versnum, procnum; 
char *in, *out; 
xdrproc_t inproc, outproc; 
struct timeval tout; 
u_long *portp; 

A user interface to the portmap service, which instructs portmap on the host at IP address 
*addr to make an RPC call on your behalf to a procedure on that host. The parameter 
*portp will be modified to the program’s port number if the procedure succeeds. The 
definitions of other parameters are discussed in callrpcQ and clnt_call(); see also 
clnt_broadcast () . 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 39 


pmap_set() 

pmap_set (prognum, versnum, protocol, port) 
u_long prognum, versnum, protocol; 
u_short port; 

A user interface to the portmap service, which establishes a mapping between the triple 
[prognum, versnum, protocol] and port on the machine’s portmap service. The value of 
protocol is most likely IPPROTO_UDP or IPPROTO_TCP. This routine returns one if it 
succeeds, zero otherwise. 

pmap_unset() 

pmap_unset (prognum, versnum) 

u_long prognum, versnum; 

A user interface to the portmap service, which destroys all mappings between the triple 
[prognum, versnum, *] and ports on the machine’s portmap service. This routine returns 
one if it succeeds, zero otherwise. 

registerrpc() 

registerrpc (prognum, versnum, procnum, procname, inproc, outproc) 
u_long prognum, versnum, procnum; 
char * (‘procname) () ; 
xdrproc_t inproc, outproc; 

Registers procedure procname with the RPC service package. If a request arrives for program 
prognum, version versnum, and procedure procnum, procname is called with a pointer to its 
parameter(s); progname should return a pointer to its static result(s); inproc is used to decode 
the parameters while outproc is used to encode the results. This routine returns zero if the 
registration succeeded, —1 otherwise. Warning: remote procedures registered in this form are 
accessed using the UDP/IP transport; see svcudp_create () for restrictions. 

rpc_createerr 

struct rpc_createerr rpc_createerr; 

A global variable whose value is set by any RPC client creation routine that does not succeed. 
Use the routine clnt_pcreateerror () to print the reason why. 

svc_destroy() 

svc_destroy (xprt) 

SVCXPRT ‘xprt; 

A macro that destroys the RPC service transport handle, xprt. Destruction usually involves 
deallocation of private data structures, including xprt itself. Use of xprt is undefined after cal- 
ling this routine. 


Sun Microsystems 


Release 2.0 


Page 40 


RPC Programming 


svc_fds 

int svc_fds; 

A global variable reflecting the RPC service side’s read file descriptor bit mask; it is suitable as a 
parameter to the select system call. This is only of interest if a service implementor does not 
call svc.run () , but rather does his own asynchronous event processing. This variable is read- 
only (do not pass its address to select!), yet it may change after calls to svc_getreq () or 
any creation routines. 

svc_freeargs() 

svc_freeargs (xprt, inproc. In) 

SVCXPRT *xprt ; 
xdrproc_t inproc; 
char *in; 

A macro that frees any data allocated by the RPC/XDR system when it decoded the arguments 
to a service procedure using svc_getargs () . This routine returns one if the results were suc- 
cessfully freed, and zero otherwise. 

svc_getargs() 

svc_getargs (xprt , inproc, in) 

SVCXPRT *xprt; 
xdrproc_t inproc; 
char * in ; 

A macro that decodes the arguments of an RPC request associated with the RPC service tran- 
sport handle, xprt. The parameter in is the address where the arguments will be placed; 
inproc is the XDR routine used to decode the arguments. This routine returns one if decoding 
succeeds, and zero otherwise. 

svc_getcaller() 

struct sockaddr_in 
svc_getcaller (xprt) 

SVCXPRT *xprt ; 

The approved way of getting the network address of the caller of a procedure associated with 
the RPC service transport handle, xprt. 

svc_getreq() 

svc_getreq (rdfds) 
int rdfds; 

This routine is only of interest if a service implementor does not call svc_run(), but instead 
implements custom asynchronous event processing. It is called when the select system call has 
determined that an RPC request has arrived on some RPC socket(s); rdfds is the resultant read 
file descriptor bit mask. The routine returns when all sockets associated with the value of 
rdfds have been serviced. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 41 


svc_register() 

svc_register (xprt, prognum, versnum, dispatch, protocol) 

SVCXPRT ‘xprt; 
u_long prognum, versnum; 
void (‘dispatch) () ; 
u_long protocol; 

Associates prognum and versnum with the service dispatch procedure, dispatch. If 
protocol is non-zero, then a mapping of the triple [prognum, versnum, protocol] to 
xprt - >xp_port is also established with the local portmap service (generally protocol is 
zero, IPPROTO_UDP or IPPROTO_TCP). The procedure dispatch () has the following 
form: 

dispatch (request, xprt) 

struct svc_req ‘request; 

SVCXPRT ‘xprt; 

The svc_register routine returns one if it succeeds, and zero otherwise. 

svc_run() 

svc_run() 

This routine never returns. It waits for RPC requests to arrive and calls the appropriate service 
procedure (using svc_getreq) when one arrives. This procedure is usually waiting for a 
select system call to return. 

svc_sendreply() 

svc_sendreply (xprt, outproc, out) 

SVCXPRT ‘xprt; 
xdrproc_t outproc; 
char ‘out; 

Called by an RPC service’s dispatch routine to send the results of a remote procedure call. The 
parameter xprt is the caller’s associated transport handle; outproc is the XDR routine which 
is used to encode the results; and out is the address of the results. This routine returns one if it 
succeeds, zero otherwise. 

svc_unregister() 


void 

svc_unregister (prognum, versnum) 
u_long prognum, versnum; 

Removes all mapping of the double [prognum, versnum] to dispatch routines, and of the triple 
[prognum, versnum, *] to port number. 


Sun Microsystems 


Release 2.0 


Page 42 


RPC Programming 


svcerr_aufch() 

void 

svcerr_auth (xprt , why) 

SVCXPRT 4 xprt ; 
enum auth_stat why; 

Called by a service dispatch routine that refuses to perform a remote procedure call due to an 
authentication error. 


svcerr_decode() 

void 

svcerr_decode (xprt) 

SVCXPRT *xprt; 

Called by a service dispatch routine that can’t successfully decode its parameters. See also 
svc.getargs () . 

svcerr_noproc() 

void 

svcerr_noproc (xprt) 

SVCXPRT *xprt ; 

Called by a service dispatch routine that doesn’t implement the desired procedure number the 
caller request. 

svcerr_noprog() 

void 

svcerr_noprog (xprt) 

SVCXPRT *xprt ; 

Called when the desired program is not registered with the RPC package. Service implementors 
usually don’t need this routine. 

svcerr_progvers() 

void 

svcerr_progvers (xprt) 

SVCXPRT *xprt ; 

Called when the desired version of a program is not registered with the RPC package. Service 
implementors usually don’t need this routine. 

svcerr_systemerr() 

void 

svcerr_systemerr (xprt) 

SVCXPRT *xprt ; 

Called by a service dispatch routine when it detects a system error not covered by any particular 
protocol. For example, if a service can no longer allocate storage, it may call this routine. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 43 


svcerr_weakauth() 

void 

svcerr_weakauth (xprt) 

SVCXPRT *xprt; 

Called by a service dispatch routine that refuses to perform a remote procedure call due to 
insufficient (but correct) authentication parameters. The routine calls 
svcerr.auth (xprt , AUTH_TOOWEAK) . 

svcraw_create() 

SVCXPRT * 
svcraw_create () 

This routine creates a toy RPC service transport, to which it returns a pointer. The transport is 
really a buffer within the process’s address space, so the corresponding RPC client should live in 
the same address space; see clntrav_create(). This routine allows simulation of RPC and 
acquisition of RPC overheads (such as round trip times), without any kernel interference. This 
routine returns NULL if it fails. 

svctcp_create() 

SVCXPRT * 

svctcp_create (sock , send_bu f_size , recv_bu f_size) 
int sock; 

u_int send_buf_size, recv_buf_size; 

This routine creates a TCP/IP-based RPC service transport, to which it returns a pointer. The 
transport is associated with the socket sock, which may be RPC_ANYSOCK, in which case a 
new socket is created. If the socket is not bound to a local TCP port, then this routine binds it 
to an arbitrary port. Upon completion, xprt->xp_sock is the transport’s socket number, and 
xprt->xp_port is the transport’s port number. This routine returns NULL if it fails. Since 
TCP-based RPC uses buffered I/O, users may specify the size of the send and receive buffers; 
values of zero choose suitable defaults. 

svcudp_create() 

SVCXPRT * 

svcudp_create (sock) 
int sock; 

This routine creates a UDP/IP-based RPC service transport, to which it returns a pointer. The 
transport is associated with the socket sock, which may be RPC_ANYSOCK, in which case a 
new socket is created. If the socket is not bound to a local UDP port, then this routine binds it 
to an arbitrary port. Upon completion, xprt->xp_sock is the transport’s socket number, and 
xprt->xp_port is the transport’s port number. This routine returns NULL if it fails. Warn- 
ing: since UDP-based RPC messages can only hold up to 8 Kbytes of encoded data, this tran- 
sport cannot be used for procedures that take large arguments or return huge results. 


Sun Microsystems 


Release 2.0 


Page 44 


RPC Programming 


xdr_accepted_reply() 


xdr_accepted_reply (xdrs, ar) 

XDR 4 xdrs; 

struct accepted_reply *ar; 

Used for describing RPC messages, externally. This routine is useful for users who wish to gen- 
erate RPC-style messages without using the RPC package. 

xdr_array() 

xdr_array (xdrs , arrp, sizep, maxsize, elsize, elproc) 

XDR *xdrs; 
char “arrp; 

u_int * sizep , maxsize, elsize; 
xdrproc_t elproc; 

A filter primitive that translates between arrays and their corresponding external representa- 
tions. The parameter arrp is the address of the pointer to the array, while sizep is the 
address of the element count of the array; this element count cannot exceed maxsize. The 
parameter elsize is the sizeof () each of the array’s elements, and elproc is an XDR filter 
that translates between the array elements’ C form, and their external representation. This rou- 
tine returns one if it succeeds, zero otherwise. 

xdr_authunix_parms() 

xdr_authunix_parms (xdrs , au PP) 

XDR ‘xdrs ; 

struct authunix_parms *aupp; 

Used for describing UNIX credentials, externally. This routine is useful for users who wish to 
generate these credentials without using the RPC authentication package. 

xdr_bool() 

xdr_bool (xdrs, bp) 

XDR *xdrs ; 
bool_t *bp; 

A filter primitive that translates between booleans (C integers) and their external representa- 
tions. When encoding data, this filter produces values of either one or zero. This routine 
returns one if it succeeds, zero otherwise. 

xdr_bytes() 

xdr_bytes (xdrs, sp, sizep, maxsize) 

XDR ‘xdrs; 
char “sp; 

u_int ‘sizep, maxsize; 

A filter primitive that translates between counted byte strings and their external representations. 
The parameter sp is the address of the string pointer. The length of the string is located at 
address sizep; strings cannot be longer than maxsize. This routine returns one if it succeeds, 
zero otherwise. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 45 


xdr_callhdr() 

void 

xdr_callhdr (xdrs, chdr) 

XDR *xdrs; 

struct rpc_msg *chdr; 

Used for describing RPC messages, externally. This routine is useful for users who wish to gen- 
erate RPC-style messages without using the RPC package. 

xdr_callmsg() 

xdr_callmsg(xdrs ( cmsg) 

XDR *xdrs ; 

struct rpc_msg *cmsg; 

Used for describing RPC messages, externally. This routine is useful for users who wish to gen- 
erate RPC-style messages without using the RPC package. 

xdr_double() 

xdr_double(xdrs, dp) 

XDR *xdrs ; 
double *dp; 

A filter primitive that translates between C double precision numbers and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_enum() 

xdr_enum(xdrs, ep) 

XDR *xdrs ; 
enum_t *ep; 

A filter primitive that translates between C enums (actually integers) and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_float() 

xdr_f loat (xdrs , fp) 

XDR *xdrs ; 
float *fp; 

A filter primitive that translates between C floats and their external representations. This 
routine returns one if it succeeds, zero otherwise. 

xdr_inline() 
long * 

xdr_inline (xdrs, len) 

XDR *xdrs; 
int len; 

A macro that invokes the in-line routine associated with the XDR stream, xdrs. The routine 
returns a pointer to a contiguous piece of the stream’s buffer; len is the byte length of the 
desired buffer. Note that pointer is cast to long *. Warning: xdr_inline() may return 0 


Sun Microsystems 


Release 2.0 


Page 46 


RPC Programming 


(NULL) if it cannot allocate a contiguous piece of a buffer. Therefore the behavior may vary 
among stream instances; it exists for the sake of efficiency. 

xdr_int() 

xdr_int (xdrs, ip) 

XDR *xdrs; 
int * ip ; 

A filter primitive that translates between C integers and their external representations. This 
routine returns one if it succeeds, zero otherwise. 

xdr_long() 

xdr_long (xdrs , lp) 

XDR *xdrs ; 
long *lp; 

A filter primitive that translates between C long integers and their external representations. 
This routine returns one if it succeeds, zero otherwise. 

xdr_opaque() 

xdr_opaque (xdrs , cp, cnt) 

XDR *xdrs ; 
char * cp ; 
u_int cnt ; 

A filter primitive that translates between fixed size opaque data and its external representation. 
The parameter cp is the address of the opaque object, and cnt is its size in bytes. This routine 
returns one if it succeeds, zero otherwise. 

xd r_opaque_aut h () 

xdr_opaque_auth (xdrs , ap) 

XDR *xdrs ; 

struct opaque_auth *ap; 

Used for describing RPC messages, externally. This routine is useful for users who wish to gen- 
erate RPC-style messages without using the RPC package. 

xdr_pmap() 

xdr_pmap (xdrs , regs) 

XDR *xdrs ; 
struct pmap *regs; 

Used for describing parameters to various portmap procedures, externally. This routine is useful 
for users who wish to generate these parameters without using the pmap interface. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 47 


xdr_pmaplist() 

xdr_pmaplist (xdrs, rp) 

XDR *xdrs; 

struct pmaplist **rp; 

Used for describing a list of port mappings, externally. This routine is useful for users who wish 
to generate these parameters without using the pmap interface. 

xdr_reference() 

xdr_re ference (xdrs , pp, size, proc) 

XDR *xdrs; 
char * *pp ; 
u_int size; 
xdrproc_t proc; 

A primitive that provides pointer chasing within structures. The parameter pp is the address of 
the pointer; size is the sizeof () the structure that *pp points to; and proc is an XDR pro- 
cedure that filters the structure between its C form and its external representation. This routine 
returns one if it succeeds, zero otherwise. 

xdr_rejected_reply() 

xdr_rejected_reply (xdrs, rr) 

XDR *xdrs; 

struct rejected_reply *rr; 

Used for describing RPC messages, externally. This routine is useful for users who wish to gen- 
erate RPC-style messages without using the RPC package. 

xdr_replymsg() 

xdr_replymsg (xdrs, rmsg) 

XDR *xdrs; 

struct rpc_msg *rmsg; 

Used for describing RPC messages, externally. This routine is useful for users who wish to gen- 
erate RPC style messages without using the RPC package. 

xdr_short() 

xdr_short (xdrs, sp) 

XDR *xdrs ; 
short *sp; 

A filter primitive that translates between C short integers and their external representations. 
This routine returns one if it succeeds, zero otherwise. 


Sun Microsystems 


Release 2.0 


Page 48 


RPC Programming 


xdr_string() 

xdr_string(xdrs, sp, maxsize) 

XDR *xdrs; 
char * * sp ; 
u_int maxsize; 

A filter primitive that translates between C strings and their corresponding external representa- 
tions. Strings cannot cannot be longer than maxsize. Note that sp is the address of the 
string’s pointer. This routine returns one if it succeeds, zero otherwise. 

xdr_u_int() 

xdr_u_int (xdrs, up) 

XDR *xdrs; 
unsigned *up; 

A filter primitive that translates between C unsigned integers and their external representa- 
tions. This routine returns one if it succeeds, zero otherwise. 

xdr_u_long() 

xdr_u_long(xdrs, ulp) 

XDR *xdrs ; 
unsigned long *ulp; 

A filter primitive that translates between C unsigned long integers and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_u_short() 

xdr_u_short (xdrs , usp) 

XDR *xdrs ; 

unsigned short *usp; 

A filter primitive that translates between C unsigned short integers and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_union() 

xdr_union (xdrs , dscmp, unp, choices, dfault) 

XDR *xdrs ; 
int * dscmp; 
char * unp ; 

struct xdr_discrim *choices; 
xdrproc_t dfault; 

A filter primitive that translates between a discriminated C union and its corresponding exter- 
nal representation. The parameter dscmp is the address of the union’s discriminant, while unp 
in the address of the union. This routine returns one if it succeeds, zero otherwise. 


Sun Microsystems 


Release 2.0 


RPC Programming 


Page 49 


xdr_void() 

xdr_void () 

This routine always returns one. 

xdr_wrapstring() 

xdr_wrapstring (xdrs, sp) 

XDR *xdrs; 
char **sp; 

A primitive that calls xdr_string (xdrs, sp, MAXUNSIGNED) ; where MAXUNSIGNED is the 
maximum value of an unsigned integer. This is handy because the RPC package passes only two 
parameters XDR routines, whereas xdr_string () , one of the most frequently used primitives, 
requires three parameters. This routine returns one if it succeeds, zero otherwise. 

xprt_register() 

void 

xprt_register (xprt) 

SVCXPRT *xprt; 

After RPC service transport handles are created, they should register themselves with the RPC 
service package. This routine modifies the global variable svc_fds. Service implementors usu- 
ally don’t need this routine. 

xprt_unregister() 

void 

xprt_unregister (xprt) 

SVCXPRT *xprt; 

Before an RPC service transport handle is destroyed, it should unregister itself with the RPC 
service package. This routine modifies the global variable svc_fds. Service implementors usu- 
ally don’t need this routine. 


Sun Microsystems 


Release 2.0 


External Data Representation 
Protocol Specification 


Contents 


1. Introduction j 

2 . Justification j 

3. XDR Library Primitives g 

3.1. Number Filters 5 

3.2. Floating Point Filters g 

3.3. Enumeration Filters 7 

3.4. No Data 7 

3.5. Constructed Data Type Filters 7 

3.5.1. Strings g 

3.5.2. Byte Arrays g 

3.5.3. Arrays g 

3.5.4. Opaque Data jj 

3.5.5. Fixed Sized Arrays jj 

3.5.6. Discriminated Unions 12 

3.5.7. Pointers 13 

3. 5. 7.1. Pointer Semantics and XDR 14 

3.6. Non-filter Primitives 15 

3.7. XDR Operation Directions 15 

4. XDR Stream Access 

4.1. Standard I/O Streams lg 

4.2. Memory Streams lg 

4.3. Record (TCP/IP) Streams 17 

5. XDR Stream Implementation jg 

5.1. The XDR Object lg 

8. XDR Standard gQ 

6.1. Basic Block Size 20 

6.2. Integer 20 

6.3. Unsigned Integer 20 

6.4. Enumerations 20 

6.5. Booleans 21 

6.6. Hyper Integer and Hyper Unsigned 21 

6.7. Floating Point and Double Precision 21 


6.8. Opaque Data 22 

6.9. Counted Byte Strings 22 

6.10. Fixed Arrays 22 

6.11. Counted Arrays 23 

6.12. Structures 23 

6.13. Discriminated Unions 23 

6.14. Missing Specifications 23 

6.15. Library Primitive / XDR Standard Cross Reference 24 

7 . Advanced Topics 25 

7.1. Linked Lists 25 

A. The Record Marking Standard 29 

B. Synopsis of XDR Routines 30 


External Data Representation 
Protocol Specification 


1. Introduction 

This manual describes library routines that allow a C programmer to describe arbitrary data 
structures in a machine-independent fashion. The external Data Representation (XDR) standard 
is the backbone of Sun’s Remote Procedure Call package, in the sense that data for remote pro- 
cedure calls is transmitted using the standard. XDR library routines should be used to transmit 
data that is accessed (read or written) by more than one type of machine. 

This manual contains a description of XDR library routines, a guide to accessing currently avail- 
able XDR streams, information on defining new streams and data types, and a formal definition 
of the XDR standard. XDR was designed to work across different languages, operating systems, 
and machine architectures. Most users (particularly RPC users) only need the information in 
sections 2 and 3 of this document. Programmers wishing to implement RPC and XDR on new 
machines will need the information in sections 4 through 6. Advanced topics, not necessary for 
all implementations, are covered in section 7. 

On Sun systems, C programs that want to use XDR routines must include the file 
<rpc/rpc.h>, which contains all the necessary interfaces to the XDR system. Since the C 
library libc . a contains all the XDR routines, compile as normal. 

cc program. c 


2. Justification 

Consider the following two programs, writer: 

^include <stdio.h> 

main () /* writer. c */ 

< 

long i; 

for (i = 0; i < 8; i + + ) { 

if (fwrite( (char *)&i, sizeof(i), 1, stdout) != 1) < 
fprintf (stderr , "failed!\n") ; 
exit (1) ; 


and reader: 


Sun Microsystems 


Release 2.0 


Page 2 


XDR Protocol Spec 


#include <stdio.h> 

main() /* reader. c */ 

long i, j; 

for (j = 0; j < 8; j ++ ) < 

if (fread ( (char *)&i, sizeof (i) , 1, stdin) != 1) { 
fprintf (stderr, "failed!\n") ; 
exit (1) ; 

> 

printf ("%ld ", i) ; 

> 

printf ("\n") ; 

> 

The two programs appear to be portable, because (a) they pass lint checking, and (b) they 
exhibit the same behavior when executed on two different hardware architectures, a Sun and a 
VAX. 

Piping the output of the writer program to the reader program gives identical results on a 
Sun or a VAX.f 

sun% writer | reader 

01234567 

sun% 

vax% writer | reader 

01234567 

vax% 

With the advent of local area networks and Berkeley’s 4.2 BSD UNEXf came the concept of “net- 
work pipes” — a process produces data on one machine, and a second process consumes data on 
another machine. A network pipe can be constructed with writer and reader. Here are the 
results if the first produces data on a Sun, and the second consumes data on a VAX. 

sun% writer | rsh vax reader 

O 16777216 33554432 50331648 67108864 83886080 100663296 117440512 

sun % 

Identical results can be obtained by executing writer on the VAX and reader on the Sun. 
These results occur because the byte ordering of long integers differs between the VAX and the 
Sun, even though word size is the same. Note that 16777216 is 2 24 — when four bytes are 
reversed, the 1 winds up in the 24th bit. 

Whenever data is shared by two or more machine types, there is a need for portable data. Pro- 
grams can be made data-portable by replacing the read() and write () calls with calls to an 
XDR library routine xdr_long(), a filter that knows the standard representation of a long 
integer in its external form. Here are the revised versions of writer: 


t VAX is a trademark of Digital Equipment Corporation, 
t UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 3 


#include <stdio.h> 

#include <rpc/rpc.h> /* xdr is a sub-library of the rpc library */ 

main () /* writer. c */ 

{ 

XDR xdrs; 
long i; 

xdrstdio_create (&xdrs , stdout, XDR_ENCODE) ; 
for (i =0; i < 8; i++) { 

if ( ! xdr_long(&xdrs, &i) ) { 

fprlnt f (stderr , "failed!\n") ; 
exit (1) ; 


and reader: 

#include <stdio.h> 

#include <rpc/rpc.h> /* xdr is a sub-library of the rpc library */ 
main () /* reader. c */ 

XDR xdrs; 
long i, j; 

xdrstdio_create (&xdrs , stdin, XDR_DECODE) ; 
for (j =0; j <8; j++) { 

if ( ! xdr_long(&xdrs, &i) ) < 

fprintf (stderr , " failed !\n") ; 
exit (1) ; 

> 

printf ("%ld ", i) ; 

> 

printf ("\n") ; 

> 

The new programs were executed on a Sun, on a VAX, and from a Sun to a VAX, the results are 
shown below. 

sun% writer | reader 

01234567 

sun% 

vax% writer | reader 

01234567 

vax% 

sun% writer | rsh vax reader 

01234567 

sun% 

Dealing with integers is just the tip of the portable-data iceberg. Arbitrary data structures 
present portability problems, particularly with respect to alignment and pointers. Alignment on 
word boundaries may cause the size of a structure to vary from machine to machine. Pointers 
are convenient to use, but have no meaning outside the machine where they are defined. 


Sun Microsystems 


Release 2.0 


Page 4 


XDR Protocol Spec 


The XDR library package solves data portability problems. It allows you to write and read arbi- 
trary C constructs in a consistent, specified, well-documented manner. Thus, it makes sense to 
use the library even when the data is not shared among machines on a network. 

The XDR library has filter routines for strings (null-terminated arrays of bytes), structures, 
unions, and arrays, to name a few. Using more primitive routines, you can write your own 
specific XDR routines to describe arbitrary data structures, including elements of arrays, arms of 
unions, or objects pointed at from other structures. The structures themselves may contain 
arrays of arbitrary elements, or pointers to other structures. 

Let’s examine the two programs more closely. There is a family of XDR stream creation rou- 
tines in which each member treats the stream of bits differently. In our example, data is mani- 
pulated using standard I/O routines, so we use xdrstdio_create () . The parameters to XDR 
stream creation routines vary according to their function. In our example, 
xdrstdio_create () takes a pointer to an XDR structure that it initializes, a pointer to a 
FILE that the input or output is performed on, and the operation. The operation may be 
XDR_ENCODE for serializing in the writer program, or XDR_DECODE for deserializing in 
the reader program. 

Note: RPC clients never need to create XDR streams; the RPC system itself creates these 
streams, which are then passed to the clients. 

The xdr_long () primitive is characteristic of most XDR library primitives and all client XDR 
routines. First, the routine returns FALSE (0) if it fails, and TRUE (1) if it succeeds. Second, 
for each data type, xxx, there is an associated XDR routine of the form: 

xdr_xxx(xdrs, fp) 

XDR *xdrs ; 
xxx * fp ; 

{ 

> 

In our case, xxx is long, and the corresponding XDR routine is a primitive, xdr_long. The 
client could also define an arbitrary structure xxx in which case the client would also supply the 
routine xdr_xxx, describing each field by calling XDR routines of the appropriate type. In all 
cases the first parameter, xdrs can be treated as an opaque handle, and passed to the primitive 
routines. 

XDR routines are direction independent; that is, the same routines are called to serialize or 
deserialize data. This feature is critical to software engineering of portable data. The idea is to 
call the same routine for either operation — this almost guarantees that serialized data can also 
be deserialized. One routine is used by both producer and consumer of networked data. This is 
implemented by always passing the address of an object rather than the object itself — only in 
the case of deserialization is the object modified. This feature is not shown in our trivial exam- 
ple, but its value becomes obvious when nontrivial data structures are passed among machines. 
If needed, you can obtain the direction of the XDR operation. See section 3.7 for details. 

Let’s look at a slightly more complicated example. Assume that a person’s gross assets and lia- 
bilities are to be exchanged among processes. Also assume that these values are important 
enough to warrant their own data type: 

struct gnumbers { 

long g_assets; 
long g_liabiliti.es ; 

}; 

The corresponding XDR routine describing this structure would be: 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 5 


bool_t /* TRUE is success, FALSE is failure */ 

xdr_gnumbers (xdrs , gp) 

XDR *xdrs ; 

struct gn umbers *gp; 

{ 

if (xdr_long (xdrs , &gp->g_assets) && 

xdr_long(xdrs, &gp->g_liabilities) ) 
return (TRUE) ; 
return (FALSE) ; 

> 

Note that the parameter xdrs is never inspected or modified; it is only passed on to the subcom- 
ponent routines. It is imperative to inspect the return value of each XDR routine call, and to 
give up immediately and return FALSE if the subroutine fails. 

This example also shows that the type bool_t is declared as an integer whose only values are 
TRUE (1) and FALSE (0). This document uses the following definitions: 

#define bool_t int 
#define TRUE 1 
#def ine FALSE O 

tdefine enum_t int /* enum_t's are used for generic enun's */ 


Keeping these conventions in mind, xdr_gnumbers () can be rewritten as follows: 


xdr_gnumbers (xdrs, gp) 

XDR *xdrs; 

struct gnumbers *gp; 

{ 

return (xdr_long (xdrs, 
xdr_long (xdrs , 


> 


&gp->g_assets) && 
&gp->g_liabilities) ) ; 


This document uses both coding styles. 


Sun Microsystems 


Release 2.0 


Page 6 


XDR Protocol Spec 


3o XDR Library Primitives 

This section gives a synopsis of each XDR primitive. It starts with basic data types and moves 
on to constructed data types. Finally, XDR utilities are discussed. The interface to these primi- 
tives and utilities is defined in the include file <rpc/xdr.h>, automatically included by 
<rpc/rpc .h>. 

3.1. Number Filters 

The XDR library provides primitives that translate between C numbers and their corresponding 
external representations. The primitives cover the set of numbers in: 

[signed , unsigned] * [short ,int ,long\ 

Specifically, the six primitives are: 

bool_t xdr_int (xdrs, ip) 

XDR *xdrs ; 
int * ip ; 

bool_t xdr_u_int (xdrs, up) 

XDR *xdrs; 
unsigned *up; 

bool_t xdr_long (xdrs , lip) 

XDR ‘xdrs; 
long *lip; 

bool_t xdr_u_long(xdrs, lup) 

XDR *xdrs ; 
u_long *lup; 

bool_t xdr_short (xdrs, sip) 

XDR *xdrs; 
short *sip; 

bool_t xdr_u_short (xdrs, sup) 

XDR *xdrs ; 
u_short *sup; 

The first parameter, xdrs, is an XDR stream handle. The second parameter is the address of 
the number that provides data to the stream or receives data from it. All routines return TRUE 
if they complete successfully, and FALSE otherwise. 

3.2. Floating Point Filters 

The XDR library also provides primitive routines for C’s floating point types: 

bool_t xdr_float (xdrs, fp) 

XDR ‘xdrs; 
float *fp; 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 7 


bool_t xdr_double (xdrs, dp) 

XDR *xdrs ; 
double *dp; 

The first parameter, xdrs is an XDR stream handle. The second parameter is the address of 
the floating point number that provides data to the stream or receives data from it. All routines 
return TRUE if they complete successfully, and FALSE otherwise. 

Note: Since the numbers are represented in IEEE floating point, routines may fail when decoding 
a valid IEEE representation into a machine-specific representation, or vice-versa. 


3.3. Enumeration Filters 

The XDR library provides a primitive for generic enumerations. The primitive assumes that a C 
enum has the same representation inside the machine as a C integer. The boolean type is an 
important instance of the enum. The external representation of a boolean is always one (TRUE) 
or zero (FALSE). 

#define bool_t int 
#define FALSE 0 
tfdefine TRUE 1 

#define enum_t int 

bool_t xdr_enum(xdrs, ep) 

XDR *xdrs; 
enum_t *ep; 

bool_t xdr_bool (xdrs, bp) 

XDR *xdrs; 
bool_t *bp; 

The second parameters ep and bp are addresses of the associated type that provides data to, or 
receives data from, the stream xdrs. The routines return TRUE if they complete successfully, 
and FALSE otherwise. 


3.4. No Data 

Occasionally, an XDR routine must be supplied to the RPC system, even when no data is passed 
or required. The library provides such a routine: 

bool_t xdr_void(); /* always returns TRUE */ 


3.5. Constructed Data Type Filters 

Constructed or compound data type primitives require more parameters and perform more com- 
plicated functions then the primitives discussed above. This section includes primitives for 
strings, arrays, unions, and pointers to structures. Constructed data type primitives may use 
memory management. In many cases, memory is allocated when deserializing data with 
XDR_DECODE. Therefore, the XDR package must provide means to deallocate memory. This 
is done by an XDR operation, XDR_FREE. To review, the three XDR directional operations are 
XDR_ENCODE, XDR_DECODE, and XDR_FREE. 


Sun Microsystems 


Release 2.0 


Page 8 


XDR Protocol Spec 


3.5.1. Strings 

In C, a string is defined as a sequence of bytes terminated by a null byte, which is not considered 
when calculating string length. However, when a string is passed or manipulated, a pointer to it 
is employed. Therefore, the XDR library defines a string to be a char *, and not a sequence of 
characters. The external representation of a string is drastically different from its internal 
representation. Externally, strings are represented as sequences of ASCII characters, while inter- 
nally, they are represented with character pointers. Conversion between the two representations 
is accomplished with the routine xdr_string() : 

bool_t xdr_string(xdrs, sp, maxlength) 

XDR ‘xdrs ; 
char “ sp ; 
u_int maxlength; 

The first parameter xdrs is the XDR stream handle. The second parameter sp is a pointer to a 
string (type char “). The third parameter maxlength specifies the maximum number of 
bytes allowed during encoding or decoding; its value is usually specified by a protocol. For 
example, a protocol specification may say that a file name may be no longer than 255 characters. 
The routine returns FALSE if the number of characters exceeds maxlength, and TRUE if it 
doesn’t. 

The behavior of xdr_string() is similar to the behavior of other routines discussed in this 
section. The direction XDR_ENCODE is easiest to understand. The parameter sp points to a 
string of a certain length; if it does not exceed maxlength, the bytes are serialized. 

The effect of deserializing a string is subtle. First the length of the incoming string is deter- 
mined; it must not exceed maxlength. Next sp is dereferenced; if the the value is NULL, then 
a string of the appropriate length is allocated and *sp is set to this string. If the original value 
of *sp is non-NULL, then the XDR package assumes that a target area has been allocated, 
which can hold strings no longer than maxlength. In either case, the string is decoded into the 
target area. The routine then appends a null character to the string. 

In the XDR_FREE operation, the string is obtained by dereferencing sp. If the string is not 
NULL, it is freed and *sp is set to NULL. In this operation, xdr_string ignores the 
maxlength parameter. 


3.5.2. Byte Arrays 

Often variable-length arrays of bytes are preferable to strings. Byte arrays differ from strings in 
the following three ways: 1) the length of the array (the byte count) is explicitly located in an 
unsigned integer, 2) the byte sequence is not terminated by a null character, and 3) the external 
representation of the bytes is the same as their internal representation. The primitive 
xdr_bytes () converts between the internal and external representations of byte arrays: 

bool_t xdr_bytes (xdrs , bpp, lp, maxlength) 

XDR ‘xdrs; 
char “bpp; 
u_int * lp ; 
u_int maxlength; 

The usage of the first, second and fourth parameters are identical to the first, second and third 
parameters of xdr_string () , respectively. The length of the byte area is obtained by dere- 
ferencing lp when serializing; *lp is set to the byte length when deserializing. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 9 


3.5.3. Arrays 


The XDR library package provides a primitive for handling arrays of arbitrary elements. The 
xdr_bytes () routine treats a subset of generic arrays, in which the size of array elements is 
known to be 1, and the external description of each element is built-in. The generic array primi- 
tive, xdr_array() requires parameters identical to those of xdr_bytes () plus two more: the 
size of array elements, and an XDR routine to handle each of the elements. This routine is 
called to encode or decode each element of the array. 

bool_t xdr_array (xdrs, ap, lp, maxlength, elementsize, xdr_element) 

XDR *xdrs ; 

char * * ap ; 

u_int * lp ; 

u_int maxlength; 

u_int elementsize; 

bool_t (*xdr_element) () ; 

The parameter ap is the address of the pointer to the array. If *ap is NULL when the array is 
being deserialized, XDR allocates an array of the appropriate size and sets *ap to that array. 
The element count of the array is obtained from *lp when the array is serialized; *lp is set to 
the array length when the array is deserialized. The parameter maxlength is the maximum 
number of elements that the array is allowed to have; elementsize is the byte size of each ele- 
ment of the array (the C function sizeofQ can be used to obtain this value). The routine 
xdr_element is called to serialize, deserialize, or free each element of the array. 

Examples 

Before defining more constructed data types, it is appropriate to present three examples. 

Example A 

A user on a networked machine can be identified by (a) the machine name, such as krypton: 
see gethostname( 3); (b) the user’s UID: see geteuid( 2); and (c) the group numbers to which the 
user belongs: see getgroups{ 2). A structure with this information and its associated XDR routine 
could be coded like this: 


struct netuser { 

char *nu_machinename; 

int nu_uid; 

u_int nu_glen; 

int *nu_gids; 

}; 

#define NLEN 255 /* machine names must be shorter than 256 chars */ 
ttdefine NGRPS 20 /* user can't be a member of more than 20 groups */ 


bool_t 

xdr_netuser (xdrs, nup) 

XDR *xdrs; 

struct netuser *nup; 

< 

return (xdr_string (xdrs , &nup->nu_machinename, NLEN) && 
xdr_int (xdrs, &nup- >nu_uid) && 

xdr_array (xdrs, &nup->nu_gids, &nup->nu_glen, NGRPS, 
sizeof (int), xdr_int) ) ; 


> 


Sun Microsystems 


Release 2.0 


Page 10 


XDR Protocol Spec 


Example B 

A party of network users could be implemented as an array of netuser structure. The declara- 
tion and its associated XDR routines are as follows: 


struct party { 

u_int p_len; 

struct netuser *p_nusers; 

}; 

#define PLEN 500 /* max number of users in a party */ 


bool_t 

xdr_party (xdrs , pp) 

XDR *xdrs; 
struct party *pp; 

{ 


> 


return (xdr_array (xdrs , &pp->p_nusers , &pp->p_len, 
sizeof (struct netuser), xdr_netuser) ) ; 


PLEN, 


Example C 

The well-known parameters to main () , argc and argv can be combined into a structure. An 
array of these structures can make up a history of commands. The declarations and XDR rou- 
tines might look like: 

struct cmd { 

u_int c_argc; 
char **c_argv; 

}; 

#define ALEN 1000 /* args can be no longer than 1000 chars */ 

#def ine NARGC 100 /* commands may have no more than 100 args */ 

struct history •( 

u_int h_len; 
struct cmd *h_cmds; 

>; 

#def ine NCMDS 75 /* history is no more than 75 commands */ 


bool_t 

xdr_wrap_string (xdrs , sp) 
XDR *xdrs ; 
char * * sp ; 


> 


return (xdr_string (xdrs , 


sp , ALEN) ) ; 


bool_t 

xdr_cmd (xdrs, cp) 

XDR *xdrs ; 
struct cmd * cp ; 


> 


return (xdr_array (xdrs , &cp->c_argv, &cp->c_argc, 
sizeof (char *) , xdr_wrap_string) ) ; 


NARGC, 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 11 


bool_t 

xdr_history (xdrs, hp) 

XDR *xdrs; 

struct history *hp; 

{ 

return (xdr_array (xdrs, &hp->h_cmds, 
sizeof (struct cmd) , xdr_cmd) ) ; 


> 


&hp->h_len. 


NCMDS, 


The most confusing part of this example is that the routine xdr_vrap_string () is needed to 
package the xdr_string() routine, because the implementation of xdr_array () only passes 
two parameters to the array element description routine; xdr_wrap_string () supplies the 
third parameter to xdr_string() . 


By now the recursive nature of the XDR library should be obvious. Let’s continue with more 
constructed data types. 


8.5.4 ■ Opaque Data 

In some protocols, handles are passed from a server to client. The client passes the handle back 
to the server at some later time. Handles are never inspected by clients; they are obtained and 
submitted. That is to say, handles are opaque. The primitive xdr_opaque () is used for 
describing fixed sized, opaque bytes. 

bool_t xdr_opaque (xdrs, p, len) 

XDR *xdrs; 
char *p ; 
u_int len; 

The parameter p is the location of the bytes; len is the number of bytes in the opaque object. 
By definition, the actual data contained in the opaque object are not machine portable. 


3.5.5. Fixed Sized Arrays 

The XDR library does not provide a primitive for fixed-length arrays (the primitive 
xdr_array() is for varying-length arrays). Example A could be rewritten to use fixed-sized 
arrays in the following fashion: 

#define NLEN 255 /* machine names must be shorter than 256 chars */ 

#def ine NGRPS 20 /* user cannot be a member of more than 20 groups */ 

struct netuser { 

char *nu_machinename; 

int nu_uid; 

int nu_gids [NGRPS] ; 

}; 


Sun Microsystems 


Release 2.0 


Page 12 


XDR Protocol Spec 


bool_t 

xdr_netuser (xdrs, nup) 

XDR *xdrs; 

struct netuser *nup; 

{ 

int i ; 

if (! xdr_string (xdrs, &nup->nu_machinename, NLEN) ) 
return (FALSE) ; 

if (! xdr_int (xdrs, &nup->nu_uid) ) 
return (FALSE) ; 
for (i = 0; i < NGRPS; i++) { 

if (! xdr_int (xdrs , &nup->nu_gids [i] ) ) 
return (FALSE) ; 

> 

return (TRUE) ; 

> 

Exercise: Rewrite example A so that it uses varying-length arrays and so that the netuser 
structure contains the actual nu_gids array body as in the example above. 

8.5.6. Discriminated Unions 

The XDR library supports discriminated unions. A discriminated union is a C union and an 
enum_t value that selects an “arm” of the union. 

struct xdr_discrim ■{ 
enum_t value; 
bool_t (*proc) () ; 

}; 

bool_t xdr_union (xdrs, dscmp, unp, arms, defaultarm) 

XDR *xdrs; 
enum_t *dscmp; 
char *unp; 

struct xdr_discrim ‘arms; 

bool_t (*defaultarm) () ; /* may equal NULL */ 

First the routine translates the discriminant of the union located at * dscmp. The discriminant 
is always an enum_t. Next the union located at *unp is translated. The parameter arms is a 
pointer to an array of xdr_discrim structures. Each structure contains an order pair of 
[value , proc] . If the union’s discriminant is equal to the associated value, then the proc is 
called to translate the union. The end of the xdr_discrim structure array is denoted by a 
routine of value NULL (0). If the discriminant is not found in the arms array, then the 
defaultarm procedure is called if it is non-NULL; otherwise the routine returns FALSE. 

Example D 

Suppose the type of a union may be integer, character pointer (a string), or a gnumbers struc- 
ture. Also, assume the union and its current type are declared in a structure. The declaration 
is: 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 13 


enum utype { INTEGER=1, STRING=2, GNUMBERS=3 >; 
struct u_tag •{ 

enum utype utype; /* this is the union's discriminant */ 

union •{ 

int ival; 

char *pval; 

struct gnumbars gn; 

)• uval; 

>; 

The following constructs and XDR procedure (de)serialize the discriminated union: 

struct xdr_discrim u_tag_arms [4] = ■{ 

{ INTEGER, xdr_int }, 

{ GNUMBERS, xdr_gnumbers } 

{ STRING, xdr_wrap_string }, 

{ dontcare , NULL } 

/* always terminate arms with a NULL xdr_proc */ 

> 

bool_t 

xdr_u_tag (xdrs , utp) 

XDR *xdrs; 
struct u_tag *utp; 

{ 

return (xdr_union (xdrs, &utp-> utype, &utp->uval, u_tag_arms , 
NULL) ) ; 

> 

The routine xdr_gnumbers () was presented in Section 2; xdr_wrap_string () was 
presented in example C. The default arm parameter to xdr_union () (the last parameter) is 
NULL in this example. Therefore the value of the union’s discriminant legally may take on only 
values listed in the u_tag_arms array. This example also demonstrates that the elements of 
the arm’s array do not need to be sorted. 

It is worth pointing out that the values of the discriminant may be sparse, though in this exam- 
ple they are not. It is always good practice to assign explicitly integer values to each element of 
the discriminant’s type. This practice both documents the external representation of the 
discriminant and guarantees that different C compilers emit identical discriminant values. 

Exercise: Implement xdr_union() using the other primitives in this section. 


3.5.7. Pointers 

In C it is often convenient to put pointers to another structure within a structure. The primitive 
xdr_reference () makes it easy to serialize, deserialize, and free these referenced structures. 

bool_t xdr_reference(xdrs, pp, size, proc) 

XDR *xdrs ; 
char * *pp ; 
u_int ssize; 
bool_t (*proc) () ; 


Sun Microsystems 


Rele ase 2.0 


Page 14 


XDR Protocol Spec 


Parameter pp is the address of the pointer to the structure; parameter ssize is the size in 
bytes of the structure (use the C function sizeof () to obtain this value); and proc is the XDR 
routine that describes the structure. When decoding data, storage is allocated if *pp is NULL. 

There is no need for a primitive xdr_struct () to describe structures within structures, 
because pointers are always sufficient. 

Exercise: Implement xdr_reference() using xdr_array () . Warning: xdr_reference() 
and xdr_array () are NOT interchangeable external representations of data. 

Example E 

Suppose there is a structure containing a person’s name and a pointer to a gnumbers structure 
containing the person’s gross assets and liabilities. The construct is: 

struct pgn { 

char ‘name; 

struct gnumbers *gnp; 

>; 

The corresponding XDR routine for this structure is: 
bool_t 

xdr_pgn (xdrs, pp) 

XDR *xdrs ; 
struct pgn *pp; 

if (xdr_string(xdrs, &pp->name, NLEN) && 

xdr_re f erence (xdrs , &pp->gnp, sizeof (struct gnumbers), 
xdr_gnumbers) ) 

return (TRUE) ; 
return (FALSE) ; 

> 


8.5. 7.1. Pointer Semantics and XDR 

In many applications, C programmers attach double meaning to the values of a pointer. Typi- 
cally the value NULL (or zero) means data is not needed, yet some application-specific interpre- 
tation applies. In essence, the C programmer is encoding a discriminated union efficiently by 
overloading the interpretation of the value of a pointer. For instance, in example E a NULL 
pointer value for gnp could indicate that the person’s assets and liabilities are unknown. That 
is, the pointer value encodes two things: whether or not the data is known; and if it is known, 
where it is located in memory. Linked lists are an extreme example of the use of application- 
specific pointer interpretation. 

The primitive xdr_re f erence () cannot and does not attach any special meaning to a NULL- 
value pointer during serialization. That is, passing an address of a pointer whose value is NULL 
to xdr_reference () when serialing data will most likely cause a memory fault and, on UNIX, 
a core dump for debugging. 

It is the explicit responsibility of the programmer to expand non-dereferenceable pointers into 
their specific semantics. This usually involves describing data with a two-armed discriminated 
union. One arm is used when the pointer is valid; the other is used when the pointer is invalid 
(NULL). Section 7 has an example (linked lists encoding) that deals with invalid pointer 
interpretation. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 15 


Exercise: After reading Section 7, return here and extend example E so that it can correctly deal 
with null pointer values. 

Exercise: Using the xdr_union () , xdr_reference() and xdr_void() primitives, imple- 
ment a generic pointer handling primitive that implicitly deals with NULL pointers. The XDR 
library does not provide such a primitive because it does not want to give the illusion that 
pointers have meaning in the external world. 


3.6. Non-filter Primitives 

XDR streams can be manipulated with the primitives discussed in this section. 

u_int xdr_getpos (xdrs) 

XDR *xdrs ; 

bool_t xdr_setpos (xdrs, pos) 

XDR *xdrs ; 
u_int pos; 

xdr_destroy (xdrs) 

XDR *xdrs; 

The routine xdr_getpos () returns an unsigned integer that describes the current position in 
the data stream. Warning: In some XDR streams, the returned value of xdr_getpos () is 
meaningless; the routine returns a —1 in this case (though —1 should be a legitimate value). 

The routine xdr_setpos () sets a stream position to pos. Warning: In some XDR streams, 
setting a position is impossible; in such cases, xdr_setpos () will return FALSE. This routine 
will also fail if the requested position is out-of-bounds. The definition of bounds varies from 
stream to stream. 

The xdr_destroy () primitive destroys the XDR stream. Usage of the stream after calling this 
routine is undefined. 


3.7. XDR Operation Directions 

At times you may wish to optimize XDR routines by taking advantage of the direction of the 
operation (XDR_ENCODE, XDR_DECODE, or XDR_FREE). The value xdrs->x_op always 
contains the direction of the XDR operation. Programmers are not encouraged to take advan- 
tage of this information. Therefore, no example is presented here. However, an example in Sec- 
tion 7 demonstrates the usefulness of the xdrs->x_op field. 


Sun Microsystems 


Release 2.0 


Page 16 


XDE Protocol Spec 


4. XBR. Stream Access 

An XDR stream is obtained by calling the appropriate creation routine. These creation routines 
take arguments that are tailored to the specific properties of the stream. 

Streams currently exist for (de)serialization of data to or from standard I/O FILE streams, 
TCP/IP connections and UNIX files, and memory. Section 5 documents the XDR object and how 
to make new XDR streams when they are required. 


4.1. Standard I/O Streams 

XDR streams can be interfaced to standard I/O using the xdrstdio_create () routine as fol- 
lows: 


^include <stdio.h> 

^include <rpc/rpc.h> /* xdr streams are a part of the rpc library */ 
void 

xdrstdio_create (xdrs , fp, x_op) 

XDR *xdrs ; 

FILE * fp; 

enum xdr_op x_op; 

The routine xdrstdio_create () initializes an XDR stream pointed to by xdrs. The XDR 
stream interfaces to the standard I/O library. Parameter fp is an open file, and x_op is an 
XDR direction. 


4.2. Memory Streams 

Memory streams allow the streaming of data into or out of a specified area of memory: 

#include <rpc/rpc.h> 
void 

xdrmem_create (xdrs, addr, len, x_op) 

XDR *xdrs ; 
char *addr; 
u_int len; 
enum xdr_op x_op ; 

The routine xdrmem_create () initializes an XDR stream in local memory. The memory is 
pointed to by parameter addr; parameter len is the length in bytes of the memory. The 
parameters xdrs and x_op are identical to the corresponding parameters of 
xdrstdio_create () . Currently, the UDP/IP implementation of RPC uses 
xdrmem_crsate () . Complete call or result messages are built in memory before calling the 
sendto () system routine. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 17 


4.3. Record (TCP/IP) Streams 

A record stream is an XDR stream built on top of a record marking standard that is built on top 
of the UNIX file or 4.2 BSD connection interface. 

#include <rpc/rpc.h> /* xdr streams are a part of the rpc library */ 

xdrrec_create (xdrs , sendsize, recvsize, iohandle, readproc, vrriteproc) 
XDR ‘xdrs ; 

u_int sendsize, recvsize; 
char ‘iohandle; 

int (*readproc) () , (‘vrriteproc) () ; 

The routine xdrrec.create () provides an XDR stream interface that allows for a bidirec- 
tional, arbitrarily long sequence of records. The contents of the records are meant to be data in 
XDR form. The stream’s primary use is for interfacing RPC to TCP connections. However, it 
can be used to stream data into or out of normal UNIX files. 

The parameter xdrs is similar to the corresponding parameter described above. The stream 
does its own data buffering similar to that of standard I/O. The parameters sendsize and 
recvsize determine the size in bytes of the output and input buffers, respectively; if their 
values are zero (0), then predetermined defaults are used. When a buffer needs to be filled or 
flushed, the routine readproc or writeproc is called, respectively. The usage and behavior 
of these routines are similar to the UNIX system calls read() and write () . However, the first 
parameter to each of these routines is the opaque parameter iohandle. The other two parame- 
ters (buf and nbytes) and the results (byte count) are identical to the system routines. If xxx 
is readproc or writeproc, then it has the following form: 

/* returns the actual number of bytes transferred. 

* -1 is an error 

V 

int 

xxx (iohandle, buf, len) 
char ‘iohandle; 
char *buf; 
int nbytes; 

The XDR stream provides means for delimiting records in the byte stream. The implementation 
details of delimiting records in a stream are discussed in appendix 1. The primitives that are 
specific to record streams are as follows: 

bool_t 

xdrrec_endof record (xdrs, flushnow) 

XDR ‘xdrs; 
bool_t flushnow; 


bool_t 

xdrrec_skiprecord (xdrs) 
XDR ‘xdrs; 


bool_t 

xdrrec_eof (xdrs) 

XDR ‘xdrs; 

The routine xdrrec_endo f record () causes the current outgoing data to be marked as a 
record. If the parameter flushnow is TRUE, then the stream’s writeproc () will be called; 
otherwise, writeproc () will be called when the output buffer has been filled. 


Sun Microsystems 


Release 2.0 


Page 18 


XDR Protocol Spec 


The routine xdrrec_skiprecord () causes an input stream’s position to be moved past the 
current record boundary and onto the beginning of the next record in the stream. 

If there is no more data in the stream’s input buffer, then the routine xdrrec_eof() returns 
TRUE. That is not to say that there is no more data in the underlying file descriptor. 

5. XDR Stream Implementation 

This section provides the abstract data types needed to implement new instances of XDR 
streams. 

5.1. The XDR Object 

The following structure defines the interface to an XDR stream: 

enura xdr_op { XDR_ENCODE = O, XDR_DECODE = 1, XDR_FREE = 2 }; 

typedef struct { 

enum xdr_op x_op; 

struct xdr_ops { 

bool_t (*x_getlong) () ; 
bool_t (*x_putlong) () ; 
bool_t (*x_getbytes) () ; 
bool_t (*x_putbytes) () ; 
u_int (*x_getpostn) () ; 

bool_t (*x_setpostn) () ; 
caddr_t (*x_inline) () ; 

VOID (*x_destroy) () ; 

> *x_ops ; 

caddr_t x_public; 

caddr_t x_private; 

caddr_t x_base; 

int x_handy; 

> XDR; 

The x_op field is the current operation being performed on the stream. This field is important 
to the XDR primitives, but should not affect a stream’s implementation. That is, a stream’s 
implementation should not depend on this value. The fields x_private, x_base, and 
x_handy are private to the particular stream’s implementation. The field x_public is for the 
XDR client and should never be used by the XDR stream implementations or the XDR primi- 
tives. 

Macros for accessing operations x_getpostn () , x_setpostn () , and x_destroy() were 
defined in Section 3.6. The operation x_inline() takes two parameters: an XDR *, and an 
unsigned integer, which is a byte count. The routine returns a pointer to a piece of the stream’s 
internal buffer. The caller can then use the buffer segment for any purpose. From the stream’s 
point of view, the bytes in the buffer segment have been consumed or put. The routine may 
return NULL if it cannot return a buffer segment of the requested size. (The x_inline routine 
is for cycle squeezers. Use of the resulting buffer is not data-portable. Users are encouraged not 
to use this feature.) 

The operations x_getbytes () and x_putbytes () blindly get and put sequences of bytes 
from or to the underlying stream; they return TRUE if they are successful, and FALSE 


/* operation; fast additional param */ 

/* get a long from underlying stream */ 
/* put a long to ” */ 

/* get some bytes from " */ 

/* put some bytes to " */ 

/* returns byte offset from beginning */ 
/* repositions position in stream */ 

/* buf quick ptr to buffered data */ 

/* free privates of this xdr_stream */ 

/* users' data */ 

/* pointer to private data */ 

/* private used for position info */ 

/* extra private word */ 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 19 


otherwise. The routines have identical parameters (replace xxx): 
bool_t 

xxxbytes (xdrs , buf, bytecount) 

XDR *xdrs ; 
char *buf; 
u_int bytecount; 

The operations x_getlong() and x_putlong() receive and put long numbers from and to 
the data stream. It is the responsibility of these routines to translate the numbers between the 
machine representation and the (standard) external representation. The UNIX primitives 
htonl () and ntohl () can be helpful in accomplishing this. Section 6 defines the standard 
representation of numbers. The higher-level XDR implementation assumes that signed and 
unsigned long integers contain the same number of bits, and that nonnegative integers have the 
same bit representations as unsigned integers. The routines return TRUE if they succeed, and 
FALSE otherwise. They have identical parameters: 

bool_t 

xxxlong (xdrs, lp) 

XDR *xdrs ; 
long *lp; 

Implementors of new XDR streams must make an XDR structure (with new operation routines) 
available to clients, using some kind of create routine. 


Sun Microsystems 


Release 2.0 


Page 20 


XDR Protoco! Spec 


@o XDR Standard 

This section defines the external data representation standard. The standard is independent of 
languages, operating systems and hardware architectures. Once data is shared among machines, 
it should not matter that the data was produced on a Sun, but is consumed by a VAX (or vice 
versa). Similarly the choice of operating systems should have no influence on how the data is 
represented externally. For programming languages, data produced by a C program should be 
readable by a FORTRAN or Pascal program. 

The external data representation standard depends on the assumption that bytes (or octets) are 
portable. A byte is defined to be eight bits of data. It is assumed that hardware that encodes 
bytes onto various media will preserve the bytes’ meanings across hardware boundaries. For 
example, the Ethernet standard suggests that bytes be encoded “little endian” style. Both Sun 
and VAX hardware implementations adhere to the standard. 

The XDR standard also suggests a language used to describe data. The language is a bastardized 
C; it is a data description language, not a programming language. (The Xerox Courier Standard 
uses bastardized Mesa as its data description language.) 


6.1. Basic Block Size 

The representation of all items requires a multiple of four bytes (or 32 bits) of data. The bytes 
are numbered 0 through n— 1, where (n mod 4)=0. The bytes are read or written to some byte 
stream such that byte m always precedes byte m+1. 


0.2. Integer 

An XDR signed integer is a 32-bit datum that encodes an integer in the range 
[-2147483648, 2147483647] . The integer is represented in two’s complement notation. The 
most and least significant bytes are 0 and 3, respectively. The data description of integers is 
integer. 


6.3. Unsigned Integer 

An XDR unsigned integer is a 32-bit datum that encodes a nonnegative integer in the range 
[0,4294967295]. It is represented by an unsigned binary number whose most and least 
significant bytes are 0 and 3, respectively. The data description of unsigned integers is 
unsigned. 


0.4. Enumerations 

Enumerations have the same representation as integers. Enumerations are handy for describing 
subsets of the integers. The data description of enumerated data is as follows: 

typedef enum •( name = value }• type-name; 

For example the three colors red, yellow and blue could be described by an enumerated type: 


& 


.A 

3^5? Sun Microsyste 


ms 


Release 2.0 


XDR Protocol Spec 


Page 21 


typedef enum { RED = 2, YELLOW = 3, BLUE = 5 } colors ; 


6.5. Booleans 

Booleans are important enough and occur frequently enough to warrant their own explicit type 
in the standard. Boolean is an enumeration with the following form: 

typedef enum { FALSE = O, TRUE = 1 }■ boolean; 


6.6. Hyper Integer and Hyper Unsigned 

The standard also defines 64-bit (8-byte) numbers called hyper integer and hyper 
unsigned. Their representations are the obvious extensions of the integer and unsigned defined 
above. The most and least significant bytes are 0 and 7, respectively. 


6.7. Floating Point and Double Precision 

The standard defines the encoding for the floating point data types float (32 bits or 4 bytes) 
and double (64 bits or 8 bytes). The encoding used is the IEEE standard for normalized single- 
and double-precision floating point numbers. See the IEEE floating point standard for more 
information. The standard encodes the following three fields, which describe the floating point 
number: 

S The sign of the number. Values 0 and 1 represent positive and negative, respectively. 

E The exponent of the number, base 2. Floats devote 8 bits to this field, while doubles devote 
11 bits. The exponents for float and double are biased by 127 and 1023, respectively. 

F The fractional part of the number’s mantissa, base 2. Floats devote 23 bits to this field, 
while doubles devote 52 bits. 

Therefore, the floating point number is described by: 

Just as the most and least significant bytes of a number are 0 and 3, the most and least 
significant bits of a single-precision floating point number are 0 and 31. The beginning bit (and 
most significant bit) offsets of 5, E, and F are 0, 1, and 9, respectively. 

Doubles have the analogous extensions. The beginning bit (and most significant bit) offsets of 5, 
E, and F are 0, 1, and 12, respectively. 

The IEEE specification should be consulted concerning the encoding for signed zero, signed 
infinity (overflow), and denormalized numbers (underflow). Under IEEE specifications, the 
“NaN” (not a number) is system dependent and should not be used. 


Sun Microsystems 


Release 2.0 


Page 22 


XDR Protocol Spec 


@.8. Opaque Data 

At times fixed-sized uninterpreted data needs to be passed among machines. This data is called 
opaque and is described as: 

typedef opaque type-name [n] ; 
opaque name [n] ; 

where n is the (static) number of bytes necessary to contain the opaque data. If n is not a multi- 
ple of four, then the n bytes are followed by enough (up to 3) zero-valued bytes to make the 
total byte count of the opaque object a multiple of four. 


6.9. Counted Byte Strings 

The standard defines a string of n (numbered 0 through n— 1) bytes to be the number n encoded 
as unsigned, and followed by the n bytes of the string. If n is not a multiple of four, then the 
n bytes are followed by enough (up to 3) zero-valued bytes to make the total byte count a multi- 
ple of four. The data description of strings is as follows: 

typedef string type-name<N> ; 
typedef string type-name<>; 
string name<N>; 
string name<>; 

Note that the data description language uses angle brackets (< and >) to denote anything that 
is varying-length (as opposed to square brackets to denote fixed-length sequences of data). 

The constant N denotes an upper bound of the number of bytes that a string may contain. If N 
is not specified, it is assumed to be 2 s2 — 1, the maximum length. The constant N would normally 
be found in a protocol specification. For example, a filing protocol may state that a file name 
can be no longer than 255 bytes, such as: 

string f ilename<255> ; 

The XDR specification does not say what the individual bytes of a string represent; this impor- 
tant information is left to higher-level specifications. A reasonable default is to assume that the 
bytes encode ASCII characters. 


6.10. Fixed Arrays 

The data description for fixed-size arrays of homogeneous elements is as follows: 

typedef elementtype type-name [n] ; 
elementtype name [n] ; 

Fixed-size arrays of elements numbered 0 through n— 1 are encoded by individually encoding the 
elements of the array in their natural order, 0 through n— 1. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 23 


(3.11. Counted Arrays 

Counted arrays provide the ability to encode varyiable-length arrays of homogeneous elements. 
The array is encoded as: the element count n (an unsigned integer), followed by the encoding of 
each of the array’s elements, starting with element 0 and progressing through element n— 1. The 
data description for counted arrays is similar to that of counted strings: 

typedef elementtype type-name<N> ; 
typedef elementtype type-name<> ; 
elementtype name<N>; 
elementtype name<>; 

Again, the constant N specifies the maximum acceptable element count of an array; if N is not 
specified, it is assumed to be 2 32 — 1. 


6.12. Structures 

The data description for structures is very similar to that of standard C: 

typedef struct { 

component -type component -name; 

} type -name; 

The components of the structure are encoded in the order of their declaration in the structure. 


6.13. Discriminated Unions 

A discriminated union is a type composed of a discriminant followed by a type selected from a 
set of prearranged types according to the value of the discriminant. The type of the discrim- 
inant is always an enumeration. The component types are called “arms” of the union. The 
discriminated union is encoded as its discriminant followed by the encoding of the implied arm. 
The data description for discriminated unions is as follows: 

typedef union switch (discriminant-type) { 
discriminant-value: arm-type; 

default: de fault -arm- type; 

} type -name; 

The default arm is optional. If it is not specified, then a valid encoding of the union cannot take 
on unspecified discriminant values. Most specifications neither need nor use default arms. 


6.14. Missing Specifications 

The standard lacks representations for bit fields and bitmaps, since the standard is based on 
bytes. This is not to say that no specification should be attempted. 


4^ Sun Microsystems 


Release 2.0 


Page 24 


XDR Protocol Spec 


6.15. Library Primitive / XDR Standard Cross Reference 

The following table describes the association between the C library primitives discussed in Sec- 
tion 3, and the standard data types defined in this section: 


C Primitive 

XDR Type 

Sections 

xdr_int 

xdrjong 

xdr_short 

integer 

3.1, 6.2 

xdr_u_int 

xdr_u_long 

xdr_u_short 

unsigned 

3.1, 6.3 

- 

hyper integer 
hyper unsigned 

6.6 

xdr_float 

float 

3.2, 6.7 

xdr_double 

double 

3.2, 6.7 

xdr_enum 

enum_t 

3.3, 6.4 

xdr_bool 

bool_t 

3.3, 6.5 

xdr_string 

xdr_bytes 

string 

3.5.1, 6.9 
3.5.2 

xdr_array 

(varying arrays) 

3.5.3, 6.11 


(fixed arrays) 

3.5.5, 6.10 

xdr_opaque 

opaque 

3.5.4, 6.8 

xdr_union 

union 

3.5.6, 6.13 

xdr_reference 

- 

3.5.7 

- 

struct 

6.6 


# 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 25 


7. Advanced Topics 

This section describes techniques for passing data structures that are not covered in the preced- 
ing sections. Such structures include linked lists (of arbitrary lengths). Unlike the simpler exam- 
ples covered in the earlier sections, the following examples are written using both the XDR C 
library routines and the XDR data description language. Section 6 describes the XDR data 
definition language used below. 


7.1. Linked Lists 

The last example in Section 2 presented a C data structure and its associated XDR routines for a 
person’s gross assets and liabilities. The example is duplicated below: 

struct gnumbers { 

long g_assets; 
long g_liabilities ; 

>; 

bool_t 

xdr_gnumbers (xdrs , gp) 

XDR *xdrs ; 

struct gnumbers *gp; 

{ 

if (xdr_long(xdrs, & (gp->g_assets) ) ) 

return (xdr_long (xdrs , & (gp->g_liabilities) ) ) ; 
return (FALSE) ; 

> 

Now assume that we wish to implement a linked list of such information. A data structure could 
be constructed as follows: 

typedef struct gnnode { 

struct gnumbers gn_numbers ; 
struct gnnode *nxt; 

>; 

typedef struct gnnode *gnumbers_list ; 

The head of the linked list can be thought of as the data object; that is, the head is not merely a 
convenient shorthand for a structure. Similarly the nxt field is used to indicate whether or not 
the object has terminated. Unfortunately, if the object continues, the nxt field is also the 
address of where it continues. The link addresses carry no useful information when the object is 
serialized. 

The XDR data description of this linked list is described by the recursive type declaration of 
gnumbersjist: 

struct gnumbers { 

unsigned g_assets; 
unsigned g_liabilities ; 

>; 


Sun Microsystems 


Release 2.0 


Page 26 


XDR Protocol Spec 


typedef union switch (boolean) { 
case TRUE : struct { 

struct gnumbers current_element; 
gnumbers_l ist rest_of_list; 

>; 

case FALSE: struct {}; 

)■ gnumbers_list ; 

In this description, the boolean indicates whether there is more data following it. If the boolean 
is FALSE, then it is the last data field of the structure. If it is TRUE, then it is followed by a 
gnumbers structure and (recursively) by a gnumbers_list (the rest of the object). Note that 
the C declaration has no boolean explicitly declared in it (though the nxt field implicitly carries 
the information), while the XDR data description has no pointer explicitly declared in it. 

Hints for writing a set of XDR routines to successfully (de)serialize a linked list of entries can be 
taken from the XDR description of the pointer-less data. The set consists of the mutually recur- 
sive routines xdr_gnumbers_list, xdr_wrap_list, and xdr_gnnode. 


bool_t 

xdr_gnnode (xdrs, gp) 

XDR *xdrs; 
struct gnnode *gp; 

return (xdr_gnumbers (xdrs, & (gp->gn_numbers) ) 
xdr_gnumbers_l ist (xdrs, &(gp->nxt)) ) 


> 


&& 


bool_t 

xdr_wrap_l ist (xdrs, glp) 

XDR ‘xdrs; 
gnumbers_list *glp; 

{ 

return (xdr_reference (xdrs , 
xdr_gnnode) ) ; 


> 


glp, sizeof (struct gnnode). 


struct xdr_discrim choices [2] = { 

/* called if another node needs (de) serializing */ 

{ TRUE, xdr_wrap_list }, 

/* called when there are no more nodes to be (de) serialized */ 
•C FALSE, xdr_void > 

> 

bool_t 

xdr_gnumbers_l ist (xdrs, glp) 

XDR *xdrs ; 
gnumbers_list *glp; 

< 

bool_t more_data; 

more_data = (*glp != (gnumbers_l ist) NULL) ; 

return (xdr_union (xdrs , &more_data, glp, choices, NULL); 


The entry routine is xdr_gnumbers_list () ; its job is to translate between the boolean value 
more_data and the list pointer values. If there is no more data, the xdr_union() primitive 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 27 


calls xdr_void and the recursion is terminated. Otherwise, xdr_union () calls 
xdr_wrap_list () , whose job is to dereference the list pointers. The xdr_gnnode () routine 
actually (de)serializes data of the current node of the linked list, and recursively calls 
xdr_gnumbers_list () to handle the remainder of the list. 

You should convince yourself that these routines function correctly in all three directions 
(XDR_ENCODE, XDR_DECODE and XDR_FREE) for linked lists of any length (including 
zero). Note that the boolean more_data is always initialized, but in the XDR_DECODE case it 
is overwritten by an externally generated value. Also note that the value of the bool_t is lost 
in the stack. The essence of the value is reflected in the list’s pointers. 

The unfortunate side effect of (de)serializing a list with these routines is that the C stack grows 
linearly with respect to the number of nodes in the list. This is due to the recursion. The rou- 
tines are also hard to code (and understand) due to the number and nature of primitives involved 
(such as xdr_reference, xdr_union, and xdr_void). 

The following routine collapses the recursive routines. It also has other optimizations that are 
discussed below. 


bool_t 

xdr_gnumbers_list (xdrs, glp) 
XDR *xdrs ; 
gnumbers_list *glp; 

bool_t more_data; 


> 


while (TRUE) { 

more_data = (*glp != (gnumbers_l 1st) NULL) ; 
if (! xdr_bool (xdrs, &more_data) ) 
return (FALSE) ; 
if (! more_data) 

return (TRUE) ; /* we are done */ 

if (! xdr_reference (xdrs, glp, sizeof (struct gnnode) , 
xdr_gnumbers) ) 

return (FALSE) ; 
glp = & ( (*glp) ->nxt) ; 

> 


The claim is that this one routine is easier to code and understand than the three recursive rou- 
tines above. (It is also buggy, as discussed below.) The parameter glp is treated as the address 
of the pointer to the head of the remainder of the list to be (de)serialized. Thus, glp is set to 
the address of the current node’s nxt field at the end of the while loop. The discriminated 
union is implemented in-line; the variable more_data has the same use in this routine as in the 
routines above. Its value is recomputed and re-(de)serialized each iteration of the loop. Since 
*glp is a pointer to a node, the pointer is dereferenced using xdr_re ference () . Note that 
the third parameter is truly the size of a node (data values plus nxt pointer), while 
xdr.gnumbers () only (de)serializes the data values. We can get away with this tricky optimi- 
zation only because the nxt data comes after all legitimate external data. 


The routine is buggy in the XDR_FREE case. The bug is that xdr_re ference () will free the 
node *glp. Upon return the assignment glp = & ( (*glp) ->nxt) cannot be guaranteed to 
work since *glp is no longer a legitimate node. The following is a rewrite that works in all 
cases. The hard part is to avoid dereferencing a pointer which has not been initialized or which 
has been freed. 


Sun Microsystems 


Release 2.0 


Page 28 


XDR Protocol Spec 


bool_t 

xdr_gnumbers_l 1st (xdrs , glp) 

XDR ‘xdrs ; 
gnumbers_list *glp; 

{ 

bool_t more_data; 
bool_t freeing; 

gnumbers_list ‘next; /* the next value of glp */ 


> 


freeing = (xdrs->x_op == XDR_FREE) ; 
while (TRUE) { 

more_data = (*glp != (gnumbers_l 1st) NULL) ; 
if (! xdr_bool (xdrs , &more_data) ) 
return (FALSE) ; 
if (! more_data) 

return (TRUE) ; /* we are done */ 

if (freeing) 

next = & ( (*glp) ->nxt) ; 

if (! xdr_reference (xdrs, glp, sizeof (struct gnnode) , 
xdr_gnumbers) ) 

return (FALSE) ; 

glp = (freeing) ? next : & ( (*glp) ->nxt) ; 

> 


Note that this is the first example in this document that actually inspects the direction of the 
operation (xdrs->x_op). The claim is that the correct iterative implementation is still easier 
to understand or code than the recursive implementation. It is certainly more efficient with 
respect to C stack requirements. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 29 


Appendix A. The Record Marking Standard 

A record is composed of one or more record fragments. A record fragment is a four-byte header 
followed by 0 to 2 S1 — 1 bytes of fragment data. The bytes encode an unsigned binary number; as 
with XDR integers, the byte order is from highest to lowest. The number encodes two values — 
a boolean that indicates whether the fragment is the last fragment of the record (bit value 1 
implies the fragment is the last fragment), and a 31-bit unsigned binary value which is the length 
in bytes of the fragment’s data. The boolean value is the high-order bit of the header; the length 
is the 31 low-order bits. 

(Note that this record specification is not in XDR standard form and cannot be implemented 
using XDR primitives!) 


Sun Microsystems 


Release 2.0 


Page 30 


XDR Protocol Spec 


Appendix B* Synopsis of XDR Routines 


xdr_array() 

xdr_array (xdrs , arrp, sizep, maxsize, elsize, elproc) 

XDR *xdrs; 
char * * arrp ; 

u_int *sizep, maxsize, elsize; 
xdrproc_t elproc; 

A filter primitive that translates between arrays and their corresponding external representa- 
tions. The parameter arrp is the address of the pointer to the array, while sizep is the 
address of the element count of the array; this element count cannot exceed maxsize. The 
parameter elsize is the sizeof () each of the array’s elements, and elproc is an XDR filter 
that translates between the array elements’ C form, and their external representation. This rou- 
tine returns one if it succeeds, zero otherwise. 

xdr_bool() 

xdr_bool (xdrs , bp) 

XDR ‘xdrs; 
bool_t *bp; 

A filter primitive that translates between booleans (C integers) and their external representa- 
tions. When encoding data, this filter produces values of either one or zero. This routine 
returns one if it succeeds, zero otherwise. 

xdr_bytes() 

xdr_bytes (xdrs, sp, sizep, maxsize) 

XDR *xdrs ; 
char * * sp ; 

u_int *sizep, maxsize; 

A filter primitive that translates between counted byte strings and their external representations. 
The parameter sp is the address of the string pointer. The length of the string is located at 
address sizep; strings cannot be longer than maxsize. This routine returns one if it succeeds, 
zero otherwise. 

xdr_destroy() 

void 

xdr_destroy (xdrs) 

XDR *xdrs ; 

A macro that invokes the destroy routine associated with the XDR stream, xdrs. Destruction 
usually involves freeing private data structures associated with the stream. Using xdrs after 
invoking xdr_destroy () is undefined. 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 31 


xdr_double() 

xdr_double (xdrs, dp) 

XDR *xdrs ; 
double *dp; 

A filter primitive that translates between C double precision numbers and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_enum() 

xdr_enum(xdrs, ep) 

XDR *xdrs ; 
enunv_t *ep; 

A filter primitive that translates between C enums (actually integers) and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_float() 

xdr_f loat (xdrs, fp) 

XDR *xdrs ; 
float *fp; 

A filter primitive that translates between C floats and their external representations. This 
routine returns one if it succeeds, zero otherwise. 

xdr_getpos() 

u_int 

xdr_getpos (xdrs) 

XDR *xdrs; 

A macro that invokes the get-position routine associated with the XDR stream, xdrs. The rou- 
tine returns an unsigned integer, which indicates the position of the XDR byte stream. A desir- 
able feature of XDR streams is that simple arithmetic works with this number, although the 
XDR stream instances need not guarantee this. 

xdr_inline() 

long 4 

xdr_inline (xdrs , len) 

XDR *xdrs; 
int len; 

A macro that invokes the in-line routine associated with the XDR stream, xdrs. The routine 
returns a pointer to a contiguous piece of the stream’s buffer; len is the byte length of the 
desired buffer. Note that the pointer is cast to long *. Warning: xdr_inline() may return 
0 (NULL) if it cannot allocate a contiguous piece of a buffer. Therefore the behavior may vary 
among stream instances; it exists for the sake of efficiency. 


Sun Microsystems 


Release 2.0 


Page 32 


XDR Protocol Spec 


xdr_int() 

xdr_lnt (xdrs, ip) 

XDR *xdr sp- 
irit * ip ; 

A filter primitive that translates between C integers and their external representations. This 
routine returns one if it succeeds, zero otherwise. 

xdr_long() 

xdr_long (xdrs , lp) 

XDR *xdrs; 
long *lp; 

A filter primitive that translates between C long integers and their external representations. 
This routine returns one if it succeeds, zero otherwise. 

xdr_opaque() 

xdr_opaque (xdrs , cp, cnt) 

XDR *xdrs; 
char * cp ; 
u_int cnt ; 

A filter primitive that translates between fixed size opaque data and its external representation. 
The parameter cp is the address of the opaque object, and cnt is its size in bytes. This routine 
returns one if it succeeds, zero otherwise. 

xdr_reference() 

xdr_re terence (xdrs , pp, size, proc) 

XDR ‘xdrs; 
char * *pp ; 
u_int size; 
xdrproc_t proc; 

A primitive that provides pointer chasing within structures. The parameter pp is the address of 
the pointer; size is the sizeof () the structure that *pp points to; and proc is an XDR pro- 
cedure that filters the structure between its C form and its external representation. This routine 
returns one if it succeeds, zero otherwise. 

xdr_setpos() 

xdr_setpos (xdrs, pos) 

XDR *xdrs; 
u_int pos; 

A macro that invokes the set position routine associated with the XDR stream xdrs. The 
parameter pos is a position value obtained from xdr_getpos () . This routine returns one if 
the XDR stream could be repositioned, and zero otherwise. Warning: it is difficult to reposition 
some types of XDR streams, so this routine may fail with one type of stream and succeed with 
another. 


Sun Microsyste 


ms 


Release 2.0 


XDR Protocol Spec 


Page 33 


xdr_ghort() 

xdr_short (xdrs, sp) 

XDR *xdrs ; 
short *sp; 

A filter primitive that translates between C short integers and their external representations. 
This routine returns one if it succeeds, zero otherwise. 

xdr_jstring() 

xdr_strlng (xdrs, sp, maxsize) 

XDR *xdrs ; 
char * * sp ; 
u_int maxsize; 

A filter primitive that translates between C strings and their corresponding external representa- 
tions. Strings cannot cannot be longer than maxsize. Note that sp is the address of the 
string’s pointer. This routine returns one if it succeeds, zero otherwise. 

xdr_u_int() 

xdr_u_int (xdrs , up) 

XDR *xdrs ; 
unsigned *up; 

A filter primitive that translates between C unsigned integers and their external representa- 
tions. This routine returns one if it succeeds, zero otherwise. 

xdr_u_long() 

xdr_u_long (xdrs, ulp) 

XDR ‘xdrs; 
unsigned long *ulp; 

A filter primitive that translates between C unsigned long integers and their external 
representations. This routine returns one if it succeeds, zero otherwise. 

xdr_u_short() 

xdr_u_short (xdrs, usp) 

XDR *xdrs ; 

unsigned short *usp; 

A filter primitive that translates between C unsigned short integers and their external 
representations. This routine returns one if it succeeds, zero otherwise. 


Sun Microsystems 


Release 2.0 


Page 34 


XDR Protocol Spec 


xdr_union() 

xdr_union(xdrs, dscnp, unp, choices, dfault) 

XDR *xdrs ; 
int ‘dscnqp; 
char *unp; 

struct xdr_discrim ‘choices; 
xdrproc_t dfault; 

A filter primitive that translates between a discriminated C union and its corresponding exter- 
nal representation. The parameter dscmp is the address of the union’s discriminant, while unp 
in the address of the union. This routine returns one if it succeeds, zero otherwise. 

xdr_void() 

xdr_void () 

This routine always returns one. It may be passed to RPC routines that require a function 
parameter, where nothing is to be done. 

xdr_wrapstring() 

xdr_wrapstring(xdrs, sp) 

XDR *xdrs ; 
char “sp; 

A primitive that calls xdr_string (xdrs, sp, MAXUNSIGNED) ; where MAXUNSIGNED is the 
maximum value of an unsigned integer. This is handy because the RPC package passes only two 
parameters XDR routines, whereas xdr_string () , one of the most frequently used primitives, 
requires three parameters. This routine returns one if it succeeds, zero otherwise. 

xdrmem_create() 

void 

xdrmem_create (xdrs , addr, size, op) 

XDR *xdrs ; 
char ‘addr; 
u_int size; 
enum xdr_op op; 

This routine initializes the XDR stream object pointed to by xdrs. The stream’s data is written 
to, or read from, a chunk of memory at location addr whose length is no more than size bytes 
long. The op determines the direction of the XDR stream (either XD R E NCODE 
XDR_DECODE, or XDR_FREE). 

xdrrec_create() 

void 

xdrrec_create (xdrs , sendsize, recvsize, handle, readit, vriteit) 

XDR ‘xdrs; 

u_int sendsize, recvsize; 
char ‘handle; 

int (‘readit) () , (‘writeit) () ; 

This routine initializes the XDR stream object pointed to by xdrs. The stream’s data is written 


Sun Microsystems 


Release 2.0 


XDR Protocol Spec 


Page 35 


to a buffer of size sendsize; a value of zero indicates the system should use a suitable default. 
The stream’s data is read from a buffer of size recvsize; it too can be set to a suitable default 
by passing a zero value. When a stream’s output buffer is full, writeit () is called. Similarly, 
when a stream’s input buffer is empty, readitQ is called. The behavior of these two routines 
is similar to the UNIX system calls read and write, except that handle is passed to the 
former routines as the first parameter. Note that the XDR stream’s op field must be set by the 
caller. Warning: this XDR stream implements an intermediate record stream. Therefore there 
are additional bytes in the stream to provide record boundary information. 

xdrrec_endofrecord() 

xdrrec_endo f record (xdrs , sendnow) 

XDR *xdrs; 
int sendnow; 

This routine can be invoked only on streams created by xdrrec.create () . The data in the 
output buffer is marked as a completed record, and the output buffer is optionally written out if 
sendnow is non-zero. This routine returns one if it succeeds, zero otherwise. 

xdrrec_eof() 

xdrrec_eof (xdrs) 

XDR *xdrs ; 
int empty; 

This routine can be invoked only on streams created by xdrrec.create () . After consuming 
the rest of the current record in the stream, this routine returns one if the stream has no more 
input, zero otherwise. 

xdrrec_skiprecord() 

xdrrec_skiprecord (xdrs) 

XDR *xdrs ; 

This routine can be invoked only on streams created by xdrrec.create () . It tells the XDR 
implementation that the rest of the current record in the stream’s input buffer should be dis- 
carded. This routine returns one if it succeeds, zero otherwise. 

xdrstdio_create() 

void 

xdrstdio_create (xdrs, file, op) 

XDR *xdrs ; 

FILE * f ile; 
enum xdr_op op; 

This routine initializes the XDR stream object pointed to by xdrs. The XDR stream data is 
written to, or read from, the Standard I/O stream file. The parameter op determines the 
direction of the XDR stream (either XDR_ENCODE, XDR_DECODE, or XDR_FREE). Warn- 
ing: the destroy routine associated with such XDR streams calls fflushQ on the file stream, 
but never fclose () . 


Sun Microsystems 


Release 2.0 


Remote Procedure Call 
Protocol Specification 


Contents 


1. Introduction 1 

1.1. Terminology 1 

1.2. The RPC Model 1 

1.3. Transports and Semantics 2 

1.4. Binding and Rendezvous Independence 2 

1.5. Message Authentication 2 

2 . Requirements 3 

2.1. Remote Programs and Procedures 3 

2.2. Authentication 4 

2.3. Program Number Assignment 4 

3 . Other Uses and Abuses of the RPC Protocol 5 

3.1. Batching 5 

3.2. Broadcast RPC 5 

4 . The RPC Message Protocol 5 

A. Authentication Parameter Specification 9 

A.l. Null Authentication 9 

A. 2. UNIX Authentication 9 

B. Record Marking Standard 10 

C. Port Mapper Program Protocol 11 

C.l. The Port Mapper RPC Protocol 11 


Remote Procedure Call 
Protocol Specification 


1. Introduction 

This document specifies a message protocol used in implementing Sun’s Remote Procedure Call 
(RPC) package. The message protocol is specified with the eXternal Data Representation (XDR) 
language. 

This document assumes that the reader is familiar with both RPC and XDR. It does not 
attempt to justify RPC or its uses. Also, the casual user of RPC does not need to be familiar 
with the information in this document. 

1.1. Terminology 

The document discusses servers, services, programs, procedures, clients and versions. A server is 
a machine where some number of network services are implemented. A service is a collection of 
one or more remote programs. A remote program implements one or more remote procedures; 
the procedures, their parameters and results are documented in the specific program’s protocol 
specification (see Appendix C for an example). Network clients are pieces of software that ini- 
tiate remote procedure calls to services. A server may support more than one version of a 
remote program in order to be forward compatible with changing protocols. 

For example, a network file service may be composed of two programs. One program may deal 
with high level applications such as file system access control and locking. The other may deal 
with low-level file I/O, and have procedures like “read” and “write”. A client machine of the 
network file service would call the procedures associated with the two programs of the service on 
behalf of some user on the client machine. 


1.2. The RPC Model 

The remote procedure call model is similar to the local procedure call model. In the local case, 
the caller places arguments to a procedure in some well-specified location (such as a result regis- 
ter). It then transfers control to the procedure, and eventually gains back control. At that 
point, the results of the procedure are extracted from the well-specified location, and the caller 
continues execution. 

The remote procedure call is similar, except that one thread of control winds through two 
processes — one is the caller’s process, the other is a server’s process. That is, the caller process 
sends a call message to the server process and waits (blocks) for a reply message. The call mes- 
sage contains the procedure’s parameters, among other things. The reply message contains the 
procedure’s results, among other things. Once the reply message is received, the results of the 
procedure are extracted, and caller’s execution is resumed. 


Sun Microsystems 


Release 2.0 


Page 2 


RPC Protocol Spec 


On the server side, a process is dormant awaiting the arrival of a call message. When one 
arrives the server process extracts the procedure’s parameters, computes the results, sends a 
reply message, and then awaits the next call message. Note that in this model, only one of the 
two processes is active at any given time. That is, the RPC protocol does not explicitly support 
multi-threading of caller or server processes. 


1.3. Transports and Semantics 

The RPC protocol is independent of transport protocols. That is, RPC does not care how a mes- 
sage is passed from one process to another. The protocol only deals with the specification and 
interpretation of messages. 

Because of transport independence, the RPC protocol does not attach specific semantics to the 
remote procedures or their execution. Some semantics can be inferred from (but should be expli- 
citly specified by) the underlying transport protocol. For example, RPC message passing using 
UDP/IP is unreliable. Thus, if the caller retransmits call messages after short time-outs, the 
only thing he can infer from no reply message is that the remote procedure was executed zero or 
more times (and from a reply message, one or more times). On the other hand, RPC message 
passing using TCP/IP is reliable. No reply message means that the remote procedure was exe- 
cuted at most once, whereas a reply message means that the remote procedure was exactly once. 
(Note: At Sun, RPC is currently implemented on top of TCP/IP and UDP/IP transports.) 


1.4. Binding and Rendezvous Independence 

The act of binding a client to a service is NOT part of the remote procedure call specification. 
This important and necessary function is left up to some higher level software. (The software 
may use RPC itself; see Appendix C.) 

Implementors should think of the RPC protocol as the jump-subroutine instruction (“JSR”) of a 
network; the loader (binder) makes JSR useful, and the loader itself uses JSR to accomplish its 
task. Likewise, the network makes RPC useful, using RPC to accomplish this task. 


1.5. Message Authentication 

The RPC protocol provides the fields necessary for a client to identify himself to a service and 
vice versa. Security and access control mechanisms can be built on top of the message authenti- 
cation. 


Sun Microsystems 


Release 2.0 


RPC Protocol Spec 


Page 3 


2* Requirements 

The RPC protocol must provide for the following: 

1. Unique specification of a procedure to be called. 

2. Provisions for matching response messages to request messages. 

3. Provisions for authenticating the caller to service and vice versa. 

Besides these requirements, features that detect the following are worth supporting because of 
protocol roll-over errors, implementation bugs, user error, and network administration: 

1. RPC protocol mismatches. 

2. Remote program protocol version mismatches. 

3. Protocol errors (like mis-specification of a procedure’s parameters). 

4. Reasons why remote authentication failed. 

5. Any other reasons why the desired procedure was not called. 


2.1. Remote Programs and Procedures 

The RPC call message has three unsigned fields: remote program number, remote program ver- 
sion number, and remote procedure number. The three fields uniquely identify the procedure to 
be called. Program numbers are administered by some central authority (like Sun). Once an 
implementor has a program number, he can implement his remote program; the first implemen- 
tation would most likely have the version number of 1. Because most new protocols evolve into 
better, stable and mature protocols, a version field of the call message identifies which version of 
the protocol the caller is using. Version numbers make speaking old and new protocols through 
the same server process possible. 

The procedure number identifies the procedure to be called. These numbers are documented in 
the specific program’s protocol specification. For example, a file service’s protocol specification 
may state that its procedure number 5 is read and procedure number 12 is write. 

Just as remote program protocols may change over several versions, the actual RPC message 
protocol could also change. Therefore, the call message also has the RPC version number in it; 
this field must be two (2). 

The reply message to a request message has enough information to distinguish the following error 
conditions: 

1) The remote implementation of RPC does speak protocol version 2. The lowest and highest 
supported RPC version numbers are returned. 

2) The remote program is not available on the remote system. 

3) The remote program does not support the requested version number. The lowest and 
highest supported remote program version numbers are returned. 

4) The requested procedure number does not exist (this is usually a caller side protocol or pro- 
gramming error). 

5) The parameters to the remote procedure appear to be garbage from the server’s point of 
view. (Again, this is caused by a disagreement about the protocol between client and ser- 
vice.) 


Sun Microsystems 


Release 2.0 


Page 4 


RPC Protocol Spec 


2.2. Authentication 


Provisions for authentication of caller to service and vice versa are provided as a wart on the 
side of the RPC protocol. The call message has two authentication fields, the credentials and 
verifier. The reply message has one authentication field, the response verifier. The RPC proto- 
col specification defines all three fields to be the following opaque type: 

enum auth_ flavor •{ 

AUTH_NULL = 0, 

AUTH_UNIX = 1. 

AUTH_SHORT = 2 

/* and more to be defined */ 

in- 


struct 


>; 


opaque_auth •( 

union switch (enum auth_flavor) •{ 

default: string auth_body<400> ; 

}; 


In simple English, any opaque_auth structure is an auth_flavor enumeration followed by a 
counted string, whose bytes are opaque to the RPC protocol implementation. 

The interpretation and semantics of the data contained within the authentication fields is 
specified by individual, independent authentication protocol specifications. Appendix A defines 
three authentication protocols. 


If authentication parameters were rejected, the response message contains information stating 
why they were rejected. 


2.3. Program Number Assignment 

Program numbers are given out in groups of 0x20000000 (536870912) according to the following 
chart: 


O 

20000000 

40000000 

60000000 

80000000 

aOOOOOOO 

cOOOOOOO 

eOOOOOOO 


lfffffff 

3fffffff 

5fffffff 

7fffffff 

9fffffff 

bfffffff 

dfffffff 

ffffffff 


defined by Sun 

defined by user 

transient 

reserved 

reserved 

reserved 

reserved 

reserved 


The first group is a range of numbers administered by Sun Microsystems, and should be identical 
for all Sun customers. The second range is for applications peculiar to a particular customer. 
This range is intended primarily for debugging new programs. When a customer develops an 
application that might be of general interest, that application should be given an assigned 
number in the first range. The third group is for applications that generate program numbers 
dynamically. The final groups are reservered for future use, and should not be used. 

The exact registration process for Sun defined numbers is yet to be established. 


Sun Microsystems 


Release 2.0 


RPC Protocol Spec 


3. Other Uses and Abuses of the RPC Protocol 

The intended use of this protocol is for calling remote procedures. That is, each call message is 
matched with a response message. However, the protocol itself is a message passing protocol 
with which other (non-RPC) protocols can be implemented. Sun currently uses (abuses) the 
RPC message protocol for the following two (non-RPC) protocols: batching (or pipelining) and 
broadcast RPC. These two protocols are discussed (but not defined) below. 


3.1. Batching 

Batching allows a client to send an arbitrarily large sequence of call messages to a server; batch- 
ing uses reliable bytes stream protocols (like TCP/IP) for their transport. In the case of batch- 
ing, the client never waits for a reply from the server and the server does not send replies to 
batch requests. A sequence of batch calls is usually terminated by a legitimate RPC in order to 
flush the pipeline (with positive acknowledgement). 

3.2. Broadcast RPC 

In broadcast RPC based protocols, the client sends an a broadcast packet to the network and 
waits for numerous replies. Broadcast RPC uses unreliable, packet based protocols (like 
UDP/IP) as their transports. Servers that support broadcast protocols only respond when the 
request is successfully processed, and are silent in the face of errors. 


4. The RPC Message Protocol 

This section defines the RPC message protocol in the XDR data description language. The mes- 
sage is defined in a top down style. Note: This is an XDR specification, not C code. 

enum ms g_ type •{ 

CALL = 0, 

REPLY = 1 

}; 


/* 

* A reply to a call message can take on two forms: 

* the message was either accepted or rejected. 

V 

enum reply_stat •{ 

MSG_ACCEPTED = 0, 

MSG_DENIED = 1 

}; 


Sun Microsystems 


Release 2.0 


Page 6 


RPC Protocol Spec 


/* 

4 Given that a call message was accepted, the following is the status of 
4 an attempt to call a remote procedure. 


V 

enum accept_stat •( 


SUCCESS = 0, 
PROG_UNAVAIL = 1, 
PROG_MI SMATCH = 2, 
PROC_UNAVAIL = 3, 
GARBAGE _ARGS = 4 


/* remote procedure was successfully executed */ 

/* remote machine exports the program number */ 

/* remote machine can't support version number */ 
/* remote program doesn't know about procedure */ 
/* remote procedure can't figure out parameters */ 


>; 


/* 

* Reasons why a call message was rejected: 

V 

enum reject_stat •{ 

RPC_MISMATCH = 0, /* RPC version number was not two (2) */ 
AUTH_ERROR =1 /* caller not authenticated on remote machine */ 

>; 


/* 

4 Why authentication failed: 

V 

enum auth_stat { 

AUTH_BADCRED = 1, 
AUTH_REJECTEDCRED = 
AU TH_ BAD VE R F = 3, 
AUTH_RE JECTEDVERF = 
AUTH_TOOWEAK = 5, 


/* 

* The RPC message : 

4 All messages start with a transaction identifier, xid, followed by 
4 a two-armed discriminated union. The union's discriminant is a msg_type 
4 which switches to one of the two types of the message. The xid of a 
4 REPLY message always matches that of the initiating CALL message. 

4 NB: The xid field is only used for clients matching reply messages with 
4 call messages; the service side cannot treat this id as any type of 
4 sequence number . 

V 

struct rpc_msg •{ 

unsigned xid; 

union switch (enum msg_type) •( 

CALL: struct call_body; 

REPLY: struct reply_body; 

>; 

>; 


/* bogus credentials (seal broken) */ 
2, /* client should begin new session 4 / 

/* bogus verifier (seal broken) */ 

4, / 4 verifier expired or was replayed */ 

/* rejected due to security reasons */ 


Sun Microsystems 


Release 2.0 


RPC Protocol Spec 


Page 7 


/* 

* Body of an RPC request call: 

* In version 2 of the RPC protocol specification, rpcvers must be equal to 2. 

* The fields prog, vers, and proc specify the remote program, its version, 

* and the procedure within the remote program to be called. These fields are 

* followed by two authentication parameters, cred (authentication credentials) 

* and verf (authentication verifier) . The authentication parameters are 

* followed * by the parameters to the remote procedure; these parameters are 

* specified by the specific program protocol. 

V 

struct call_body { 

unsigned rpcvers; /* must be equal to two (2) */ 

unsigned prog; 

unsigned vers; 

unsigned proc; 

struct opaque_auth cred; 

struct opaque_auth verf; 

/* procedure specific parameters start here */ 

>; 

/* 

* Body of a reply to an RPC request. 

* The call message was either accepted or rejected. 

V 

struct reply_body -( 

union switch (enum reply_stat) { 

MSG^ACCEPTED : struct accepted_reply ; 

MSG_DENIED: struct re jected_reply; 

}; 

>; 


Sun Microsystems 


Release 2.0 


Page 8 


RPC Protocol Spec 


/* 

* Reply to an RPC request that was accepted by the server. 

* Note: there could be an error even though the request was accepted. 

* The first field is an authentication verifier which the server generates 

* in order to validate itself to the caller. It is followed by a union 

* whose discriminant is an enum accept_stat. The SUCCESS arm of the union is 

* protocol specific. The PROG_UNAVAIL, PROC_UNAVAIL, and GARBAGE_ARGS arms 

* of the union are void. The PROG_MI SMATCH arm specifies the lowest and 

* highest version numbers of the remote program that are supported by the 
4 server . 


V 

struct accepted_reply { 

struct opaque_auth verf; 

union switch (enum accept_stat) { 

SUCCESS: struct { 

/* 

* procedure-specific results start here 

V 

}; 

PROG_MI SMATCH: struct { 
unsigned low; 
unsigned high; 

>; 

default: struct ■( 

/* 


* void. Cases include PROG_UNAVAIL, 

* PROC_UNAVAI L , and GARBAGE^ARGS . 

V 


/* 


* Reply to an RPC request that was rejected by the server. 

* The request can be rejected because of two reasons - either the server is 
4 not running a compatible version of the RPC protocol (RPC_MI SMATCH) , or 

* the server refused to authenticate the caller (AUTH_ERROR) . In the case of 

* an RPC version mismatch, the server returns the lowest and highest supported 

* RPC version numbers. In the case of refused authentication, the failure 

* status is returned. 

V 

struct re jected_reply { 

union switch (enum reject_stat) { 

RPC_MI SMATCH: struct { 
unsigned low; 
unsigned high; 

>; 

AUTH_ERROR: enum auth_stat; 

>; 

>; 


Sun Microsystems 


Release 2.0 


RPC Protocol Spec 


Page 9 


Appendix A. Authentication Parameter Specification 

As previously stated, authentication parameters are opaque, but open-ended to the rest of the 
RPC protocol. This section defines some “flavors” of authentication which have been imple- 
mented at (and supported by) Sun. 


A.l. Null Authentication 

Often calls must be made where the caller does not know who he is and the server does not care 
who the caller is. In this case, the auth_flavor value (the discriminant of the opaque_auth’s 
union) of the RPC message’s credentials, verifier, and response verifier is AUTH_NULL (0). The 
bytes of the auth_body string are undefined. It is recommended that the string length be zero. 


A.2. UNIX Authentication 

The caller of a remote procedure may wish to identify himself as he is identified on a UNEXf sys- 
tem. The value of the credential’s discriminant of an RPC call message is AUTHLUNIX (1). 
The bytes of the credential’s string encode the the following (XDR) structure: 

struct auth_unix { 
unsigned 
string 
unsigned 
unsigned 
unsigned 

>-• 

The stamp is an arbitrary id which the caller machine may generate. The machinename is the 
name of the caller’s machine (like “krypton”). The uid is the caller’s effective user id. The 
gid is the callers effective group id. The gids is a counted array of groups which contain the 
caller as a member. The verifier accompanying the credentials should be of AUTH_NULL 
(defined above). 

The value of the discriminate of the response verifier received in the reply message from the 
server may be AUTH_NULL or AUTH_SHORT (2). In the case of AUTH_SHORT, the bytes of 
the response verifier’s string encode an auth_opaque structure. This new auth_opaque 
structure may now be passed to the server instead of the original AUTH_UNIX flavor creden- 
tials. The server keeps a cache which maps short hand auth_opaque structures (passed back via 
a AUTH_SHORT style response verifier) to the original credentials of the caller. The 
caller can save network bandwidth and server cpu cycles by using the new credentials. 

The server may flush the short hand auth_opaque structure at any time. If this happens, the 
remote procedure call message will be rejected due to an authentication error. The reason for 
the failure will be AUTH_REJECTEDCRED. At this point, the caller may wish to try the origi- 
nal AUTH_UNIX style of credentials. 


stamp ; 

machinename<255> ; 

uid; 

gid; 

gids<10> ; 


t UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


Page 10 


RPC Protocol Spec 


Appendix B. Record Marking Standard 

When RPC messages are passed on top of a byte stream protocol (like TCP/IP), it is necessary, 
or at least desirable, to delimit one message from another in order to detect and possibly recover 
from user protocol errors. This is called record marking (RM). Sun uses this RM/TCP/IP tran- 
sport for passing RPC messages on TCP streams. One RPC message fits into one RM record. 

A record is composed of one or more record fragments. A record fragment is a four-byte header 
followed by 0 to 2 s1 —! bytes of fragment data. The bytes encode an unsigned binary number; 
as with XDR integers, the byte order is from highest to lowest. The number encodes two values 
— a boolean which indicates whether the fragment is the last fragment of the record (bit value 1 
implies the fragment is the last fragment) and a 31-bit unsigned binary value which is the length 
in bytes of the fragment’s data. The boolean value is the highest-order bit of the header; the 
length is the 31 low-order bits. (Note that this record specification is not in XDR standard 
form!) 


Sun Microsystems 


Release 2.0 


RPC Protocol Spec 


Page 11 


Appendix C. Port Mapper Program Protocol 

The port mapper program maps RPC program and version numbers to UDP/IP or TCP/IP port 
numbers. This program makes dynamic binding of remote programs possible. 

This is desirable because the range of reserved port numbers is very small and the number of 
potential remote programs is very large. By running only the port mapper on a reserved port, 
the port numbers of other remote programs can be ascertained by querying the port mapper. 


C.l. The Port Mapper RPC Protocol 

The protocol is specified by the XDR description language. 

Port Mapper RPC Program Number : 100000 
Version Number: 1 
Supported Transports : 

UDP/IP on port 111 
RM/TCP/IP on port 111 


/* 

* Handy transport protocol numbers 

V 

#define IPPR0T0_TCP 6 /* protocol number used for rpc/rm/tcp/ip */ 

#define IPPR0T0_UDP 17 /* protocol number used for rpc/udp/ip */ 

/* Procedures */ 

/* 

* Convention: procedure zero of any protocol takes no parameters 

* and returns no results. 

V 

0. PMAPPROC_NULL () returns () 

/* 

* Procedure 1, setting a mapping: 

4 When a program first becomes available on a 

* machine, it registers itself with the port mapper program on the 

* same machine. The program passes its program number (prog) , 

* version number (vers) , transport protocol number (prot) , 

4 and the port (port) on which it awaits service request . The 
4 procedure returns success whose value is TRUE if the procedure 
4 successfully established the mapping and FALSE otherwise. The 
4 procedure will refuse to establish a mapping if one already exists 
4 for the tuple [prog, vers, prot] . 

V 

1. PMAPPROC_SET (prog, vers, prot, port) returns (success) 

unsigned prog; 
unsigned vers; 
unsigned prot; 
unsigned port; 
boolean success; 


Sun Microsystems 


Release 2.0 


Page 12 


RPC Protoco! Spec 


/* 

4 Procedure 2, Unsetting a mapping: 

* When a program becomes unavailable, it should unregister itself 

4 with the port mapper program on the same machine. The parameters 
4 and results have meanings identical to those of PMAPPROC_SET . 

V 

2. PMAPPROC_UNSET (prog, vers, dummyl, dummy2) returns (success) 

unsigned prog; 
unsigned vers; 

unsigned dummyl; /* this value is always ignored */ 
unsigned dummy2; /* this value is always ignored */ 
boolean success; 

/ 4 

* Procedure 3, looking-up a mapping: 

* Given a program number (prog) , version number (vers) and 

* transport protocol number (prot) , this procedure returns the port 

* number on which the program is awaiting call requests. A port 

* value of zeros means that the program has not been registered. 

*/ 

3. PMAPPROC_GETPORT (prog, vers, prot, dummy) returns (port) 

unsigned prog; 
unsigned vers; 
unsigned prot; 

unsigned dummy; /* this value is always ignored */ 
unsigned port; /* zero means the program is not registered */ 


/* 

* Procedure 4, dumping the mappings: 

* This procedure enumerates all entries in the port mapper's database. 

* The procedure takes no parameters and returns a ''list'' of 

* [program, version, prot, port] values. 

V 

4. PMAPPROC_DUMP () returns (maplist) 
struct maplist -( 

union switch (boolean) { 

FALSE: struct { /* void, end of list */ }; 

TRUE : struct { 

unsigned prog; 
unsigned vers; 
unsigned prot; 
unsigned port; 
struct maplist the_rest; 

>; 

}; 

> maplist; 


Sun Microsystems 


Release 2.0 


RPC Protocol Spec 


Page 13 


/* 

* Procedure 5, indirect call routine: 

4 The procedures allows a caller to call another remote procedure 

4 on the same machine without knowing the remote procedure's port 

* number. Its intended use is for supporting broadcasts to arbitrary 

4 remote programs via the well-known port mapper's port. The parameters 

* prog, vers, proc, and the bytes of args are the program number, 

4 version number, procedure number, and parameters the the remote 

* procedure . 

t 

4 NB: 

4 1. This procedure only sends a response if the procedure was 

4 successfully executed and is silent (No response) otherwise. 

4 2 . The port mapper communicates with the remote program via 

4 UDP/IP only. 

* 

4 The procedure returns the port number of the remote program and 

4 the bytes of results are the results of the remote procedure. 

V 

5. PMAPPROC_CALLIT (prog, vers, proc, args) returns (port, results) 
unsigned prog; 
unsigned vers; 
unsigned proc; 
string argsO; 
unsigned port; 
string resultsO; 


Sun Microsystems 


Release 2.0 


Network File System 
Protocol Specification 


Contents 


1. Introduction 1 

1.1. Remote Procedure Call 1 

1.2. External Data Representation 1 

1.3. Stateless Servers 2 

2. NFS Protocol Definition 3 

2.1. Version 2 3 

2.1.1. Server/Client Relationship 3 

2.1.2. Permission Issues 4 

2.1.3. RPC Information 4 

2.1.4. Sizes 5 

2.1.5. Basic Data Types 6 

2. 1.5.1. stat 6 

2. 1.5. 2. ftype 7 

2. 1.5. 3. fhandle 7 

2. 1.5. 4. timeval 8 

2. 1.5. 5. fattr 8 

2. 1.5. 6. sattr 9 

2. 1.5. 7. filename 9 

2. 1.5. 8. path 9 

2. 1.5.9. attrstat 10 

2.1.5.10. diropargs 10 

2.1.5.11. diropres 10 

2.1.6. Server Procedures 11 

2. 1.6.1. Do Nothing (Procedure 0, Version 2) 11 

2. 1.6. 2. Get File Attributes (Procedure 1, Version 2) 11 

2. 1.6. 3. Set File Attributes (Procedure 2, Version 2) 12 

2. 1.6. 4. Get Filesystem Root (Procedure 3, Version 2) 12 

2. 1.6. 5. Look Up File Name (Procedure 4, Version 2) 12 

2. 1.6. 6. Read From Symbolic Link (Procedure 5, Version 2) 12 

2. 1.6.7. Read From File (Procedure 6, Version 2) 13 

2. 1.6.8. Write to Cache (Procedure 7, Version 2) 13 

2. 1.6.9. Write to File (Procedure 8, Version 2) 13 

2.1.6.10. Create File (Procedure 9, Version 2) 14 

2.1.6.11. Remove File (Procedure 10, Version 2) 14 

2.1.6.12. Rename File (Procedure 11, Version 2) 14 

2.1.6.13. Create Link to File (Procedure 12, Version 2) 14 


2.1.6.14. Create Symbolic Link (Procedure 13, Version 2) 15 

2.1.6.15. Create Directory (Procedure 14, Version 2) 15 

2.1.6.16. Remove Directory (Procedure 15, Version 2) 15 

2.1.6.17. Read From Directory (Procedure 16, Version 2) 16 

2.1.6.18. Get Filesystem Attributes (Procedure 17, Version 2) 16 

3. Mount Protocol Definition jg 

3.1. Version 1 18 

3.1.1. RPC Information 18 

3.1.2. Sizes 18 

3.1.3. Basic Data Types ig 

3. 1.3.1. fhandle ig 

3. 1.3.2. fhstatus ig 

3. 1.3. 3. dirpath ig 

3. 1.3. 4. name ig 

3.1.4. Server Procedures 20 

3.1. 4.1. Do Nothing (Procedure 0, Version 1) 20 

3. 1.4. 2. Add Mount Entry (Procedure 1, Version 1) 20 

3. 1.4. 3. Return Mount Entries (Procedure 2, Version 1) 20 

3.1. 4.4. Remove Mount Entry (Procedure 3, Version 1) 21 

3. 1.4. 5. Remove All Mount Entries (Procedure 4, Version 1) 21 

3. 1.4. 6. Return Export List (Procedure 5, Version 1) 21 


Network File System 
Protocol Specification 


1. Introduction 

The Sun Network Filesystem (NFS) protocol provides transparent remote access to shared 
filesystems over local area networks. The NFS protocol is designed to be machine, operating sys- 
tem, network architecture, and transport protocol independent. This independence is achieved 
through the use of Remote Procedure Call (RPC) primitives built on top of an external Data 
Representation (XDR). 

The supporting mount protocol allows the server to hand out remote access privileges to a res- 
tricted set of clients. Thus, it allows clients to attach a remote directory tree at any point on 
some local filesystem. 


1.1. Remote Procedure Call 

Sun’s remote procedure call specification, described in the RPC Programming Guide, provides a 
clean, procedure-oriented interface to remote services. Each server supplies a program that is a 
set of procedures. The combination of host address, program number, and procedure number 
specifies one remote service procedure. 

RPC is a high-level protocol built on top of low-level transport protocols. It does not depend on 
services provided by specific protocols, so it can be used easily with any underlying transport 
protocol. Currently the only supported transport protocol is UDP/IP. 

The RPC protocol includes a slot for authentication parameters on every call. The contents of 
the authentication parameters are determined by the “flavor” (type) of authentication used by 
the server and client. A server may support several different flavors of authentication at once: 
AUTH_NONE passes no authentication information (this is called null authentication); 
AUTH_UNIX passes the UNIXf uid, gid, and groups with each call. 

Servers have been known to change over time, and so can the protocol that they use. So RPC 
provides a version number with each RPC request. Thus, one server can service requests for 
several different versions of the protocol at the same time. 


1.2. External Data Representation 

Sun’s external data representation specification, described in the XDR Protocol Specification, 
provides a common way of representing a set of data types over a network. This takes care of 
problems such as different byte ordering on different communicating machines. It also defines 

f UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


Page 2 


NFS Protocol Spec 


the size of each data type so that machines with different structure alignment algorithms can 
share a common format over the network. 

In this document we use the XDR data definition language to specify the parameters and results 
of each RPC service procedure that a NFS server provides. The XDR data definition language 
reads a lot like C, although a few new constructs have been added. The notation 

string name [SIZE]; 
string data<DSIZE> ; 

defines name, which is a fixed size block of SIZE bytes, and data, which is a variable size block 
of up to DSIZE bytes. This same notation is used to indicate fixed length arrays, and arrays 
with a variable number of elements up to some maximum. 

The discriminated union definition 

union switch (enum status) •{ 

NFS_OK : 

struct ■( 

filename 
filename 
integer 

> 

NFSJERROR: 
struct { 

errstat 
integer 

> 

default : 

struct {} 

> 

means the first thing over the network is an enumeration type called status; if its value is 
NFS_OK, the next thing on the network will be the structure containing filel, file2, and 
count. If the value of status is neither NFS_OK nor NFS_ERROR, then there is no more data 
to look at. 


filel; 
file2; 
count ; 


error ; 
errno ; 


1.3. Stateless Servers 

The NFS protocol is stateless. That is, a server does not need to maintain state about any of its 
clients in order to function correctly. Stateless servers have a distinct advantage over stateful 
servers in the event of a crash. With stateless servers, a client need only retry a request until 
the server responds; it does not even need to know that the server has crashed. The client of a 
stateful server, on the other hand, needs to detect a server crash and rebuild the server’s state 
when it comes back up. 

This may not sound like an important issue, but it affects the protocol in some strange ways. We 
feel that it is worth a bit of extra complexity in the protocol to be able to write very simple 
servers that don’t need fancy crash recovery. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 3 


2o NFS Protocol Definition 

The NFS protocol is designed to be operating system independent, but let’s face it, it was 
designed in a UNIX environment. As such, it has some features which are very UNEXish. When 
in doubt about how something should work, a quick look at how it is done on UNIX will probably 
put you on the right track. 

The protocol definition is given as a set of procedures with arguments and results defined using 
XDR. A brief description of the function of each procedure should provide enough information 
to allow implementation on most machines. There is a different section provided for each sup- 
ported version of the protocol. Most of the procedures, and their parameters and results, are 
self-explanatory. A few do not fit into the normal UNIX mold, however. 

The LOOKUP procedure looks up one component of a pathname at a time. It is not obvious at 
first why it does not just take the whole pathname, traipse down the directories, and return a file 
handle when it is done. There are two good reasons not to do this. First, pathnames need 
separators between the directory components, and different operating systems use different 
separators. We could define a Network Standard Pathname Representation, but then every 
pathname would have to be parsed and converted at each end. Second, if pathnames were 
passed, the server would have to keep track of the mounted filesystems for all of its clients, so 
that it could break the pathname at the right point and pass the remainder on to the correct 
server. 

Another procedure which might seem strange to UNIX people is the READDIR procedure. What 
READDIR does is provide a network standard format for representing directories. The same 
argument as above could have been used to justify a READDIR procedure that returns only one 
directory entry per call. The problem is efficiency. Directories can contain many entries, and a 
remote call to return each would just be too slow. 


2.1. Version 2 

The released version of the NFS protocol is actually the second. Even in the second version, 
there are various obsolete procedures and parameters, which will probably be removed in later 
versions. 


2.1.1. Server/ Client Relationship 

The NFS protocol is designed to allow servers to be as simple and general as possible. Some- 
times the simplicity of the server can be a problem, if the client wants to implement complicated 
filesystem semantics. 

For example, UNIX allows removal of open files. A process can open a file and, while it is open, 
remove it from the directory. The file can be read and written as long as the process keeps it 
open, even though the file has no name in the filesystem. It is impossible for a stateless server to 
implement these semantics. The client can do some tricks like renaming the file on remove, and 
only removing it on close. We believe that the server provides enough functionality to imple- 
ment most filesystem semantics on the client. 

Every NFS client can also be a server, and remote and local mounted filesystems can be freely 
intermixed. This leads to some interesting problems when a client travels down the directory 
tree of a remote filesystem and reaches the mount point on the server for another remote 


Sun Microsystems 


Release 2.0 


Page 4 


NFS Protocol Spec 


filesystem. Allowing the server to following the second remote mount means it must do loop 
detection, server lookup, and user revalidation. Instead, we decided not to let clients cross a 
server’s mount point. When a client does a LOOKUP on a directory that the server has mounted 
a filesystem on, the client sees the underlying directory instead of the mounted directory. A 
client can do remote mounts that match the server’s mount points to maintain the server’s view. 


2.1.2. Permission Issues 

The NFS protocol, strictly speaking, does not define the permission checking used by servers. 
However, it is expected that a server will do normal UNIX permission checking using 
AUTH_UNIX style authentication as the basis of its protection mechanism. The server gets the 
client’s effective uid , effective gid and groups on each call, and uses them to check permission. 
There are various problems with this method that can been resolved in interesting ways. 

Using uid and gid implies that the client and server share the same uid list. Every server and 
client pair must have the same mapping from user to uid and from group to gid. Since every 
client can also be a server this tends to imply that the whole network shares the same uid/ gid 
space. This is acceptable for the short term, but a more workable network authentication 
method will be necessary before long. 

Another problem arises due to the semantics of open. UNIX does its permission checking at open 
time and then that the file is open, and has been checked on later read and write requests. With 
stateless servers this breaks down, because the server has no idea that the file is open and it 
must do permission checking on each read and write call. On a local filesystem, a user can open 
a file then change the permissions so that no one is allowed to touch it, but will still be able to 
write to the file because it is open. On a remote filesystem, by contrast, the write would fail. 
To get around this problem the server’s permission checking algorithm should allow the owner of 
a file to access it no matter what the permissions are set to. 

A similar problem has to do with paging in from a file over the network. The UNIX kernel 
checks for execute permission before opening a file for demand paging, then reads blocks from 
the open file. The file may not have read permission but after it is opened it doesn’t matter. An 
NFS server can’t tell the difference between a normal file read and a demand page-in read. To 
make this work the server allows reading of files if the uid given in the call has execute or read 
permission on the file. 

In UNIX, the user ID zero has access to all files no matter what permission and ownership they 
have. This super-user permission is not allowed on the server since anyone who can become 
super-user on their workstation could gain access to all remote files. Instead, the server maps 
uid 0 to —2 before doing its access checking. This works as long as the NFS is not used to sup- 
ply root filesystems, where super-user access cannot be avoided. Eventually servers will have to 
allow some kind of limited super-user access. 


2.1.8. RPC Information 

Authentication 

The NFS service uses AUTH_UNIX style authentication except in the NULL procedure where 
AUTH_NONE is also allowed. 

Protocols 

NFS currently is supported on UDP/IP only. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 5 


Constants 

These are the RPC constants needed to call the NFS service. They are given in decimal. 

PROGRAM 100003 

VERSION 2 

Port Number 

The NFS protocol currently uses the UDP port number 2049. This is a bug in the protocol 
and will be changed very shortly. 


2.1.4. Sizes 

These are the sizes, given in decimal bytes, of various XDR structures used in the protocol. 
MAXDATA 8192 

The maximum number of bytes of data in a READ or WRITE request. 

MAXPATHLEN 1024 

The maximum number of bytes in a pathname argument. 

MAXNAMLEN 255 

The maximum number of bytes in a file name argument. 

COOKIESIZE 4 

The size in bytes of the opaque “cookie” passed by READDIR. 

FHSIZE 32 

The size in bytes of the opaque file handle. 


Sun Microsystems 


Release 2.0 


Page 6 


NFS Protocol Spec 


2.1.5. Basic Data Types 

The following XDR definitions are basic structures and types used in other structures later on. 


2. 1.5.1. itat 

typedef enum { 

NFS_OK = 0, 

NFSERR_PERM=1 , 

NFSERR_N0ENT=2, 

NFSERR_I0=5 , 

NFSERR_NXI0=6, 

NFSERR_ACCES=13 , 

NFSERR_EXIST=17, 

NFSERR_N0DEV=19 , 

NFSERR_N0TDIR=20 , 

NFSERR_ISDIR=21, 

NFSERR_FBIG=27, 

NFSERR_N0SPC=28, 

NF SERR_ROF S= 30 ( 

NFSERR_NAMET00L0NG=63 , 

NFSERR_N0TEMPTY=66 , 

NFSERR_DQU0T=69, 

NFSERR_STALE=70, 

NFSERR_WFLUSH=99 

> stat; 

The stat type is returned with every procedure’s results. A value of NFS_OK indicates that the 

call completed successfully and the results are valid. The other values indicate some kind of 

error occurred on the server side during the servicing of the procedure. The error values are 

derived from UNIX error numbers. 

NFSERR_PERM 

Not owner. The caller does not have correct ownership to perform the requested operation. 
NFSERR_NOENT 

No such file or directory. The file or directory specified does not exist. 

NFSERRJO 

I/O error. Some sort of hard error occurred when the operation was in progress. This could 
be a disk error, for example. 

NFSERR_NXIO 

No such device or address. 

NFSERR_ACCES 

Permission denied. The caller does not have the correct permission to perform the requested 
operation. 

NFSERR_EXIST 

File exists. The file specified already exists. 

NFSERR_NODEV 
No such device. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 7 


NFSERR_NOTDIR 

Not a directory. The caller specified a non-directory in a directory operation. 
NFSERRJSDIR 

Is a directory. The caller specified a directory in a non-directory operation. 

NFSERR_FBIG 

File too large. The operation caused a file to grow beyond the server’s limit. 
NFSERR_NOSPC 

No space left on device. The operation caused the server’s filesystem to reach its limit. 
NFSERR_ROFS 

Read-only filesystem. Write attempted on a read-only filesystem. 
NFSERR_NAMETOOLONG 

File name too long. The file name in an operation was too long. 

NFSERR_NOTEMPTY 

Directory not empty. Attempted to remove a directory that was not empty. 
NFSERR_DQUOT 

Disk quota exceeded. The client’s disk quota on the server has been exceeded. 
NFSERR_STALE 

The fhandle given in the arguments was invalid. That is, the file referred to by that file 
handle no longer exists, or access to it has been revoked. 

NFSERR_WFLUSH 

The server’s write cache used in the WRITECACHE call got flushed to disk. 

2. 1.5. 2. ftype 

typedef enum { 

NFNON = 0 . 

NFREG = 1, 

NFDIR = 2, 

NFBLK = 3, 

NFCHR = 4, 

NFLNK = 5 

> ftype; 

The enumeration ftype gives the type of a file. The type NFNON indicates a non-file, NFREG is 
a regular file, NFDIR is a directory, NFBLK is a block-special device, NFCHR is a character- 
special device, and NFLNK is a symbolic link. 


2. 1.5.3. fhandle 

typedef opaque fhandle [FHS I ZE] ; 

The fhandle is the file handle that the server passes to the client. All file operations are done 
using file handles to refer to a file or directory. The file handle can contain whatever informa- 
tion the server needs to distinguish an individual file. 


Sun Microsystems 


Release 2.0 


Page 8 


NFS Protocol Spec 


2. 1.5. 4- timeval 

typed© f struct { 

unsigned seconds; 
unsigned useconds; 

} timeval; 

The timeval structure is the number of seconds and microseconds since midnight January 1, 
1970 Greenwich Mean Time. It is used to pass time and date information. 

2. 1.5. 5. fattr 

typedef struct { 

ftype type; 
unsigned mode; 
unsigned nlink; 
unsigned uid; 
unsigned gid; 
unsigned size; 
unsigned blocksize; 
unsigned rdev; 
unsigned blocks; 
unsigned fsid; 
unsigned fileid; 
timeval atime; 
t imeva 1 mt ime ; 
timeval ctime; 

} fattr; 

The fattr structure contains the attributes of a file; type is the type of the file; nlink is the 
number of hard links to the file, that is, the number of different names for the same file; uid is 
the user identification number of the owner of the file; gid is the group identification number of 
the group of the file; size is the size in bytes of the file; blocksize is the size in bytes of a 
block of the file; rdev is the device number of the file if it is type NFCHR or NFBLK; blocks is 
the number of blocks that the file takes up on disk; fsid is the file system identifier for the 
filesystem that contains the file; fileid is a number that uniquely identifies the file within its 
filesystem; atime is the time when the file was last accessed for either read or write; mtime is 
the time when the file data was last modified (written); and ctime is the time when the status of 
the file was last changed. Writing to the file also changes ctime if the size of the file changes. 

Mode is the access mode encoded as a set of bits. The bits are the same as the mode bits 
returned by the stat(2 ) system call in UNIX. Notice that the file type is specified both in the 
mode bits and in the file type. This is really a bug in the protocol and should be fixed in future 
versions. The descriptions given below specify the bit positions using octal numbers. 

0040000 This is a directory. The type field should be NFDIR. 

0020000 This is a character special file. The type field should be NFCHR. 

0060000 This is a block special file. The type field should be NFBLK. 

0100000 This is a regular file. The type field should be NFREG. 

0120000 This is a symbolic link file. The type field should be NFLNK. 

.A 

Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


0140000 This is a named socket. The type field should be NFNON. 
0004000 Set user id on execution. 

0002000 Set group id on execution. 

0001000 Save swapped text even after use. 

0000400 Read permission for owner. 

0000200 Write permission for owner. 

0000100 Execute and search permission for owner. 

0000040 Read permission for group. 

0000020 Write permission for group. 

0000010 Execute and search permission for group. 

0000004 Read permission for others. 

0000002 Write permission for others. 

0000001 Execute and search permission for others. 

2. 1.5.6. aattr 


typedef struct 


unsigned 

mode; 

unsigned 

uid; 

unsigned 

gid; 

unsigned 

size; 

timeval 

at ime ; 

timeval 

mt ime ; 


)■ sattr; 


The sattr structure contains the file attributes which can be set from the client. The fields are 
the same as for fattr above. A size of zero means the file should be truncated. A value of 
— 1 indicates a field that should be ignored. 

2.1.5. 7. filename 

typedef string filename<MAXNAMLEN>; 

The type filename is used for passing file names or pathname components. 

2. 1.5. 8. path 

typedef string path<MAXPATHLEN> ; 

The type path is a pathname. The server considers it as a string with no internal structure, but 
to the client it is the name of a node in a filesystem tree. 


Sun Microsystems 


Release 2.0 


Page 10 


NFS Protocol Spec 


2. 1.5. 9. attrstat 

typedef union switch (stat status) •( 

NES_OK : 

fattr attributes; 
default : 

struct {}■ 

)■ attrstat; 

The attrstat structure is a common procedure result. It contains a status and, if the call 
succeeded, it also contains the attributes of the file on which the operation was done. 


2.1.5.10. diroparga 

typedef struct { 

fhandle dir; 

filename name; 

}• diropargs; 

The diropargs structure is used in directory operations. The fhandle dir is the directory 
in which to find the file name. A directory operation is one in which the directory is affected. 


2.1.5.11. diroprea 

typedef union switch (stat status) { 

NFS_OK: 

struct •{ 

fhandle file; 
fattr attributes; 

> 

default : 

struct {} 

)• diropres; 

The results of a directory operation are returned in a diropres structure. If the call succeeded 
a new file handle file and the attributes associated with that file are returned along with 
the status. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 11 


2.1.6. Server Procedures 

The following sections define the RPC procedures supplied by a NFS server. The RPC pro- 
cedure number and version are given in the header, along with the name of the prodedure. The 
synopsis of prodecures has this format: 

<proc #> . <proc name> ( <arguments> ) returns ( <results> ) 

<argument declarations> 

<results declarations> 

In the first line, proc name is the name of the procedure, arguments is a list of the names of the 
arguments, and results is a list of the names of the results. The second and third lines give the 
XDR argument declarations and results declarations . Afterwards, there is a description of what 
the procedure is expected to do, and how its arguments and results are used. If there are bugs 
or problems with the procedure, they are listed at the end. 

All of the procedures in the NFS protocol are assumed to be synchronous. When a procedure 
returns to the client, the client can assume that the operation has completed and any data asso- 
ciated with the request is now on stable storage. For example, a client WRITE request may 
cause the server to update data blocks, filesystem information blocks (such as indirect blocks in 
UNIX), and file attribute information (size and modify times). When the WRITE returns to the 
client, it can assume that the write is safe, even in case of a server crash, and it can discard the 
data written. This is a very important part of the statelessness of the server. If the server 
waited to flush data from remote requests the client would have to save those requests so that it 
could resend them in case of a server crash. 


2. 1.6.1. Do Nothing (Procedure 0, Version 2) 

O. NESPROC_NULL ( ) returns ( ) 

This procedure does no work. It is made available in all RPC services to allow server response 
testing and timing. 


2. 1.6. 2. Get File Attributes (Procedure 1, Version 2) 

1. NFSPROC_GETATTR (file) returns (reply) 
fhandle file; 
attrstat reply; 

If reply, status is NFS_OK then reply . attributes contains the attributes for the file 
given by file. 

Bugs: the rdev field in the attributes structure is a UNIX device specifier. It should be removed 
or generalized. 


Sun Microsystems 


Release 2.0 


Page 12 


NFS Protocol Spec 


2. 1.6. 3. Set File Attributes (Procedure 2, Version 2) 

2 . NFSPROC_SETATTR (file, attributes) returns (reply) 
fhandle file; 
sattr attributes; 
attrstat reply; 

The attributes argument contains fields which are either —1 or are the new value for the 
attributes of file. If reply .status is NFS_OK then reply . attributes has the attributes 
of the file after the setattr operation has completed. 

Bugs: the use of —1 to indicate an unused field in attributes is wrong. 


2. 1.6. 4 . Get Filesystem Root (Procedure S, Version 2) 

3. NESPROC_ROOT ( ) returns ( ) 

Obsolete. This procedure is no longer used because finding the root file handle of a filesystem 
requires moving pathnames between client and server. To do this right we would have to define 
a network standard representation of pathnames. Instead, the function of looking up the root 
file handle is done by the MNTPROC_MNT procedure (see section entitled Mount Protocol 
Definition for details). 


2. 1.6. 5. Look Up File Name (Procedure Version 2) 

4. NFSPROC_LOOKUP (which) returns (reply) 
diropargs which; 
diropres reply; 

If reply. status is NFS_OK then reply, file and reply . attributes are the file handle 
and attributes for the file which. name in the directory given by which. dir. 

Bugs: there is some question as to what is the correct reply to a LOOKUP request when 
which. name is a mount point on the server for a remote mounted filesystem. Currently, we 
return the fhandle of the underlying directory. This is not completely acceptable, as the 
clients see a different view of the filesystem than the server does. 


2. 1.6. 6. Read From Symbolic Link (Procedure 5, Version 2) 

5. NF SPROC_RE ADL I NK (file) returns (reply) 
fhandle file; 

union switch (stat status) { 

NFS_OK : 

path data; 
default : 

struct {)• 

> reply; 

If status has the value NFS_OK then reply. data is the data in the symbolic link given by 
file. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


2.1.6. 7. Read From File (Procedure 6, Version 2) 

6. NFSPROC_READ (file, offset, count, totalcount) returns (reply) 

fhandle file; 

unsigned offset; 

unsigned count; 

unsigned totalcount; 

union switch (stat status) { 

NFS_OK : 

fattr attributes; 
string data<MAXDATA> ; 
default : 

struct {}■ 

> reply; 

Returns up to count bytes of data from the file given by file, starting at offset bytes from 
the beginning of the file. The first byte of the file is at offset zero. The file attributes after the 
read takes place are returned in attributes. 

Bugs; the argument totalcount is unused, and should be removed. 

2. 1.6.8. Write to Cache (Procedure 7, Version 2) 

7. NFSPROC_WRI TECACHE ( ) returns ( ) 

Obsolete. 

2. 1.6.9. Write to File (Procedure 8, Version 2) 

8. NFSPROC_WRITE ( file, beginof fset , of fset , totalcount , data) returns (reply) 

fhandle file; 
unsigned beginof fset; 
unsigned offset; 
unsigned totalcount; 
string data<MAXDATA> ; 
attrstat reply; 

Writes data beginning offset bytes from the beginning of file. The first byte of the file is 
at offset zero. If reply, status is NFS_OK then reply . attributes contains the attributes 
of the file after the write has completed. The write operation is atomic. Data from this WRITE 
will not be mixed with data from another client’s WRITE. 

Bugs: the arguments beginof fset and totalcount are ignored and should be removed. 


Sun Microsystems 


Release 2.0 


Page 14 


NFS Protocol Spec 


2.1.6.10. Create File (Procedure 9, Version 2) 

9. NF SPROC_CRE ATE (where, attributes) returns (dir) 
diropargs where; 
sattr attributes; 

diropres dir; 

The file where. name is created in the directory given by where. dir. The initial attributes of 
the new file are given by attributes. A reply. status of NFS_OK indicates that the file 
was created and reply, file and reply . attributes are its file handle and attributes. Any 
other reply .status means that the operation failed and no file was created. 

Bugs: this routine should pass an exclusive create flag meaning, create the file only if it is not 
already there. 


2.1.6.11. Remove File (Procedure 10, Version 2) 

10. NF SPR0C_REM0TE (which) returns (status) 
diropargs which; 
stat status; 

The file which. name is removed from the directory given by which, dir. A status of 
NFS_0K means the directory entry was removed. 


2.1.6.12. Rename File (Procedure 11, Version 2) 

11. NFSPROC_RENAME (from, to) returns (status) 
diropargs from; 
diropargs to; 
stat status; 

The existing file from. name in the directory given by from. dir is renamed to to. name in 
the directory given by to. dir. If status is NFS_0K the file was renamed. The RENAME 
operation is atomic on the server; it cannot be interrupted in the middle. 


2.1.6.13. Create Link to File (Procedure 12, Version 2) 

12. NF SPR0C_L INK (from, to) returns (status) 
fhandle from; 

diropargs to; 
stat status; 

Creates the file to. name in the directory given by to. dir, which is a hard link to the existing 
file given by from. If the return value of status is NFS_0K a link was created. Any other 
return value indicates an error and the link is not created. 

A hard link should have the property that changes to either of the linked files are reflected in 
both files. When a hard link is made to a file, the attributes for the file should have a value for 
nlink which is one greater than the value before the link. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 15 


2.1.6.14- Create Symbolic Link (Procedure IS, Version 2) 

13. NFSPROC_SYMLINK (from, to, attributes) returns (status) 

diropargs from; 
path to; 

sattr attributes; 

stat status; 

Creates the file from. name with ftype NFLNK in the directory given by from. dir. The new 
file contains the pathname to and has initial attributes given by attributes. If the return 
value of status is NFS_OK a link was created. Any other return value indicates an error and 
the link is not created. 

A symbolic link is a pointer to another file. The name given in to is not interpreted by the 
server, just stored in the newly created file. A READLINK operation returns the data to the 
client for interpretation. 

Bugs: on UNIX servers the attributes are never used, since symbolic links always have mode 
0777. 

2.1.6.15. Create Directory ( Procedure 14, Version 2) 

14. NFSPROC_MKDIR (where, attributes) returns (reply) 

diropargs where; 
sattr attributes; 

diropres reply; 

The new directory where. name is created in the directory given by where. dir. The initial 
attributes of the new directory are given by attributes. A reply .status of NFS_0K indi- 
cates that the new directory was created and reply, file and reply . attributes are its file 
handle and attributes. Any other reply .status means that the operation failed and no direc- 
tory was created. 


2.1.6.16. Remove Directory (Procedure 15, Version 2) 

15. NFSPROC_RMDIR (which) returns (status) 
diropargs which; 

stat status; 

The existing, empty directory which. name in the directory given by which. dir is removed. 
If status is NES_0K the directory was removed. 


Sun Microsystems 


Release 2.0 


Page 16 


NFS Protocol Spec 


2.1.6.17. Read From Directory (Procedure 16, Version 2) 

16. NFSPROC_READDIR (dir, cookie, count) returns (entries) 

fhandle dir; 

opaque cookie [COOKIESIZE] ; 

unsigned count; 

union switch (stat status) ■( 

NFS_OK: 

typedef union switch (boolean valid) •{ 

TRUE; 

struct { 

unsigned fileid; 

filename name; 

opaque cookie [COOKIESIZE] ; 

entry nextentry; 

> 

FALSE: 

struct {} 

> entry; 
boolean eof; 
default : 

} entries; 

Returns a variable number of directory entries, with a total size of up to count bytes, from the 
directory given by dir. Each entry contains a fileid which is a unique number to identify 
the file within a filesystem, the name of the file, and a cookie which is an opaque pointer to the 
next entry in the directory. The cookie is used in the next READDIR call to get more entries 
starting at a given point in the directory. The special cookie zero (all bits zero) can be used to 
get the entries starting at the beginning of the directory. The fileid field should be the same 
number as the fileid in the the attributes of the file (see the section entitled fattr under Basic 
Data Types). The eof flag has a value of TRUE if there are no more entries in the directory; 
valid is used to mark the end of the entries. If the returned value of status is NFS_OK then 
it is followed by a variable number of entries. 

2.1.6.18. Get Filesystem Attributes (Procedure 17, Version 2) 

17. NFSPROC_STATFS (file) returns (reply) 

fhandle file; 

union switch (stat status) { 

NF S_OK : 

struct { 

unsigned tsize; 

unsigned bsize; 

unsigned blocks; 

unsigned bfree; 

unsigned bavail; 

}• fsattr; 
default : 

struct -Q 

> reply; 

If reply .status is NFS_OK then reply, fsattr gives the attributes for the filesystem that 
contains file. The attribute fields contain the following values: 

Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 17 


tsize The optimum transfer size of the server in bytes. This is the number of bytes the 
server would like to have in the data part of READ and WRITE requests. 

bsize The block size in bytes of the filesystem. 

blocks The total number of bsize blocks on the filesystem. 

bfree The number of free bsize blocks on the filesystem. 

bavail The number of bsize blocks available to non-privileged users. 

Bugs: this call does not work well if a filesystem has variable size blocks. 


Sun Microsystems 


Release 2.0 


Page 18 


NFS Protocol Spec 


3. Mount Protocol Definition 

The mount protocol is separate from, but related to, the NFS protocol. It provides all of the 
operating system specific services to get the NFS off the ground — looking up path names, vali- 
dating user identity, and checking access permissions. Clients use the mount protocol to get the 
first file handle, which allows them entry into a remote filesystem. 

The mount protocol is kept separate from the NFS protocol to make it easy to plug in new 
access checking and validation methods without changing the NFS server protocol. 

Notice that the protocol definition implies stateful servers because the server maintains a list of 
client’s mount requests. The mount list information is not critical for the correct functioning of 
either the client or the server. It is intented for advisory use only, for example, to warn possible 
clients when a server is going down. 


3.1. Version 1 

Version one of the mount protocol communicates with the version two of the NFS protocol. The 
only connecting point is the fhandle structure, which is the same for both protocols. 


8.1.1. RPC Information 
Authentication 

The mount service uses AUTH_UNIX style authentication only. 

Protocols 

The mount service is currently supported on UDP/IP only. 

Constants 

These are the RPC constants needed to call the MOUNT service. They are given in 
decimal. 

PROGRAM 100005 
VERSION 1 

Port Number 

Consult the server’s portmapper, described in the RPC Protocol Specification, to find which 
port number the mount service is registered on. 


3.1.2. Sizes 

These are the sizes given in decimal bytes of various XDR structures used in the protocol. 
MNTPATHLEN 1024 

The maximum number of bytes in a pathname argument. 

MNTNAMLEN 255 

The maximum number of bytes in a name argument. 

FHSIZE 32 

The size in bytes of the opaque file handle. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 19 


3.1.8. Basic Data Types 


3. 1.3.1. f handle 

typedef opaque fhandle [FHSIZE] ; 

The fhandle is the file handle that the server passes to the client. All file operations are done 
using file handles to refer to a file or directory. The file handle can contain whatever informa- 
tion the server needs to distinguish an individual file. 

This is the same as the fhandle XDR definition in version 2 of the NFS protocol; see the sec- 
tion on fhandle under Basic Data Types. 


3.1.3. 2. fhstatus 

typedef union switch (unsigned status) •{ 

0 : 

fhandle directory; 
default : 

struct {} 

> 

If a status of zero is returned, the call completed successfully, and a file handle for the 
directory follows. A non-zero status indicates some sort of error. In this case the status is a 
UNIX error number. 


3. 1.3. 3. dir path 

typedef string dirpath<MNTPATHLEN> ; 

The type dirpath is a normal UNIX pathname of a directory. 


3. 1.3. 4- name 

typedef string name<MNTNAMLEN> ; 

The type name is an arbitrary string used for various names. 


Sun Microsystems 


Release 2.0 


Page 20 


NFS Protocol Spec 


3.1.4. Server Procedures 

The following sections define the RPC procedures supplied by a mount server. The RPC pro- 
cedure number and version are given in the header, along with the name of the procedure. The 
synopsis of procedures has this format: 

<proc #> . <proc name> ( <arguments> ) returns ( <results> ) 

<argument declarations> 

<results declarations> 

In the first line, proc name is the name of the procedure, arguments is a list of the names of the 
arguments, and results is a list of the names of the results. The second and third lines give the 
XDR argument declarations and results declarations . Afterwards, there is a description of what 
the procedure is expected to do, and how its arguments and results are used. If there are bugs 
or problems with the procedure, they are listed at the end. 

3. 1 . 4 .I. Do Nothing (Procedure 0, Version 1) 

0. MNTPROC_NULL ( ) returns ( ) 

This procedure does no work. It is made available in all RPC services to allow server response 
testing and timing. 


3. 1.4-2. Add Mount Entry (Procedure 1, Version 1) 

1. MNTPROC_MNT (directory) returns (reply) 
dirpath dirname; 
fhstatus reply; 

If reply .status is 0, reply . directory contains the file handle for the directory dirname. 
This file handle may be used in the NFS protocol. This procedure also adds a new entry to the 
mount list for this client mounting dirname. 


3. 1-4-3. Return Mount Entries (Procedure 2, Version 1) 

2. MNTPROC_DUMP ( ) returns (mountlist) 

union switch (boolean more_entries) •{ 

TRUE: 

struct ■( 

name hostname; 

dirpath directory; 
mountlist nextentry; 

> 

FALSE: 

struct {}■ 

} mountlist; 

Returns the list of remote mounted filesystems. The mountlist contains one entry for each 
hostname and directory pair. 


Sun Microsystems 


Release 2.0 


NFS Protocol Spec 


Page 21 


3.1. 4- 4- Remove Mount Entry (Procedure 3, Version 1) 

3. MNTPROC_UMNT (directory) returns ( ) 

dirpath directory; 

Removes the mount list entry for directory. 

3. 1.4- 5. Remove All Mount Entries (Procedure 4, Version 1) 

4. MNTPROC_UMNTALL ( ) returns ( ) 

Removes all of the mount list entries for this client. 


3. 1.4.6. Return Export List (Procedure 5, Version 1) 

5. MNTPROC_EXPORT ( ) returns (exportlist) 

union switch (boolean more_entries) { 

TRUE: 

struct { 

dirpath filesys; 

typedef union switch (boolean more_groups) ■{ 

TRUE: 

struct { 

name grname; 
groups nextgroup; 

> 

FALSE: 

struct {)■ 

> groups; 

mount list nextentry; 

> 

FALSE: 

struct {} 

}• exportlist; 

Returns in exportlist a variable number of export list entries. Each entry contains a filesys- 
tem name and a list of groups that are allowed to import it. The filesystem name is in 
exportlist. filesys, and the group name is in exportlist . groups . grname. 

Bugs: the exportlist should contain more information about the status of the filesystem, such as a 
read-only flag. 


Sun Microsystems 


Release 2.0 


Index 


A 

atime, 8 

attributes, 10, 12, 13, 14, 15, 15 
attrstat, 10 
AUTH_NONE, 1, 4 
AUTHJJNIX, 1, 4, 4, 18 

B 

beginofTset, 13 
blocks, 8 
blocksize, 8 

c 

cookie, 16 
COOKIESIZE, 5 
count, 2, 13, 16 
ctime, 8 

D 

data, 2, 13, 13 
dir, 10, 16 
directory, 19, 20, 21 
dirname, 20, 20 
diropargs, 10 
diropres, 10 
dirpath, 10 , 19 
DS1ZE, 2 

E 

entries, 16 

entry, 16 

eof, 16 

exportlist, 21 

exportlist.filesys, 21 

exportlist.groups.grname, 21 

F 

fattr, 8 

fhandle, 7 , 7, 18, 19 
FHSIZE, 5 , 18 
fhstatus, 19 

file, 10, 11, 12, 12, 13, 13, 16 
filel, 2 
file2, 2 

fileid, 8, 16, 16 


filename, 9 
from, 14 
from.dir, 14, 15 
from. name, 14, 15 
fsid, 8 
ftype, 7 

G 

gid, 8 

H 

hostname, 20 

L 

LOOKUP, 3, 4 

M 

MAXDATA, 5 
MAXNAMLEN, 5 
MAXPATHLEN, 5 
MNTNAMLEN, 18 
MNTPATHLEN, 18 
MNTPROC_DUMP, 20 
MNTPROC_EXPORT, 21 
MNTPROC_MNT, 20 , 12 
MNTPROCLNULL, 20 
MNTPROCJJMNT, 21 
MNTPRO C_UMNT ALL , 21 
Mode, 8 
mountlist, 20 
mtime, 8 

N 

name, 19 , 2, 10, 16, 19 
NFBLK, 7, 8 
NFCHR, 7, 8 
NFDIR, 7, 8 
NFLNK, 7, 8, 15 
NFNON, 7, 9 
NFREG, 7, 8 
NFS_ERROR, 2 
NFS_OK, 2, 6 
NFSERR_ACCES, 6 
NFSERR_DQUOT, 7 
NFSERR_EXIST, 0 


NFSERR_FBIG, 7 
NFSERR_IO, 0 
NFSERRJSDIR, 7 
NFSERR_NAMETOOLONG, 7 
NFSERR_NODEV, « 
NFSERR_NOENT, 6 
NFSERRJMOSPC, 7 
NFSERR_NOTDIR, 7 
NFSERRJMOTEMPTY, 7 
NFSERR_NXIO, ft 
NFSERR_PERM, fi 
NFSERR_ROFS, 7 
NFSERR_STALE, 7 
NFSERR_WFLUSH, 7 
NFSPROC_CREATE, 14 
NFSPROC_GETATTR, 11 
NFSPROC_LINK, 14 
NFSPROC_LOOKUP, 12 
NFSPROC_MKDIR, 15 
NFSPROC_NULL, 11 
NFSPROC_READ, 13 
NFSPROC_READDIR, 1© 
NFSPROC_READLINK, 12 
NFSPROC_REMOVE, 14 
NFSPROC_RENAME, 14 
NFSPROC_RMDIR, 15 
NFSPROC_ROOT, 12 
NFSPROC_SETATTR, 12 
NFSPROC_STATFS, 1ft 
NFSPROC_SYMLINK, 15 
NFSPROC_ WRITE, 13 
NFSPROC_WRITECACHE, 13 
nlink, 8, 14 
NULL, 4 

O 

offset, 13, 13 

P 

path, 0 

R 

rdev, 8, 11 

READDIR, 3, 5 

reply. attributes, 11, 13, 14, 15 

reply. data, 12 

reply. directory, 20 

reply. file, 14, 15 

reply .fsattr, 16 

reply .status, 11, 13, 14, 14, 15, 15, 16, 20 

s 

sattr, 9 
SIZE, 2, 8 


stat, 6 

status, 2, 10, 10, 12, 14, 14, 14, 15, 15, 16, 19 

T 

timeval, 8 
to, 15 

to.dir, 14, 14 
to. name, 14, 14 
totalcount, 13, 13 
type, 8 

u 

uid, 8 

V 

valid, 16 

w 

where.dir, 14, 15 
where. name, 14, 15 
which. dir, 12, 14, 15 
which. name, 12, 14, 15 
WRITECACHE, 7 


— n — 


Yellow Pages 
Protocol Specification 


Contents 


1. Introduction and Terminology 1 

1.1. RPC — Remote Procedure Call 1 

1.2. XDR — External Data Representation 2 

2. YP Data Base Servers 3 

2.1. Maps and Operations on Maps 3 

2.1.1. Map Structure 3 

2.1.2. YP Private Key Symbols 3 

2.1.3. Match Operation 3 

2.1.4. Map Entry Enumeration Operations 3 

2.1.5. Map Update 4 

2.2. Master and Slave YP Data Base Servers 4 

2.3. Map Propagation, and Consistency 4 

2.3.1. Functions to Aid in Map Propagation 5 

2.3.2. Map Transfer Mechanism 5 

2.4. Domains 5 

2.5. Non-features 6 

2.5.1. Map Update Within the YP 6 

2.5.2. Version Commitment Across Multiple Requests 6 

2.5.3. Guaranteed Global Consistency 6 

2.5.4. Access Control 6 

2.6. YP Data Base Server Protocol Definition 6 

2.6.1. RPC Constants 6 

2.6.2. Other Manifest Constants 7 

2.6.3. Remote Procedure Return Values 7 

2.6.4. Basic Data Structures 7 

2.6.5. YP Data Base Server Remote Procedures 10 

2. 6. 5.1. Do Nothing (Procedure 0, Version 1) 10 

2. 6. 5. 2. Do You Serve This Domain? (Procedure 1, Version 1) 10 

2. 6.5. 3. Answer Only If You Serve This Domain (Procedure 2, 

Version 1) 10 

2. 6. 5. 4. Return Value of a Key (Procedure 3, Version 1) 10 

2. 6.5. 5. Get First Key-Value Pair in Map (Procedure 4, Version 

!) 11 

2. 6.5. 6. Get Next Key-Value Pair in Map (Procedure 5, Version 

1) 11 

2. 6.5. 7. Return Map Parameters (Procedure 6, Version 1) 11 


2. 6.5. 8. Tell Peers About New Map (Procedure 7, Version 1) 11 

2. 6. 5.9. Get Latest Version of Map (Procedure 8, Version 1) 12 

2.6.5.10. Get New Map Version From Here (Procedure 9, 

Version 1) 22 

3. YP Binders 23 

3.1. Introduction 23 

3.2. YP Binder Protocol Definition 13 

3.2.1. RPC Constants 23 

3.2.2. Other Manifest Constants 24 

3.2.3. Basic Data Structures 24 

3.2.4. YP Binder Remote Procedures lg 

3.2.4. 1. Do Nothing (Procedure 0, Version 1) 26 

3. 2. 4. 2. Get Current Binding for a Domain (Procedure 1, 

Version 1) 26 

3. 2. 4.3. Set Domain Binding (Procedure 2, Version 1) 26 


Yellow Pages 
Protocol Specification 


1. Introduction and Terminology 

The Yellow Pages (YP), Sun’s distributed lookup service, is a network service providing read 
access to a replicated database. The lookup service is provided by a set of YP database servers, 
which communicate among themselves to keep their databases consistent. The client interface 
to this service uses the Remote Procedure Call (RPC) mechanism. 

Translating or mapping a name to its value is one of the most common operations performed in 
computer systems. Common examples are the translation of a variable name to a virtual 
memory address, the translation of a user name to a system ID or list of capabilities, and the 
translation of a network node name to an internet address. There are two fundamental read- 
only operations that can be performed on a map: matching and enumeration. Match means to 
look up a name (which we call a key) and return its current value. Enumerate means to return 
each key-value pair in turn. 

The YP supplies matching and enumeration operations in a network environment, in which high 
availability and reliability are required. It provides that availability and reliability by replicating 
both databases and database servers on multiple nodes within a single local net, and within the 
internet. The database is replicated, but not distributed: all changes are made at a single server 
and eventually propagate to the remaining servers without locking. The YP is appropriate for 
an environment in which changes to the mapping databases occur on the order of tens per day. 

The YP operates on an arbitrary number of map databases. Map names provide the lower of 
two levels of a naming hierarchy. Maps are themselves grouped into named sets, called 
domains. Domain names provides a second, higher level of naming. Map names must be unique 
within a domain, but may be duplicated in different domains. The YP client interface requires 
that both a map name and a domain name be supplied to perform match and enumeration 
operations. 

The YP achieves high availability by replication. One area not addressed by the protocol which 
has to be addressed by the implementors is global consistency among the replicated copies of the 
database. Every implementation should be designed so that at steady state a request yields the 
same result when it is made of any YP database server. Update and update-propagation 
mechanisms must be implemented to supply the required degree of consistency. 


1.1. RPC — Remote Procedure Call 

Sun’s Remote Procedure Call (RPC) mechanism defines a paradigm for interprocess communica- 
tion modeled on function calls. Clients call functions that optionally return values. All inputs 
and outputs to the functions are in the client’s address space. The function is executed by a 
server program. 


Sun Microsystems 


Release 2.0 


Page 2 


YP Protocol Spec 


Using RPC, clients address servers by a program number (this identifies the application level 
protocol that the server speaks), and a version number. Additionally, each server procedure has 
a procedure number assigned to it. 

In an internet environment, a client must also know the server’s host internet address, and the 
server’s rendezvous port. The server listens for service requests at ports that are associated with 
a particular transport protocol — TCP/IP and/or UDP/IP. 

The format of the data structures used as inputs to and outputs from the remotely-executed pro- 
cedures are typically defined by header files that are included when the client interface functions 
are compiled. Levels above the client interface package need not know any particulars of the 
RPC interface to the server. 


1.2. XDR — External Data Representation 

The Sun External Data Representation (XDR) specification establishes standard representations 
for basic data types (such as strings, signed and unsigned integers, and structures and unions) in 
a way that allows them to be transferred among machines with varying architectures. XDR pro- 
vides primitives to encode (that is, translate from the local host’s representation to the standard 
representation) and decode (translate from the standard representation to the local host’s 
representation) basic data types. Constructor primitives allow arbitrarily complex data types to 
be made from the basic types. 

The YP’s RPC input and output data structures are described using XDR’s data description 
language. In general, the data description language looks like the C language, with a few extra 
constructs. One such extra construct is the discriminated union. This is like a C language 
union, in that it can hold various objects, but differs from it in that a discriminant indicates 
which object it currently holds. The discriminant is the first thing across the wire. Consider a 
simple example: 

union switch (long int) { 

1 : 

string exmpl_name< 16 > 

O: 

unsigned int exmpl_error_code 
default : 

struct {} 

> 

The example should be interpreted as follows: the first object to be encoded/decoded (that is, 
the discriminant) is a long integer. If it has the value one, the next object is a string. If the 
discriminant has the value zero, the next object is an unsigned integer. If the discriminant takes 
any other value, don’t encode or decode any more data. 

A string data type in the XDR data definition language adds the ability to specify the maximum 
number of elements in an byte array or string of potentially variable size. For instance: 

string domain<YPMAXDOMAIN> ; 

states that the byte sequence domain may be less than or equal to YPMAXDOMAIN bytes long. 

An additional primitive data type is a boolean, which takes the value one to mean TRUE and 
zero to mean FALSE. 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


Page 3 


2. YP Data Base Servers 


2.1. Maps and Operations on Maps 


2.1.1. Map Structure 

Maps are named sets of key-value pairs. The keys and their values are counted binary objects. 
The keys and their values may be ASCII information, but they need not be. The data compris- 
ing a map is determined by the client applications that are the final customers for the data, not 
by the YP. The YP has no syntactic nor semantic knowledge of the map contents. Neither does 
the YP determine or know any map’s name. Map names are managed by the YP’s clients. 
Conflict in the map namespace must be resolved by human administrators outside the YP sys- 
tem. 

Typical implementations for YP maps are files or DBMS systems. The design of the YP’s map 
database is an implementation detail, and is unspecified by the protocol. 


2.1.2. YP Private Key Symbols 

It is useful to be able to embed key-value pairs that may be used by the YP subsystem itself, or 
by human administrators or administration programs within all maps. Keys beginning with YP_ 
may be conventionally used to embed out-of-band information within a map, and should be con- 
sidered to be YP-private. The client interface to the YP’s enumeration functions should be 
implemented to filter out YP-private keys. Client programs should not see them; they won’t 
know what to do with them, and client parsers should not be forced to do the filtration. 

A unfiltered interface to the YP enumeration functions may also be supplied for programs that 
need to see YP-private keys. Alternatively, it could be assumed that any client that needs to see 
a YP-private key knows the name of that key. If that assumption is made, the YP match opera- 
tion is sufficient, and no unfiltered flavor of the YP enumeration operations needs to be supplied. 

The price paid for the ability to imbed administrative information within maps is that the key 
namespace is reduced. 


2.1.3. Match Operation 

The YP supports an exact match operation in the YPPROC_MATCH procedure. That is, if a 
match string and some key in the map are exactly the same, the value of the key is returned. 
No pattern matching, case conversion, or wildcarding is supported. 


2.1-4 ■ Map Entry Enumeration Operations 

The two operations which exist to enumerate the entries of a map are a “get first key-value 
pair” operation (the YPPROC_FIRST procedure), and a “get next key-value pair” operation (the 
YPPROC_NEXT procedure). If “get first” is called once, and then “get next” is called until the 
return value indicates that there are no more entries in the map, each entry in the map will be 


Sun Microsystems 


Release 2.0 


Page 4 


YP Protocol Spec 


seen exactly once. Further, if the same sequence of calls is made again on the same map at the 
same YP database server, the order in which the entries will be seen is the same. 

The actual ordering function is unspecified, and may not be assumed. It also may not be 
assumed that enumerating a map at a different YP database server will return the entries in the 
same order, whether that server represents the same implementation or not. 


2.1.5. Map Update 

The update of YP maps is an implementation detail which is outside the specification of the YP 
service. 


2.2. Master and Slave YP Data Base Servers 

The protocols assume that for each map there is one distinguished YP database server, called the 
map’s master. Map updates take place only on the master. An updated map should be 
transferred from the master to the rest of the YP database servers, which are slave servers for 
this map. 

It is possible for each map to have a different YP database server as its master, or for all maps to 
have the same master, or any other combination. The choice of how to set up map masters is 
one of implementation and administrative policy. 


2.3. Map Propagation, and Consistency 

Getting map updates from the master to the slaves is called map propagation. Neither technol- 
ogy nor algorithms for map propagation are specified by the protocol. Map propagation may be 
entirely manual: for instance, a person could copy the maps from the master to the slaves at a 
regular interval, or when a change is made on the master. This is unnecessarily labor intensive. 
There are hooks within the protocol for automatic convergence. The procedures designed for 
server-to-server communications are described in the next section. 

In order to escape from the idiosyncrasies of any particular implementation, all maps should be 
uniformly timestamped internally. An internal timestamp allows the map to be copied to or 
reconstructed at any number of nodes, without the time format, local clock time, or file creation 
or modification algorithms at that site having any effect on the map’s version. 

The timestamp should be created at the site where the map was created, or was last modified. 
The timestamp is out-of-band data, as far as the applications using the map are concerned, and 
should be associated with the YP-private key YP_LAST_MODIFIED. Its value should be an 
ASCII numeric sequence representing the time the map was created or last modified as the 
number of seconds since January 1, 1970 (GMT). The ASCII numeric sequence may be zero- 
padded to the left, up to a total length of ten characters. Each YP database server can read the 
YP_LAST_MODIFIED entry from each map it serves, and compare it with the version its peers 
have. 

The intent is for a slave to try to get the current copy from the master. If the master is 
unreachable, the subnet can still converge at the highest available order number. The slaves 
communicate among themselves to guarantee that all agree on the current version. 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


Page 5 


2.3.1. Functions to Aid in Map Propagation 

Any YP database server can communicate with any other. Any server may call 
YPPROC_MATCH, YPPROC_FIRST, or YPPROC_NEXT in a second server, in which case the first 
server is a client of the second. The protocol also has four functions that exist to help servers 
converge on a single version of a map. 

YPPROC_GET is called by a master server in a peer slave server. It tells the slave server to get a 
new version of a map from the master. 

YPPROC_PUSH is called by an administrative program in a master. It tells the master to notice 
that a new version of the map exists, and tell the peer slaves to get the new version. 

YPPROC_PULL is called by an administrative program in a slave. It tells the slave to get a new 
version of a map. 

YPPROC_POLL can be called either by a server or by an administrative program in any server. 
It is called to find out what the server’s current map version is, and which server it thinks is the 
map’s master. 


2.3.2. Map Transfer Mechanism 

The way a map is transferred from one server to another is not specified by the protocol. One 
possibility is the manual process described above. Another might be that a YP database server 
could activate some other process that would exist only to do the map transfer. A third might 
be for a server to enumerate the more recent version of the map, by using the normal client map 
enumeration functions. 

If the enumeration method is used, it will take several functions to transfer the whole map, and 
the map version may change at the supplying site. A version change over the lifetime of the 
transfer can be detected by the consumer server if the consumer brackets the enumeration with 
calls to the YPPROC_POLL procedure in the supplier. 


2.4. Domains 

Domain provide a second level for naming within the YP subsystem. They are names for sets of 
maps, therefore create separate map name spaces. Domains provide an opportunity to break 
large organizations up into administerable chunks, and the ability to create parallel, non- 
interfering test and production environments. 

Ideally, the domain of interest to a client ought to be associated with the invoking user, but in 
practice it is useful for client machines to be in a default domain. Implementations of the YP 
client interface should supply some mechanism for telling processes the domain name they should 
use. This is needed not only because the concept of domain is a useless one as far as most pro- 
grams are concerned, but, more importantly, so that programs can be written that are insensi- 
tive to both location and the invoking user. 

Information logically associated with all domains (or to no domain) can be held in a domain that 
is really a meta-domain. This domain may have a well-known name, so that information within 
it can be accessed regardless of the machine’s default domain, or of the domain of the invoking 
user. 


Sun Microsystems 


Release 2.0 


Page 6 


YP Protocol Spec 


2.5* Non-features 

The following capabilities are not included in the current YP protocols: 


2.5.1. Map Update Within the YP 

All write (and delete) access to the YP’s map database is assumed to be outside of the YP sub- 
system. It is probable that write access to the map database will be included in later versions of 
the YP protocols. 


2.5.2. Version Commitment Across Multiple Requests 

The YP protocol was designed to keep the YP database server stateless with regard to its clients. 
Therefore, there is no facility for contracting with a server to preallocate any resource beyond 
that required to service any single request. In particular, there is no way to get a server to com- 
mit to use a single version of a map while trying to enumerate that map’s entries. 


2.5.8. Guaranteed Global Consistency 

There is no facility for locking maps during the update or propagation phases, therefore it is vir- 
tually guaranteed that the map database be globally inconsistent during those phases. The set of 
client applications for which the YP is an appropriate lookup service is one that (by definition) 
must be tolerant of transient inconsistencies. 


2.5.4 ■ Access Control 

The YP database servers make no attempt to restrict access to the map data by any means. All 
syntactically correct requests are serviced. 


2.6. YP Data Base Server Protocol Definition 

This section describes version 1 of the protocol. It is likely that changes will be made to succes- 
sive versions as the service matures. 


2.6.1. RPC Constants 

All numbers are in decimal. 

YPPROG 100004 

The YP database server protocol program number. 
YPVERS 1 

The current YP protocol version. 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


Page 7 


2.6.2. Other Manifest Constants 

All numbers are in decimal. 

YPMAXRECORD 1024 

The total maximum size of key and value for any pair. The absolute sizes of the key and 

value may divide this maximum arbitrarily. 

YPMAXDOMA1N 64 

The maximum number of characters in a domain name. 

YPMAXMAP 64 

The maximum number of characters in a map name. 

YPMAXPEER 256 

The maximum number of characters in a YP server host name. 

2.6.3. Remote Procedure Return Values 

This section presents the return status values returned by several of the YP remote procedures. 

All numbers are in decimal. 

typedef enun { 

YP_TRUE = 1, 

YP_NOMORE = 2, 

YP_FALSE = 0, 

YP_N0MAP = -1, 

YP_N0D0M = -2, 

YP_N0KEY = -3, 

YP_BAD0P = -4, 

YP_BADDB = -5, 

YP_YPERR = -6, 

YP_BADARGS = -7 
> ypstat; 


2.6.4- Basic Data Structures 

This section defines the data structures used as inputs to and outputs from the YP remote pro- 
cedures. 

domainname 

typedef string domainname<YPMAXDOMAIN> 
mapname 

typedef string mapname <YPMAXMAP> 

peername 

typedef string peername<YPMAXPEER> 


/* General purpose success code. */ 
/* No more entries in map. */ 

/* General purpose failure code.*/ 
/* No such map in domain.*/ 

/* Domain not supported.*/ 

/* No such key in map.*/ 

/* Invalid operation.*/ 

/* Server database is bad . */ 

/* YP server error . */ 

/* Request arguments bad.*/ 


Sun Microsystems 


Release 2.0 


Page 8 


YP Protocol Spec 


keydat 

typedef string keydat<YPMAXRECORD> 


valdat 


typedef string valdat<YPMAXRECORD> 


ypmap_parms 

struct ypmap_parms •{ 
domainname 
map name 

unsigned long int ordernum 
peername 

> 

This contains parameters giving information about map mapname within domain domainname . 
The peername parameter is the name of the map’s master YP database server. If any of the 
three string pointers represent unknown (or unavailable) information, the parameters will be null 
strings. The ordernum parameter contains a binary value representing the value of the map’s 
YP_LAST_MODIEIED key. If the YP_LAST_MODIFIED value is unavailable, ordernum contains 
the value 0. 


yprequest 


struct yprequest { 

union switch (enum ypreqtype) { 
YPREQ_KEY : 

struct { 

domainname 

mapname 

keydat 

> 

YPREQ_NOKEY : 

struct { 

domainname 

mapname 

> 

YPRE Q_MAP_P ARMS : 

struct ypmap_parms 

default : 


> 


> 


o 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


Page 9 


ypresponse 

struct ypresponse { 

union switch (enun ypresptype) { 
YPRESP_VAL : 

struct ■{ 

ypstat 

valdat 

> 

YPRE SP_KE Y_VAL : 

struct •( 

ypstat 

valdat 

keydat 

> 

YPRE SP_MAP_P ARMS : 

struct ypmap_parms 

default : 


Sun Microsystems 


Release 2.0 


Page 10 


YP Protocol Spec 


2.6.5. YP Data Base Server Remote Procedures 

This section contains a specification for each function that can be called as a remote procedure. 
The input and output parameters are described using the XDR data definition language. When- 
ever the input parameter is a struct yprequest, the mapname and domainname parameters 
fully specify the map. 


2. 6. 5.1. Do Nothing (Procedure 0, Version 1) 

0. YPPROC_NULL ( ) returns ( ) 

This does no work. It is made available in all RPC services to allow server response testing and 
timing. 


2. 6. 5. 2. Do You Serve This Domain? (Procedure 1, Version 1) 

1. YPPROC_DOMAI N (domain) returns (servesp) 
domainname domain; 
boolean servesp; 

The server returns TRUE if it serves the passed domain, and FALSE otherwise. This function 
allows a potential client to ascertain whether or not a given server supports a named domain. 


2.6. 5. S. Answer Only If You Serve This Domain (Procedure 2, Version 1) 

2 . YPPROC_DOMAIN_NONACK (domain) returns (servesp) 
domainname domain; 
boolean servesp; 

The server returns TRUE if it serves the passed domain; otherwise it does not return. The 
intent of the function is that it be called in a broadcast environment, in which it is useful to res- 
trict the number of useless messages. If this function is called, the client interface implementa- 
tion must be written so as to regain control in the negative case, for instance by means of a 
timeout on the response. 

Sun’s current implementation currently does return in the FALSE case by forcing an RPC 
decode error. 


2.6.5. 4- Return Value of a Key (Procedure 3, Version 1) 

3. YPPROC_MATCH (req) returns (resp) 
struct yprequest req; 
struct ypresponse resp; 

The type of the req must be YPREQ_KEY. This returns the value associated with the key 
keydat. The type of the resp is YPRESP_VAL. If the ypstat parameter in the resp has the 
value YP_TRUE, the value data are returned in valdat. 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


2.6.5. 5. Get First Key-Value Pair in Map (Procedure 4, Version 1) 

4. YPPROC_FIRST (req) returns (resp) 
struct yprequest req; 
struct ypresponse resp; 

The type of the req must be YPREQ_NOKEY. The reap is of type YPRESP_KEY_VAL. If the 
value of the ypstat is YP_TRUE, this returns the first key-value pair from the map named in 
the req to the keydat and valdat parameters. An empty map is indicated by ypstat con- 
taining the value YP_NOMORE. 


2.6.5. 6. Get Next Key-Value Pair in Map ( Procedure 5, Version 1) 

5 . YPPROC_NEXT (req) returns (resp) 
struct yprequest req; 
struct ypresponse resp; 

The type of the req must be YPREQ_KEY. The resp is type YPRESP_KEY_VAL. If the value 
of the ypstat is YP_TRUE, this returns the key-value pair following the key-value named in the 
req parameter to the keydat and valdat parameters within resp. If the passed key is the 
last key in the map, the value of ypstat is YP_NOMORE. 


2.6.5. 7. Return Map Parameters (Procedure 6, Version 1) 

6 . YPPROC_POLL (req) returns (resp) 
struct yprequest req; 
struct ypresponse resp; 

The type of the req must be YPREQ_NOKEY. The resp is of type YPREQ_MAP_PARMS. The 
YP server returns the order number (binary timestamp value) and master server name for the 
map. If the domain is not supported, the domainname is a null string. If the map is unknown, 
the mapname is a null string. If unknown, the ordernum parameter has the value zero. If unk- 
nown, the peername is a null string. 


2. 6.5.8. Tell Peers About New Map (Procedure 7, Version 1) 

7. YPPROC_P U SH (req) returns ( ) 
struct yprequest req; 

The type of the req must be YPREQ_NOKEY. The master server rechecks the named map to 
make sure that the map parameters are up-to-date. It then calls the YPPROC_GET procedure in 
each reachable peer. If the server is not the master of the named map, it takes no action. 


Sun Microsystems 


Release 2.0 


Page 12 


YP Protocol Spec 


2. 6. 5. 9. Get Latest Version of Map (Procedure 8, Version 1) 

8. YPPROC_PULL (req) returns ( ) 
struct yprequest req; 

The type of the req must be YPREQ_NOKEY. The slave server attempts to get a more recent 
version of the named map from a peer. The master, if reachable, is checked first. If the master’s 
version is not greater than the slave’s version, the slave does not try any further. If the master’s 
version is greater than the slave’s, the slave attempts to transfer the map. If the master is not 
reachable, the slave attempts to find a greater version held at some other peer. If the server is 
the master of the named map, it takes no action. 


2.6.5.10. Get New Map Version From Here (Procedure 9, Version 1) 

9. YPPROC_GET (req) returns ( ) 
struct yprequest req; 

The type of the req must be YPREQ_NOKEY. The server assumes that the caller is the master 
of the map, and tries to get a new version from that master server. In terms of version numbers 
and peer reachability, it follows the course of action described for YPPROC_PULL. If the server 
is the master of the named map after replacing the master peer’s name with the caller’s name, it 
takes no action. That is, if a master calls YPPROC_GET in itself, it takes no action. 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


3c YP Binders 


3.1. Introduction 

In order that any network service be usable, there must be some way for potential clients to find 
the servers. This section describes the YP binder, an optional element in the YP subsystem that 
supplies YP database server addressing information to potential YP clients. 

In order to address a YP server in an ARPA internet environment, a client must know the 
server’s internet address, and the port at which the server is listening for service requests. No 
contract is negotiated between a YP server and a potential client, therefore the addressing infor- 
mation is sufficient to bind the client to the server. 

Of the many possible ways for a client to get the addressing information, one alternative is to 
supply an entity to cache the bindings, and to serve that binding database to potential YP 
clients. The theory is that if finding the service takes a lot of work, allocate a specialist to do it, 
rather than burden every client with a job that is irrelevant to its real function. A YP binder 
only makes sense if it is easier for a client to find the YP binder than to find a YP database 
server, and if the YP binder can itself find a YP database server. 

We make the assumption that a YP binder is present at every network node, and because of 
this, addressing the YP binder is easier than addressing a YP database server. The scheme for 
finding a local resource is implementation-specific, but given that a resource is guaranteed to be 
local, there may be some efficient way of finding it. We further assume that the YP binder can 
find a YP database server in some way, but that that way is either complicated or time- 
consuming to do. If either of these assumptions is untrue, then probably your implementation is 
not a good bet for a YP binder. 

If a YP binder is implemented, it can provide added value beyond the binding: it can verify that 
the binding is correct and that the YP database server is alive and well, for instance. The 
degree of sureness in a binding that the YP binder gives to a client is a parameter that can be 
tuned appropriately in the implementation. 


3.2. YP Binder Protocol Definition 

This section describes version 1 of the protocol. It is likely that changes will be made to succes- 
sive versions as the service matures. 


3.2.1. RPC Constants 

All numbers are decimal. 

YPBINDPROG 100007 

The YP binder protocol program number. 

YPBINDVERS 1 

The current YP binder protocol version. 


ypj? Sun Microsyste 


ms 


Release 2.0 


Page 14 


YP Protoco! Spec 


3.2.2. Other Manifest Constants 

All numbers are decimal. 

YPMAXDOMAIN 64 

The maximum number of characters in a domain name. This is identical to the constant 
defined above within the YP database server protocol section. 

ypbind_resptype 

enum ypbind_resptype •{ 

YPBIND_SUCC_VAL = 1 # 

YPBIND_FAIL_VAL = 2 

> 

This discriminates between success responses and failure responses to a 
YPBINDPROC_DOMAIN request. 

ypbinderr 

typedef enum •( 

YPBIND_ERR_ERR 1 
YPB I ND_ERR_NOSERV 2 
YP B I ND_E RR_RE S C 3 
} ypbinderr 

The error case of most interest to a YP binder client is YPBIND_ERR_NOSERV; it means that 
the binding request cannot be satisfied because the YP binder doesn’t know how to address any 
YP database server in the named domain. 


/* Internal error */ 

/* No bound server for passed domain */ 
/* System resource allocation failure */ 


3.2.3. Basic Data Structures 

This section defines the data structures used as inputs to and outputs from the YP binder 
remote procedures. 

domainname 

typedef string domainname<YPMAXDOMAIN> 

This is identical to the domainname string defined above within the YP database server protocol 
section. 

ypbind_binding 

struct ypbind_binding { 

unsigned long int ypbind_binding_addr 
unsigned short int ypbind_binding_port 

> 

This contains the information necessary to bind a client to a YP database server in the ARPA 
internet environment. ypbind_binding_addr holds the host IP address (4 bytes), and 
ypbind_binding_port holds the port address (2 bytes). Both IP address and port address must be 
in ARPA network byte order (most significant byte first, or big endian), regardless of the host 
machine’s native architecture. 


Sun Microsystems 


Release 2.0 


YP Protocol Spec 


ypbind_resp 

struct ypbind_resp { 

union switch (enum ypbind_resptype status) { 
YPBIND_SUCC_VAL : 

struct ypbind_binding 
YP B I ND_F AI L_VAL : 

ypbinderr 

default : 


This is the response to a YPBINDPROC_DOMAIN request. 


ypbind_setdom 

struct ypbind_setdom { 
domainname 

struct ypbind_binding 

> 

This is the input data structure for the YPBINDPROC_SETDOM procedure. 


Sun Microsystems 


Release 2.0 


Page 16 


YP Protocol Spec 


S. 2-4- YP Binder Remote Procedures 

Like the YP procedures earlier, these procedures are described using the XDR data definition 
language. 


3. 2. 4-1. Do Nothing (Procedure 0, Version 1) 

0. YPBINDPROC_NULL ( ) returns ( ) 

This does no work. It is made available in all RPC services to allow server response testing and 
timing. 


3.24.2. Get Current Binding for a Domain (Procedure 1, Version 1) 

1 . YP B I NDPR0C_D0MAI N (domain) returns (resp) 
domainname domain; 
struct ypbind_resp resp; 

This returns the binding information necessary to address a YP database server within the 
ARPA internet environment. 


3.24.3. Set Domain Binding (Procedure 2, Version 1) 

2. YP B I NDPROC_SE TDOM (setdom) returns ( ) 
struct ypbind_setdom setdom; 

This instructs a YP binder to use the passed information as its current binding information for 
the passed domain. 


Sun Microsystems 


Release 2.0 


Index 


A 

ARPA network byte order, 

B 

boolean, 2 
byte order, 14 

D 

discriminated union, 2 
domain, 1 
domainname, 7, 14 

E 

enumeration, 1, 3 
enumeration defined, 1 

G 

global consistency, 1 

K 

keydat, 8 

M 

map, 1 
mapname, 7 
master, 4 
match, 1, 3 
match defined, 1 

P 

peername, 7 
propagation, 1 

s 

slave, 4 
string, 2 

T 

timestamps, 4 

u 

update, 1 


V 

14 valdat, 8 

X 

XDR data description language, 2 

Y 

YP binder detailed error codes, 14 
YP private keys, 3 
YP server return status values, 7 
YP_LAST_MODIFIED, 4, 8 
ypbind_binding, 14 
ypbind_resp, 15 
ypbind_resptype, 14 
ypbind_setdom, 15 
ypbinderr, 14 

YPBINDPROCJDOMAIN, 1© 
YPBINDPROC_NULL, 18 
YPBINDPROC_SETDOM, 16 
YPBINDPROG, 13 
YPBINDVERS, 13 
ypmap_parms, 8 
YPMAXDOMAIN, 7, 14 
YP MAX MAP, 7 
YPMAXPEER, 7 
YPMAXRECORD, 7 
YPPROCLDOMAIN, 10 
YPPROC_DOMAIN_NONACK, 10 
YPPROCLFIRST, 11 
YPPROCLGET, II 
YPPROCLMATCH, 10 
YPPROC_NEXT, 11 
YPPROC_NULL, 10 
YPPROC_POLL, 11 
YPPROC_PULL, 12 
YPPROC_PUSH, 11 
YPPROG, 6 
yprequest, 8 
ypresponse, 9 
ypstat, 7 
YPVERS, © 


Inter-Process Communication 

Primer 


Contents 


1. Introduction 1 

2 . Basics 2 

2.1. Socket Types 2 

2.2. Socket Creation 3 

2.3. Binding Names 3 

2.4. Connection Establishment 4 

2.5. Data Transfer 6 

2.6. Discarding Sockets 6 

2.7. Connectionless Sockets 6 

2.8. Input/Output Multiplexing 7 

3. Network Library Routines 8 

3.1. Host Names 8 

3.2. Network Names 9 

3.3. Protocol Names 10 

3.4. Service Names 10 

3.5. Miscellaneous 10 

4 . Client/Server Model 13 

4.1. Servers 13 

4.2. Clients 15 

4.3. Connectionless Servers 16 

5 . Advanced Topics 20 

5.1. Out of Band Data 20 

5.2. Signals and Process Groups 21 

5.3. Pseudo Terminals 21 

5.4. Internet Address Binding 22 

5.5. Broadcasting and Datagram Sockets 24 

5.6. Signals 24 


Inter-Process Communication 

Primer 


This document provides an introduction to the inter-process communication (IPC) facilities on 
Sun’s version of the UNIXf operating system. It discusses the overall model for IPC, and intro- 
duces IPC primitives that have been added to the system. The majority of the document consid- 
ers the use of these primitives in developing applications. The reader is expected to be familiar 
with the C programming language, as all examples are written in C. 


1. Introduction 

One of the most important features added in the Berkeley 4.2 release of the UNIX operating sys- 
tem is substantial new IPC facilities. These facilities are the result of more than two years of 
discussion and research. The facilities provided in this release incorporate many of the ideas 
from current research, while trying to maintain simplicity and conciseness. These IPC facilities 
have already established a de facto standard. 

UNIX has previously been weak in doing IPC. Until recently, the only standard mechanism that 
allowed two processes to communicate were pipes (the mpx files in Version 7 were experimental). 
Unfortunately, pipes are restrictive in that two communicating processes must be related 
through a common ancestor. Further, the semantics of pipes makes them impossible to maintain 
in a distributed environment. 

Earlier attempts at extending the IPC facilities of UNIX have met with mixed reaction. The 
majority of problems have been related to these facilities being tied to the UNIX filesystem, 
either through naming or implementation. Consequently, the IPC facilities provided in this 
release have been designed as a totally independent subsystem, and allow processes to rendez- 
vous in many ways. Processes may rendezvous through a UNIX filesystem-like name space (a 
space where all names are path names) as well as through a network name space. In fact, new 
name spaces may be added at a future time with only minor changes visible to users. Further- 
more, the communication facilities have been extended to include more than the simple byte 
stream provided by pipes. These extensions have resulted in a completely new part of the sys- 
tem, which users will need time to familiarize themselves with. It is likely that as more use is 
made of these facilities, they will be refined; only time will tell. 

The remainder of this document is organized in four sections. Section 2 introduces new system 
calls and the basic model of communication. Section 3 describes some of the supporting library 
routines users may find useful in constructing distributed applications. Section 4 is concerned 
with the client/server model used in developing applications; it includes examples of the two 
major types of servers. Section 5 delves into advanced topics that sophisticated users may need 
to know when using IPC facilities. 


t UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


Page 2 


IPC Primer 


2o Basics 

The basic building block for communication is the socket. A socket is an endpoint of communi- 
cation to which a name may be bound. Each socket in use has a type and one or more associated 
processes. Sockets exist within communication domains. A communication domain is an 
abstraction introduced to bundle common properties of processes communicating through sock- 
ets. One such property is the scheme used to name sockets. For example, in the UNIX com- 
munication domain sockets are named with UNIX path names; e.g. a socket may be named 
/dev/foo . Sockets normally exchange data only with sockets in the same domain (it may be pos- 
sible to cross domain boundaries, but only if some translation process is performed). The IPC 
supports two separate communication domains: the UNIX domain, and the Internet domain is 
used by processes which communicate using the the DARPA standard communication protocols. 
The underlying communication facilities provided by these domains have a significant influence 
on the internal system implementation as well as the interface to socket facilities available to a 
user. An example of the latter is that a socket operating in the UNIX domain sees a subset of 
the possible error conditions which are possible when operating in the Internet domain. 


2.1. Socket Types 

Sockets are typed according to the communication properties visible to a user. Processes are 
presumed to communicate only between sockets of the same type, although there is nothing that 
prevents communication between sockets of different types should the underlying communication 
protocols support this. 

Three types of sockets are currently available to a user. A stream socket provides for the 
bidirectional, reliable, sequenced, and unduplicated flow of data without record boundaries. 
Aside from the bidirectionality of data flow, a pair of connected stream sockets provides an 
interface nearly identical to that of pipes. 1 

A datagram socket supports bidirectional flow of data that is not promised to be sequenced, reli- 
able, or unduplicated. That is, a process receiving messages on a datagram socket may find dupli- 
cate messages, and possibly in an order different from the order in which it was sent. An impor- 
tant characteristic of a datagram socket is that record boundaries in data are preserved. 
Datagram sockets closely model the facilities found in many contemporary packet switched net- 
works such as the Ethernet. 

A raw socket provides access to underlying communication protocols that support socket abstrac- 
tions. These sockets are normally datagram oriented, though their exact characteristics depend 
on the interface provided by the protocol. Raw sockets are not intended for the general user; 
they have been provided mainly for those interested in developing new communication protocols, 
who must gain access to the more esoteric facilities of an existing protocol. 

Two interesting, but implemented, socket types are the sequenced packet socket and the reliably 
delivered message socket. The first is identical to a stream socket, except that record boundaries 
are preserved; it is similar to the Xerox NS Sequenced Packet protocol. The second has similar 
properties to a datagram socket, but with reliable delivery. This document discusses only imple- 
mented sockets. 


1 In the UNIX domain, in fact, the semantics are identical and, as one might expect, pipes have been 
implemented internally as simply a pair of connected stream sockets. 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 3 


2.2. Socket Creation 

To create a socket, use the socket system call: 

s = socket (domain, type, protocol); 

This call requests that the system create a socket in the specified domain and of the specified 
type. A particular protocol may also be requested. If the protocol is left unspecified (a value of 
0), the system will select an appropriate protocol from those protocols which comprise the com- 
munication domain and which may be used to support the requested socket type. The user is 
returned a descriptor (a small integer number) which may be used in later system calls which 
operate on sockets. The domain is specified as one of the manifest constants defined in the file 
<sys/socket.h>. For the UNIX domain the constant is AF_UNIX; 2 for the Internet domain 
AFJNET. The socket types are also defined in this file and one of SOCK_STREAM, 
SOCK_DGRAM, or SOCK_RAW must be specified. To create a stream socket in the Internet 
domain the following call might be used: 

s = socket (AF_INET, SOCK_STREAM, 0) ; 

This call would result in a stream socket being created with the TCP protocol providing the 
underlying communication support. To create a datagram socket for on-machine use a sample 
call might be: 

s = socket (AF_UNIX, SOCK_DGRAM, O) ; 

To obtain a particular protocol one selects the protocol number, as defined within the communi- 
cation domain. For the Internet domain the available protocols are defined in <netinet/in.h> 
or, better yet, one may use one of the library routines discussed in section 3, such as getproto- 
byname : 

#include <sys/types .h> 

#include <sys/socket .h> 

#include <netinet/in . h> 

#include <netdb.h> 

pp = getprotobyname ("tcp") ; 

s = socket (AF_I NET, SOCK_STREAM, pp->p_proto) ; 

There are several reasons a socket call may fail. Aside from the rare occurrence of lack of 
memory (ENOBUFS), a socket request may fail due to a request for an unknown protocol 
(EPROTONOSUPPORT), or a request for a type of socket for which there is no supporting pro- 
tocol (EPROTOTYPE). 


2.3. Binding Names 

A socket is created without a name. Until a name is bound to a socket, processes have no way 
to reference it and, consequently, no messages may be received on it. The bind call is used to 
assign a name to a socket: 


2 The manifest constants are named AF_whatever as they indicate the address format to use in 
interpreting names. 


Sun Microsystems 


Release 2.0 


Page 4 


IPC Primer 


bind (s , name, namelen) ; 

The bound name is a variable length byte string which is interpreted by the supporting 
protocol(s). Its interpretation may vary from communication domain to communication domain 
(this is one of the properties which comprise the domain). In the UNIX domain names are path 
names while in the Internet domain names contain an Internet address and port number. If one 
wanted to bind the name /dev/foo to a UNIX domain socket, the following would be used: 

#include <sys/un.h> 

struct sockaddr_un sun; 

sun . sun_family = AF_UNIX; 

strcpy (sun . sun_path, "/dev/foo") ; 

bind(s, &sun, strlen ("/dev/foo") +2) ; 

In binding an Internet address things become more complicated. The actual call is simple, 

#include <sys /types .h> 

#Include <netinet/in .h> 

struct sockaddr_in sin; 

bind(s, &sin, sizeof (sin)); 

but the selection of what to place in the address tin requires some discussion. We will come 
back to the problem of formulating Internet addresses in section 3 when the library routines used 
in name resolution are discussed. 


2.4. Connection Establishment 

With a bound socket it is possible to rendezvous with an unrelated process. This operation is 
usually asymmetric with one process a client and the other a server. The client requests ser- 
vices from the server by initiating a connection to the server’s socket. The server, when willing 
to offer its advertised services, passively listens on its socket. On the client side the connect call 
is used to initiate a connection. Using the UNIX domain, this might appear as, 

struct sockaddr_un server; 

connect (s , ^server, strlen (server . sun_path) +2) ; 

while in the Internet domain, 

struct sockaddr_in server; 

connect (s, fiserver, sizeof (server)); 

If the client process’s socket is unbound at the time of the connect call, the system will automati- 
cally select and bind a name to the socket; c.f. section 5. 4. 3 An error is returned when the con- 
nection was unsuccessful (any name automatically bound by the system, however, remains). 
Otherwise, the socket is associated with the server and data transfer may begin. 

Many errors can be returned when a connection attempt fails. The most common are: 

ETIMEDOUT 

After failing to establish a connection for a period of time, the system decided there was no 
point in retrying the connection attempt any more. This usually occurs because the 

3 You must do a geteoekname (2) call to retrieve the binding. 


Sun Microsystems 


Release 2.0 


IPC Primer 


destination host is down, or because problems in the network resulted in transmissions being 
lost. 

ECONNREFUSED 

The host refused service for some reason. When connecting to a host running the 0.9 release 
version of UNIX this is usually due to a server process not being present at the requested 
name. 

ENETDOWN or EHOSTDOWN 

These operational errors are returned based on status information delivered to the client 
host by the underlying communication services. 

ENETUNREACH or EHOSTUNREACH 

These operational errors can occur either because the network or host is unknown (no route 
to the network or host is present), or because of status information returned by intermediate 
gateways or switching nodes. Many times the status returned is not sufficient to distinguish 
a network being down from a host being down. In these cases the system is conservative and 
indicates the entire network is unreachable. 

For the server to receive a client's connection it must perform two steps after binding its socket. 
The first is to indicate a willingness to listen for incoming connection requests: 

listen (s, 5); 

The second parameter to the listen call specifies the maximum number of outstanding connec- 
tions which may be queued awaiting acceptance by the server process. Should a connection be 
requested while the queue is full, the connection will not be refused, but rather the individual 
messages which comprise the request will be ignored. This gives a harried server time to make 
room in its pending connection queue while the client retries the connection request. Had the 
connection been returned with the ECONNREFUSED error, the client would be unable to tell if 
the server was up or not. As it is now it is still possible to get the ETIMEDOUT error back, 
though this is unlikely. The backlog figure supplied with the listen call is limited by the system 
to a maximum of 5 pending connections on any one queue. This avoids the problem of processes 
hogging system resources by setting an infinite backlog, then ignoring all connection requests. 

With a socket marked as listening, a server may accept a connection: 

fromlen = sizeof (from) ; 

snew = accept (s, fifrom, fifromlen) ; 

A new descriptor is returned on receipt of a connection (along with a new socket). If the server 
wishes to find out who its client is, it may supply a buffer for the client socket’s name. The 
value-result parameter fromlen is initialized by the server to indicate how much space is associ- 
ated with from, then modified on return to reflect the true size of the name. If the client s name 
is not of interest, the second parameter may be zero. 

Accept normally blocks. That is, the call to accept will not return until a connection is available 
or the system call is interrupted by a signal to the process. Further, there is no way for a pro- 
cess to indicate it will accept connections from only a specific individual, or individuals. It is up 
to the user process to consider who the connection is from and close down the connection if it 
does not wish to speak to the process. If the server process wants to accept connections on more 
than one socket, or not block on the accept call there are alternatives; they will be considered in 
section 5. 


Sun Microsystems 


Release 2.0 


Page 6 


IPC Primer 


2.5. Data Transfer 

With a connection established, data may begin to flow. To send and receive data there are a 
number of possible calls. With the peer entity at each end of a connection anchored, a user can 
send or receive a message without specifying the peer. As one might expect, in this case, then 
the normal read and write system calls are useable, 

write (s, buf, sizeof (buf)); 
read(s, buf, sizeof (buf)); 

In addition to read and write, the new calls send and recv may be used: 

send(s, buf, sizeof (buf), flags); 
recv(s, buf, sizeof (buf), flags); 

While send and recv are virtually identical to read and write, the extra flags argument is impor- 
tant. The flags may be specified as a non-zero value if one or more of the following is required: 

MSG_OOB send/receive out of band data 

MSG_PEEK look at data without reading 

MSG_DONTROUTE send data without routing packets 

Out of band data is a notion specific to stream sockets, and one which we will not immediately 
consider. The option to have data sent without routing applied to the outgoing packets is 
currently used only by the routing table management process, and is unlikely to be of interest to 
the casual user. The ability to preview data is, however, of interest. When MSG_PREVIEW is 
specified with a recv call, any data present is returned to the user, but treated as still unread. 
That is, the next read or recv call to the socket will return data previously previewed. 


2.6. Discarding Sockets 

Once a socket is no longer of interest, it may be discarded by applying a close to the descriptor, 
close (s) ; 

If data is associated with a socket which promises reliable delivery (e.g. a stream socket) when a 
close takes place, the system will continue to attempt to transfer the data. However, after a 
fairly long period of time, if the data is still undelivered, it will be discarded. Should a user have 
no use for any pending data, it may perform a shutdown on the socket prior to closing it. This 
call is of the form: 

shutdown (s, how); 

where how is 0 if the user is no longer interested in reading data, 1 if no more data will be sent, 
or 2 if no data is to be sent or received. Applying shutdown to a socket causes any data queued 
to be immediately discarded. 


2.7. Connectionless Sockets 

To this point we have been concerned mostly with sockets which follow a connection oriented 
model. There is also support for connectionless interactions typical of datagram facilities found 
in contemporary packet switched networks. A datagram socket provides a symmetric interface 
to data exchange. While processes are still likely to be client and server, there is no requirement 
for connection establishment. Instead, each message includes the destination address. 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 7 


Datagram sockets are created as before, and each should have a name bound to it in order that 
the recipient of a message may identify the sender. To send data, the sendto primitive is used, 

sendto (s , buf, buflen, flags, &to. tolen) ; 

The s, buf, buflen, and flags parameters are used as before. The to and tolen values are used to 
indicate the intended recipient of the message. When using an unreliable datagram interface, it 
is unlikely any errors will be reported to the sender. Where information is present locally to 
recognize a message which may never be delivered (for instance when a network is unreachable), 
the call will return —1 and the global value errno will contain an error number. 

To receive messages on an unconnected datagram socket, the recvfrom primitive is provided: 

recv from(s, buf, buflen, flags, &from, fifromlen) ; 

Once again, the fromlen parameter is handled in a value-result fashion, initially containing the 
size of the from buffer. 

In addition to the two calls mentioned above, datagram sockets may also use the connect call to 
associate a socket with a specific address. In this case, any data sent on the socket will automat- 
ically be addressed to the connected peer, and only data received from that peer will be 
delivered to the user. Only one connected address is permitted for each socket (i.e. no multi- 
casting). Connect requests on datagram sockets return immediately, as this simply results in the 
system recording the peer’s address (as compared to a stream socket where a connect request ini- 
tiates establishment of an end to end connection). Other of the less important details of 
datagram sockets are described in section 5. 


2.8. Input/Output Multiplexing 

One last facility often used in developing applications is the ability to multiplex I/O requests 
among multiple sockets and/or files. This is done using the select call: 

select (nfds, &readfds, fiwritefds, Sexecptfds, &timeout) ; 

Select takes as arguments three bit masks, one for the set of file descriptors for which the caller 
wishes to be able to read data on, one for those descriptors to which data is to be written, and 
one for which exceptional conditions are pending. Bit masks are created by or-ing bits of the 
form l<<fd. That is, a descriptor fd is selected if a 1 is present in the ftf th bit of the mask. 
The parameter nfds specifies the range of file descriptors (i.e. one plus the value of the largest 
descriptor) specified in a mask. 

A timeout value may be specified if the selection is not to last more than a predetermined period 
of time. If timeout is set to 0, the selection takes the form of a poll, returning immediately. If 
the last parameter is a null pointer, the selection will block indefinitely.^ Select normally returns 
the number of file descriptors selected. If the select call returns due to the timeout expiring, 
then a value of -1 is returned along with the error number EINTR. 

Select provides a synchronous multiplexing scheme. Asynchronous notification of output comple- 
tion, input availability, and exceptional conditions is possible through use of the SIGIO and 
SIGURG signals described in section 5. 


4 To be more specific, a return takes place only when a descriptor is selectable, or when a signal is 
received by the caller, interrupting the system call. 


Sun Microsystems 


Release 2.0 


Page 8 


IPC Primer 


3. Network Library Routines 

The discussion in section 2 indicated the possible need to locate and construct network addresses 
when using the IPC facilities in a distributed environment. To aid in this task a number of rou- 
tines have been added to the standard C run-time library. In this section we will consider the 
new routines provided to manipulate network addresses. While the Sun system release network- 
ing facilities support only the DARPA standard Internet protocols, these routines have been 
designed with flexibility in mind. As more communication protocols become available, we hope 
the same user interface will be maintained in accessing network-related address data bases. The 
only difference should be the values returned to the user. Since these values are normally sup- 
plied the system, users should not need to be directly aware of the communication protocol 
and/or naming conventions in use. 

Locating a service on a remote host requires many levels of mapping before client and server 
may communicate. A service is assigned a name which is intended for human consumption; for 
example, “the login server on host monet”. This name, and the name of the peer host, must 
then be translated into network addresses which are not necessarily suitable for human consump- 
tion. Finally, the address must then used in locating a physical location and route to the service. 
The specifics of these three mappings is likely to vary between network architectures. For 
instance, it is desirable for a network to not require hosts be named in such a way that their 
physical location is known by the client host. Instead, underlying services in the network may 
discover the actual location of the host at the time a client host wishes to communicate. This 
ability to have hosts named in a location independent manner may induce overhead in connec- 
tion establishment, as a discovery process must take place, but allows a host to be physically 
mobile without requiring it to notify its clientele of its current location. 

Standard routines are provided for: mapping host names to network addresses, network names to 
network numbers, protocol names to protocol numbers, and service names to port numbers and 
the appropriate protocol to use in communicating with the server process. The file <netdb.h> 
must be included when using any of these routines. 


3.1. Host Names 

A host name to address mapping is represented by the hostent structure: 


hostent 

{ 


char 

*h_name; 

/* 

official name of host */ 

char 

**h_aliases; 

/* 

alias list */ 


int 

h_addrtype; 

/* 

host address type 

V 

int 

h_length; 

/* 

length of address 

V 

char 

*h_addr ; 

/* 

address */ 


>; 

Note that the h_addr field in the structure definition is defined as a pointer to char. In the case 
of Internet addresses (the only case implemeted to date) you should cast this to a (struct 
in_addr *) when using the item. 

The official name of the host and its public aliases are returned, along with a variable length 
address and address type. The routine gethostbyname/ 3N) takes a host name and returns a hos- 
tent structure, while the routine gethostbyaddr(3N ) maps host addresses into a hostent structure. 
It is possible for a host to have many addresses, all having the same name. Geihostybyname 
returns the first matching entry in the data base file /etc/ hosts] if this is unsuitable, the lower 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 9 


level routine gcthostcnt^ 3N) may be used. Por example, to obtain a hostent structure for a host 
on a particular network the following routine might be used (for simplicity, only Internet 
addresses are considered): 

#include <sys/types .h> 

#include <sys/socket .h> 

#include <netinet/in.h> 

#include <netdb.h> 

struct hostent * 

gethostbynameandnet (name, net) 
char ‘name; 
int net ; 

register struct hostent *hp ; 
register char * * cp ; 

sethostent (0) ; 

while ( (hp = gethostent () ) != NULL) { 

if (hp->h_addrtype != AF_INET) 
continue; 

if (strcmp (name, hp->h_name)) { 

for (cp = hp->h_a liases; cp && *cp != NULL; cp++) 
if (strcmp (name, *cp) == 0) 
goto found; 
continue; 

> 

found : 

if (in_netof (* (struct in_addr *)hp->h_addr) ) == net) 
break ; 

> 

endhostent (0) ; 
return (hp) ; 

> 

(«n_ne<o/(3N) is a standard routine which returns the network portion of an Internet address.) 


3.2. Network Names 

As for host names, routines for mapping network names to numbers, and back, are provided. 
These routines return a netent structure: 

/* 

* Assumption here is that a network number 

* fits in 32 bits -- probably a poor one. 

V 

struct netent { 

char *n_name; /* official name of net */ 

char **n_aliases; /* alias list */ 

int n_addrtype; /* net address type */ 

int n_net ; /* network # */ 

>; 

The routines getnetbyname(3N), getnetbynumber{3N), and getnetent(3N) are the network 


Sun Microsystems 


Release 2.0 


Page 10 


IPC Primer 


counterparts to the host routines described above. 


3.3. Protocol Names 

For protocols the protoent structure defines the protocol-name mapping used with the routines 
getprotobyname(3N), getprotobynumber(3N), and getprotoent{ 3N): 

struct protoent •{ 

char *p_name ; /* official protocol name */ 

char 4 *p_aliases; /* alias list */ 

int p_proto; /* protocol # */ 


3.4. Service Names 

Information regarding services is a bit more complicated. A service is expected to reside at a 
specific port and employ a particular communication protocol. This view is consistent with the 
Internet domain, but inconsistent with other network architectures. Further, a service may 
reside on multiple ports or support multiple protocols. If either of these occurs, the higher level 
library routines will have to be bypassed in favor of homegrown routines similar in spirit to the 
gethostbynameandnet routine described above. A service mapping is described by the servent 
structure, 

struct servent { 

char * s_name ; 

char * *s_a liases ; 

int s_port; 

char *s_proto; 

>; 

The routine getservbyname (3N) maps service names to a servent structure by specifying a ser- 
vice name and, optionally, a qualifying protocol. Thus the call 

sp = getservbyname ("telnet" , (char *)0); 

returns the service specification for a telnet server using any protocol, while the call 

sp = getservbyname ("telnet" , "tcp") ; 

returns only that telnet server which uses the TCP protocol. The routines getservbyport(3N) 
and getservent(3N) are also provided. The getservbyport routine has an interface similar to that 
provided by getservbyname-, an optional protocol name may be specified to qualify lookups. 


/* official service name */ 
/* alias list */ 

/* port # */ 

/* protocol to use */ 


3.5. Miscellaneous 

With the support routines described above, an application program should rarely have to deal 
directly with addresses. This allows services to be developed as much as possible in a network 
independent fashion. It is clear, however, that purging all network dependencies is very difficult. 
So long as the user is required to supply network addresses when naming services and sockets 
there will always some network dependency in a program. For example, the normal code 
included in client programs, such as the remote login program, is of the form shown in Figure 1. 


Sun Microsystems 


Release 2.0 


IPC Primer 


#include <sys/types .h> 
#include <sys/socket .h> 
#include <netinet/in .h> 
^include <stdio.h> 
^include <netdb.h> 


main(argc, argv) 

char *argv[]; 

{ 

struct sockaddr_in sin; 
struct servent *sp; 
struct hostent *hp; 
int s ; 

sp = getservbyname ("login" , "tcp") ; 
if (sp == NULL) { 

fprintf (stderr, "rlogin : tcp/login : unknown service\n") ; 
exit (1) ; 

> 

hp = gethostbyname (argv [1] ) ; 
if (hp == NULL) < 

fprintf ( st derr, "rlogin: %s: unknown host\n" , argv [1] ) ; 
exit (2) ; 

> 

bzero ( (char *)&sin, sizeof (sin)); 

bcopy (hp->h_addr, (char *) &sin . sin_addr , hp->h_length) ; 

sin . sin_family = hp->h_addrtype; 

sin.sin_port = sp->s_port; 

s = socket (AF_INET, SOCK_STREAM, 0) ; 

if (s < 0) { 

perror ("rlogin : socket"); 
exit (3) ; 

> 

if (connect (s, (char *)&sin, sizeof (sin)) < 0) { 

perror ("rlogin : connect"); 
exit (5) ; 

> 


> 


Figure 1: Remote login client code 


This example will be considered in more detail in section 4. 

If we wanted to make the remote login program independent of the Internet protocols and 
addressing scheme we would be forced to add a layer of routines which masked the network 
dependent aspects from the mainstream login code. For the current facilities available in the 
system this does not appear to be worthwhile. Perhaps when the system is adapted to different 
network architectures the utilities will be reorganized more cleanly. 

Aside from the address-related data base routines, there are several other routines available in 
the run-time library which are of interest to users. These are intended mostly to simplify 


Sun Microsystems 


Release 2.0 


Page 12 


IPC Primer 


manipulation of names and addresses. The following table summarizes the routines for manipu- 
lating variable length byte strings and handling byte swapping of network addresses and values. 


C Run-Time Routines 

Call 

Synopsis 

bcmp (si, s2,n) 
bcopy (si , s2 , n) 
bzero (base , n) 
htonl (val) 
htons (val) 
ntohl (val) 
ntohs (val) 

compare byte-strings; 0 if same, not 0 otherwise 
copy n bytes from si to s2 
zero-fill n bytes starting at base 

convert 32-bit quantity from host to network byte order 
convert 16-bit quantity from host to network byte order 
convert 32-bit quantity from network to host byte order 
convert 16-bit quantity from network to host byte order 


The byte swapping routines are provided because the operating system expects addresses to be 
supplied in network order. On a VAX, or machine with similar architecture, this is usually 
reversed. Consequently, programs are sometimes required to byte swap quantities. The library 
routines which return network addresses provide them in network order so that they may simply 
be copied into the structures provided to the system. This implies users should encounter the 
byte swapping problem only when interpreting network addresses. For example, if an Internet 
port is to be printed out the following code would be required: 

printf("port number %d\n", ntohs (sp->s_port) ) ; 

On machines other than the VAX these routines are defined as null macros. 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 13 


4* Client/Server Model 

The most commonly used paradigm in constructing distributed applications is the client/server 
model. In this scheme client applications request services from a server process. This implies an 
asymmetry in establishing communication between the client and server which has been exam- 
ined in section 2. In this section we will look more closely at the interactions between client and 
server, and consider some of the problems in developing client and server applications. 

Client and server require a well known set of conventions before service may be rendered (and 
accepted). This set of conventions comprises a protocol which must be implemented at both 
ends of a connection. Depending on the situation, the protocol may be symmetric or asym- 
metric. In a symmetric protocol, either side may play the master or slave roles. In an asym- 
metric protocol, one side is immutably recognized as the master, with the other the slave. An 
example of a symmetric protocol is the TELNET protocol used in the Internet for remote termi- 
nal emulation. An example of an asymmetric protocol is the Internet file transfer protocol, FTP. 
No matter whether the specific protocol used in obtaining a service is symmetric or asymmetric, 
when accessing a service there is a client process and a server process. We will first consider 
the properties of server processes, then client processes. 

A server process normally listens at a well know address for service requests. Alternative 
schemes which use a service server may be used to eliminate a flock of server processes clogging 
the system whde remaining dormant most of the time. The Xerox Courier protocol uses the 
latter scheme. When using Courier, a Courier client process contacts a Courier server at the 
remote host and identifies the service it requires. The Courier server process then creates the 
appropriate server process based on a data base and splices the client and server together, void- 
ing its part in the transaction. This scheme is attractive in that the Courier server process may 
provide a single contact point for all services, as well as carrying out the initial steps in authenti- 
cation. However, while this is an attractive possibility for standardizing access to services, it 
does introduce a certain amount of overhead due to the intermediate process involved. Imple- 
mentations which provide this type of service within the system can minimize the cost of client 
server rendezvous. 


4.1. Servers 

In this release, most servers are accessed at well known Internet addresses or UNIX domain 
names. When a server is started at boot time it advertises it services by listening at a well know 
location. For example, the remote login server’s main loop is of the form shown in Figure 2. 

The first step taken by the server is look up its service definition: 

sp = getservbyname ("login" , "tcp") ; 
if (sp == NULL) { 

fprintf (stderr , "rlogind: tcp/login: unknown service\n") ; 
exit (1) ; 

> 

This definition is used in later portions of the code to define the Internet port at which it listens 
for service requests (indicated by a connection). 

Step two is to disassociate the server from the controlling terminal of its invoker. This is impor- 
tant as the server will likely not want to receive signals delivered to the process group of the 
controlling terminal. 


Sun Microsystems 


Release 2.0 


Page 14 


IPC Primer 


main(argc, argv) 
int argc; 
char **argv; 

int f ; 

struct sockaddr_in from; 
struct servent *sp; 

sp = getservbyname ("login" , "tcp") ; 
if (sp == NULL) { 

fprintf (stderr , "rlogind: tcp/login: unknown service\n") 
exit (1) ; 

> 

#ifndef DEBUG 

<<disassociate server from controlling terminal>> 

tend if 

sin.sin_port = sp->s_port; 

f = socket (AF_INET, SOCK_STREAM, 0) ; 

if (bind(f, (caddr_t) &sin, sizeof (sin)) < 0) { 

> 

listen (f, 5); 
for (;;) { 

int g, len = sizeof (from) ; 

g = accept(f, fifrom, &len) ; 
if (g < 0) { 

if (errno != EINTR) 

perror ("rlogind: accept"); 
continue; 

> 

if ( fork () == 0) < 
close ( f) ; 
doit (g, &from) ; 

> 

close (g) ; 

> 

> 


Figure 2: Remote login server 


Once a server has established a pristine environment, it creates a socket and begins accepting 
service requests. The bind call is required to insure the server listens at its expected location. 
The main body of the loop is fairly simple; 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 15 


for (;;) { 

int g, len = sizeof (from) ; 


g = accept (f, &from, &len) ; 
if (g < o) { 

if (errno != EINTR) 

perror ("rlogind : accept”); 
continue; 


> 

if (fork() == 0) { 
close (f) ; 
doit(g, &from) ; 

> 

close (g) ; 


An accept call blocks the server until a client requests service. This call could return a failure 
status if the call is interrupted by a signal such as SIGCHLD (to be discussed in section 5). 
Therefore, the return value from accept is checked to insure a connection has actually been esta- 
blished. With a connection in hand, the server then forks a child process and invokes the main 
body of the remote login protocol processing. Note how the socket used by the parent for queue- 
ing connection requests is closed in the child, while the socket created as a result of the accept is 
closed in the parent. The address of the client is also handed the doit routine because it requires 
it in authenticating clients. 


4.2. Clients 

The client side of the remote login service was shown earlier in Figure 1. One can see the 
separate, asymmetric roles of the client and server clearly in the code. The server is a passive 
entity, listening for client connections, while the client process is an active entity, initiating a 
connection when invoked. 

Let us consider more closely the steps taken by the client remote login process. As in the server 
process the first step is to locate the service definition for a remote login: 

sp = getservbyname (" login" , "tcp") ; 
if (sp == NULL) { 

fprintf (stderr, "rlogin: tcp/login: unknown service\n") ; 
exit (1) ; 

> 

Next the destination host is looked up with a gethostbyname call: 

hp = gethostbyname (argv [1] ) ; 
if (hp == NULL) { 

fprintf (stderr , "rlogin: %s : unknown host\n" , argv[l]); 
exit (2) ; 

> 

With this accomplished, all that is required is to establish a connection to the server at the 
requested host and start up the remote login protocol. The address buffer is cleared, then filled 
in with the Internet address of the foreign host and the port number at which the login process 
resides: 


Sun Microsystems 


Release 2.0 


Page 16 


IPC Primer 


bzero ( (char *)&sin, sizeof (sin)); 

bcopy (hp->h_addr , (char *) sin . sin_addr , hp->h_length) ; 
sin.sin_family = hp->h_addrtype; 
sin.sin_port = sp->s_port; 

A socket is created, and a connection initiated. 

s = socket (hp - >h_addrtype , SOCK_STREAM, 0) ; 
if (s < 0) { 

perror ("rlogin : socket"); 
exit (3) ; 

> 

if (connect (s, (char *)&sin, sizeof (sin)) < 0) { 
perror ("rlogin : connect") ; 
exit (4) ; 

> 

The details of the remote login protocol will not be considered here. 


4.3. Connectionless Servers 

While connection-based services are the norm, some services are based on the use of datagram 
sockets. One, in particular, is the rwho service which provides users with status information for 
hosts connected to a local area network. This service, while predicated on the ability to broad- 
cast information to all hosts connected to a particular network, is of-interest as an example usage 
of datagram sockets. 

A user on any machine running the rwho server may find out the current status of a machine 
with the ruptime( 1) program. The output generated is illustrated in Figure 3. 


arpa 

up 

9:45. 

5 

users , 

load 

1.15, 1.39, 1.31 

cad 

up 

2+12:04. 

8 

users , 

load 

4.67, 5.13, 4.59 

calder 

up 

10:10, 

0 

users , 

load 

0.27, 0.15, 0.14 

dali 

up 

2+06:28, 

9 

users , 

load 

1.04, 1.20, 1.65 

degas 

up 

25+09:48, 

0 

users , 

load 

1.49, 1.43, 1.41 

ear 

up 

5+00:05, 

0 

users , 

load 

1.51, 1.54, 1.56 

ernie 

down 

0:24 


esvax 

down 

17:04 


ingres 

down 

0:26 


kim 

up 

3+09:16, 

8 

users , 

load 

2.03, 2.46, 3.11 

matisse 

up 

3+06:18, 

0 

users , 

load 

0.03, 0.03, 0.05 

medea 

up 

3+09:39, 

2 

users , 

load 

0.35, 0.37, 0.50 

merlin 

down 

19 + 15:3 


miro 

up 

1+07:20, 

7 

users. 

load 

4.59, 3.28, 2.12 

monet 

up 

1+00:43, 

2 

users , 

load 

0.22, 0.09, 0.07 

oz 

down 

16:09 


statvax 

up 

2+15:57, 

3 

users , 

load 

1.52, 1.81, 1.86 

ucbvax 

up 

9:34, 

2 

users , 

load 

6.08, 5.16, 3.28 


Figure 3: ruptime output 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 17 


Status information for each host is periodically broadcast by rwho server processes on each 
machine. The same server process also receives the status information and uses it to update a 
database. This database is then interpreted to generate the status information for each host. 
Servers operate autonomously, coupled only by the local network and its broadcast capabilities. 

The rwho server, in a simplified form, is pictured in Figure 4. There are two separate tasks per- 
formed by the server. The first task is to act as a receiver of status information broadcast by 
other hosts on the network. This job is carried out in the main loop of the program. Packets 
received at the rwho port are interrogated to insure they’ve been sent by another rwho server 
process, then are time stamped with their arrival time and used to update a file indicating the 
status of the host. When a host has not been heard from for an extended period of time, the 
database interpretation routines assume the host is down and indicate such on the status reports. 
This algorithm is prone to error as a server may be down while a host is actually up, but serves 
our current needs. 


Sun Microsystems 


Release 2.0 


Page 18 


IPC Primer 


main () 

{ 

sp = getservbyname ("who" , "udp") ; 
net = getnetbyname ("localnet") ; 
sin.sin_addr = inet_makeaddr (INADDR^ANY, net); 
sin.sin_port = sp->s_port; 

s = socket (AF_INET, SOCK_DGRAM, 0) ; 

bind(s, &sin, sizeof (sin)); 

sigset (SIGALRM, onalrm) ; 
onalrm () ; 
for (;;) { 

struct whod wd; 

int cc, whod, len = sizeof (from) ; 

cc = recv from (s , (char *)&wd, sizeof (struct whod), 0, &from, &len) 
if (cc <= 0) ■( 

if (cc < 0 && errno != EINTR) 
perror ("rwhod : recv"); 
continue; 

> 

if (from. sin_port != sp->s_port) { 

fprintf (stderr , "rwhod: %d: bad from port\n", 
ntohs ( from . sin_port) ) ; 
continue; 

> 

if (! verify (wd . wd_hostname) ) { 

fprintf (stderr, "rwhod: malformed host name from %x\n", 
ntohl ( from . sin_addr . s_addr ) ) ; 
continue; 

> 

(void) sprintf (path, "%s/whod . %s" . RWHODIR, wd . wd_hostname) ; 
whod = open (path, 0_FWR0NLY | 0_F CREATE | 0_F TRUNCATE , 0666); 

(void) time (&wd . wd_recvtime) ; 

(void) write (whod, (char *)&wd, cc) ; 

(void) close (whod) ; 

> 

> 


Figure 4: rwho server 


The second task performed by the server is to supply information regarding the status of its 
host. This involves periodically acquiring system status information, packaging it up in a mes- 
sage and broadcasting it on the local network for other rwho servers to hear. The supply func- 
tion is triggered by a timer and runs off a signal. Locating the system status information is 
somewhat involved, but uninteresting. Deciding where to transmit the resultant packet does, 
however, indicates some problems with the current protocol. 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 19 


Status information is broadcast on the local network. For networks which do not support the 
notion of broadcast another scheme must be used to simulate or replace broadcasting. One pos- 
sibility is to enumerate the known neighbors (based on the status received). This, unfortunately, 
requires some bootstrapping information, as a server started up on a quiet network will have no 
known neighbors and thus never receive, or send, any status information. This is the identical 
problem faced by the routing table management process in propagating routing status informa- 
tion. The standard solution, unsatisfactory as it may be, is to inform one or more servers of 
known neighbors and request that they always communicate with these neighbors. If each server 
has at least one neighbor supplying it, status information may then propagate through a neighbor 
to hosts which are not (possibly) directly neighbors. If the server is able to support networks 
which provide a broadcast capability, as well as those which do not, then networks with an arbi- 
trary topology may share status information. 5 

The second problem with the current scheme is that the rwho process services only a single local 
network, and this network is found by reading a file. It is important that software operating in a 
distributed environment not have any site-dependent information compiled into it. This would 
require a separate copy of the server at each host and make maintenance a severe headache. 
The Sun system attempts to isolate host-specific information from applications by providing sys- 
tem calls which return the necessary information. 6 The rwho server performs a lookup in a file to 
find its local network. A better, though still unsatisfactory, scheme used by the routing process 
is to interrogate the system data structures to locate those directly connected networks. A 
mechanism to acquire this information from the system would be a useful addition. 


8 One must, however, be concerned about loops. That is, if a host is connected to multiple networks, it 
will receive status information from itself. This can lead to an endless, wasteful, exchange of information. 

0 An example of such a system call is the gethoetname(2) call which returns the host’s official name. 


Sun Microsystems 


Release 2.0 


Page 20 


IPC Primer 


5. Advanced Topics 

A number of facilities have yet to be discussed. For most users of the IPC the mechanisms 
already described will suffice in constructing distributed applications. However, others will find 
need to utilize some of the features which we consider in this section. 


5.1. Out of Band Data 

The stream socket abstraction includes the notion of out of band data. Out of band data is a 
logically independent transmission channel associated with each pair of connected stream sock- 
ets. Out of band data is delivered to the user independently of normal data along with the 
SIGURG signal. In addition to the information passed, a logical mark is placed in the data 
stream to indicate the point at which the out of band data was sent. The remote login and 
remote shell applications use this facility to propagate signals from between client and server 
processes. When a signal is expected to flush any pending output from the remote process(es), 
all data up to the mark in the data stream is discarded. 

The stream abstraction defines that the out of band data facilities must support the reliable 
delivery of at least one out of band message at a time. This message may contain at least one 
byte of data, and at least one message may be pending delivery to the user at any one time. For 
communications protocols which support only in-band signaling (that is, the urgent data is 
delivered in sequence with the normal data) the system extracts the data from the normal data 
stream and stores it separately. This allows users to choose between receiving the urgent data in 
order and receiving it out of sequence without having to buffer all the intervening data. 

To send an out of band message the MSG_OOB flag is supplied to a send or sendto calls, while 
to receive out of band data MSG_OOB should be indicated when performing a recvfrom or recv 
call. To find out if the read pointer is currently pointing at the mark in the data stream, the 
SIOCATMARK ioctl is provided: 

ioctl(s, SIOCATMARK, &yes) ; 

If yes is a 1 on return, the next read will return data after the mark. Otherwise (assuming out of 
band data has arrived), the next read will provide data sent by the client prior to transmission of 
the out of band signal. The routine used in the remote login process to flush output on receipt of 
an interrupt or quit signal is shown in Figure 5. 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 21 


oob () 

< 

int out = 1+1; 

char waste [BUFSIZ] , mark; 

signal (SIGURG, oob) ; 

/* flush local terminal input and output */ 
ioctl (1, TIOCFLUSH, (char *)&out); 
for ( ; ; ) { 

if (ioctl (rem, SIOCATMARK, &mark) < 0) { 
perror ("ioctl") ; 
break ; 

> 

if (mark) 

break; 

(void) read (rem, waste, sizeof (waste)); 

> 

recv (rem, &mark, 1, MSG_00B) ; 

> 


Figure 5: Flushing terminal I/O on receipt of out of band data 


5.2. Signals and Process Groups 

Due to the existence of the SIGURG and SIGIO signals each socket has an associated process 
group (just as is done for terminals). This process group is initialized to the process group of its 
creator, but may be redefined at a later time with the SIOCSPGRP ioctl: 

ioctl (s, SIOCSPGRP, fipgrp) ; 

A similar ioctl, SIOCGPGRP, is available for determining the current process group of a socket. 


5.3. Pseudo Terminals 

Many programs will not function properly without a terminal for standard input and output. 
Since a socket is not a terminal, it is often necessary to have a process communicating over the 
network do so through a pseudo terminal. A pseudo terminal is actually a pair of devices, mas- 
ter and slave, which allow a process to serve as an active agent in communication between 
processes and users. Data written on the slave side of a pseudo terminal is supplied as input to a 
process reading from the master side. Data written on the master side is given the slave as 
input. In this way, the process manipulating the master side of the pseudo terminal has control 
over the information read and written on the slave side. The remote login server uses pseudo 
terminals for remote login sessions. A user logging in to a machine across the network is pro- 
vided a shell with a slave pseudo terminal as standard input, output, and error. The server pro- 
cess then handles the communication between the programs invoked by the remote shell and the 
user’s local client process. When a user sends an interrupt or quit signal to a process executing 
on a remote machine, the client login program traps the signal, sends an out of band message to 
the server process who then uses the signal number, sent as the data value in the out of band 


Sun Microsystems 


Release 2.0 


Page 22 


IPC Primer 


message, to perform a killpg(2 ) on the appropriate process group. 


5.4. Internet Address Binding 

Binding addresses to sockets in the Internet domain can be fairly complex. Communicating 
processes are bound by an association. An association is composed of local and foreign 
addresses, and local and foreign ports. Port numbers are allocated out of separate Spaces, one 
for each Internet protocol. Associations are always unique. That is, there may never be dupli- 
cate <protocol, local address, local port, foreign address, foreign port> tuples. 

The bind system call allows a process to specify half of an association, <local address, local 
port>, while the connect and accept primitives are used to complete a socket’s association. 
Since the association is created in two steps the association uniqueness requirement indicated 
above could be violated unless care is taken. Further, it is unrealistic to expect user programs to 
always know proper values to use for the local address and local port since a host may reside on 
multiple networks and the set of allocated port numbers is not directly accessible to a user. 

To simplify local address binding the notion of a wildcard address has been provided. When an 
address is specified as INADDR_ANY (a manifest constant defined in <netinet/in.h>), the sys- 
tem interprets the address as meaning, any valid address. For example, to bind a specific port 
number to a socket, but leave the local address unspecified, the following code might be used: 

#include <sys/types .h> 

^include <netinet/in .h> 

struct sockaddr_in sin; 

s = socket (AF_I NET. SOCK_STREAM, O) ; 

sin . sin_family = AF_INET; 

sin . sin_addr . s_addr = INADDR_ANY; 

sin.sin_port = MYPORT; 

bind(s, (char *)&sin, sizeof (sin)); 

Sockets with wildcarded local addresses may receive messages directed to the specified port 
number, and addressed to any of the possible addresses assigned a host. For example, if a host is 
on networks 46 and 10 and a socket is bound as above, then an accept call is performed, the pro- 
cess will be able to accept connection requests which arrive either from network 46 or network 
10. 

In a similar fashion, a local port may be left unspecified (specified as zero), in which case the sys- 
tem will select an appropriate port number for it. For example: 

sin . sin_addr . s_addr = MYADDRESS; 
sin.sin_port = 0; 

bind (s , (char *)&sin, sizeof (sin)); 

The system selects the port number based on two criteria. The first is that ports numbered 0 
through IPPORT_RESERVED— 1 are reserved for privileged users (that is, the super user). The 
second is that the port number is not currently bound to some other socket. In order to find a 
free port number in the privileged range the following code is used by the remote shell server: 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 23 


struct sockaddr_in sin; 

Iport = IPPORT_RESERVED - 1; 
sin . sin_addr . s_addr = I NADDR_ANY ; 

for (;;) { 

sin.sin_port = htons ( (u_short) Iport) ; 
if (bind(s, (caddr_t) &sin, sizeof (sin)) >= 0) 
break; 

if (errno != EADDRINUSE && errno != EADDRNOTAVAIL) •( 
perror ("socket") ; 
break; 

> 

Iport-- ; 

if (Iport == IPP0RT_RESERVED/2) { 

fprintf (stderr , "socket: All ports in use\n") ; 
break ; 

> 

> 

The restriction on allocating ports was done to allow processes executing in a secure environment 
to perform authentication based on the originating address and port number. 

In certain cases the algorithm used by the system in selecting port numbers is unsuitable for an 
application. This is due to associations being created in a two step process. For example, the 
Internet file transfer protocol, FTP, specifies that data connections must always originate from 
the same local port. However, duplicate associations are avoided by connecting to different 
foreign ports. In this situation the system would disallow binding the same local address and 
port number to a socket if a previous data connection’s socket were around. To override the 
default port selection algorithm then an option call must be performed prior to address binding: 

setsockopt (s, S0L_S0CKET, SO_REUSEADDR, (char *)0, 0) ; 
bind (s, (char *) &sin, sizeof (sin)); 

With the above call, local addresses may be bound which are already in use. This does not 
violate the uniqueness requirement as the system still checks at connect time to be sure any 
other sockets with the same local address and port do not have the same foreign address and 
port (if an association already exists, the error EADDRINUSE is returned). 

Local address binding by the system is currently done somewhat haphazardly when a host is on 
multiple networks. Logically, one would expect the system to bind the local address associated 
with the network through which a peer was communicating. For instance, if the local host is 
connected to networks 46 and 10 and the foreign host is on network 32, and traffic from network 
32 were arriving via network 10, the local address to be bound would be the host’s address on 
network 10, not network 46. This unfortunately, is not always the case. For reasons too compli- 
cated to discuss here, the local address bound may be appear to be chosen at random. This pro- 
perty of local address binding will normally be invisible to users unless the foreign host does not 
understand how to reach the address selected. 7 


7 For example, if network 46 were unknown to the host on network 32, and the local address were bound 
to that located on network 46, then even though a route between the two hosts existed through network 
10, a connection would fail. 


Sun Microsystems 


Release 2.0 


Page 24 


IPC Primer 


5.5. Broadcasting and Datagram Sockets 

By using a datagram socket it is possible to send broadcast packets on many networks supported 
by the system (the network itself must support the notion of broadcasting; the system provides 
no broadcast simulation in software). Broadcast messages can place a high load on a network 
since they force every host on the network to service them. 

To send a broadcast message, an Internet datagram socket should be created: 

s = socket (AF_INET, SOCK_DGRAM, O) ; 

and at least a port number should be bound to the socket: 

sin . sin_family = AF_INET; 

sin . sin_addr . s_addr = INADDR — ANY; 

sin.sin_port = MYPORT; 

bind (s ( (char *)&sin, sizeof (sin) ) ; 

Then the message should be addressed as: 

dst . sin_family = AF_INET; 
inet_makeaddr (net , INADDR_ANY) ; 
dst.sin_port = DESTPORT; 

and, finally, a sendto call may be used: 

sendto(s, buf, buflen, O, &dst, sizeof (dst)); 

Received broadcast messages contain the senders address and port (datagram sockets are 
anchored before a message is allowed to go out). 

There are a couple of minor problems in the above example. One is created because 
INADDR_ANY has two meanings: 

1. Fill in my own address, and, 

2. Broadcast. 

Unfortunately, broadcast must at some time in the future be changed to —1 instead of 0, so that 
broadcast will no longer be The second problem is how do you get your net number? You could 
use the SIOCGICONF ioctl call, or you could get your own address and do a \net_netof on that. 
INADDR_ANY. 

5.6. Signals 

Two new signals have been added to the system which may be used in conjunction with the IPC 
facilities. The SIGURG signal is associated with the existence of an urgent condition. The 
SIGIO signal is used with interrupt driven I/O (not presently implemented). SIGURG is 
currently supplied a process when out of band data is present at a socket. If multiple sockets 
have out of band data awaiting delivery, a select call may be used to determine those sockets 
with such data. 

An old signal which is useful when constructing server processes is SIGCHLD. This signal is 
delivered to a process when any children processes have changed state. Normally servers use the 
signal to reap child processes after exiting. For example, the remote login server loop shown in 
Figure 2 may be augmented as follows: 


Sun Microsystems 


Release 2.0 


IPC Primer 


Page 25 


int reaper () ; 

signal (SIGCHLD, reaper) ; 
listen (f , 10) ; 

for ( ; ; ) { 

int g, len = sizeof (from) ; 

g = accept (f, &from, &len, 0) ; 
if (g < o) { 

if (errno != EINTR) 

perror ("rlogind : accept"); 
continue; 

> 


> 

#include <wait.h> 
reaper () 

{ 

union wait status; 

while (wait3 (fistatus, WNOHANG, 0) > 0) 


> 

If the parent server process fails to reap its children, a large number of zombie processes may be 
created. 


Sun Microsystems 


Release 2.0 


Network Implementation 


Contents 


1. Introduction 2 

2. Overview 1 

3. Goals 2 

4. Internal Address Representation 2 

5. Memory Management g 

0. Internal Layering 4 

6.1. Socket Layer 4 

6.1.1. Socket State 5 

6.1.2. Socket Data Queues 6 

6.1.3. Socket Connection Queueing 6 

6.2. Protocol Layer(s) 7 

6.3. Network-Interface Layer 8 

7. Socket/Protocol Interface 10 

8. Protocol/Protocol Interface 13 

8.1. pr_output 13 

8.2. pr_input 14 

8.3. pr_ctlinput 14 

8.4. pr_ctloutput 15 

9. Protocol/Network-Interface Interface 15 

9.1. Packet Transmission 15 

9.2. Packet Reception 15 

10. Gateways and Routing Issues ig 

10.1. Routing Tables 15 

10.2. Routing Table Interface 18 

10.3. User-Level Routing Policies 18 

11. Raw Sockets ig 

11.1. Control Blocks 19 


11.2. Input Processing 19 

11.3. Output Processing 20 

12. Buffering and Congestion Control 20 

12.1. Memory Management 20 

12.2. Protocol Buffering Policies 21 

12.3. Queue Limiting 21 

12.4. Packet Forwarding 21 

13. Out of Band Data 22 

A. Acknowledgements and References 22 

B. References 22 


Network Implementation 


1. Introduction 

This report describes the internal structure of the networking facilities of the Sun Workstation 
version of the UNEXf operating system. These facilities are derived from the networking facili- 
ties added at U.C. Berkeley in the Berkeley 4.2 release of the system. The system provides a 
uniform user interface to networking, and a structure that permits system implementors to add 
new facilities. The internal structure is not visible to the user, rather it is intended to aid imple- 
mentors of communication protocols and network services by providing a framework that pro- 
motes code sharing and minimizes implementation effort. 

The reader is expected to be familiar with the C programming language and system interface, as 
described in the System Interface Overview at the beginning of the Sun System Interface 
Manual. Basic understanding of network communication concepts is assumed; where required 
any additional ideas are introduced. 

The remainder of this document provides a description of the system internals, avoiding, when 
possible, those portions utilized only by the interprocess communication facilities. 


2* Overview 

If we consider the International Standards Organization’s (ISO) Open System Interconnection 
(OSI) model of network communication [IS081] [Zimmermann80], the networking facilities 
described here correspond to a portion of the session layer (layer 3) and all of the transport and 
network layers (layers 2 and 1, respectively). 

The network layer provides possibly imperfect data transport services with minimal addressing 
structure. Addressing at this level is normally host to host, with implicit or explicit routing 
optionally supported by the communicating agents. 

At the transport layer the notions of reliable transfer, data sequencing, flow control, and service 
addressing are normally included. Reliability is usually managed by explicit acknowledgement of 
data delivered. Failure to acknowledge a transfer results in retransmission of the data. Sequenc- 
ing may be handled by tagging each message handed to the network layer by a sequence number 
and maintaining state at the endpoints of communication to utilize received sequence numbers in 
reordering data which arrives out of order. 

The session layer facilities may provide forms of addressing which are mapped into formats 
required by the transport layer, service authentication and client authentication, etc. Various 
systems also provide services such as data encryption and address and protocol translation. 

The following sections begin by describing some of the common data structures and utility rou- 
tines, then examine the internal layering. The contents of each layer and its interface are con- 
sidered. Certain of the interfaces are protocol implementation specific. For these cases 

t UNIX is a trademark of Bell Laboratories. 


Sun Microsystems 


Release 2.0 


Page 2 


Network Implementation 


examples have been drawn from the Internet [Cerf78] protocol family. Later sections cover rout- 
ing issues, the design of the raw socket interface and other miscellaneous topics. 


3. Goals 

The networking system was designed with the goal of supporting multiple protocol families and 
addressing styles. This required information to be “hidden” in common data structures which 
could be manipulated by all the pieces of the system, but which required interpretation only by 
the protocols which “controlled” it. The system described here attempts to minimize the use of 
shared data structures to those kept by a suite of protocols (a protocol family), and those used 
for rendezvous between “synchronous” and “asynchronous” portions of the system (for example, 
queues of data packets are filled at interrupt time and emptied based on user requests). 

A major goal of the system was to provide a framework within which new protocols and 
hardware could easily be supported. To this end, a great deal of effort has been extended to 
create utility routines which hide many of the more complex and/or hardware dependent chores 
of networking. Later sections describe the utility routines and the underlying data structures 
they manipulate. 


4. Internal Address Representation 

Common to all portions of the system are two data structures. These structures are used to 
represent addresses and various data objects. Addresses, internally are described by the 
sockaddr structure, 

struct sockaddr •( 

short sa_family ; /* data format identifier */ 

char sa_data [14] ; /* address */ 

>; 

All addresses belong to one or more address families which define their format and interpreta- 
tion. The sa_family field indicates which address family the address belongs to, the sa^data field 
contains the actual data value. The size of the data field, 14 bytes, was selected based on a 
study of current address formats 


5. Memory Management 

A single mechanism is used for data storage: memory buffers, or mbufs. An mbuf is a structure 
of the form: 

struct mbuf ■( 


struct 

mbuf 4 m_next; 

/* 

next 

buffer in chain */ 

u_long 

m_of f ; 

/* 

offset of data */ 

short 

m_len; 

/* 

amount of data in this mbuf */ 

short 

m_ type; 

/* 

mbuf 

type (accounting) */ 

u_char 

m_dat [MLEN] ; 

/* 

data 

storage */ 

struct 

mbuf *m_act; 

/* 

link 

in higher -level mbuf list 


>; 

The m_next field is used to chain mbufs together on linked lists, while the m_act field allows lists 
of mbufs to be accumulated. By convention, the mbufs common to a single object (for example, 


w Sun Microsystems 


Release 2.0 


Network Implementation 


Page 3 


a packet) are chained together with the m_next field, while groups of objects are linked via the 
m_act field (possibly when in a queue). 

Each mbuf has a small data area for storing information, m_dat. The mjlen field indicates the 
amount of data, while the m_off field is an offset to the beginning of the data from the base of 
the mbuf. Thus, for example, the macro mtod, which converts a pointer to an mbuf to a pointer 
to the data stored in the mbuf, has the form 

#define mtod(x,t) ( (t) ( (int) (x) + (x) ->m_of f) ) 

(note the t parameter, a C type cast, is used to cast the resultant pointer for proper assignment). 

In addition to storing data directly in the mbuf’s data area, data of page size may be also be 
stored in a separate area of memory. The mbuf utility routines maintain a pool of pages for this 
purpose and manipulate a private page map for such pages. The virtual addresses of these data 
pages precede those of mbufs, so when pages of data are separated from an mbuf, the mbuf data 
offset is a negative value. An array of reference counts on pages is also maintained so that copies 
of pages may be made without core to core copying (copies are created simply by duplicating 
the relevant page table entries in the data page map and incrementing the associated reference 
counts for the pages). Separate data pages are currently used only when copying data from a 
user process into the kernel, and when bringing data in at the hardware level. Routines which 
manipulate mbufs are not normally aware if data is stored directly in the mbuf data array, or if 
it is kept in separate pages. 

The following utility routines are available for manipulating mbuf chains: 
m = m_copy(mO, off, len); 

The m_copy routine create a copy of all, or part, of a list of the mbufs in mO. Len bytes of 
data, starting off bytes from the front of the chain, are copied. Where possible, reference 
counts on pages are used instead of core to core copies. The original mbuf chain must have 
at least off + len bytes of data. If len is specified as M_COPYALL, all the data present, 
offset as before, is copied. 

m_cat (m , n) ; 

The mbuf chain, n, is appended to the end of m. Where possible, compaction is performed. 
m_adj (m, diff) ; 

The mbuf chain, m is adjusted in size by diff bytes. If diff is non-negative, diff bytes are 
shaved off the front of the mbuf chain. If diff is negative, the alteration is performed from 
back to front. No space is reclaimed in this operation, alterations are accomplished by 
changing the mjlen and m_off fields of mbufs. 

m = m_pul lup (mO, size); 

After a successful call to m_pullup, the mbuf at the head of the returned list, m, is 
guaranteed to have at least size bytes of data in contiguous memory (allowing access via a 
pointer, obtained using the mtod macro). If the original data was less than size bytes long, 
len was greater than the size of an mbuf data area (112 bytes), or required resources were 
unavailable, m is 0 and the original mbuf chain is deallocated. 

This routine is particularly useful when verifying packet header lengths on reception. For 
example, if a packet is received and only 8 of the necessary 16 bytes required for a valid 
packet header are present at the head of the list of mbufs representing the packet, the 
remaining 8 bytes may be “pulled up” with a single m_pu//up call. If the call fails the 
invalid packet will have been discarded. 

By insuring mbufs always reside on 128 byte boundaries it is possible to always locate the mbuf 
associated with a data area by masking off the low bits of the virtual address. This allows 


Sun Microsystems 


Release 2.0 


Page 4 


Network Implementation 


modules to store data structures in mbufs and pass them around without concern for locating the 
original mbuf when it comes time to free the structure. The dtom macro is used to convert a 
pointer into an mbuf’s data area to a pointer to the mbuf, 

#define dtom(x) ((struct mbuf *)((int)x & ~ (MSIZE-1) ) ) 

Mbufs are used for dynamically allocated data structures such as sockets, as well as memory allo- 
cated for packets. Statistics are maintained on mbuf usage and can be viewed by users using the 
netstat( 8) program. 


6. Internal Layering 

The internal structure of the network system is divided into three layers. These layers 
correspond to the services provided by the socket abstraction, those provided by the communica- 
tion protocols, and those provided by the hardware interfaces. The communication protocols are 
normally layered into two or more individual cooperating layers, though they are collectively 
viewed in the system as one layer providing services supportive of the appropriate socket 
abstraction. 

The following sections describe the properties of each layer in the system and the interfaces each 
must conform to. 


6.1. Socket Layer 

The socket layer deals with the interprocess communications facilities provided by the system. 
A socket is a bidirectional endpoint of communication which is “typed” by the semantics of com- 
munication it supports. The system calls described in the System Interface Overview are used to 
manipulate sockets. 


A socket consists of the following data structure: 

struct socket { 


short 

so_type; 

short 

so_options ; 

short 

so_linger ; 

short 

so_state; 

caddr_t 

so_pcb ; 

struct 

protosw *so_proto; 

struct 

socket *so_head; 

struct 

socket *so_qO; 

short 

so_q01en; 

struct 

socket *so_q; 

short 

so_qlen; 

short 

so_qlimit; 

struct 

sockbuf so_snd; 

struct 

s o ckbu f s o_r cv ; 

short 

so_timeo; 

u_short 

so_error ; 

short 

so_oobmark ; 

short 

so_pgrp ; 


/* generic type */ 

/* from socket call */ 

/* time to linger while closing */ 
/* internal state flags */ 

/* protocol control block */ 

/* protocol handle */ 

/* back pointer to accept socket */ 
/* queue of partial connections */ 
/* partials on so_qO */ 

/* queue of incoming connections */ 
/* number of connections on so_q */ 
/* max number queued connections */ 
/* send queue */ 

/* receive queue */ 

/* connection timeout */ 

/* error affecting connection */ 

/* chars to oob mark */ 

/* pgrp for signals */ 


}; 


Each socket contains two data queues, «o_rcr and so_snd, and a pointer to routines which 


Sun Microsystems 


Release 2.0 


Network Implementation 


Page 5 


provide supporting services. The type of the socket, so_type is defined at socket creation time 
and used in selecting those services which are appropriate to support it. The supporting protocol 
is selected at socket creation time and recorded in the socket data structure for later use. Pro- 
tocols are defined by a table of procedures, the protosw structure, which will be described in 
detail later. A pointer to a protocol specific data structure, the “protocol control block” is also 
present in the socket structure. Protocols control this data structure and it normally includes a 
back pointer to the parent socket structure(s) to allow easy lookup when returning information 
to a user (for example, placing an error number in the so_error field). The other entries in the 
socket structure are used in queueing connection requests, validating user requests, storing 
socket characteristics (for example, options supplied at the time a socket is created), and main- 
taining a socket’s state. 

Processes “rendezvous at a socket” in many instances. For instance, when a process wishes to 
extract data from a socket’s receive queue and it is empty, or lacks sufficient data to satisfy the 
request, the process blocks, supplying the address of the receive queue as an “wait channel’ to be 
used in notification. When data arrives for the process and is placed in the socket’s queue, the 
blocked process is identified by the fact it is waiting “on the queue”. 


6.1.1. Socket State 


A socket’s state is defined from the following: 


#def ine SS_NOFDREF OxOOl 
#def ine SS_ISCONNECTED 0x002 
#def ine SS_ISCONNECTING 0x004 
#def ine SS_ISDISCONNECTING 0x008 
#def ine SS_CANTSENDMORE 0x010 
#def ine SS_CANTRCVMORE 0x020 
#def ine SS_CONNAWAITING 0x040 
#def ine S S_RC VATMARK 0x080 
#def ine SS_PRIV OxlOO 
#def ine SS_NBI0 0x200 
#def ine SS_ASYNC 0x400 


/* no file table ref any more */ 

/* socket connected to a peer */ 

/* in process of connecting to peer */ 
/* in process of disconnecting */ 

/* can't send more data to peer */ 

/* can't receive more data from peer */ 
/* connections awaiting acceptance */ 

/* at mark on input */ 

/* privileged */ 

/* non-blocking ops */ 

/* async i/o notify */ 


The state of a socket is manipulated both by the protocols and the user (through system calls). 
When a socket is created the state is defined based on the type of input/output the user wishes 
to perform. “Non-blocking” I/O implies a process should never be blocked to await resources. 
Instead, any call which would block returns prematurely with the error EWOULDBLOCK (the 
service request may be partially fulfilled, for example, a request for more data than is present). 


If a process requested “asynchronous” notification of events related to the socket the SIGIO sig- 
nal is posted to the process. An event is a change in the socket’s state, examples of such occu- 
rances are: space becoming available in the send queue, new data available in the receive queue, 
connection establishment or disestablishment, etc. 


A socket may be marked “priviledged” if it was created by the super-user. Only priviledged 
sockets may send broadcast packets, or bind addresses in priviledged portions of an address 
space. 


Sun Microsystems 


Release 2.0 


Page 6 


Network Implementation 


6.1.2. Socket Data Queues 


A socket’s data queue contains a pointer to the data stored in the queue and other entries 
related to the management of the data. The following structure defines a data queue: 


struct sockbuf •( 


short 

sb_cc ; 

short 

sb_hiwat ; 

short 

sb_mbcnt ; 

short 

sb_mbmax ; 

short 

sb_lowat ; 

short 

sb_timeo ; 

struct 

mbuf *sb_mb; 

struct 

proc *sb_sel ; 

short 

sb_ flags; 


/* actual chars in buffer */ 

/* max actual char count */ 

/* chars of mbufs used */ 

/* max chars of mbufs to use */ 

/* low water mark */ 

/* timeout */ 

/* the mbuf chain */ 

/* process selecting read/write */ 
/* flags, see below */ 


Data is stored in a queue as a chain of mbufs. The actual count of characters as well as high and 
low water marks are used by the protocols in controlling the flow of data. The socket routines 
cooperate in implementing the flow control policy by blocking a process when it requests to send 
data and the high water mark has been reached, or when it requests to receive data and less 
than the low water mark is present (assuming non-blocking I/O has not been specified). 


When a socket is created, the supporting protocol “reserves” space for the send and receive 
queues of the socket. The actual storage associated with a socket queue may fluctuate during a 
socket’s lifetime, but is assumed this reservation will always allow a protocol to acquire enough 
memory to satisfy the high water marks. 

The timeout and select values are manipulated by the socket routines in implementing various 
portions of the interprocess communications facilities and will not be described here. 

A socket queue has a number of flags used in synchronizing access to the data and in acquiring 
resources; 


tdefine SB_LOCK 0x01 
#define SB_WANT 0x02 
#def ine SB_WAIT 0x04 
#define SB_SEL 0x08 
#def ine SB_C0LL 0x10 


/* lock on data queue (so_rcv only) */ 
/* someone is waiting to lock */ 

/* someone is waiting for data/space */ 
/* buffer is selected */ 

/* collision selecting */ 


The last two flags are manipulated by the system in implementing the select mechanism. 


6 . 1 . 8 . Socket Connection Queueing 

In dealing with connection oriented sockets (for example, SOCK_STREAM) the two sides are 
considered distinct. One side is termed active, and generates connection requests. The other 
side is called passive and accepts connection requests. 

From the passive side, a socket is created with the option SO_ACCEPTCONN specified, creat- 
ing two queues of sockets: so_qO for connections in progress and so_q for connections already 
made and awaiting user acceptance. As a protocol is preparing incoming connections, it creates 
a socket structure queued on so_qO by calling the routine sonewconnQ. When the connection is 
established, the socket structure is then transfered to so_q, making it available for an accept. 


Sun Microsystems 


Release 2.0 


Network Implementation 


Page 7 


If an SO_ACCEPTCONN socket is closed with sockets on either so_qO or ao_g, these sockets are 
dropped. 


6.2. Protocol Layer(s) 


Protocols are described by a set of entry points and certain socket visible characteristics, some of 
which are used in deciding which socket type(s) they may support. 

An entry in the “protocol switch” table exists for each protocol module configured into the sys- 
tem. It has the following form: 


struct protosw •( 

short pr_type ; 
short pr_f amily ; 
short pr_protocol; 
short pr_flags; 

/* protocol-protocol hooks */ 
int (*pr_input) () ; 

int (*pr_output) () ; 

int ( *pr_ctl input) () ; 

int (*pr_ctloutput) () ; 

/* user-protocol hook */ 

int (*pr_usrreq) () ; 

/* utility hooks */ 

int (*pr_init) () ; 

int (*pr_fasttimo) () ; 

int (*pr_slowtimo) () ; 

int (*pr_drain) () ; 

>; 

A protocol is called through the pr_init entry before any other. Thereafter it is called every 200 
milliseconds through the prjasttimo entry and every 500 milliseconds through the pr_slowtimo 
for timer based actions. The system will call the pr_drain entry if it is low on space and this 
should throw away any non-critical data. 

Protocols pass data between themselves as chains of mbufs using the prjnput and pr_output rou- 
tines. Prjnput passes data up (towards the user) and pr_output passes it down (towards the net- 
work); control information passes up and down on pr_ctltnput and pr_ctloutput. The protocol is 
responsible for the space occupied by any the arguments to these entries and must dispose of it. 

The pr_userreq routine interfaces protocols to the socket code and is described below. 


/* socket type used for */ 

/* protocol family * / 

/* protocol number */ 

/* socket visible attributes */ 

/* input to protocol (from below) */ 
/* output to protocol (from above) */ 
/* control input (from below) */ 

/* control output (from above) */ 

/* user request */ 

/* initialization routine */ 

/* fast timeout (200ms) */ 

/* slow timeout (500ms) */ 

/* flush any excess space possible */ 


The pr_flags field is constructed from the following values: 


#def ine PR_AT0MIC 0x01 
#def ine PR_ADDR 0x02 
#def ine PR_CONNREQUIRED 0x04 
#def ine PR_WANTRCVD 0x08 
#def ine PR_RIGHTS 0x10 


/* exchange atomic messages only */ 

/* addresses given with messages */ 

/* connection required by protocol */ 
/* want PRU_RCVD calls */ 

/* passes capabilities */ 


Protocols which are connection-based specify the PR_CONNREQUIRED flag so that the socket 
routines will never attempt to send data before a connection has been established. If the 
PR_WANTRCVD flag is set, the socket routines will notfiy the protocol when the user has 
removed data from the socket’s receive queue. This allows the protocol to implement ack- 
nowledgement on user receipt, and also update windowing information based on the amount of 


Sun Microsystems 


Release 2.0 


Page 8 


Network Implementation 


space available in the receive queue.. The PR_ADDR field indicates any data placed in the 
socket’s receive queue will be preceded by the address of the sender. The PR_ATOMIC flag 
specifies each user request to send data must be performed in a single protocol send request; it is 
the protocol’s responsibility to maintain record boundaries on data to be sent. The PR_RIGHTS 
flag indicates the protocol supports the passing of capabilities; this is currently used only the 
protocols in the UNIX protocol family. 

When a socket is created, the socket routines scan the protocol table looking for an appropriate 
protocol to support the type of socket being created. The pr_type field contains one of the possi- 
ble socket types (for example, SOCK_STREAM), while the pr^famtly field indicates which proto- 
col family the protocol belongs to. The pr_protocol field contains the protocol number of the 
protocol, normally a well known value. 


6.3. Network-Interface Layer 

Each network-interface configured into a system defines a path through which packets may be 
sent and received. Normally a hardware device is associated with this interface, though there is 
no requirement for this (for example, all systems have a software “loopback” interface used for 
debugging and performance analysis). In addition to manipulating the hardware device, an inter- 
face module is responsible for encapsulation and deencapsulation of any low level header infor- 
mation required to deliver a message to it’s destination. The selection of which interface to use 
in delivering packets is a routing decision carried out at a higher level than the network- 
interface layer. Each interface normally identifies itself at boot time to the routing module so 
that it may be selected for packet delivery. 

An interface is defined by the following structure, 


Sun Microsystems 


Release 2.0 


Network Implementation 


Page 9 


struct if net •{ 


char 

4 i f _name ; 

short 

if_unit; 

short 

i f _mtu ; 

int 

i f _net ; 

short 

if_flags; 

short 

if_timer ; 

int 

if_host [2] ; 

struct 

sockaddr if_addr; 

union ■( 

struct sockaddr 
struct sockaddr 
> i f _i f u ; 

struct 

if queue if_snd; 

int 

(* if— init) () ; 

int 

(*if_output) () ; 

int 

(*if_ioctl) () ; 

int 

(*if_reset) () ; 

int 

( 4 if_watchdog) () ; 

int 

if_ipackets; 

int 

if_ierrors; 

int 

if_opackets; 

int 

if_oerrors; 

int 

if_collisions; 

struct 

if net *if_next; 


}; 

Each interface has a send queue 
if_output. If the interface resides on a 
reset has been performed. An interface 
should be called every if_timer seconds (if non-zero). 


/* name, for example , en ' 1 or "lo’’ 
/* sub-unit for lower level driver */ 
/* maximum transmission unit */ 

/* network number of interface */ 

/* up/down, broadcast, etc. */ 

/* time 'til if_watchdog called 4 / 

/* local net host number */ 

/* address of interface */ 

i f u_broadaddr ; 

ifu_dstaddr ; 

/* output queue */ 

/* init routine */ 

/* output routine */ 

/* ioctl routine */ 

/* bus reset routine */ 

/* timer routine */ 

/* packets received on interface */ 

/* input errors on interface */ 

/* packets sent on interface */ 

/* output errors on interface */ 

/* collisions on csma interfaces */ 


and routines used for initialization, if_init, and output, 
system bus, the routine if^reset will be called after a bus 
may also specify a timer routine, if_watchdog, which 


The state of an interface and certain characteristics are stored in the ifjlags field. The follow- 
ing values are possible: 


#def ine 

IFF_UP 

Oxl 

/* 

interface is up */ 

#def ine 

IFF_BROADCAST 

0x2 

/* 

broadcast address valid */ 

#define 

IFF_DEBUG 

0x4 

/* 

turn on debugging */ 

#define 

IFF_R0UTE 

0x8 

/* 

routing entry installed */ 

#define 

IFF_POINTOPOINT 

0x10 

/* 

interface is point-to-point link */ 

#def ine 

IFF_NOTRAILERS 

0x20 

/* 

avoid use of trailers */ 

#define 

IFF_RUNNING 

0x40 

/* 

resources allocated */ 


If the interface is connected to a network which supports transmission of broadcast packets, the 
IFF_BROADCAST flag will be set and the if_broadaddr field will contain the address to be used 
in sending or accepting a broadcast packet. If the interface is associated with a point to point 
hardware link (for example, a DEC DMR-11), the IFF_POINTOPOINT flag will be set and 
if_dstaddr will contain the address of the host on the other side of the connection. These 
addresses and the local address of the interface, if_addr, are used in filtering incoming packets. 
The interface sets IFF_RUNNING after it has allocated system resources and posted an initial 
read on the device it manages. This state bit is used to avoid multiple allocation requests when 
an interface’s address is changed. The IFF_NOTRAILERS flag indicates the interface should 
refrain from using a trailer encapsulation on outgoing packets. 1 


1 Trailer protocols are normally disabled on the Sun Workstation. 


Sun Microsystems 


Release 2.0 


Page 10 


Network Implementation 


The information stored in an ifnet structure for point to point communication devices is not 
currently used by the system internally. Rather, it is used by the user level routing process in 
determining host network connections and in initially devising routes (refer to chapter 10 for 
more information). 

Various statistics are also stored in the interface structure. These may be viewed by users using 
the netatat{ 1) program. 

The interface address and flags may be set with the SIOCSIFADDR and SIOCSIFFLAGS ioctls. 
SIOCSIFADDR is used to initially define each interface’s address; SIOGSIFFLAGS can be used 
to mark an interface down and perform site-specific configuration. 


7. Socket /Protocol Interface 

The interface between the socket routines and the communication protocols is through the 
pr_usrreq routine defined in the protocol switch table. The following requests to a protocol 
module are possible: 

#define PRU_ATTACH O /* attach protocol */ 

#define PRU_DETACH 1 /* detach protocol */ 

#define PRU_BIND 2 /* bind socket to address */ 

#define PRU_LISTEN 3 /* listen for connection */ 

#define PRU_CONNECT 4 /* establish connection to peer */ 

#define PRU_ACCEPT 5 /* accept connection from peer */ 

#define PRU_DISCONNECT 6 /* disconnect from peer */ 

#define PRU_SHUTDOWN 7 /* won't send any more data */ 

#define PRU_RCVD 8 /* have taken data; more room now */ 

#define PRU_SEND 9 /* send this data */ 

#def ine PRU_ABORT 10 /* abort (fast DISCONNECT, DETATCH) */ 

#define PRU_CONTROL 11 /* control operations on protocol */ 

#define PRU_SENSE 12 /* return status into m */ 

#define PRU_RCVOOB 13 /* retrieve out of band data */ 

#define PRU_SENDOOB 14 /* send out of band data */ 

#define PRU_SOCKADDR 15 /* fetch socket's address */ 

#define PRU_PEERADDR 16 /* fetch peer's address */ 

#define PRU_C0NNECT2 17 /* connect two sockets */ 

/* begin for protocols internal use */ 

#define PRU_FASTTIMO 18 /* 200ms timeout */ 

#define PRU_SLOWTIMO 19 /* 500ms timeout */ 

#define PRU_PR0T0RCV 20 /* receive from below */ 

#def ine PRU_PR0T0SEND 21 / 4 send to below */ 

A call on the user request routine is of the form, 

error = (‘protosw[] ,pr_usrreq) (up, req, m, addr, rights) ; 
int error; 
struct socket *up; 
int req; 

struct mbuf 4 m, ‘rights; 
caddr_t addr ; 

The mbuf chain, m, and the address are optional parameters. The rights parameter is an 
optional pointer to an mbuf chain containing user specified capabilities (see the sendmsg and 
recvmsg system calls). The protocol is responsible for disposal of both mbuf chains. A non-zero 
return value gives a UNIX error number which should be passed to higher level software. The 


Sun Microsystems 


Release 2.0 


Network Implementation 


following paragraphs describe each of the requests possible. 

PRU_ATTACH 

When a protocol is bound to a socket (with the socket system call) the protocol module is 
called with this request. It is the responsibility of the protocol module to allocate any 
resources necessary. The “attach” request will always precede any of the other requests, 
and should not occur more than once. 

PRU_DETACH 

This is the antithesis of the attach request, and is used at the time a socket is deleted. The 
protocol module may deallocate any resources assigned to the socket. 

PRU_BIND 

When a socket is initially created it has no address bound to it. This request indicates an 
address should be bound to an existing socket. The protocol module must verify the 
requested address is valid and available for use. 

PRU_LISTEN 

The “listen” request indicates the user wishes to listen for incoming connection requests on 
the associated socket. The protocol module should perform any state changes needed to 
carry out this request (if possible). A “listen” request always precedes a request to accept a 
connection. 

PRU_CONNECT 

The “connect” request indicates the user wants to a establish an association. The addr 
parameter supplied describes the peer to be connected to. The effect of a connect request 
may vary depending on the protocol. Virtual circuit protocols, such as TCP [Postel80b], use 
this request to initiate establishment of a TCP connection. Datagram protocols, such as 
UDP [Postel79], simply record the peer’s address in a private data structure and use it to tag 
all outgoing packets. There are no restrictions on how many times a connect request may be 
used after an attach. If a protocol supports the notion of multi-casting, it is possible to use 
multiple connects to establish a multi-cast group. Alternatively, an association may be bro- 
ken by a PRU_DISCONNECT request, and a new association created with a subsequent con- 
nect request; all without destroying and creating a new socket. 

PRU_ACCEPT 

Following a successful PRU_LISTEN request and the arrival of one or more connections, this 
request is made to indicate the user has accepted the first connection on the queue of pend- 
ing connections. The protocol module should fill in the supplied address buffer with the 
address of the connected party. 

PRU_DISCONNECT 

Eliminate an association created with a PRU_CONNECT request. 

PRU_SHUTDOWN 

This call is used to indicate no more data will be sent and/or received (the addr parameter 
indicates the direction of the shutdown, as encoded in the soshutdown system call). The pro- 
tocol may, at its discretion, deallocate any data structures related to the shutdown. 

PRU_RCVD 

This request is made only if the protocol entry in the protocol switch table includes the 
PR_WANTRCVD flag. When a user removes data from the receive queue this request will 
be sent to the protocol module. It may be used to trigger acknowledgements, refresh win- 
dowing information, initiate data transfer, etc. 


Sun Microsystems 


Release 2.0 


Page 12 


Network Implementation 


PRU_SEND 

Each user request to send data is translated into one or more PRU_SEND requests (a proto- 
col may indicate a single user send request must be translated into a single PRU_SEND 
request by specifying the PR_ATOMIC flag in its protocol description). The data to be sent 
is presented to the protocol as a list of mbufs and an address is, optionally, supplied in the 
addr parameter. The protocol is responsible for preserving the data in the socket’s send 
queue if it is not able to send it immediately, or if it may need it at some later time (for 
example, for retransmission). 

PRU_ABORT 

This request indicates an abnormal termination of service. The protocol should delete any 
existing assoc iation(s). 

PRU_CONTROL 

The “control” request is generated when a user performs a UNIX ioctl system call on a 
socket (and the ioctl is not intercepted by the socket routines). It allows protocol-specific 
operations to be provided outside the scope of the common socket interface. The addr 
parameter contains a pointer to a static kernel data area where relevant information may be 
obtained or returned. The m parameter contains the actual ioctl request code (note the 
non-standard calling convention). 

PRU_SENSE 

The “sense” request is generated when the user makes an fstat system call on a socket; it 
requests status of the associated socket. There currently is no common format for the status 
returned. Information which might be returned includes per-connection statistics, protocol 
state, resources currently in use by the connection, the optimal transfer size for the connec- 
tion (based on windowing information and maximum packet size). The addr parameter con- 
tains a pointer to a static kernel data area where the status buffer should be placed. 

PRU_RCVOOB 

Any “out-of-band” data presently available is to be returned. An mbuf is passed in to the 
protocol module and the protocol should either place data in the mbuf or attach new mbufs 
to the one supplied if there is insufficient space in the single mbuf. 

PRU_SENDOOB 

Like PRU_SEND, but for out-of-band data. 

PRU_SOCKADDR 

The local address of the socket is returned, if any is currently bound to the it. The address 
format (protocol specific) is returned in the addr parameter. 

PRU_PEERADDR 

The address of the peer to which the socket is connected is returned. The socket must be in 
a SS_ISCONNECTED state for this request to be made to the protocol. The address format 
(protocol specific) is returned in the addr parameter. 

PRU_CONNECT2 

The protocol module is supplied two sockets and requested to establish a connection between 
the two without binding any addresses, if possible. This call is used in implementing the 
socketpair(2) system call. 

The following requests are used internally by the protocol modules and are never generated by 
the socket routines. In certain instances, they are handed to the pr_usrreq routine solely for 
convenience in tracing a protocol’s operation (for example, PRU_SLOWTIMO). 


^ Sun Microsystems 


Release 2.0 


Network Implementation 


PRU_FASTTIMO 

A “fast timeout” has occured. This request is made when a timeout occurs in the protocol’s 
pr_fastimo routine. The addr parameter indicates which timer expired. 

PRUJ3LOWTIMO 

A “slow timeout” has occured. This request is made when a timeout occurs in the protocol’s 
pr_slowtimo routine. The addr parameter indicates which timer expired. 

PRU_PROTORCV 

This request is used in the protocol-protocol interface, not by the routines. It requests 
reception of data destined for the protocol and not the user. No protocols currently use this 
facility. 

PRU_PROTOSEND 

This request allows a protocol to send data destined for another protocol module, not a user. 
The details of how data is marked “addressed to protocol” instead of “addressed to user” 
are left to the protocol modules. No protocols currently use this facility. 


8. Protocol/Protocol Interface 

The interface between protocol modules is through the pr_usrreq, pr_input, pr_output, 
pr_ctlinput, and pr_ctloutput routines. The calling conventions for all but the pr_usrreq routine 
are expected to be specific to the protocol modules and are not guaranteed to be consistent 
across protocol families. We will examine the conventions used for some of the Internet proto- 
cols in this section as an example. 


8.1. pr_output 

The Internet protocol UDP uses the convention, 

error = udp_output (inp , m) ; 
int error; 
struct inpcb *inp; 
struct mbuf *m; 

where the inp, “internet protocol control 61ock”, passed between modules conveys per connec- 
tion state information, and the mbuf chain contains the data to be sent. UDP performs con- 
sistency checks, appends its header, calculates a checksum, etc. before passing the packet on to 
the IP module: 

error = ip_output(m, opt, ro, allowbroadcast); 
int error; 

struct mbuf *m, *opt; 
struct route *ro; 
int allowbroadcast; 

The call to IP’s output routine is more complicated than that for UDP, as befits the additional 
work the IP module must do. The m parameter is the data to be sent, and the opt parameter is 
an optional list of IP options which should be placed in the IP packet header. The ro parameter 
is is used in making routing decisions (and passing them back to the caller). The final parame- 
ter, allowbroadcast is a flag indicating if the user is allowed to transmit a broadcast packet. This 
may be inconsequential if the underlying hardware does not support the notion of broadcasting. 


Sun Microsystems 


Release 2.0 


Page 14 


Network Implementation 


All output routines return 0 on success and a UNIX error number if a failure occured which could 
be immediately detected (no buffer space available, no route to destination, etc.). 


8.2. pr_input 

Both UDP and TCP use the following calling convention, 

(void) (*protosw[] .pr_input) (m) ; 
struct mbuf *m; 

Each mbuf list passed is a single packet to be processed by the protocol module. 

The IP input routine is a software interrupt level routine, and so is not called with any parame- 
ters. It instead communicates with network interfaces through a queue, ipintrq, which is identi- 
cal in structure to the queues used by the network interfaces for storing packets awaiting 
transmission. 


8.3. pr_ctlinput 


This routine is used to convey “control” information to a protocol module (i.e. information which 
might be passed to the user, but is not data). This routine, and the pr_ctloutput routine, have 
not been extensively developed, and thus suffer from a “clumsiness” that can only be improved 
as more demands are placed on it. 

The common calling convention for this routine is, 

(void) (*protosw[] .pr_ctlinput) (req, info) ; 
int req; 
caddr_t info; 


The req parameter is one of the following, 


#define PRC_IEDOWN 0 
#define PRC_ROUTEDEAD 1 
#def ine PRC_QUENCH 4 
^define PRC_HOSTDEAD 6 
#def ine PRC_HOSTUNREACH 7 
#def ine PRC_UNREACH_NET 8 
#def in© PRC_UNREACH_HOST 9 
#def ine PRC_UNREACH_PROTOCOL 10 
#def ine PRC_UNREACH_PORT 11 
#def ine PRC_MSGSIZE 12 
#define PRC_RE D I RE CT_NE T 13 
#def ine PRC_RE D I RE CT_HOS T 14 
#def in© PRC_TIMXCEED_INTRANS 17 
#define PRC_TIMXCEED_REASS 18 
#def in© PRC_PARAMPROB 19 


/* interface transition */ 

/* select new route if possible */ 

/* some said to slow down */ 

/* normally from IMP */ 

/* ditto */ 

/* no route to network */ 

/* no rout© to host */ 

/* dst says bad protocol */ 

/* bad port # */ 

/* message size forced drop */ 

/* net routing redirect */ 

/* host routing redirect */ 

/* packet lifetime expired in transit */ 
/* lifetime expired on reass q a / 

/* header incorrect */ 


while the info parameter is a “catchall” value which is request dependent. Many of the requests 
have obviously been derived from ICMP (the Internet Control Message Protocol), and from error 
messages defined in the 1822 host/IMP convention [BBN78], Mapping tables exist to convert 
control requests to UNIX error codes which are delivered to a user. 


Sun Microsystems 


Release 2.0 


Network Implementation 


Page 15 


8.4. pr_ctloutput 

This routine is not currently used by any protocol modules. 


9. Protocol/Network-Interface Interface 

The lowest layer in the set of protocols which comprise a protocol family must interface itself to 
one or more network interfaces in order to transmit and receive packets. It is assumed that any 
routing decisions have been made before handing a packet to a network interface, in fact this is 
absolutely necessary in order to locate any interface at all (unless, of course, one uses a single 
“hardwired” interface). There are two cases to be concerned with, transmission of a packet, and 
receipt of a packet; each will be considered separately. 


9.1. Packet Transmission 

Assuming a protocol has a handle on an interface, ifp, a (struct ifnet *), it transmits a fully for- 
matted packet with the following call, 

error = (*ifp->if_output) (ifp. m, dst) 
int error; 
struct ifnet *ifp; 
struct mbuf ♦in- 
struct sockaddr *dst; 

The output routine for the network interface transmits the packet m to the dst address, or 
returns an error indication (a UNIX error number). In reality transmission may not be immedi- 
ate, or successful; normally the output routine simply queues the packet on its send queue and 
primes an interrupt driven routine to actually transmit the packet. For unreliable mediums, 
such as the Ethernet, “successful” transmission simply means the packet has been placed on the 
cable without a collision. On the other hand, an 1822 interface guarantees proper delivery or an 
error indication for each message transmitted. The model employed in the networking system 
attaches no promises of delivery to the packets handed to a network interface, and thus 
corresponds more closely to the Ethernet. Errors returned by the output routine are normally 
trivial in nature (no buffer space, address format not handled, etc.). 


9.2. Packet Reception 

Each protocol family must have one or more “lowest level” protocols. These protocols deal with 
internetwork addressing and are responsible for the delivery of incoming packets to the proper 
protocol processing modules. In the PUP model [Boggs78] these protocols are termed Level 1 
protocols, in the ISO model, network layer protocols. In our system each such protocol module 
has an input packet queue assigned to it. Incoming packets received by a network interface are 
queued up for the protocol module and a software interrupt is posted to initiate processing. 

Three macros are available for queueing and dequeueing packets, 

IF_ENQUEUE(ifq, m) 

This places the packet m at the tail of the queue ifq. 


Sun Microsystems 


Release 2.0 


Page 16 


Network Implementation 


IF_DEQUEUE(ifq, m) 

This places a pointer to the packet at the head of queue ifq in m. A zero value will be 
returned in m if the queue is empty. 

IF_PREPEND(ifq, m) 

This places the packet m at the head of the queue ifq. 

Each queue has a maximum length associated with it as a simple form of congestion control. 
The macro IF_QFULL(ifq) returns 1 if the queue is filled, in which case the macro IF_DROP(ifq) 
should be used to bump a count of the number of packets dropped and the offending packet 
dropped. For example, the following code fragment is commonly found in a network interface’s 
input routine, 

if (IF_QFULL (inq) ) { 

IF_DROP (inq) ; 
m_freem (m) ; 

} else 

IF_ENQUEUE (inq, m) ; 


10. Gateways and Routing Issues 

The system has been designed with the expectation that it will be used in an internetwork 
environment. The “canonical” environment was envisioned to be a collection of local area net- 
works connected at one or more points through hosts with multiple network interfaces (one on 
each local area network), and possibly a connection to a long haul network (for example, the 
ARPANET). In such an environment, issues of gatewaying and packet routing become very 
important. Certain of these issues, such as congestion control, have been handled in a simplistic 
manner or specifically not addressed. Instead, where possible, the network system attempts to 
provide simple mechanisms upon which more involved policies may be implemented. As some of 
these problems become better understood, the solutions developed will be incorporated into the 
system. 

This section will describe the facilities provided for packet routing. The simplistic mechanisms 
provided for congestion control are described in chapter 12. 

10.1. Routing Tables 

The network system maintains a set of routing tables for selecting a network interface to use in 
delivering a packet to its destination. These tables are of the form: 

struct rtentry { 

u_long rt_hash; 
struct sockaddr rt_dst; 
struct sockaddr rt_gateway; 
short r t_ f 1 ags ; 

short rt_refcnt; 
u_long rt_use; 
struct ifnet *rt_ifp; 

>; 

The routing information is organized in two separate tables, one for routes to a host and one for 
routes to a network. The distinction between hosts and networks is necessary so that a single 

Sun Microsystems 


/* hash key for lookups */ 

/* destination net or host */ 

/* forwarding agent */ 

/* see below */ 

/* no. of references to structure */ 
/* packets sent using route */ 

/* interface to give packet to */ 


Release 2.0 


Network Implementation 


mechanism may be used for both broadcast and multi-drop type networks, and also for networks 

built from point-to-point links (e.g DECnet [DEC80]). 

Each table is organized as a hashed set of linked lists. Two 32-bit hash values are calculated by 
routines defined for each address family; one based on the destination being a host, and one 
assuming the target is the network portion of the address. Each hash value is used to locate a 
hash chain to search (by taking the value modulo the hash table size) and the entire 32-bit value 
is then used as a key in scanning the list of routes. Lookups are applied first to the routing table 
for hosts, then to the routing table for networks. If both lookups fail, a final lookup is made for 
a “wildcard” route (by convention, network 0). By doing this, routes to a specific host on a net- 
work may be present as well as routes to the network. This also allows a “fall back” network 
route to be defined to an “smart” gateway which may then perform more intelligent routing. 

Each routing table entry contains a destination (who’s at the other end of the route), a gateway 
to send the packet to, and various flags which indicate the route’s status and type (host or net- 
work). A count of the number of packets sent using the route is kept for use in deciding 
between multiple routes to the same destination (see below), and a count of “held references” to 
the dynamically allocated structure is maintained to insure memory reclamation occurs only 
when the route is not in use. Finally a pointer to the a network interface is kept; packets sent 
using the route should be handed to this interface. 

Routes are typed in two ways: either as host or network, and as “direct” or “indirect”. The 
host/network distinction determines how to compare the rt_dst field during lookup. If the route 
is to a network, only a packet’s destination network is compared to the rt_dst entry stored in the 
table. If the route is to a host, the addresses must match bit for bit. 

The distinction between “direct” and “indirect” routes indicates whether the destination is 
directly connected to the source. This is needed when performing local network encapsulation. 
If a packet is destined for a peer at a host or network which is not directly connected to the 
source, the internetwork packet header will indicate the address of the eventual destination, 
while the local network header will indicate the address of the intervening gateway. Should the 
destination be directly connected, these addresses are likely to be identical, or a mapping 
between the two exists. The RTF_GATEWAY flag indicates the route is to an “indirect” gate- 
way agent and the local network header should be filled in from the rt_gateway field instead of 
rt_dst, or from the internetwork destination address. 

It is assumed multiple routes to the same destination will not be present unless they are deemed 
equal in cost (the current routing policy process never installs multiple routes to the same desti- 
nation). However, should multiple routes to the same destination exist, a request for a route will 
return the “least used” route based on the total number of packets sent along this route. This 
can result in a “ping-pong” effect (alternate packets taking alternate routes), unless protocols 
“hold onto” routes until they no longer find them useful; either because the destination has 
changed, or because the route is lossy. 

Routing redirect control messages are used to dynamically modify existing routing table entries 
as well as dynamically create new routing table entries. On hosts where exhaustive routing 
information is too expensive to maintain (for example, work stations), the combination of wild- 
card routing entries and routing redirect messages can be used to provide a simple routing 
management scheme without the use of a higher level policy process. Statistics are kept by the 
routing table routines on the use of routing redirect messages and their affect on the routing 
tables. These statistics may be viewed using netgtat( 1). 

Status information other than routing redirect control messages may be used in the future, but 
at present they are ignored. Likewise, more intelligent “metrics” may be used to describe routes 


Sun Microsystems 


Release 2.0 


Page 18 


Network Implementation 


in the future, possibly based on bandwidth and monetary costs. 


10.2. Routing Table Interface 

A protocol accesses the routing tables through three routines, one to allocate a route, one to free 
a route, and one to process a routing redirect control message. The routine rtalloc performs 
route allocation; it is called with a pointer to the following structure, 

struct route { 

struct rtentry *ro_rt; 
struct sockaddr ro_dst; 

}; 

The route returned is assumed “held” by the caller until disposed of with an rtfree call. Proto- 
cols which implement virtual circuits, such as TCP, hold onto routes for the duration of the 
circuit’s lifetime, while connection-less protocols, such as UDP, currently allocate and free routes 
on each transmission. 

The routine rtredirect is called to process a routing redirect control message. It is called with a 
destination address and the new gateway to that destination. If a non-wildcard route exists to 
the destination, the gateway entry in the route is modified to point at the new gateway supplied. 
Otherwise, a new routing table entry is inserted reflecting the information supplied. Routes to 
interfaces and routes to gateways which are not directly accesible from the host are ignored. 


10.3. User-Level Routing Policies 

Routing policies implemented in user processes manipulate the kernel routing tables through two 
ioctl calls. The commands SIOCADDRT and SIOCDELRT add and delete routing entries, 
respectively; the tables are read through the /dev/kmem device. The decision to place policy 
decisions in a user process implies routing table updates may lag a bit behind the identification of 
new routes, or the failure of existing routes, but this period of instability is normally very small 
with proper implementation of the routing process. Advisory information, such as ICMP error 
messages and IMP diagnostic messages, may be read from raw sockets (described in the next sec- 
tion). 

One routing policy process has already been implemented. The system standard “routing dae- 
mon” uses a variant of the Xerox NS Routing Information Protocol [Xerox82] to maintain up to 
date routing tables in our local environment. Interaction with other existing routing protocols, 
such as the Internet GGP (Gateway-Gateway Protocol), may be accomplished using a similar 
process. 


11. Raw Sockets 

A raw socket is a mechanism which allows users direct access to a lower level protocol. Raw 
sockets are intended for knowledgeable processes which wish to take advantage of some protocol 
feature not directly accessible through the normal interface, or for the development of new pro- 
tocols built atop existing lower level protocols. For example, a new version of TCP might be 
developed at the user level by utilizing a raw IP socket for delivery of packets. The raw IP 
socket interface attempts to provide an identical interface to the one a protocol would have if it 
were resident in the kernel. 


Sun Microsystems 


Release 2.0 


Network Implementation 


Page 19 


The raw socket support is built around a generic raw socket interface, and (possibly) augmented 
by protocol-specific processing routines. This section will describe the core of the raw socket 
interface. 


11.1. Control Blocks 

Every raw socket has a protocol control block of the following form, 
struct rawcb { 

struct rawcb ‘rcb_next; /* doubly linked list */ 

struct rawcb *rcb_prev; 

struct socket *rcb_socket; /* back pointer to socket */ 

struct sockaddr rcb_faddr; /* destination address */ 

struct sockaddr rcb_laddr; /* socket's address */ 

caddr_t rcb_pcb; /* protocol specific stuff */ 

short rcb_flags; 

>; 

All the control blocks are kept on a doubly linked list for performing lookups during packet 
dispatch. Associations may be recorded in the control block and used by the output routine in 
preparing packets for transmission. The addresses are also used to filter packets on input; this 
will be described in more detail shortly. If any protocol specific information is required, it may 
be attached to the control block using the rcb_pcb field. 

A raw socket interface is datagram oriented. That is, each send or receive on the socket requires 
a destination address. This address may be supplied by the user or stored in the control block 
and automatically installed in the outgoing packet by the output routine. Since it is not possible 
to determine whether an address is present or not in the control block, two flags, RAW_LADDR 
and RAW_FADDR, indicate if a local and foreign address are present. Another flag, 
RAW_DONTROUTE, indicates if routing should be performed on outgoing packets. If it is, a 
route is expected to be allocated for each “new” destination address. That is, the first time a 
packet is transmitted a route is determined, and thereafter each time the destination address 
stored in rcb_route differs from rcb_faddr , or rcb_routc.ro_rt is zero, the old route is discarded 
and a new one allocated. 


11.2. Input Processing 

Input packets are “assigned” to raw sockets based on a simple pattern matching scheme. Each 
network interface or protocol gives packets to the raw input routine with the call: 

raw_input (m, proto, src, dst) 
struct mbuf *m; 

struct sockproto ‘proto, struct sockaddr *src, *dst; 

The data packet then has a generic header prepended to it of the form 

struct raw_header { 

struct sockproto raw_proto; 
struct sockaddr raw_dst; 
struct sockaddr raw_src; 

>; 

and it is placed in a packet queue for the “raw input protocol” module. Packets taken from this 


Sun Microsystems 


Release 2.0 


Page 20 


Network Implementation 


queue are copied into any raw sockets that match the header according to the following rules, 

1) The protocol family of the socket and header agree. 

2) If the protocol number in the socket is non-zero, then it agrees with that found in the packet 
header. 

3) If a local address is defined for the socket, the address format of the local address is the 
same as the destination address’s and the two addresses agree bit for bit. 

4) The rules of 3) are applied to the socket’s foreign address and the packet’s source address. 

A basic assumption is that addresses present in the control block' and packet header (as con- 
structed by the network interface and any raw input protocol module) are in a canonical form 
which may be “block compared”. 


11.3. Output Processing 

On output the raw pr_usrreq routine passes the packet and raw control block to the raw proto- 
col output routine for any processing required before it is delivered to the appropriate network 
interface. The output routine is normally the only code required to implement a raw socket 
interface. 


12. Buffering and Congestion Control 

One of the major factors in the performance of a protocol is the buffering policy used. Lack of a 
proper buffering policy can force packets to be dropped, cause falsified windowing information to 
be emitted by protocols, fragment host memory, degrade the overall host performance, etc. Due 
to problems such as these, most systems allocate a fixed pool of memory to the networking sys- 
tem and impose a policy optimized for “normal” network operation. 

The networking system developed for UNIX is little different in this respect. At boot time a fixed 
amount of memory is allocated by the networking system. At later times more system memory 
may be requested as the need arises, but at no time is memory ever returned to the system. It is 
possible to garbage collect memory from the network, but difficult. In order to perform this gar- 
bage collection properly, some portion of the network will have to be “turned off” as data struc- 
tures are updated. The interval over which this occurs must kept small compared to the average 
inter-packet arrival time, or too much traffic may be lost, impacting other hosts on the network, 
as well as increasing load on the interconnecting mediums. In our environment we have not 
experienced a need for such compaction, and thus have left the problem unresolved. 

The mbuf structure was introduced in chapter 5. In this section a brief description will be given 
of the allocation mechanisms, and policies used by the protocols in performing connection level 
buffering. 


12.1. Memory Management 

The basic memory allocation routines place no restrictions on the amount of space which may be 
allocated. Any request made is filled until the system memory allocator starts refusing to allo- 
cate additional memory. When the current quota of memory is insufficient to satisfy an mbuf 
allocation request, the allocator requests enough new pages from the system to satisfy the 


Sun Microsyste 


ms 


Release 2.0 


Network Implementation 


Page 21 


current request only. All memory owned by the network is described by a private page table 
used in remapping pages to be logically contiguous as the need arises. In addition, an array of 
reference counts parallels the page table and is used when multiple copies of a page are present. 

Mbufs are 128 byte structures, 16 fitting in a 2048 byte page of memory. When data is placed in 
mbufs, if possible, it is copied or remapped into logically contiguous pages of memory from the 
network page pool. Data smaller than the size of a page is copied into one or more 112 byte 
mbuf data areas. 


12.2. Protocol Buffering Policies 

Protocols reserve fixed amounts of buffering for send and receive queues at socket creation time. 
These amounts define the high and low water marks used by the socket routines in deciding 
when to block and unblock a process. The reservation of space does not currently result in any 
action by the memory management routines, though it is clear if one imposed an upper bound on 
the total amount of physical memory allocated to the network, reserving memory would become 
important. 

Protocols which provide connection level flow control do this based on the amount of space in 
the associated socket queues. That is, send windows are calculated based on the amount of free 
space in the socket’s receive queue, while receive windows are adjusted based on the amount of 
data awaiting transmission in the send queue. Care has been taken to avoid the “silly window 
syndrome” described in [Clark82] at both the sending and receiving ends. 


12.3. Queue Limiting 

Incoming packets from the network are always received unless memory allocation fails. How- 
ever, each Level 1 protocol input queue has an upper bound on the queue’s length, and any pack- 
ets exceeding that bound are discarded. It is possible for a host to be overwhelmed by excessive 
network traffic (for instance a host acting as a gateway from a high bandwidth network to a low 
bandwidth network). As a “defensive” mechanism the queue limits may be adjusted to throttle 
network traffic load on a host. Consider a host willing to devote some percentage of its machine 
to handling network traffic. If the cost of handling an incoming packet can be calculated so that 
an acceptable “packet handling rate” can be determined, then input queue lengths may be 
dynamically adjusted based on a host’s network load and the number of packets awaiting pro- 
cessing. Obviously, discarding packets is not a satisfactory solution to a problem such as this 
(simply dropping packets is likely to increase the load on a network); the queue lengths were 
incorporated mainly as a safeguard mechanism. 


12.4. Packet Forwarding 

When packets can not be forwarded because of memory limitations, the system generates a 
“source quench” message. In addition, any other problems encountered during packet forward- 
ing are also reflected back to the sender in the form of ICMP packets. This helps hosts avoid 
unneeded retransmissions. 

Broadcast packets are never forwarded due to possible dire consequences. In an early stage of 
network development, broadcast packets were forwarded and a “routing loop” resulted in net- 
work saturation and every host on the network crashing. 


Sun Microsystems 


Release 2.0 


Page 22 


Network Implementation 


13. Out of Band Data 

Out of band data is a facility peculiar to the stream socket abstraction defined. Little agreement 
appears to exist as to what its semantics should be. TCP defines the notion of “urgent data” as 
in-line, while the NBS protocols [Burruss8l] and numerous others provide a fully independent 
logical transmission channel along which out of band data is to be sent. In addition, the amount 
of the data which may be sent as an out of band message varies from protocol to protocol; every- 
thing from 1 bit to 16 bytes or more. 

A stream socket’s notion of out of band data has been defined as the lowest reasonable common 
denominator (at least reasonable in our minds); clearly this is subject to debate. Out of band 
data is expected to be transmitted out of the normal sequencing and flow control constraints of 
the data stream. A minimum of 1 byte of out of band data and one outstanding out of band 
message are expected to be supported by the protocol supporting a stream socket. It is a proto- 
cols prerogative to support larger sized messages, or more than one outstanding out of band mes- 
sage at a time. 

Out of band data is maintained by the protocol and usually not stored in the socket’s send 
queue. The PRU_SENDOOB and PRU_RCVOOB requests to the pr_uarreq routine are used in 
sending and receiving data. 


Appendix A. Acknowledgements and References 

The internal structure of the system is patterned after the Xerox PUP architecture [Boggs79], 
while in certain places the Internet protocol family has had a great deal of influence in the 
design. The use of software interrupts for process invocation is based on similar facilities found 
in the VMS operating system. Many of the ideas related to protocol modularity, memory 
management, and network interfaces are based on Rob Gurwitz’s TCP/IP implementation for 
the 4.1BSD version of UNIX on the VAX [Gurwitz8l]. 


Appendix B. References 


[Boggs79] 


[BBN78] 

[Cerf78] 

[Clark82] 

[DEC80] 


[Gurwitz81] 


Boggs, D. R., J. F. Shoch, E. A. Taft, and R. M. Metcalfe; PUP: An Inter- 
network Architecture. Report CSL-79-10. XEROX Palo Alto Research 
Center, July 1979. 

Bolt Beranek and Newman; Specification for the Interconnection of Host 
and IMP. BBN Technical Report 1822. May 1978. 

Cerf, V. G.; The Catenet Model for Internetworking. Internet Working 
Group, IEN 48. July 1978. 

Clark, D. D.; Window and Acknowledgement Strategy in TCP. Internet 
Working Group, IEN Draft Clark-2. March 1982. 

Digital Equipment Corporation; DECnet DIGITAL Network Architecture 
— General Description. Order No. AA-K179A-TK. October 1980. 

Gurwitz, R. F.; VAX-UNIX Networking Support Project - Implementa- 
tion Description. Internetwork Working Group, IEN 168. January 1981. 


Sun Microsystems 


Release 2.0 


Network Implementation 


Page 23 


[IS081] 

International Organization for Standardization. ISO Open Systems Inter- 
connection — Basic Reference Model. ISO/TC 97/SC 16 N 719. August 
1981. 

[Joy82a] 

Joy, W.; Cooper, E.; Fabry, R.; Leffler, S.; and McKusick, M.; System 
Interface Overview. Computer Systems Research Group, Technical 
Report 5. University of California, Berkeley. Draft of September 1, 1982. 

[Postel79] 

Postel, J., ed. DOD Standard User Datagram Protocol. Internet Working 
Group, IEN 88. May 1979. 

[Postel80a] 

Postel, J., ed. DOD Standard Internet Protocol. Internet Working 
Group, IEN 128. January 1980. 

[Postel80b] 

Postel, J., ed. DOD Standard Transmission Control Protocol. Internet 
Working Group, IEN 129. January 1980. 

[Xerox8l] 

Xerox Corporation. Internet Transport Protocols. Xerox System Integra- 
tion Standard 028112. December 1981. 

[Zimmermann80] 

Zimmermann, H. OSI Reference Model — The ISO Model of Architecture 
for Open Systems Interconnection. IEEE Transactions on Communica- 
tions. Com-28(4); 425-432. April 1980. 


Sun Microsystems 


Release 2.0