ONLamp.com: IRIX Binary Compatibility, Part 3

Published on ONLamp.com (http://www.onlamp.com/)
http://www.onlamp.com/pub/a/bsd/2002/09/12/irix.html
See this if you're having trouble printing code examples

IRIX Binary Compatibility, Part 3

by Emmanuel Dreyfus
09/19/2002

IRIX Oddities: system calls that you will not see anywhere else!

Now that we are able to launch dynamic binaries, the goal is to get them linking. The dynamic linker has to do a lot of system calls before actually launching the program. Most of them are plain SVR4, and hence are taken from sys/compat/svr4. Here, we will deal with IRIX-specific system calls.

`syssgi()` Overview

One of the very first things on which we fail when running IRIX 6.5 binaries is the syssgi(2) system call. In fact, syssgi(2) is more like a meta-system call. Its first argument is an int named request. Depending on request's value, syssgi(2) will run literally dozens of different commands. The remaining arguments to syssgi(2) are interpreted according to the request argument.

syssgi(2) commands include some quite standard functionality that is implemented in plain system calls on NetBSD, such as getgroups(2), getsid(2), or getpid(2). There are also some SGI-specific things, such as commands to get hardware inventory, system configuration, or NVRAM values.

In This Series

IRIX Binary Compatibility, Part 6
With IRIX threads emulated, it's time to emulate share groups, a building block of parallel processing. Emmanuel Dreyfus digs deep into his bag of reverse engineering tricks to demonstrate how headers, documentation, a debugger, and a lot of luck are helping NetBSD build a binary compatibility layer for IRIX.

IRIX Binary Compatibility, Part 5
How do you emulate a thread model on an operating system that doesn't support native threads (in user space, anyway)? Emmanuel Dreyfus returns with the fifth article of his series on reverse engineering and kernel programming. This time, he explains thread models and demonstrates how NetBSD emulates IRIX threads.

IRIX Binary Compatibility, Part 4
Emmanuel Dreyfus tackles the chore of emulating IRIX signal handling on NetBSD.

IRIX Binary Compatibility, Part 2
Emmanual Dreyfus shows us how he implemented the things necessary to start an IRIX binary. These things include the program's arguments, environment, and for dynamic binaries, the ELF auxiliary table, which is used by the dynamic linker to learn how to link the program.

IRIX Binary Compatibility, Part 1
This article details the IRIX binary compatibility implementation for the NetBSD operating system. It covers creating a new emulation subsystem inside the NetBSD kernel as well as some reverse engineering to understand and reproduce how IRIX internals work.

The big question is why SGI decided to fold so much functionality into a single system call. There must be a good reason for doing so, but it is not easy to guess. The only thing that is obvious when you are doing some reverse engineering on syssgi(2) is that for many requests, you do not know what arguments are used when calling syssgi(2). The only information available is the name of the request, and it makes things much more difficult.

syssgi(2) emulation in NetBSD is done in sys/compat/irix/irix_syssgi.c:irix_sys_syssgi(). It is just a giant switch on the request value, which will branch to various kernel functions implementing the request. All standard features, such as getpid(2), are quite easy to implement. Others are more tricky.

The first difficulty with syssgi(2) is the ELFMAP request. The dynamic linker invokes syssgi(2) with this request, and all we can find in the syssgi(2) man page is that this is an interface to implement a system library function, and that this interface is subject to change. Not very helpful.

Fortunately, Linux already tried to get there, and the person that worked on it managed to discover that ELFMAP takes a file descriptor, an ELF program header array, and the array length, and then maps the ELF sections described in the array in the calling process' user space. Information on this can be found inside Linux kernel sources, in linux/arch/mips/kernel/sysirix.c:irix_syssgi() and linux/arch/mips/kernel/irixelf.c:irix_mapelf().

In fact, syssgi(ELFMAP) is a kernel implementation of a part of the dynamic linker. Native binaries on a NetBSD system map each code section doing a mmap(2). Here again, one could wonder what the reasons are for pushing that code from userland to the kernel. One reason could be to improve performance by saving system calls: an IRIX binary can map a library with only one system call.

Reverse Engineering `syssgi(ELFMAP)`

Another way of guessing what the syssgi(ELFMAP) function does is to use the par(1) command in IRIX. This command is similar to ktrace(1) on NetBSD: it reports the system call activity of a user program. Fortunately, syssgi(ELFMAP) gets disassembled into a more system-call-looking presentation:

31mS          : open("/lib/rld", O_RDONLY, 04) = 3
31mS          : read(3, <7f 45 4c 46 01 02 01 00 00 00>..., 512) = 512
32mS          : elfmap(3, 0x7fff2d98, 2) = 0xfb60000

Here is the ktrace(1)/kdump(1) output on NetBSD for this:

1343 ftp      CALL  open(0xfb3509c,0,0x4)
1343 ftp      NAMI  "/emul/irix/lib/rld"
1343 ftp      NAMI  "/emul/irix"
1343 ftp      NAMI  "/emul/irix/lib/rld"
1343 ftp      RET   open 3
1343 ftp      CALL  read(0x3,0x7fffe76c,0x200)
1343 ftp      RET   read 512/0x200
1343 ftp      CALL  syssgi(0x44,0x3,0x7fffe7e0,0x2,0,0xfb3509c)

NB : 0x44 is the request code for ELFMAP. This is defined in IRIX's <sys/syssgi.h>.

Here we get the confirmation that:

We should never really trust par(1) about what is going on, because it does masquerade on some system calls.
syssgi(ELFMAP) really expects three arguments.
The first one is very likely to be the file descriptor just acquired from open(2).
It is much more difficult to guess what the second and third arguments are.
We need to discover the returned value.

If we had to rediscover the second and third argument usage without looking at Linux sources, we could try to dig up some information using gdb. The goal is to look at what is at the address pointed to by the second argument. To achieve this, we can break just at the syssgi(2) system call stub in libc. We can get the address of the syssgi(2) system call stub in libc using the GNU nm(1) command with the -D option. This will list all of the dynamic symbols from a binary.

$ nm -D /lib/libc.so.1 | grep syssgi
0fa33260 A _syssgi
0fa33260 W syssgi

Then we just have to set the breakpoint and run:

$ gdb ftp
(gdb) b *0x0fa33260
Breakpoint 1 at 0xfa33260
(gdb) run
Starting program: ./ftp

Breakpoint 1, 0xfa33260 in ?? ()
(gdb) info registers
          zero       at       v0       v1       a0       a1       a2       
a3
 R0   00000000 00000001 7fffe788 00000001 00000044 00000005 7fffe7c8 
00000002
(snip)

The SVR4 ABI states that registers A0 to A3 are used to pass the first four arguments to a function. A0 is equal to 0x44, which corresponds to the ELFMAP request. This is the first syssgi(2) argument. Here we are!

A1 is still the file descriptor. It is 5 and not 3 because we are running with gdb and there are more files open. However, a ktrace(1) would show that this is the file descriptor just returned by open(2). What we are looking for is the buffer pointed to by the second argument to syssgi(ELFMAP), which is the third argument to syssgi(2), stored in A2. Let us dump the memory pointed to by A2:

(gdb) x/20wx $a2
0x7fffe7c8:     0x00000001      0x00000000      0x0fb60000      
0x0fb60000
0x7fffe7d8:     0x00035000      0x00035000      0x00000005      
0x00004000
0x7fffe7e8:     0x00000001      0x00038000      0x0fbd8000      
0x0fbd8000
0x7fffe7f8:     0x00002000      0x00002000      0x00000006      
0x00004000
0x7fffe808:     0x00000000      0x01200000      0x00000000      
0xf3fffffe

It is difficult here to recognize a program header array. But if we re-read the kernel trace, we can see that the program has just read 512 bytes at 0x7fffe754.

1613 ftp      CALL  read(0x5,0x7fffe754,0x200)
1613 ftp      RET   read 512/0x200
1613 ftp      CALL  syssgi(0x44,0x5,0x7fffe7c8,0x2,0,0xfb3509c)

These 512 bytes are the beginning of the /lib/rld file; that is, the ELF headers. If the program has just called syssgi(2) without modifying this area, then 0x7fffe7c8 should point to data that is a plain copy of what is in the /lib/rld file, at offset 0x7fffe7c8 - 0x7fffe754 = 0x74.

We can check that this is true:

$ hexdump -s 0x74 -n 80 /lib/rld
0000074 0000 0001 0000 0000 0fb6 0000 0fb6 0000
0000084 0003 5000 0003 5000 0000 0005 0000 4000
0000094 0000 0001 0003 8000 0fbd 8000 0fbd 8000
00000a4 0000 2000 0000 2000 0000 0006 0000 4000
00000b4 0000 0000 0120 0000 0000 0000 f3ff fffe

Now we know that the program passed some data to syssgi(ELFMAP) from the file. We do not know yet that this is a program header array, but we are getting closer. The question is: what data is in the file at offset 0x74?

Probably some header information, since this is not that far away from the beginning of the file.

The job can be finished using a small C program:

/* cc -o elfdump elfdump.c */
#include <elf.h>
#include <stdio.h>

int main(void) {
        Elf32_Ehdr buf;

        (void)read(0, &buf, sizeof(buf));
        printf("buf.e_phoff = 0x%08x\n", buf.e_phoff);
        printf("buf.e_phentsize = 0x%08x\n", buf.e_phentsize);
        printf("buf.e_phnum = 0x%08x\n", buf.e_phnum);

        return 0;
}

Here is elfdump output :
$ elfdump < /lib/rld
buf.e_phoff = 0x00000034
buf.e_phentsize = 0x00000020
buf.e_phnum = 0x00000004

Now we know that the program header table is at offset 0x34, that each entry is 0x20 bytes long, and there are 4 entries. Syssgi(ELFMAP) was hence passed a pointer to the third program headers: 0x34 + 2 * 0x20 = 0x74.

If we list the program headers, we are now fully convinced that syssgi(ELFMAP) was given a pointer to the two loadable (see LOAD lines below) program headers. Note that vaddr, paddr, memsz, and other values fit values we saw in the memory dump earlier.

$ objdump -p /lib/rld

/lib/rld:     file format elf32-bigmips

Program Header:
0x70000002 off    0x000000b8 vaddr 0x0fb600b8 paddr 0x0fb600b8 align 
2**3
         filesz 0x00000080 memsz 0x00000080 flags r--
0x70000000 off    0x00000138 vaddr 0x0fb60138 paddr 0x0fb60138 align 
2**2
         filesz 0x00000018 memsz 0x00000018 flags r--
    LOAD off    0x00000000 vaddr 0x0fb60000 paddr 0x0fb60000 align 
2**14
         filesz 0x00035000 memsz 0x00035000 flags r-x
    LOAD off    0x00038000 vaddr 0x0fbd8000 paddr 0x0fbd8000 align 
2**14
         filesz 0x00002000 memsz 0x00002000 flags rw-

We also discover that the return address is the vaddr field of the first loadable section; that is, the first section in the program header array.

As a conclusion on this topic, I would say that it is possible to guess what an undocumented system call such as syssgi(ELFMAP) does, and generally what arguments it expects, but it is really much easier if you already have an idea of what you are looking for.

`irix_syssgi_mapelf()` Implementation

On NetBSD, syssgi(ELFMAP) is implemented through the irix_syssgi_mapelf() function. Let us now talk about what this function does.

We already have some support in the kernel for mapping ELF program sections: the kernel needs to load the ELF program sections of the executable and the interpreter.

The code to do this is split into two parts: the first part is the elf32_load_psection() function, from sys/kern/exec_elf32.c. This function takes a program header and builds a set of virtual memory (VM) commands that will load the code section. The VM commands are described by the struct exec_vmcmd, which is defined in sys/sys/exec.h. One struct, exec_vmcmd, contains a pointer to a function and holds its arguments. The functions that can be used are in sys/kern/exec_subr.c:

vmcmd_map_pagedvn(): maps a file area into user space.
vmcmd_map_readvn(): reads a file area into user space.
vmcmd_map_zero(): zeroes a user space area.

The function elf32_load_psection() builds VM commands that use vmcmd_map_pagedvn() when it has to load a section that fits within memory page range. For pages that are not completely filled, the data is copied instead of being mapped, and this is done using vmcmd_map_readvn(). vmcmd_map_zero() is then used to zero the end of the page.

The set of VM commands is returned by elf32_load_psection() in a struct exec_vmcmd_set (defined in /sys/sys/exec.h). Once we have the struct exec_vmcmd_set filled, we can use the second part of the ELF section load, which consists of running the VM commands found in the set.

Although this works well for loading just an executable and its interpreter, calling elf32_load_psection() and running the VM commands does not work very well for the syssgi(ELFMAP) implementation. The reason is that when the kernel loads an executable and its interpreter, it doesn't have to deal with the possibility that the virtual address range where a section was to be loaded is already mapped to another object. This is because the process address space is completely unused at that time.

When mapping several shared libraries, the likelihood that the load address of an object is already allocated to another object is is very high. In fact, it does happen for any X11-related o32 IRIX binary:

The load addresses of the program sections of libX11 overlap with the load addresses of libXaw:

$ objdump -p /usr/lib/libX11.so.1
(snip)
    LOAD off    0x00000000 vaddr 0x0f5b0000 paddr 0x0f5b0000 align 
2**14
         filesz 0x000f1000 memsz 0x000f1000 flags r-x
    LOAD off    0x000f4000 vaddr 0x0f6a4000 paddr 0x0f6a4000 align 
2**14
         filesz 0x0000a000 memsz 0x0000a000 flags rw-

$ objdump -p /usr/lib/libXaw.so.2
(snip)
    LOAD off    0x00000000 vaddr 0x0f5a0000 paddr 0x0f5a0000 align 
2**14
         filesz 0x00041000 memsz 0x00041000 flags r-x
    LOAD off    0x00044000 vaddr 0x0f5f4000 paddr 0x0f5f4000 align 
2**14
         filesz 0x00005000 memsz 0x00005000 flags rw-

The first section of libXaw.so.2 loads at 0x0f5a0000 and is 0x00041000 bytes long. Therefore, the section is loaded from 0x0f5a0000 to 0x0f5e1000, whereas the first section of libx11.so.1 wants to be loaded at 0x0f5b0000, which falls into that range.

Using par(1) on IRIX, it is possible to check what IRIX does to work around this: the value returned by syssgi(ELFMAP) is not the default load address of the first LOAD section, but another place. By building test programs linked with libX11 and libXaw, it is possible to check that libX11 is indeed loaded at the address returned by syssgi(ELFMAP). The library has been relocated in memory.

elf32_load_psection() contains no code to check if the address range requested by the program header is available. We then have to check this in our irix_syssgi_mapelf() function. This is done using the uvm_findspace(9) function. uvm_findspace(9) can be used in several ways. Given an area's virtual address, an area's length, and the UVM_FLAG_FIXED flag, it will tell if the area can be allocated at the given virtual address or not. Without the UVM_FLAG_FIXED flag, uvm_findspace(9) will find a virtual address where the area can be allocated.

uvm_findspace(9) is first used to check that there is enough free space to load each program section. If there is a problem with any of them, then we will have to relocate all the sections from this shared object.

Relocation is a difficult job. syssgi(ELFMAP) only returns the virtual address of the first section. If the sections are relocated, the only way for the calling program to find them is by using offsets from the first section. If the first section is moved by 0x4000 bytes, all of the other sections should be moved by 0x4000 bytes.

We want to keep the code in irix_syssgi_mapelf() simple, so that it has some chance to work correctly. We do this by making a few assumptions:

The section described by the program header array will never overlap.
The load address of a section in the program header array is always higher than the section described by the previous entry.

Since syssgi(ELFMAP) is used to map shared libraries, the first assumption is likely to be okay: no shared library will come with overlapping code sections. The second assumption seems okay, but one could build a bad binary with program headers reverse-ordered. At least this nasty kind of object does not seems to exist in a real IRIX system.

Once we have made the two assumptions, we compute the section union area. This is an area enclosing all of the code sections described in the program header array. Then we use uvm_findspace(9) without the UVM_FLAG_FIXED flag to find a place for this area. Once we have the address of a free place, we just have to add the offset to this new location to the load addresses of all entries in the program header array. The elf32_load_psection can do its job; the load addresses are not already used.

Here is a quick summary of irix_syssgi_mapelf() behavior:

Copy the program header array into kernel memory.
For each section, check that:

The section is loadable.
The section's load address is bigger than the previous section.
There is some free space to load the section at the default address.

If not, we will need a relocation.

If we need a relocation:

Compute the section union size.
Find some free place for the section union.
For each section:

Apply the relocation offset to the section entry in the table.

For each section:

Run elf32_load_psection().
For each VM command in the returned set:

Run the VM command.

One reading the Linux implementation of syssgi(ELFMAP) might wonder why the NetBSD version is that much more complicated. This is because the Linux version does not handle relocations, nor does it properly handle the loading of sections that are not aligned on a page boundary.

Other IRIX Oddities

There are a few other IRIX-specific system calls that are used in nearly every IRIX binary: sysmp() and prctl(). Both are meta-system calls like syssgi().

sysmp() is supposed to gather various multiprocessor-related functionality. The most commonly-used request is PGSIZE, which returns the memory page size. There are also requests to get well-known kernel structure offsets in /dev/kmem (KERNADDR), or the number of available processors (NPROCS). All of the requests are defined in IRIX's <sys/sysmp.h>.

prctl() implements functions related to multi-threading. The most-used request is LASTSHEXIT, which tells the kernel that the caller is the last thread of the process. Every IRIX process calls this before terminating. The emulation of this command is simple, for now: we just do nothing. All of the commands of prctl() are defined in IRIX's <sys/prctl.h>.

Finally, sginap() is another widely used, IRIX-specific system call. It is an equivalent of sleep(2) that returns the number of ticks elapsed. This was easy to emulate by checking the system clock before and after a sleep(9), and then returning the difference.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.

Return to the BSD DevCenter.

IRIX Binary Compatibility, Part 3

IRIX Oddities: system calls that you will not see anywhere else!

syssgi() Overview

Reverse Engineering syssgi(ELFMAP)

irix_syssgi_mapelf() Implementation

Other IRIX Oddities

`syssgi()` Overview

Reverse Engineering `syssgi(ELFMAP)`

`irix_syssgi_mapelf()` Implementation