Stupid tricks at the userspace/kernelspace boundary, part 1

Basically every operating system structures execution into two logical parts, userspace and kernelspace. The former is where the application’s own code executes, and the latter is where the OS kernel services requests from applications that require privileged access of various kinds — reading and writing data on disk, getting more memory, interacting with other applications, making network connections, and so forth.

Let’s take a look at how recent versions of Linux (on x86-compatible CPUs) implement the transition between the two.

To start off, kernelspace is privileged and can easily return to userspace. However, as userspace is unprivileged, it can’t just start running code in kernelspace at will. So in order to make a request of the OS, a system call, it generates a software interrupt via the INT instruction, which causes essentially the same sort of response as a hardware interrupt generated by some physical device (a timer, an external input device, etc.): the processor switches to kernelspace and starts running a predefined function in the kernel. This interrupt handler determines what’s going on, acts on the system call, and returns (via IRET) from the interrupt back to userspace.

Usually application programmers don’t have to think about any of this. The standard C library includes a C function for each system call, which additionally takes care of things like passing function parameters the way the system call interrupt handler wants them, and providing unified error reporting. Moreover, this abstraction layer allows the interface to be changed: INT is somewhat time-consuming, since it makes the processor deal with the possibility of a hardware error, and eventually Intel introduced the faster SYSENTER instruction, and AMD a similar SYSCALL one, which was specifically for use for system calls and nothing else.

Well and good, except now code running on older processors needs to continue to use INT, and newer Intel ones should use SYSENTER, and newer AMD ones should use SYSCALL, and so we need a different C library for each processor type. And the problem isn’t just restricted to the C library: nothing is stopping applications from ignoring the C library and making system calls if they want to, and there are legitimate uses for skipping standard libraries entirely (statically-linked rescue utilities being one of the big ones). So we need a different abstraction layer, preferably one right at the syscall level with no C-library–style tricks.

Enter the VDSO, the “virtual dynamic shared object” that looks like it might be a normal library (“dynamic shared object”) on disk, but is actually provided by the kernel. Since the kernel does a lot of hardware probing at boot up, it knows the fastest way to make a system call, so in a stroke of genius the kernel provides to every process a userspace library that implements that fastest system call method. We can see evidence of this by looking at a process’s own “maps” file in the /proc directory, which contains information about each process:

dr-wily:~ geofft$ cat /proc/self/maps
08048000-0804f000 r-xp 00000000 fd:00 65412      /bin/cat
0804f000-08050000 rw-p 00006000 fd:00 65412      /bin/cat
08050000-08071000 rw-p 08050000 00:00 0          [heap]
b74e2000-b761c000 r--p 00000000 fd:00 630461     /usr/lib/locale/locale-archive
b761c000-b761d000 rw-p b761c000 00:00 0
b761d000-b7772000 r-xp 00000000 fd:00 66165      /lib/i686/cmov/libc-2.7.so
b7772000-b7773000 r--p 00155000 fd:00 66165      /lib/i686/cmov/libc-2.7.so
b7773000-b7775000 rw-p 00156000 fd:00 66165      /lib/i686/cmov/libc-2.7.so
b7775000-b7778000 rw-p b7775000 00:00 0
b778f000-b7791000 rw-p b778f000 00:00 0
b7791000-b7792000 r-xp b7791000 00:00 0          [vdso]
b7792000-b77ac000 r-xp 00000000 fd:00 49513      /lib/ld-2.7.so
b77ac000-b77ae000 rw-p 0001a000 fd:00 49513      /lib/ld-2.7.so
bffeb000-c0000000 rw-p bffeb000 00:00 0          [stack]

The line marked “[vdso]” indicates the virtual memory location of the VDSO, and its permissions: readable but not writable, executable, and private (i.e, not shared with other processes). If we’re curious, we can write out the VDSO with a simple C program to read its own /proc/self/maps, search for the “[vdso]” line, parse the addresses, and write out that section of its own memory.

Let’s take a look at it with the objdump command, which includes a simple decompiler:

dr-wily:/tmp/geofft geofft$ ./dump-vdso > vdso
dr-wily:/tmp/geofft geofft$ objdump -d vdso

vdso:     file format elf32-i386

Disassembly of section .text:

ffffe400 <__kernel_sigreturn>:
ffffe400:       58                      pop    %eax
ffffe401:       b8 77 00 00 00          mov    $0x77,%eax
ffffe406:       cd 80                   int    $0x80
ffffe408:       90                      nop    
ffffe409:       8d 76 00                lea    0x0(%esi),%esi

ffffe40c <__kernel_rt_sigreturn>:
ffffe40c:       b8 ad 00 00 00          mov    $0xad,%eax
ffffe411:       cd 80                   int    $0x80
ffffe413:       90                      nop

ffffe414 <__kernel_vsyscall>:
ffffe414:       51                      push   %ecx
ffffe415:       52                      push   %edx
ffffe416:       55                      push   %ebp
ffffe417:       89 e5                   mov    %esp,%ebp
ffffe419:       0f 34                   sysenter 
ffffe41b:       90                      nop    
ffffe41c:       90                      nop    
ffffe41d:       90                      nop    
ffffe41e:       90                      nop    
ffffe41f:       90                      nop    
ffffe420:       90                      nop    
ffffe421:       90                      nop    
ffffe422:       eb f3                   jmp    ffffe417 <__kernel_vsyscall+0x3>
ffffe424:       5d                      pop    %ebp
ffffe425:       5a                      pop    %edx
ffffe426:       59                      pop    %ecx
ffffe427:       c3                      ret

So, on the computer I’m using (a server with four dual-core AMD Opterons), the kernel recommends SYSENTER. Apparently AMD gave up on SYSCALL at some point and decided to implement Intel’s SYSENTER. Meanwhile, on my netbook with an Intel Atom that’s good for nothing except low power consumption, I get the much simpler:

geofft@white-elephant:/tmp$ ./dump-vdso > vdso
geofft@white-elephant:/tmp$ objdump -d vdso

vdso:     file format elf32-i386


Disassembly of section .text:

ffffe400 <__kernel_sigreturn>:
ffffe400:       58                      pop    %eax
ffffe401:       b8 77 00 00 00          mov    $0x77,%eax
ffffe406:       cd 80                   int    $0x80
ffffe408:       90                      nop
ffffe409:       8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi

ffffe410 <__kernel_rt_sigreturn>:
ffffe410:       b8 ad 00 00 00          mov    $0xad,%eax
ffffe415:       cd 80                   int    $0x80
ffffe417:       90                      nop
ffffe418:       90                      nop
ffffe419:       8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi

ffffe420 <__kernel_vsyscall>:
ffffe420:       cd 80                   int    $0x80
ffffe422:       c3                      ret

Now here’s where I got curious. /proc/self/maps, as noted above, listed the page as not writable. Could you make it writable? Let’s try calling the standard function to change memory permissions and seeing if it will let us:

        if (mprotect(vdso_begin, vdso_size, PROT_EXEC | PROT_READ | PROT_WRITE) != 0) {
                perror("mprotect");
        }

… and we get no error, and in fact if we print out /proc/self/maps we see

b774b000-b774c000 rwxp b774b000 00:00 0          [vdso]

Huh. I wonder if it’s actually writable. Going back to objdump, it looks like the SYSENTER operation is at 0×419 bytes into the page (the VDSO is one page, and a page is 0×1000 bytes long). What if we overwrite it to never make the system call?

        *(char *)(vdso_begin + 0x419) = 0x90;
        *(char *)(vdso_begin + 0x41A) = 0x90;
        printf("Hello world!\n");

0×90 being the opcode for a NOP (no operation, other than stepping one byte forward). Compile it and — no error, and… it doesn’t terminate either. Seems like printf is waiting for some system call to report success, and now that we’ve severed communication with the kernel, that’s never going to happen…

It turns out that even without the printf, the program hangs forever: “return 0;” is just another convenience from the C library. The actual “entry point” of a program is in the C library, which calls main() and waits for its return value. Once it has that, it makes the exit() system call — but again, without the ability to make system calls, the program can’t exit, and the C library’s exit() wrapper function just keeps trying to exit, forever.

Stay tuned for more useful stupid tricks involving modifying __kernel_vsyscall — what sorts of things can we do with this abstraction layer? Here’s a hint: take a look at the “BUGS” section of fakeroot’s man page. And yes, I’m aware of ptrace, but there is a reason not to use it…

(See this blog post from Johan Petersson for another treatment of the first 3/5 of this, and this blog post from Anomit Ghosh for a Python version of dump-vdso.)

19 July 2010