Finding a kernel regression in half an hour with git bisect run

The git bisect command helps you identify the first commit in a range that broke something. You give it a good commit and a bad one, and it will do a binary search between the two to find the first bad commit. At each step, you say either git bisect good or git bisect bad depending on whether it passes your test, and it will move you halfway through the remaining commits in the range.

There are several guides for using git bisect with the Linux kernel (e.g., upstream, Gentoo, and Ubuntu all have one). Unfortunately, they're pretty time-intensive operations; they all say something to the effect of, "now build the kernel, reboot into it, and test it, then type git bisect good or git bisect bad depending on whether it worked." For a tricky hardware compatibility bug, this might be your only option. But if you're testing something about the kernel's behavior, this is unnecessarily slow and manual, and you might be tempted to do something else, like read commit logs.

At work a few days ago, someone reported that a certain application no longer worked in a new VM. After some initial debugging with strace, we determined that the program was calling the listen system call with a backlog of 0: that is, it was saying it was willing to accept up to zero connections. By the specification, it shouldn't work—and yet it did work on their older VM. A few things were different between the new systems, but one notable one was that the new VM had kernel 4.9 and the old one kernel 4.1. (Another was that it was deployed in a new cloud environment that my team is responsible for, with some networking changes, so we wanted to ensure we had not broken anything!)

I tried reading through git log --grep listen v4.1..v4.9 net/, but there was entirely too much and I couldn't find anything. So I decided to see if bisection could help me, with the use of git bisect run, which enables fully automated bisecting. I wasn't excited about rebooting my machine to do a binary search across eight kernel releases, but if I could get it to run in some other way, I could just leave it running.

For a normal program, it's pretty easy to use git bisect run, which just wants a command that returns success (0) or failure (1): you can usually do something like git bisect run make test. For a kernel regression, though, we'll need a command to boot the kernel and run some code. We can use the qemu virtual machine software for this, which has two properties that make it particularly suitable as such a command: it can boot a Linux kernel directly, instead of emulating a bootloader on a hard disk, and it can run a temporary VM in a single command line without any additional setup.

We'll build ourselves a tiny "initrd" (initial RAM disk), which is what's commonly used to load enough drivers to access your hard drive and completely boot your system. However, our initrd will just contain our one test program, which will possibly print a success message, and shut down the system. We can't meaningfully get a return value out of qemu, so we'll just grep its output for the success message.

The first step is to check out the kernel sources, if we don't have them already, and build a kernel:

$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux
$ cd linux
$ make defconfig
$ make -j8

which prints this after a while:

Kernel: arch/x86/boot/bzImage is ready  (#21)

Then make sure we can boot our new kernel with qemu, without an initrd:

$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage

That is, run it in text mode with the VM's serial console on standard input/output instead of trying to pop up a graphical window, and tell the kernel to use the serial port for console output. If your system supports it, you can add -enable-kvm to make it a faster, although since we want to shut down the VM immediately once we run our test, it doesn't make a huge difference (2 seconds vs. 4 on my machine).

This will panic, because we gave the kernel neither a root filesystem nor a working initrd. (You can kill the VM by typing Ctrl-A and then X.) So let's write an initrd with a single binary, init. It needs to shut down the system, so we get back to our prompt:

$ mkdir initrd
$ cd initrd
$ cat > init.c << EOF
#include <sys/reboot.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
        printf("Hello world!\n");
        reboot(RB_POWER_OFF);
}
EOF

(Yes, the system call for shutting down the system is named "reboot", because the name "shutdown" was already used for the system call to close a socket. I guess early UNIX computers didn't support initiating a hardware poweroff from software, so the shutdown command would just stop all processes, sync and unmount disks, and print a message asking the operator to cut power.)

Compile this program statically, so it's a single binary, put it in the particular form required for an initrd (a compressed cpio archive, an old but very simple format with a weird command-line tool) and make sure it's named init, and then we can boot it up with qemu:

$ cd initrd
$ cc -static -o init init.c
$ echo init | cpio -H newc -o | gzip > initrd.gz
1621 blocks
$ cd ..
$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz
...
[    0.502593] ALSA device list:
[    0.502889]   No soundcards found.
[    0.503554] Freeing unused kernel memory: 1088K (ffffffff81f2f000 - ffffffff8203f000)
[    0.504262] Write protecting the kernel read-only data: 14336k
[    0.505004] Freeing unused kernel memory: 1680K (ffff88000185c000 - ffff880001a00000)
[    0.505855] Freeing unused kernel memory: 1340K (ffff880001cb1000 - ffff880001e00000)
Hello world!
[    1.089618] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[    1.092997] ACPI: Preparing to enter system sleep state S5
[    1.094083] reboot: Power down

Great. We've built our own kernel, passed it a test binary to run, and got it booted in a qemu command that exits. This is turning into something we can pass to git bisect run. Now it's time to write the actual test. Here's what I ultimately ended up with to track down my bug:

#include <sys/types.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/reboot.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
    /* The problem I was tracing was only reproducible with syncookies
       disabled. While the initrd gets unpacked into a writable temporary
       filesystem, nothing exists yet, so if I need /proc, I need to create
       and mount it myself. */
    if (getpid() == 1) {
        mkdir("/proc");
        mount("proc", "/proc", "proc", 0, NULL);
        char buf[] = "0\n";
        int fd = open("/proc/sys/net/ipv4/tcp_syncookies", O_WRONLY);
        write(fd, buf, 2);
        close(fd);
    }

    int server = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);

    /* Also, while a loopback ethernet device exist, it's not
       enabled, so network tests won't work. This code is equivalent
       to `ifconfig lo up`. */
    struct ifreq ifreq = {
        .ifr_name = "lo",
    };
    ioctl(server, SIOCGIFFLAGS, &ifreq);
    if (!(ifreq.ifr_flags & IFF_UP)) {
        ifreq.ifr_flags |= IFF_UP;
        ioctl(server, SIOCSIFFLAGS, &ifreq);
    }

    struct sockaddr_in addr = {
        .sin_family = AF_INET,
        .sin_port = htons(54321),
        .sin_addr = {htonl(INADDR_LOOPBACK)},
    };
    bind(server, (struct sockaddr *)&addr, sizeof(addr));
    listen(server, 0);

    int client = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    struct timeval timeout = {3, 0};
    setsockopt(client, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
    if (connect(client, (struct sockaddr *)&addr, sizeof(addr)) == 0) {
        printf("Success\n");
    } else {
        perror("connect");
    }
    if (getpid() == 1) {
        reboot(RB_POWER_OFF);
    }
    return 0;
}

Most of it is specific to the thing I was trying to test, but you may also need the code to create and mount /proc or to enable lo. Also, I put a few things conditional on getpid() == 1 so that I could safely test the program on my host system, where it wasn't running as root and where I didn't want it powering anything off. (I ran it a few times under strace to make sure it was doing what I expected it to do, and I didn't want to bother with getting strace inside my initrd.)

So I first made sure this is reproducible on a stock kernel by itself, isolated from any config my workplace might add:

$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz | grep ^Success
$ git checkout v4.1
$ make defconfig && make -j8
$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz | grep ^Success
Success

Cool, it's definitely a regression somewhere between those versions. (The set of config options change from kernel version to kernel version, so across this wide of a range, the easiest thing is to just get the current kernel's default config - if you need custom config options, you might want to edit .config after running make defconfig or something.) Now time to let git bisect run do its thing:

$ git bisect start
$ git bisect bad v4.9
$ git bisect good v4.1
$ git bisect run sh -c 'make defconfig && make -j8 && qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz | grep ^Success'

It started printing a bunch of build logs and I went to work on something else. About half an hour later (I expected it to take longer!), it prints this out:

ef547f2ac16bd9d77a780a0e7c70857e69e8f23f is the first bad commit
commit ef547f2ac16bd9d77a780a0e7c70857e69e8f23f
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Oct 2 11:43:37 2015 -0700

    tcp: remove max_qlen_log

    This control variable was set at first listen(fd, backlog)
    call, but not updated if application tried to increase or decrease
    backlog. It made sense at the time listener had a non resizeable
    hash table.

    Also rounding to powers of two was not very friendly.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
$ git describe --contains ef547f2ac16bd9d77a780a0e7c70857e69e8f23f
v4.4-rc1~141^2~238^2~2

which looks awfully relevant—it implies they were previously rounding off the backlog. Looking at commit, we can see what happened: before kernel 4.4, the backlog argument was always capped to at least 8, and also rounded up to the next power of two. So listen(fd, 0) was turning into listen(fd, 8) on older kernels, and the program previously worked despite using listen() incorrectly. This commit was actually somewhere in the git log I was trying to read, but I must have scrolled past it.

git reflog shows that git bisect went through sixteen commits before settling on this one: it found this one on its 11th try, and then spent 5 more commits confirming that all the commits before this one were good. So I'm glad git bisect run found this commit, and I'm especially glad it found it in half an hour unattended, without me having to manually compile and test sixteen kernels by hand.

2 March 2018