Sunday, January 9, 2011

Implementing breakpoints on x86 Linux

Debugger features such as breakpoints go well beyond the normal ways in which programs interact. How can one process force another process to pause at a particular line of code? We can bet that the debugger is using some special features of the CPU and/or operating system.

As I learned about implementing breakpoints, I was surprised to find that these interfaces, while arcane, are not actually very complex. In this article we'll implement a minimal but working breakpoints library in about 75 lines of C code. All of the code is shown here, and is also available on GitHub.

True, most of us will never write a debugger. But breakpoints are useful in many other tools: profilers, compatibility layers, security scanners, etc. If your program has to interact deeply with another process, without modifying its source code, breakpoints might just do the trick. And I think there's value in knowing what your debugger is really doing.

Our code will compile with gcc and will run on Linux machines with 32- or 64-bit x86 processors. I'll assume you're familiar with the basics of UNIX programming in C. We'll make heavy use of the ptrace system call, which I'll try to explain as we go. If you're looking for other reading about ptrace, I highly recommend this article and of course the manpage.

The API

Our little breakpoint library is named breakfast. Here's breakfast.h:

#ifndef _BREAKFAST_H
#define _BREAKFAST_H

#include <sys/types.h> /* for pid_t */

typedef void *target_addr_t;
struct breakpoint;

void breakfast_attach(pid_t pid);
target_addr_t breakfast_getip(pid_t pid);
struct breakpoint *breakfast_break(pid_t pid, target_addr_t addr);
int breakfast_run(pid_t pid, struct breakpoint *bp);

#endif

Before we dive into the implementation, we'll describe the API our library provides. breakfast is used by one process (the tracing process) to place breakpoints inside the code of another process (the target process). To establish this relationship, the tracing process calls breakfast_attach with the target's process ID. This also suspends execution of the target.

We use breakfast_getip to read the target's instruction pointer, which holds the address of the next instruction it will execute. Note that this is a pointer to machine code within the target's address space. Dereferencing the pointer within our tracing process would make little sense. The type alias target_addr_t reminds us of this fact.

We use breakfast_break to create a breakpoint at a specified address in the target's machine code. This function returns a pointer to a breakpoint structure, which has unspecified contents.

Calling breakfast_run lets the target run until it hits a breakpoint or it exits; breakfast_run returns 1 or 0 respectively. On the first invocation, we'll pass NULL for the argument bp. On subsequent calls, we're resuming from a breakpoint, and we must pass the corresponding breakpoint struct pointer. The breakfast_getip function will tell us the breakpoint's address, but it's our responsibility to map this to the correct breakpoint structure.

The trap instruction

We will implement a breakpoint by inserting a new CPU instruction into the target's memory at the breakpoint address. This instruction should suspend execution of the target and give control back to the OS.

There are many ways to return control to the OS, but we'd like to minimize disruption to the code we're hot-patching. x86 provides an instruction for exactly this purpose. It's named int3 and is encoded as the single byte 0xCC. When the CPU executes int3, it will stop what it's doing and jump to the interrupt 3 service routine, which is a piece of code chosen by the OS. On Linux, this routine will send the signal SIGTRAP to the current process.

Since we're placing int3 in the target's code, the target will receive a SIGTRAP. Under normal circumstances this would invoke the target's SIGTRAP handler, which usually kills the process. Instead we'd like the tracing process to intercept this signal and interpret it as the target hitting a breakpoint. We'll accomplish this through the magic of ptrace.

Let's start off breakfast.c with some includes:

#include <sys/ptrace.h>
#include <sys/reg.h>
#include <sys/wait.h>
#include <stdlib.h>
#include "breakfast.h"

Then we'll define our trap instruction:

#if defined(__i386)
#define REGISTER_IP EIP
#define TRAP_LEN 1
#define TRAP_INST 0xCC
#define TRAP_MASK 0xFFFFFF00

#elif defined(__x86_64)
#define REGISTER_IP RIP
#define TRAP_LEN 1
#define TRAP_INST 0xCC
#define TRAP_MASK 0xFFFFFFFFFFFFFF00

#else
#error Unsupported architecture
#endif

We use some GCC pre-defined macros to discover the machine architecture, and we define a few constants accordingly. REGISTER_IP expands to a constant from sys/reg.h which identifies the machine register holding the instruction pointer.

The trap instruction is stored as an integer TRAP_INST and its length in bytes is TRAP_LEN. These are the same on 32- and 64-bit x86. The trap instruction is a single byte, but we'll read and write the target's memory in increments of one machine word, meaning 32 or 64 bits. So we'll read 4 or 8 bytes of machine code, clear out the first byte with TRAP_MASK, and substitute 0xCC. Since x86 is a little-endian architecture, the first byte in memory is the least-significant byte of an integer machine word.

Invoking ptrace

All of the various ptrace requests are made through a single system call named ptrace. The first argument specifies the type of request and the meaning of any remaining arguments. The second argument is almost always the process ID of the target.

Before we can mess with the target process, we need to attach to it:

void breakfast_attach(pid_t pid) {
int status;
ptrace(PTRACE_ATTACH, pid);
waitpid(pid, &status, 0);
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT);
}

The PTRACE_ATTACH request will stop the target with SIGSTOP. We wait for the target to receive this signal. We also request to receive a notification when the target exits, which I'll explain below.

Note that ptrace and waitpid could fail in various ways. In a real application you'd want to check the return value and/or errno. For brevity I've omitted those checks in this article.

We use another ptrace request to get the value of the target's instruction pointer:

target_addr_t breakfast_getip(pid_t pid) {
long v = ptrace(PTRACE_PEEKUSER, pid, sizeof(long)*REGISTER_IP);
return (target_addr_t) (v - TRAP_LEN);
}

Since the target process is suspended, it's not running on any CPU, and its instruction pointer is not stored in a real CPU register. Instead, it's saved to a "user area" in kernel memory. We use the PTRACE_PEEKUSER request to read a machine word from this area at a specified byte offset. The constants in sys/regs.h give the order in which registers appear, so we simply multiply by sizeof(long).

After we hit a breakpoint, the saved IP points to the instruction after the trap instruction. When we resume execution, we'll go back and execute the original instruction that we overwrote with the trap. So we subtract TRAP_LEN to give the true address of the next instruction.

Creating breakpoints

We need to remember two things about a breakpoint: the address of the code we replaced, and the code which originally lived there.

struct breakpoint {
target_addr_t addr;
long orig_code;
};

To enable a breakpoint, we save the original code and insert our trap instruction:

static void enable(pid_t pid, struct breakpoint *bp) {
long orig = ptrace(PTRACE_PEEKTEXT, pid, bp->addr);
ptrace(PTRACE_POKETEXT, pid, bp->addr, (orig & TRAP_MASK) | TRAP_INST);
bp->orig_code = orig;
}

The PTRACE_PEEKTEXT request reads a machine word from the target's code address space, which is named "text" for historical reasons. PTRACE_POKETEXT writes to that space. On x86 Linux there's actually no difference between code and data spaces, so PTRACE_PEEKDATA and PTRACE_POKEDATA would work just as well.

Creating a breakpoint is straightforward:

struct breakpoint *breakfast_break(pid_t pid, target_addr_t addr) {
struct breakpoint *bp = malloc(sizeof(*bp));
bp->addr = addr;
enable(pid, bp);
return bp;
}

To disable a breakpoint we just write back the saved word:

static void disable(pid_t pid, struct breakpoint *bp) {
ptrace(PTRACE_POKETEXT, pid, bp->addr, bp->orig_code);
}

Running and stepping

Once we attach to the target, its execution stops. Here's how to resume it:

static int run(pid_t pid, int cmd);  /* defined momentarily */

int breakfast_run(pid_t pid, struct breakpoint *bp) {
if (bp) {
ptrace(PTRACE_POKEUSER, pid, sizeof(long)*REGISTER_IP, bp->addr);

disable(pid, bp);
if (!run(pid, PTRACE_SINGLESTEP))
return 0;
enable(pid, bp);
}
return run(pid, PTRACE_CONT);
}

We ask ptrace to continue execution — but if we're resuming from a breakpoint, we have to do some cleanup first. We rewind the instruction pointer so that the next instruction to execute is at the breakpoint. Then we disable the breakpoint and make the target execute just one instruction. Once we're past the breakpoint, we can re-enable it for next time. run returns 0 if the target exits, which theoretically could happen during our single-step.

The last piece of the puzzle is run:

static int run(pid_t pid, int cmd) {
int status, last_sig = 0, event;
while (1) {
ptrace(cmd, pid, 0, last_sig);
waitpid(pid, &status, 0);

if (WIFEXITED(status))
return 0;

if (WIFSTOPPED(status)) {
last_sig = WSTOPSIG(status);
if (last_sig == SIGTRAP) {
event = (status >> 16) & 0xffff;
return (event == PTRACE_EVENT_EXIT) ? 0 : 1;
}
}
}
}

cmd is either PTRACE_CONT or PTRACE_SINGLESTEP. In the latter case, the OS will set a control bit to make the CPU raise interrupt 3 after one instruction has completed.

ptrace will let the target run until it exits or receives a signal. We use the wait macros to see what happened. If the target exited, we simply return 0. If it received a signal, we check which one. Anything other than SIGTRAP should be delivered through to the target, which we accomplish by passing the signal number to the next ptrace call.

In the case of SIGTRAP, we check bits 16-31 of status for the value PTRACE_EVENT_EXIT, which indicates that the target is about to exit. Recall that we requested this notification by setting the option PTRACE_O_TRACEEXIT. You'd think (at least, I did) that checking WIFEXITED should be enough. But I ran into a problem where delivering a fatal signal to the target would make the tracing process loop forever. I fixed it by adding this extra check. I'm certainly no ptrace expert and I'd be interested to hear more about what's going on here.

Testing it

That's all for breakfast.c! Now let's write a small test.c.

#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include "breakfast.h"

void child();
void parent(pid_t);

int main() {
pid_t pid;
if ((pid = fork()) == 0)
child();
else
parent(pid);
return 0;
}

We fork two separate processes, a parent and a child. The child will do some computation we hope to observe. Let's make it compute the famous factorial function:

int fact(int n) {
if (n <= 1)
return 1;
return n * fact(n-1);
}

void child() {
kill(getpid(), SIGSTOP);
printf("fact(5) = %d\n", fact(5));
}

The parent is going to invoke PTRACE_ATTACH and send the child SIGSTOP, but the child might finish executing before the parent gets a chance to call ptrace. So we make the child preemptively stop itself. This isn't a problem when attaching to a long-running process.

Actually, for this fork-and-trace-child pattern, we should be using PTRACE_TRACEME. But we implemented a minimal library, so we work with what we have.

The parent uses breakfast to place breakpoints in the child:

void parent(pid_t pid) {
struct breakpoint *fact_break, *last_break = NULL;
void *fact_ip = fact, *last_ip;
breakfast_attach(pid);
fact_break = breakfast_break(pid, fact_ip);
while (breakfast_run(pid, last_break)) {
last_ip = breakfast_getip(pid);
if (last_ip == fact_ip) {
printf("Break at fact()\n");
last_break = fact_break;
} else {
printf("Unknown trap at %p\n", last_ip);
last_break = NULL;
}
}
}

In principle we can use breakfast to trace any running process we own, even if we don't have its source code. But we still need a way to find interesting breakpoint addresses. Here it's as easy as we could want: the child forked from our process, so we get the address of its fact function just by taking the address of our fact function.

Let's try it out:

$ gcc -Wall -o test test.c breakfast.c
$ ./test
Break at fact()
Break at fact()
Break at fact()
Break at fact()
Break at fact()
fact(5) = 120

We can see the five calls to fact.

Inspecting the target

The ability to count function calls is already useful for performance profiling. But we'll usually want to get more information out of the stopped target. Let's see if we can read the arguments passed to fact. This part will be specific to 64-bit x86, though the idea generalizes.

Each architecture defines a C calling convention, which specifies how function arguments are passed, often using a combination of registers and stack slots. On 64-bit x86, the first argument is passed in the RDI register. You can verify this by running objdump -d test and looking at the disassembled code for fact.

So we'll modify test.c to read this register:

#include <sys/ptrace.h>
#include <sys/reg.h>
...
if (last_ip == fact_ip) {
int arg = ptrace(PTRACE_PEEKUSER, pid, sizeof(long)*RDI);
printf("Break at fact(%d)\n", arg);
...

And we get:

$ gcc -Wall -o test test.c breakfast.c
$ ./test
Break at fact(5)
Break at fact(4)
Break at fact(3)
Break at fact(2)
Break at fact(1)
fact(5) = 120

Cool!

Notes

This is my first non-trivial project using ptrace, and it's likely I got some details wrong. Please reply with corrections or advice and I'll update the code if necessary.

This code is available for download at GitHub. It's released under a BSD license, which imposes few restrictions. It's definitely not suitable as-is for production use, but feel free to take the code and build something more with it.

The x86 architecture has dedicated registers for setting breakpoints, subject to various limitations. We ignored this feature in favor of the more flexible software breakpoints. Hardware breakpoints can do some things this technique can't, such as breaking on reads from a particular memory address.

3 comments:

  1. Nice post.

    One subtlety: It seems that replacing a long's worth in disable() means that bad things could happen when two breakpoints are within sizeof(long)-1 of each other. To avoid this I think it suffices for disable() to replace only the single byte occupied by the int3 instruction.

    ReplyDelete
  2. still don't know what happens in:
    ptrace(PTRACE_POKETEXT, pid, bp->addr, (orig & TRAP_MASK) | TRAP_INST);
    overwrite this single assemble code ? what happens?
    record pc, jump to int3, and then jump back?

    ReplyDelete
  3. Sophia,

    That is replacing the instruction at the breakpoint address with the trap machine instruction int3.

    ptrace(PTRACE_POKETEXT, /* write into data area */
    pid, /* process identifier */
    bp->addr, /* address of the breakpoint */
    (orig & TRAP_MASK) | TRAP_INST); /* see below */

    Here, the variable orig contains the original instruction data at our desired breakpoint address. For x86, this is 4 bytes. Imagine orig contains the value 0xDEADBEEF (some arbitrary machine instruction), then orig & TRAP_MASK will evaluate to 0xDEADBE00. 0xDEADBE00 | TRAP_INST is the same as 0XDEADBE00 | 0xCC since TRAP_INST is defined as 0xCC (the int3 machine instruction). This expression will evaluate to 0xDEADBECC.

    Because of an oddity with x86 (it's little endian), the bytes are actually reversed when they're written back. So the value 0xCCBEADDE will be written to memory at our breakpoint address. The end result is that the x86 instruction int3, our "break" instruction, is now located at the address we wanted to break at: bp->addr.

    ReplyDelete