Printing a program's state as it runs is the simple but effective debugging tool of programmers everywhere. For efficiency, we usually disable the most verbose output in production. But sometimes you need to diagnose a problem in a deployed system. It would be convenient to declare "tracepoints" and enable them at runtime, like so:
tracepoint foo_entry;
int foo(int n) {
TRACE(foo_entry, "called foo(%d)\n", n);
// ...
}
// Called from UI, monitoring interface, etc.
void debug_foo() {
enable(&foo_entry);
}
Here's a simple implementation of this API:
typedef int tracepoint;
#define TRACE(_point, _args...) \
do { \
if (_point) printf(_args); \
} while (0)
static inline void enable(tracepoint *point) {
*point = 1;
}
Each tracepoint is simply a global variable. The construct do { ... } while (0)
is a standard trick to make macro-expanded code play nicely with its surroundings. We also use GCC's syntax for macros with a variable number of arguments.
This approach does introduce a bit of overhead. One concern is that reading a global variable will cause a cache miss and will also evict a line of useful data from the cache. There's also some impact from adding a branch instruction. We'll develop a significantly more complicated implementation which avoids both of these problems.
Our new solution will be specific to x86-64 processors running Linux, though the idea can be ported to other platforms. This approach is inspired by various self-modifying-code schemes in the Linux kernel, such as ftrace, kprobes, immediate values, etc. It's mostly intended as an example of how these tricks work. The code in this article is not production-ready.
The design
Our new TRACE
macro will produce code like the following pseudo-assembly:
foo:
...
; code before tracepoint
...
tracepoint:
nop
after_tracepoint:
...
; rest of function
...
ret
do_tracepoint:
push args to printf
call printf
jmp after_tracepoint
In the common case, the tracepoint is disabled, and the overhead is only a single nop
instruction. To enable the tracepoint, we replace the nop
instruction in memory with jmp do_tracepoint
.
The TRACE
macro
Our nop
instruction needs to be big enough that we can overwrite it with an unconditional jump. On x86-64, the standard jmp
instruction has a 1-byte opcode and a 4-byte signed relative displacement, so we need a 5-byte nop
. Five one-byte 0x90
instructions would work, but a single five-byte instruction will consume fewer CPU resources. Finding the best way to do nothing is actually rather difficult, but the Linux kernel has already compiled a list of favorite nops. We'll use this one:
#define NOP5 ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"
Let's check this instruction using udcli
:
$ echo 0f 1f 44 00 00 | udcli -x -64 -att
0000000000000000 0f1f440000 nop 0x0(%rax,%rax)
GCC's extended inline assembly lets us insert arbitrarily bizarre assembly code into a normal C program. We'll use the asm goto
flavor, new in GCC 4.5, so that we can pass C labels into our assembly code. (The tracing use case inspired the asm goto
feature, and my macro is adapted from an example in the GCC manual.)
Here's how it looks:
typedef int tracepoint;
#define TRACE(_point, _args...) \
do { \
asm goto ( \
"0: " NOP5 \
".pushsection trace_table, \"a\";" \
".quad " #_point ", 0b, %l0;" \
".popsection" \
: : : : __lbl_##_point); \
if (0) { \
__lbl_##_point: printf(_args); \
} \
} while (0)
We use the stringify and concat macro operators, and rely on the gluing together of adjacent string literals. A call like this:
TRACE(foo_entry, "called foo(%d)\n", n);
will produce the following code:
do {
asm goto (
"0: .byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"
".pushsection trace_table, \"a\";"
".quad foo_entry, 0b, %l0;"
".popsection"
: : : : __lbl_foo_entry);
if (0) {
__lbl_foo_entry: printf("called foo(%d)\n", n);
}
} while (0);
Besides emitting the nop
instruction, we write three 64-bit values ("quad
s"). They are, in order:
- The address of the
tracepoint
variable declared by the user. We never actually read or write this variable. We're just using its address as a unique key. - The address of the
nop
instruction, by way of a local assembler label. - The address of the C label for our
printf
call, as passed toasm goto
.
This is the information we need in order to patch in a jmp
at runtime. The .pushsection
directive makes the assembler write into the trace_table
section without disrupting the normal flow of code and data. The "a"
section flag marks these bytes as "allocatable", i.e. something we actually want available at runtime.
We count on GCC's optimizer to notice that the condition 0
is unlikely to be true, and therefore move the if
body to the end of the function. It's still considered reachable due to the label passed to asm goto
, so it will not fall victim to dead code elimination.
The linker script
We have to collect all of these trace_table
records, possibly from multiple source files, and put them somewhere for use by our C code. We'll do this with the following linker script:
SECTIONS {
trace_table : {
trace_table_start = .;
*(trace_table)
trace_table_end = .;
}
}
This concatenates all trace_table
sections into a single section in the resulting binary. It also provides symbols trace_table_start
and trace_table_end
at the endpoints of this section.
Memory protection
Linux systems will prevent an application from overwriting its own code, for good security reasons, but we can explicitly override these permissions. Memory permissions are managed per page of memory. There's a correct way to determine the size of a page, but our code is terribly x86-specific anyway, so we'll hardcode the page size of 4096 bytes.
#define PAGE_SIZE 4096
#define PAGE_OF(_addr) ( ((uint64_t) (_addr)) & ~(PAGE_SIZE-1) )
Then we can unprotect an arbitrary region of memory by calling mprotect
for the appropriate page(s):
static void unprotect(void *addr, size_t len) {
uint64_t pg1 = PAGE_OF(addr),
pg2 = PAGE_OF(addr + len - 1);
if (mprotect((void *) pg1, pg2 - pg1 + PAGE_SIZE,
PROT_READ | PROT_EXEC | PROT_WRITE)) {
perror("mprotect");
abort();
}
}
We're calling mprotect
on a page which was not obtained from mmap
. POSIX does not define this behavior, but Linux specifically allows mprotect
on any page except the vsyscall page.
Enabling a tracepoint
Now we need to implement the enable
function:
void enable(tracepoint *point);
We will scan through the trace_table
records looking for a matching tracepoint
pointer. The C struct corresponding to a trace_table
record is:
struct trace_desc {
tracepoint *point;
void *jump_from;
void *jump_to;
} __attribute__((packed));
The packed
attribute tells GCC not to insert any padding within or after these structs. This ensures that their layout will match the records we produced from assembly. Now we can implement a linear search through this table.
void enable(tracepoint *point) {
extern struct trace_desc trace_table_start[], trace_table_end[];
struct trace_desc *desc;
for (desc = trace_table_start; desc < trace_table_end; desc++) {
if (desc->point != point)
continue;
int64_t offset = (desc->jump_to - desc->jump_from) - 5;
if ((offset > INT32_MAX) || (offset < INT32_MIN)) {
fprintf(stderr, "offset too big: %lx\n", offset);
abort();
}
int32_t offset32 = offset;
unsigned char *dest = desc->jump_from;
unprotect(dest, 5);
dest[0] = 0xe9;
memcpy(dest+1, &offset32, 4);
}
}
We enable a tracepoint by overwriting its nop
with an unconditional jump. The opcode is 0xe9
. The operand is a 32-bit displacement, interpreted relative to the instruction after the jump. desc->jump_from
points to the beginning of what will be the jump instruction, so we subtract 5 from the displacement. Then we unprotect memory and write the new bytes into place.
That's everything. You can grab all of this code from GitHub, including a simple test program.
Pitfalls
Where to start?
This code is extremely non-portable, relying on details of x86-64, Linux, and specific recent versions of the GNU C compiler and assembler. The idea can be ported to other platforms, with some care. For example, ARM processors require an instruction cache flush after writing to code. Linux on ARM implements the cacheflush
system call for this purpose.
Our code is not thread-safe, either. If one thread reaches a nop
while it is being overwritten by another thread, the result will surely be a crash or other horrible bug. The Ksplice paper [PDF] discusses how to prevent this, in the context of live-patching the Linux kernel.
Is it worth opening this can of worms in order to improve performance a little? In general, no. Obviously we'd have to measure the performance difference to be sure. But for most projects, concerns of maintainability and avoiding bugs will preclude tricky hacks like this one.
The Linux kernel is under extreme demands for both performance and flexibility. It's part of every application on a huge number of systems, so any small performance improvement has a large aggregate effect. And those systems are incredibly diverse, making it likely that someone will see a large difference. Finally, kernel development will always involve tricky low-level code as a matter of course. The infrastructure is already there to support it — both software infrastructure and knowledgeable developers.