In which x86 has too many registers after all.
Introduction
The fiendish complexity of the x86 instruction set means that even bizarrely restricted subsets are capable of arbitrary computation. As others have shown, we can compute using alphanumeric machine code or English sentences, using only the mov
instruction, or using the MMU as it handles a never-ending double-fault. Here is my contribution to this genre of Turing tarpit: x86 is Turing-complete with no registers.
No registers?
What do I mean by "no registers"? Well, really just whatever makes the puzzle interesting, but the basic guideline is:
No instruction's observable behavior can depend on the contents of any ordinary user-space register.
So we can't read from R[ABCD]X
, R[SD]I
, R[SB]P
(that's right, no stack), R8
-R15
, any of their smaller sub-registers, or any of the x87 / MMX / SSE registers. This forbids implicit register access like push
or movsb
as well as explicit operands. I think I would allow RIP
-relative addressing, but it's probably not useful when you're building a single executable which loads at a fixed address.
We also can't use the condition flags in EFLAGS
, so conditional jumps and moves are right out. Many instructions will set these flags, but those dead stores are okay by me.
All memory access depends on segment selectors, the page table base in CR3
, and so on. We trust that the OS (Linux in my example) has set up a reasonable flat memory model, and we shouldn't try to modify that. Likewise there are debug registers, parts of EFLAGS
(such as the trap bit), and numerous MSRs which can influence the execution of nearly any instruction. We ignore all that too. Basically, the parts of CPU state which normal user-space code doesn't touch are treated as constants.
So what's left that we can work with? Just
- the instruction pointer,
- memory operands, and
- self-modifying code.
But it would be too easy to self-modify an instruction into having a register operand. The above restrictions must hold for every instruction we execute, not just those appearing in our binary. Later on I'll demonstrate experimentally that we aren't cheating.
The instruction set
In a RISC architecture, every memory access is a register load or store, and our task would be completely impossible. But x86 does not have this property. For example we can store a constant directly into memory. Here's machine code along with NASM (Intel syntax) assembly:
c6042500004000ba mov byte [0x400000], 0xba
66c7042500004000dbba mov word [0x400000], 0xbadb
c7042500004000efbead0b mov dword [0x400000], 0xbadbeef
48c7042500004000efbead0b mov qword [0x400000], 0xbadbeef
In the latter case the 4-byte constant is sign-extended to 8 bytes.
We can also perform arithmetic on a memory location in place:
8304250000400010 add dword [0x400000], 0x10
But moving data around is going to be hard. As far as I know, every instruction which loads from one address and stores to another, for example movsb
, depends on registers in some way.
Conditional control flow is possible thanks to this gem of an instruction:
ff242500004000 jmp qword [0x400000]
This jumps to whatever address is stored as a 64-bit quantity at address 0x400000. This seems weird but it's really just a load where the destination register is the instruction pointer. Many RISC architectures also allow this.
Compiling from Brainfuck
Let's get more concrete and talk about compiling Brainfuck code to this subset of x86. Brainfuck isn't the simplest language out there (try Subleq) but it's pretty familiar as an imperative, structured-control language. So I think compiling from Brainfuck makes this feel "more real" than compiling from something super weird.
A Brainfuck program executes on a linear tape of (typically) byte-size cells.
TAPE_SIZE equ 30000
tape_start:
times TAPE_SIZE dq cell0
head equ tape_start + (TAPE_SIZE / 2)
Like many Brainfuck implementations, the tape has a fixed size (more on this later) and we start in the middle. head
is not a variable with a memory location; it's just an assembler constant for the address of the middle of the tape.
Since our only way to read memory is jmp [addr]
, the tape must store pointers to code. We create 256 short routines, each representing one of the values a cell can hold.
cont_zero: dq 0
cont_nonzero: dq 0
out_byte: db 0
align 16
cell_underflow:
jmp inc_cell
align 16
cell0:
mov byte [out_byte], 0
jmp [cont_zero]
%assign cellval 1
%rep 255
align 16
mov byte [out_byte], cellval
jmp [cont_nonzero]
%assign cellval cellval+1
%endrep
align 16
cell_overflow:
jmp dec_cell
There are two things we need to do with a cell: get its byte value for output, and test whether it's zero. So each routine moves a byte into out_byte
and jumps to the address stored at either cont_zero
or cont_nonzero
.
We produce most of the routines using a NASM macro. We also have functions to handle underflow and overflow, so a cell which would reach -1 or 256 is bumped back to 0 or 255. (We could implement the more typical wrap-around behavior with somewhat more code.)
The routines are aligned on 16-byte boundaries so that we can implement Brainfuck's +
and -
by adding or subtracting 16. But how do we know where the head is? We can't store it in a simple memory variable because we'd need a double-indirect jump instruction. This is where the self-modifying code comes in.
test_cell:
jmp [head]
inc_cell:
add qword [head], 16
jmp test_cell
dec_cell:
sub qword [head], 16
jmp test_cell
move_right:
add dword [inc_cell+4], 8
add dword [dec_cell+4], 8
add dword [test_cell+3], 8
jmp [cont_zero]
move_left:
sub dword [inc_cell+4], 8
sub dword [dec_cell+4], 8
sub dword [test_cell+3], 8
jmp [cont_zero]
Recall that head
is an assembler constant for the middle of the tape. So inc_cell
etc. will only touch the exact middle of the tape — except that we modify the instructions when we move left or right. The address operand starts at byte 3 or 4 of the instruction (check the disassembly!) and we change it by 8, the size of a function pointer.
Also note that inc_cell
and dec_cell
jump to test_cell
in order to handle overflow / underflow. By contrast the move instructions don't test the current cell and just jump to [cont_zero]
unconditionally.
To output a byte we perform the system call write
(1, &out_byte, 1)
. There's no escaping the fact that the Linux system call ABI uses registers, so I allow them here. We can do arbitrary computation without output; it's just nice if we can see the results. Input is messier still but it's not fundamentally different from what we've seen here. Code that self-modifies by calling read
()
is clearly the future of computing.
Putting it all together, I wrote a small Brainfuck compiler which does little more than match brackets. For each Brainfuck instruction it outputs one line of assembly, a call to a NASM macro which will load cont_[non]zero
and jump to one of test_cell
, inc_cell
, etc. For the program [+]
the compiler's output looks like
k00000000: do_branch k00000003, k00000001
k00000001: do_inc k00000002
k00000002: do_branch k00000003, k00000001
k00000003: jmp exit
which blows up into something like
401205: 48c70425611240005c124000 mov qword ptr [0x401261], 0x40125c
401211: 48c704256912400022124000 mov qword ptr [0x401269], 0x401222
40121d: e90cefffff jmp 40012e <test_cell>
401222: 48c70425611240003f124000 mov qword ptr [0x401261], 0x40123f
40122e: 48c70425691240003f124000 mov qword ptr [0x401269], 0x40123f
40123a: e9f6eeffff jmp 400135 <inc_cell>
40123f: 48c70425611240005c124000 mov qword ptr [0x401261], 0x40125c
40124b: 48c704256912400022124000 mov qword ptr [0x401269], 0x401222
401257: e9d2eeffffe9d2eeffff jmp 40012e <test_cell>
40125c: e9c1eeffff jmp 400122 <exit>
Even within our constraints, this code could be a lot more compact. For example, a test
could be merged with a preceding inc
or dec
.
Demos
Let's try it out on some of Daniel B Cristofani's Brainfuck examples.
$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | ./rip
Hello, world!
$ curl -s http://www.hevanet.com/cristofd/brainfuck/fib.b | ./compiler
$ ./rip
0
1
1
2
3
5
8
13
…
And now let's try a Brainfuck interpreter written in Brainfuck. There are several, but we will choose the fastest one (by Clive Gifford), which is also compatible with our handling of end-of-file and cell overflow.
$ curl -s http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi2.b | ./compiler
$ (curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b;
echo '!Uryyb, jbeyq!') | ./rip
Hello, world!
This takes about 4.5 seconds on my machine.
Verifying it with ptrace
How can we verify that a program doesn't use registers? There's no CPU flag to disable registers, but setting them to zero after each instruction is close enough. Linux's ptrace
system call allows us to manipulate the state of a target process.
uint64_t regs_boundary;
void clobber_regs(pid_t child) {
struct user_regs_struct regs_int;
struct user_fpregs_struct regs_fp;
CHECK(ptrace(PTRACE_GETREGS, child, 0, ®s_int));
if (regs_int.rip < regs_boundary)
return;
CHECK(ptrace(PTRACE_GETFPREGS, child, 0, ®s_fp));
// Clear everything before the instruction pointer,
// plus the stack pointer and some bits of EFLAGS.
memset(®s_int, 0, offsetof(struct user_regs_struct, rip));
regs_int.rsp = 0;
regs_int.eflags &= EFLAGS_MASK;
// Clear x87 and SSE registers.
memset(regs_fp.st_space, 0, sizeof(regs_fp.st_space));
memset(regs_fp.xmm_space, 0, sizeof(regs_fp.xmm_space));
CHECK(ptrace(PTRACE_SETREGS, child, 0, ®s_int));
CHECK(ptrace(PTRACE_SETFPREGS, child, 0, ®s_fp));
clobber_count++;
}
For the layout of struct user_regs_struct
, see /usr/include/sys/user.h
.
We allow registers in the first part of the program, which is responsible for system calls. regs_boundary
is set by looking for the symbol FORBID_REGS
in the binary.
We run the target using PTRACE_SINGLESTEP
, which sets the trap flag so that the CPU will raise a debug exception after one instruction. Linux handles this exception, suspends the traced process, and wakes up the tracer, which was blocked on waitpid
.
while (1) {
// For this demo it's simpler if we don't deliver signals to
// the child, so the last argument to ptrace() is zero.
CHECK(ptrace(PTRACE_SINGLESTEP, child, 0, 0));
CHECK(waitpid(child, &status, 0));
if (WIFEXITED(status))
finish(WEXITSTATUS(status));
inst_count++;
clobber_regs(child);
}
And the demo:
$ gcc -O2 -Wall -o regclobber regclobber.c
$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | time ./regclobber ./rip
Hello, world!
Executed 81366 instructions; clobbered registers 81189 times.
0.36user 1.81system 0:01.96elapsed 110%CPU (0avgtext+0avgdata 1392maxresident)k
At almost two seconds elapsed, this is hundreds of times slower than running ./rip
directly. Most of the time is spent in the kernel, handling all those system calls and debug exceptions.
I wrote about ptrace
before if you'd like to see more of the things it can do.
Notes on universality
Our tape has a fixed size of 30,000 cells, the same as Urban Müller's original Brainfuck compiler. A system with a finite amount of state can't really be Turing-complete. But x86 itself also has a limit on addressable memory. So does C, because sizeof(void *)
is finite. These systems are Turing-complete when you add an external tape using I/O, but so is a finite-state machine!
So while x86 isn't really Turing-complete, with or without registers, I think the above construction "feels like" arbitrary computation enough to meet the informal definition of "Turing-complete" commonly used by programmers, for example in the mov
is Turing-complete paper. If you know of a way to formalize this idea, do let me know (I'm more likely to notice tweets @miuaf
than comments here).