Saturday, January 10, 2015

151-byte static Linux binary in Rust

Part of the sales pitch for Rust is that it's "as bare metal as C".1 Rust can do anything C can do, run anywhere C can run,2 with code that's just as efficient, and at least as safe (but usually much safer).

I'd say this claim is about 95% true, which is pretty good by the standards of marketing claims. A while back I decided to put it to the test, by making the smallest, most self-contained Rust program possible. After resolving a few issues along the way, I ended up with a 151-byte, statically linked executable for AMD64 Linux. With the release of Rust 1.0-alpha, it's time to show this off.

Here's the Rust code:

#![crate_type="rlib"]
#![allow(unstable)]

#[macro_use] extern crate syscall;

use std::intrinsics;

fn exit(n: usize) -> ! {
    unsafe {
        syscall!(EXIT, n);
        intrinsics::unreachable()
    }
}

fn write(fd: usize, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

#[no_mangle]
pub fn main() {
    write(1, "Hello!\n".as_bytes());
    exit(0);
}

This uses my syscall library, which provides the syscall! macro. We wrap the underlying system calls with Rust functions, each exposing a safe interface to the unsafe syscall! macro. The main function uses these two safe functions and doesn't need its own unsafe annotation. Even in such a small program, Rust allows us to isolate memory unsafety to a subset of the code.

Because of crate_type="rlib", rustc will build this as a static library, from which we extract a single object file tinyrust.o:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 00 00 00 00          mov    $0x0,%esi
                        b: R_X86_64_32  .rodata.str1625
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

We disable stack exhaustion checking, as well as position-independent code, in order to slim down the output. After optimization, the only instructions that survive come from inline assembly blocks in the syscall library.

Note that main doesn't end in a ret instruction. The exit function (which gets inlined) is marked with a "return type" of !, meaning "doesn't return". We make good on this by invoking the unreachable intrinsic after syscall!. LLVM will optimize under the assumption that we can never reach this point, making no guarantees about the program behavior if it is reached. This represents the fact that the kernel is actually going to kill the process before syscall!(EXIT, n) can return.

Because we use inline assembly and intrinsics, this code is not going to work on a stable-channel build of Rust 1.0. It will require an alpha or nightly build until such time as inline assembly and intrinsics::unreachable are added to the stable language of Rust 1.x.

Note that I didn't even use #![no_std]! This program is so tiny that everything it pulls from libstd is a type definition, macro, or fully inlined function. As a result there's nothing of libstd left in the compiler output. In a larger program you may need #![no_std], although its role is greatly reduced following the removal of Rust's runtime.

Linking

This is where things get weird.

Whether we compile from C or Rust,3 the standard linker toolchain is going to include a bunch of junk we don't need. So I cooked up my own linker script:

SECTIONS {
    . = 0x400078;
  
    combined . : AT(0x400078) ALIGN(1) SUBALIGN(1) {
        *(.text*)
        *(.data*)
        *(.rodata*)
        *(.bss*)
    }
}

We smash all the sections together, with no alignment padding, then extract that section as a headerless binary blob:

$ ld --gc-sections -e main -T script.ld -o payload tinyrust.o
$ objcopy -j combined -O binary payload payload.bin

Finally we stick this on the end of a custom ELF header. The header is written in NASM syntax but contains no instructions, only data fields. The base address 0x400078 seen above is the end of this header, when the whole file is loaded at 0x400000. There's no guarantee that ld will put main at the beginning of the file, so we need to separately determine the address of main and fill that in as the e_entry field in the ELF file header.

$ ENTRY=$(nm -f posix payload | grep '^main ' | awk '{print $3}')
$ nasm -f bin -o tinyrust -D entry=0x$ENTRY elf.s
$ chmod +x ./tinyrust
$ ./tinyrust
Hello!

It works! And the size:

$ wc -c < tinyrust
158

Seven bytes too big!

The final trick

To get down to 151 bytes, I took inspiration from this classic article, which observes that padding fields in the ELF header can be used to store other data. Like, say, a string constant. The Rust code changes to access this constant:

use std::{mem, raw};

#[no_mangle]
pub fn main() {
    let message: &'static [u8] = unsafe {
        mem::transmute(raw::Slice {
            data: 0x00400008 as *const u8,
            len: 7,
        })
    };

    write(1, message);
    exit(0);
}

A Rust slice like &[u8] consists of a pointer to some memory, and a length indicating the number of elements that may be found there. The module std::raw exposes this as an ordinary struct that we build, then transmute to the actual slice type. The transmute function generates no code; it just tells the type checker to treat our raw::Slice<u8> as if it were a &[u8]. We return this value out of the unsafe block, taking advantage of the "everything is an expression" syntax, and then print the message as before.

Trying out the new version:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:        
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 08 00 40 00          mov    $0x400008,%esi
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

...
$ wc -c < tinyrust
151
$ ./tinyrust
Hello!

The object code is the same as before, except that the relocation for the string constant has become an absolute address. The binary is smaller by 7 bytes (the size of "Hello!\n") and it still works!

You can find the full code on GitHub. The code in this article works on rustc 1.0.0-dev (44a287e6e 2015-01-08). If I update the code on GitHub, I will also update the version number printed by the included build script.

I'd be curious to hear if anyone can make my program smaller!


  1. C is not really "bare metal", but that's another story

  2. From a pure language perspective. If you want to talk about availability of compilers and libraries, then Rust still has a bit of a disadvantage ;) 

  3. In fact, this code grew out of an earlier experiment with really small binaries in C. 

12 comments:

  1. Cool, here is an equivalent 71 bytes haskell executable (for comparison):

    #!/usr/local/bin/runhaskell
    module Main where
    main = putStrLn "Hello!"

    ReplyDelete
    Replies
    1. Surely you're trolling, but that's not a stand-alone binary. It would require /usr/local/bin/runhaskell and everything that thing requires...

      Delete
    2. HugoDaniel, very nicely done.

      Delete
  2. @HugoDaniel - That looks like 71 bytes of text but just with a hashbang so bash can execute it. This article is talking about ELF binaries which are directly executable. Your example would require a haskell compiler or interpreter to run.

    ReplyDelete
  3. ...And since I don't know the haskell ecosystem that well, I'd just get the whole dev env. Last time, that weighed in at over a gibibyte, but hey, I'm sure it'll fit on a /modern/ toaster.

    ReplyDelete
  4. Graeme, the shebang is the executable magic number (located in the first 2 bytes of the file) for the os interpreter (following at most 128 bytes for the location of the interpreter). I guess not many people might know this. Shebang is actually and executable format such as AOUT, ELF or gzipped ELF.
    A bit more info: https://en.wikipedia.org/wiki/Shebang_%28Unix%29#Magic_number

    Anders Eurenius, binary size doesn't really matter much since the OS can demand pages as it needs them (the whole thing is not loaded into memory, even in a small static linked binary).
    A bit more info here: https://en.wikipedia.org/wiki/Demand_paging

    ReplyDelete
  5. @HugoDaniel -- the point is that your program requires Haskell runtime to be able to run. The program in this blog post requires only standard libc, so it will run without installing anything.

    ReplyDelete
  6. It doesn't depend on libc, either. Just the syscall ABI that's built in to the kernel.

    ReplyDelete
  7. We should let someone just check the execution times and memory overhead of both the haskell wrapper and the binary code, then the difference might be noticeable even for HD.
    Regarding binary size, try using both implementations on a 8bit microcontroller and see who succeeds.

    ReplyDelete
  8. Running a *nix kernel on a 8-bit microcontroller must be very similar to eating ice-cream with your forehead.

    ReplyDelete
  9. It looks like there's actually a missed optimisation in the LLVM code gen, in the first example it should produce xor %esi,%esi instead of mov $0x0,%esi, that would reduce the code size by 4 bytes and make it run quicker.

    It's probably a bug in the way constants are passed into an inline asm, although I'd have thought the peephole pass would have picked it up.

    ReplyDelete
  10. The article looks out of step with current no_std-wielding Rust. Shall there be "151-byte Rust binary revisited"?

    ReplyDelete