Wednesday, June 21, 2017

A Rust view on Effective Modern C++

Recently I've been reading Effective Modern C++ by Scott Meyers. It's a great book that contains tons of practical advice, as well as horror stories to astound your friends and confuse your enemies. Since Rust shares many core ideas with modern C++, I thought I'd describe how some of the C++ advice translates to Rust, or doesn't.

This is not a general-purpose Rust / C++ comparison. Honestly, it might not make a lot of sense if you haven't read the book I'm referencing. There are a number of C++ features missing in Rust, for example integer template arguments and advanced template metaprogramming. I'll say no more about those because they aren't new to modern C++.

I may have a clear bias here because I think Rust is a better language for most new development. However, I massively respect the effort the C++ designers have put into modernizing the language, and I think it's still the best choice for many tasks.

There's a common theme that I'll avoid repeating: most of the C++ pitfalls that result in undefined behavior will produce compiler or occasionally runtime errors in Rust.

Chapters 1 & 2: Deducing Types / auto

This is what Rust and many other languages call "type inference". C++ has always had it for calls to function templates, but it became much more powerful in C++11 with the auto keyword.

Rust's type inference seems to be a lot simpler. I think the biggest reason is that Rust treats references as just another type, rather than the weird quasi-transparent things that they are in C++. Also, Rust doesn't require the auto keyword — whenever you want type inference, you just don't write the type. Rust also lacks std::initializer_list, which simplifies the rules further.

The main disadvantage in Rust is that there's no support to infer return types for fn functions, only for lambdas. Mostly I think it's good style to write out those types anyway; GHC Haskell warns when you don't. But it does mean that returning a closure without boxing is impossible, and returning a complex iterator chain without boxing is extremely painful. Rust is starting to improve the situation with -> impl Trait.

Rust lacks decltype and this is certainly a limitation. Some of the uses of decltype are covered by trait associated types. For example,

template<typename Container, typename Index>
auto get(Container& c, Index i)
    -> decltype(c[i])
{ … }

becomes

fn get<Container, Index, Output>(c: &Container, i: Index) -> &Output
    where Container: ops::Index<Index, Output=Output>
{ … }

The advice to see inferred types by intentionally producing a type error applies equally well in Rust.

Chapter 3: Moving to Modern C++

Initializing values in Rust is much simpler. Constructors are just static methods named by convention, and they take arguments in the ordinary way. For good or for ill, there's no std::initializer_list.

nullptr is not an issue in Rust. &T and &mut T can't be null, and you can make null raw pointers with ptr::null() or ptr::null_mut(). There are no implicit conversions between pointers and integral types.

Regarding aliases vs. typedefs, Rust also supports two syntaxes:

use foo::Bar as Baz;
type Baz = foo::Bar;

type is a lot more common, and it supports type parameters.

Rust enums are always strongly typed. They are scoped unless you explicitly use MyEnum::*;. A C-like enum (one with no data fields) can be cast to an integral type.

f() = delete; has no equivalent in Rust, because Rust doesn't implicitly define functions for you in the first place.

Similar to the C++ override keyword, Rust requires a default keyword to enable trait specialization. Unlike in C++, it's mandatory.

As in C++, Rust methods can be declared to take self either by reference or by move. Unlike in C++, you can't easily overload the same method to allow either.

Rust supports const iterators smoothly. It's up to the iterator whether it yields T, &T, or &mut T (or even something else entirely).

The IntoIterator trait takes the place of functions like std::begin that produce an iterator from any collection.

Rust has no equivalent to noexcept. Any function can panic, unless panics are disabled globally. This is pretty unfortunate when writing unsafe code to implement data types that have to be exception-safe. However, recoverable errors in Rust use Result, which is part of the function's type.

Rust supports a limited form of compile-time evaluation, but it's not yet nearly as powerful as C++14 constexpr. This is set to improve with the introduction of miri.

In Rust you mostly don't have to worry about "making const member functions thread safe". If something is shared between threads, the compiler will ensure it's free of thread-related undefined behavior. (This to me is one of the coolest features of Rust!) However, you might run into higher-level issues such as deadlocks that Rust's type system can't prevent.

There are no special member functions in Rust, e.g. copy constructors. If you want your type to be Clone or Copy, you have to opt-in with a derive or a manual impl.

Chapter 4: Smart Pointers

Smart pointers are very important in Rust, as in modern C++. Much of the advice in this chapter applies directly to Rust.

std::unique_ptr corresponds directly to Rust's Box type. However, Box doesn't support custom deallocation code. If you need that, you have to either make it part of impl Drop on the underlying type, or write your own smart pointer. Box also does not support custom allocators.

std::shared_ptr corresponds to Rust's Arc type. Both provide thread-safe reference counting. Rust also supports much faster thread-local refcounting with the Rc type. Don't worry, the compiler will complain if you try to send an Rc between threads.

C++ standard libraries usually implement shared_ptr as a "fat pointer" containing both a pointer to the underlying value and a pointer to a refcount struct. Rust's Rc and Arc store the refcounts directly before the value in memory. This means that Rc and Arc are half the size of shared_ptr, and may perform better due to fewer indirections. On the downside, it means you can't upgrade Box to Rc/Arc without a reallocation and copy. It could also introduce performance problems on certain workloads, due to cache line sharing between the refcounts and the data. (I would love to hear from anyone who has run into this!) Boost supports intrusive_ptr which should perform very similarly to Rust's Arc.

Like Box, Rc and Arc don't support custom deleters or allocators.

Rust supports weak pointer variants of both Rc and Arc. Rather than panicing or returning NULL, the "upgrade" operation returns None, as you'd expect in Rust.

Chapter 5: Rvalue References, Move Semantics, and Perfect Forwarding

This is a big one. Move semantics are rare among programming languages, but they're key in both Rust and C++. However, the two languages take very different approaches, owing to the fact that Rust was designed around moves whereas they're a late addition to C++.

There's no std::move in Rust. Moves are the default for non-Copy types. The behavior of a move or copy is always a shallow bit-wise copy; there is no way to override it. This can greatly improve performance. For example, when a Rust Vec changes address due to resizing, it will use a highly optimized memcpy. In comparison, C++'s std::vector has to call the move constructor on every element, or the copy constructor if there's no noexcept move constructor.

However the inability to hook moves and the difficulty of creating immovable types is an obstacle for certain kinds of advanced memory management, such as intrusive pointers and interacting with external garbage collectors.

Moves in C++ leave the source value in an unspecified but valid state — for example, an empty vector or a NULL unique pointer. This has several weird consequences:

  • A move counts as mutating a source variable, so "Move requests on const objects are silently transformed into copy operations". This is a surprising performance leak.
  • The moved-out-of variable can still be used after the move, and you don't necessarily know what you'll get.
  • The destructor will still run and must take care not to invoke undefined behavior.

The first two points don't apply in Rust. You can move out of a non-mut variable. The value isn't considered mutated, it's considered gone. And the compiler will complain if you try to use it after the move.

The third point is somewhat similar to old Rust, where types with a destructor would contain an implicit "drop flag" indicating whether they had already been moved from. As of Rust 1.12 (September 2016), these hidden struct fields are gone, and good riddance! If a variable has been moved from, the compiler simply omits a call to its destructor. In the situations where a value may or may not have been moved (e.g. move in an if branch), Rust uses local variables on the stack.

Rust doesn't have a feature for perfect forwarding. There's no need to treat references specially, as they're just another type. Because there are no rvalue references in Rust, there's also no need for universal / forwarding references, and no std::forward.

However, Rust lacks variadic generics, so you can't do things like "factory function that forwards all arguments to constructor".

Item 29 says "Assume that move operations are not present, not cheap, and not used". I find this quite dispiriting! There are so many ways in C++ to think that you're moving a value when you're actually calling an expensive copy constructor — and compilers won't even warn you!

In Rust, moves are always available, always as cheap as memcpy, and always used when passing by value. Copy types don't have move semantics, but they act the same at runtime. The only difference is whether the static checks allow you to use the source location afterwards.

All in all, moves in Rust are more ergonomic and less surprising. Rust's treatment of moves should also perform better, because there's no need to leave the source object in a valid state, and there's no need to call move constructors on individual elements of a collection. (But can we benchmark this?)

There's a bunch of other stuff in this chapter that doesn't apply to Rust. For example, "The interaction among perfect-forwarding constructors and compiler-generated copy and move operations develops even more wrinkles when inheritance enters the picture." This is the kind of sentence that will make me run away screaming. Rust doesn't have any of those features, gets by fine without them, and thus avoids such bizarre interactions.

Chapter 6: Lambda Expressions

C++ allows closures to be copied; Rust doesn't.

In C++ you can specify whether a lambda expression's captures are taken into the closure by reference or by value, either individually or for all captures at once. In Rust this is mostly inferred by how you use the captures: whether they are mutated, and whether they are moved from. However, you can prefix the move keyword to force all captures to be taken by value. This is useful when the closure itself will outlive its environment, common when spawning threads for example.

Rust uses this inference for another purpose: determining which Fn* traits a closure will implement. If the lambda body moves out of a capture, it can only implement FnOnce, whose "call" operator takes self by value. If it doesn't move but does mutate captures, it will implement FnOnce and FnMut, whose "call" takes &mut self. And if it neither moves nor mutates, it will implement all of FnOnce, FnMut, and Fn. C++ doesn't have traits (yet) and doesn't distinguish these cases. If your lambda moves from a capture, you can call it again and you'll see whatever "empty" value was left behind by the move constructor.

Rust doesn't support init capture; however, move capture is supported natively. You can do whatever init you like outside the lambda and then move the result in.

Like C++, Rust allows inference of closure parameter types. Unlike C++, an individual closure cannot be generic.

Chapter 7: The Concurrency API

Rust doesn't have futures in the standard library; they're part of an external library maintained by a core Rust developer. They're also used for async I/O.

In C++, dropping a std::thread that is still running terminates the program, which certainly seems un-fun to me. The behavior is justified by the possibility that the thread captures by reference something from its spawning context. If the thread then outlived that context, it would result in undefined behavior. In Rust, this can't happen because thread::spawn(f) has a 'static bound on the type of f. So, when a Rust JoinHandle falls out of scope, the thread is safely detached and continues to run.

The other possibility, in either language, is to join threads on drop, waiting for the thread to finish. However this has surprising performance implications and still isn't enough to allow threads to safely borrow from their spawning environment. Such "scoped threads" are provided by libraries in Rust and use a different technique to ensure safety.

C++ and Rust both provide atomic variables. In C++ they support standard operations such as assignment, ++, and atomic reads by conversion to the underlying type. These all use the "sequentially consistent" memory ordering, which provides the strongest guarantees. Rust is more explicit, using dedicated methods like fetch_add which also specify the memory ordering. (This kind of API is also available in C++.)

This chapter also talks about the C++ type qualifier volatile, even though it has to do with stuff like memory-mapped I/O and not threads. Rust doesn't have volatile types; instead, a volatile read or write is done using an intrinsic function.

Chapter 8: Tweaks

Rust containers don't have methods like emplace_back. You can however use the experimental placement-new feature.

Conclusions

Rust and C++ share many features, allowing a detailed comparison between them. Rust is a much newer design that isn't burdened with 20 years of backwards compatibility. This I think is why Rust's versions of these core features tend to be simpler and easier to reason about. On the other hand, Rust gains some complexity by enforcing strong static guarantees.

There are of course some differences of principle, not just historical quirks. C++ has an object system based on classes and inheritance, even allowing multiple inheritance. There's no equivalent in Rust. Rust also prefers simple and explicit semantics, while C++ allows a huge amount of implicit behavior. You see this for example with implicit copy construction, implicit conversions, ad-hoc function overloading, quasi-transparent references, and the operators on atomic values. There are still some implicit behaviors in Rust, but they're carefully constrained. Personally I prefer Rust's explicit style; I find there are too many cases where C++ doesn't "do what I mean". But other programmers may disagree, and that's fine.

I hope and expect that C++ and Rust will converge on similar feature-sets. C++ is scheduled to get a proper module system, a "concepts" system similar to traits, and a subset with statically-checkable memory safety. Rust will eventually have integer generics, variadic generics, and more powerful const fn. It's an exciting time for both languages :)

Sunday, May 3, 2015

Modeling garbage collectors with Alloy: part 1

Formal methods for software verification are usually seen as a high-cost tool that you would only use on the most critical systems, and only after extensive informal verification. The Alloy project aims to be something completely different: a lightweight tool you can use at any stage of everyday software development. With just a few lines of code, you can build a simple model to explore design issues and corner cases, even before you've started writing the implementation. You can gradually make the model more detailed as your requirements and implementation get more complex. After a system is deployed, you can keep the model around to evaluate future changes at low cost.

Sounds great, doesn't it? I have only a tiny bit of prior experience with Alloy and I wanted to try it out on something more substantial. In this article we'll build a simple model of a garbage collector, visualize its behavior, and fix some problems. This is a warm-up for exploring more complex GC algorithms, which will be the subject of future articles.

I won't describe the Alloy syntax in full detail, but you should be able to follow along if you have some background in programming and logic. See also the Alloy documentation and especially the book Software Abstractions: Logic, Language, and Analysis by Daniel Jackson, which is a very practical and accessible introduction to Alloy. It's a highly recommended read for any software developer.

You can download Alloy as a self-contained Java executable, which can do analysis and visualization and includes an editor for Alloy code.

The model

We will start like so:

open util/ordering [State]

sig Object { }
one sig Root extends Object { }

sig State {
    pointers: Object -> set Object,
    collected: set Object,
}

The garbage-collected heap consists of Objects, each of which can point to any number of other Objects (including itself). There is a distinguished object Root which represents everything that's accessible without going through the heap, such as global variables and the function call stack. We also track which objects have already been garbage-collected. In a real implementation these would be candidates for re-use; in our model they stick around so that we can detect use-after-free.

The open statement invokes a library module to provide a total ordering on States, which we will interpret as the progression of time. More on this later.

Relations

In the code that follows, it may look like Alloy has lots of different data types, overloading operators with total abandon. In fact, all these behaviors arise from an exceptionally simple data model:

Every value is a relation; that is, a set of tuples of the same non-zero length.

When each tuple has length 1, we can view the relation as a set. When each tuple has length 2, we can view it as a binary relation and possibly as a function. And a singleton set is viewed as a single atom or tuple.

Since everything in Alloy is a relation, each operator has a single definition in terms of relations. For example, the operators . and [] are syntax for a flavor of relational join. If you think of the underlying relations as a database, then Alloy's clever syntax amounts to an object-relational mapping that is at once very simple and very powerful. Depending on context, these joins can look like field access, function calls, or data structure lookups, but they are all described by the same underlying framework.

The elements of the tuples in a relation are atoms, which are indivisible and have no meaning individually. Their meaning comes entirely from the relations and properties we define. Ultimately, atoms all live in the same universe, but Alloy gives "warnings" when the type system implied by the sig declarations can prove that an expression is always the empty relation.

Here are the relations implied by our GC model, as tuple sets along with their types:

Object: {Object} = {O1, O2, ..., Om}
Root:   {Root}   = {Root}
State:  {State}  = {S1, S2, ..., Sn}

pointers:  {(State, Object, Object)}
collected: {(State, Object)}

first: {State} = {S1}
last:  {State} = {Sn}
next:  {(State, State)} = {(S1, S2), (S2, S3), ..., (S(n-1), Sn)}

The last three relations come from the util/ordering library. Note that a sig implicitly creates some atoms.

Dynamics

The live objects are everything reachable from the root:

fun live(s: State): set Object {
    Root.*(s.pointers)
}

*(s.pointers) constructs the reflexive, transitive closure of the binary relation s.pointers; that is, the set of objects reachable from each object.

Of course the GC is only part of a system; there's also the code that actually uses these objects, which in GC terminology is called the mutator. We can describe the action of each part as a predicate relating "before" and "after" states.

pred mutate(s, t: State) {
    t.collected = s.collected
    t.pointers != s.pointers
    all a: Object - t.live |
        t.pointers[a] = s.pointers[a]
}

pred gc(s, t: State) {
    t.pointers = s.pointers
    t.collected = s.collected + (Object - s.live)
    some t.collected - s.collected
}

The mutator cannot collect garbage, but it can change the pointers of any live object. The GC doesn't touch the pointers, but it collects any dead object. In both cases we require that something changes in the heap.

It's time to state the overall facts of our model:

fact {
    no first.collected
    first.pointers = Root -> (Object - Root)
    all s: State - last |
        let t = s.next |
        mutate[s, t] or gc[s, t]
}

This says that in the initial state, no object has been collected, and every object is in the root set except Root itself. This means we don't have to model allocation as well. Each state except the last must be followed by a mutator step or a GC step.

The syntax all x: e | P says that the property P must hold for every tuple x in e. Alloy supports a variety of quantifiers like this.

Interacting with Alloy

The development above looks nice and tidy — I hope — but in reality, it took a fair bit of messing around to get to this point. Alloy provides a highly interactive development experience. At any time, you can visualize your model as a collection of concrete examples. Let's do that now by adding these commands:

pred Show {}
run Show for 5

Now we select this predicate from the "Execute" menu, then click "Show". The visualizer provides many options to customise the display of each atom and relation. The config that I made for this project is "projected over State", which means you see a graph of the heap at one moment in time, with forward/back buttons to reach the other States.

After clicking around a bit, you may notice some oddities:

Diagram of a heap with an object pointing to the root

The root isn't a heap object; it represents all of the pointers that are reachable without accessing the heap. So it's meaningless for an object to point to the root. We can exclude these cases from the model easily enough:

fact {
    all s: State | no s.pointers.Root
}

(This can also be done more concisely as part of the original sig.)

Now we're ready to check the essential safety property of a garbage collector:

assert no_dangling {
    all s: State | no (s.collected & s.live)
}

check no_dangling for 5 Object, 10 State

And Alloy says:

Executing "Check no_dangling for 5 Object, 10 State"
   ...
   8338 vars. 314 primary vars. 17198 clauses. 40ms.
   Counterexample found. Assertion is invalid. 14ms.

Clicking "Counterexample" brings up the visualization:

Diagram of four states. A single heap object is unrooted, then collected, but then the root grows a new pointer to it!

Whoops, we forgot to say that only pointers to live objects can be stored! We can fix this by modifying the mutate predicate:

pred mutate(s, t: State) {
    t.collected = s.collected
    t.pointers != s.pointers
    all a: Object - t.live |
        t.pointers[a] = s.pointers[a]

    // new requirement!
    all a: t.live |
        t.pointers[a] in s.live
}

With the result:

Executing "Check no_dangling for 5 Object, 10 State"
   ...
   8617 vars. 314 primary vars. 18207 clauses. 57ms.
   No counterexample found. Assertion may be valid. 343ms.

SAT solvers and bounded model checking

"May be" valid? Fortunately this has a specific meaning. We asked Alloy to look for counterexamples involving at most 5 objects and 10 time steps. This bounds the search for counterexamples, but it's still vastly more than we could ever check by exhaustive brute force search. (See where it says "8617 vars"? Try raising 2 to that power.) Rather, Alloy turns the bounded model into a Boolean formula, and feeds it to a SAT solver.

This all hinges on one of the weirdest things about computing in the 21st century. In complexity theory, SAT (along with many equivalents) is the prototypical "hardest problem" in NP. Why do we intentionally convert our problem into an instance of this "hardest problem"? I guess for me it illustrates a few things:

  • The huge gulf between worst-case complexity (the subject of classes like NP) and average or "typical" cases that we encounter in the real world. For more on this, check out Impagliazzo's "Five Worlds" paper.

  • The fact that real-world difficulty involves a coordination game. SAT solvers got so powerful because everyone agrees SAT is the problem to solve. Standard input formats and public competitions were a key part of the amazing progress over the past decade or two.

Of course SAT solvers aren't quite omnipotent, and Alloy can quickly get overwhelmed when you scale up the size of your model. Applicability to the real world depends on the small scope hypothesis:

If an assertion is invalid, it probably has a small counterexample.

Or equivalently:

Systems that fail on large instances almost always fail on small instances with similar properties.

This is far from a sure thing, but it already underlies a lot of approaches to software testing. With Alloy we have the certainty of proof within the size bounds, so we don't have to resort to massive scale to find rare bugs. It's difficult (but not impossible!) to imagine a GC algorithm that absolutely cannot fail on fewer than 6 nodes, but is buggy for larger heaps. Implementations will often fall over at some arbitrary resource limit, but algorithms and models are more abstract.

Conclusion

It's not surprising that our correctness property

all s: State | no (s.collected & s.live)

holds, since it's practically a restatement of the garbage collection "algorithm":

t.collected = s.collected + (Object - s.live)

Because reachability is built into Alloy, via transitive closure, the simplest model of a garbage collector does not really describe an implementation. In the next article we'll look at incremental garbage collection, which breaks the reachability search into small units and allows the mutator to run in-between. This is highly desirable for interactive or real-time apps; it also complicates the algorithm quite a bit. We'll use Alloy to uncover some of these complications.

In the meantime, you can play around with the simple GC model and ask Alloy to visualize any scenario you like. For example, we can look at runs where the final state includes at least 5 pointers, and at least one collected object:

pred Show {
    #(last.pointers) >= 5
    some last.collected
}

run Show for 5

Thanks for reading! You can find the code in a GitHub repository which I'll update if/when we get around to modeling more complex GCs.

Wednesday, March 18, 2015

html5ever project update: one year!

I started the html5ever project just a bit over one year ago. We adopted the current name last July.

<kmc> maybe the whole project needs a better name, idk
<Ms2ger> htmlparser, perhaps
<jdm> tagsoup
<Ms2ger> UglySoup
<Ms2ger> Since BeautifulSoup is already taken
<jdm> html5ever
<Ms2ger> No
<jdm> you just hate good ideas
<pcwalton> kmc: if you don't call it html5ever that will be a massive missed opportunity

By that point we already had a few contributors. Now we have 469 commits from 18 people, which is just amazing. Thank you to everyone who helped with the project. Over the past year we've upgraded Rust almost 50 times; I'm extremely grateful to the community members who had a turn at this Sisyphean task.

Several people have also contributed major enhancements. For example:

  • Clark Gaebel implemented zero-copy parsing. I'm in the process of reviewing this code and will be landing pieces of it in the next few weeks.

  • Josh Matthews made it possible to suspend and resume parsing from the tree sink. Servo needs this to do async resource fetching for external <script>s of the old-school (non-async/defer) variety.

  • Chris Paris implemented fragment parsing and improved serialization. This means Servo can use html5ever not only for parsing whole documents, but also for the innerHTML/outerHTML getters and setters within the DOM.

  • Adam Roben brought us dramatically closer to spec conformance. Aside from foreign (XML) content and <template>, we pass 99.6% of the html5lib tokenizer and tree builder tests! Adam also improved the build and test infrastructure in a number of ways.

I'd also like to thank Simon Sapin for doing the initial review of my code, and finding a few bugs in the process.

html5ever makes heavy use of Rust's metaprogramming features. It's been something of a wild ride, and we've collaborated with the Rust team in a number of ways. Felix Klock came through in a big way when a Rust upgrade broke the entire tree builder. Lately, I've been working on improvements to Rust's macro system ahead of the 1.0 release, based in part on my experience with html5ever.

Even with the early-adopter pains, the use of metaprogramming was absolutely worth it. Most of the spec-conformance patches were only a few lines, because our encoding of parser rules is so close to what's written in the spec. This is especially valuable with a "living standard" like HTML.

The future

Two upcoming enhancements are a high priority for Web compatibility in Servo:

  • Character encoding detection and conversion. This will build on the zero-copy UTF-8 parsing mentioned above. Non-UTF-8 content (~15% of the Web) will have "one-copy parsing" after a conversion to UTF-8. This keeps the parser itself lean and mean.

  • document.write support. This API can insert arbitrary UTF-16 code units (which might not even be valid Unicode) in the middle of the UTF-8 stream. To handle this, we might switch to WTF-8. Along with document.write we'll start to do speculative parsing.

It's likely that I'll work on one or both of these in the next quarter.

Servo may get SVG support in the near future, thanks to canvg. SVG nodes can be embedded in HTML or loaded from an external XML file. To support the first case, html5ever needs to implement WHATWG's rules for parsing foreign content in HTML. To handle external SVG we could use a proper XML parser, or we could extend html5ever to support "XML5", an error-tolerant XML syntax similar to WHATWG HTML. Ygg01 made some progress towards implementing XML5. Servo would most likely use it for XHTML as well.

Improved performance is always a goal. html5ever describes itself as "high-performance" but does not have specific comparisons to other HTML parsers. I'd like to fix that in the near future. Zero-copy parsing will be a substantial improvement, once some performance issues in Rust get fixed. I'd like to revisit SSE-accelerated parsing as well.

I'd also like to support html5ever on some stable Rust 1.x version, although it probably won't happen for 1.0.0. The main obstacle here is procedural macros. Erick Tryzelaar has done some great work recently with syntex, aster, and quasi. Switching to this ecosystem will get us close to 1.x compatibility and will clean up the macro code quite a bit. I'll be working with Erick to use html5ever as an early validation of his approach.

Simon has extracted Servo's CSS selector matching engine as a stand-alone library. Combined with html5ever this provides most of the groundwork for a full-featured HTML manipulation library.

The C API for html5ever still builds, thanks to continuous integration. But it's not complete or well-tested. With the removal of Rust's runtime, maintaining the C API does not restrict the kind of code we can write in other parts of the parser. All we need now is to complete the C API and write tests. This would be a great thing for a community member to work on. Then we can write bindings for every language under the sun and bring fast, correct, memory-safe HTML parsing to the masses :)

Friday, February 20, 2015

Turing tarpits in Rust's macro system

Bitwise Cyclic Tag is an extremely simple automaton slash programming language. BCT uses a program string and a data string, each made of bits. The program string is interpreted as if it were infinite, by looping back around to the first bit.

The program consists of commands executed in order. There is a single one-bit command:

0: Delete the left-most data bit.

and a single two-bit command:

1 x: If the left-most data bit is 1, copy bit x to the right of the data string.

We halt if ever the data string is empty.

Remarkably, this is enough to do universal computation. Implementing it in Rust's macro system gives a proof (probably not the first one) that Rust's macro system is Turing-complete, aside from the recursion limit imposed by the compiler.

#![feature(trace_macros)]

macro_rules! bct {
    // cmd 0:  d ... => ...
    (0, $($ps:tt),* ; $_d:tt)
        => (bct!($($ps),*, 0 ; ));
    (0, $($ps:tt),* ; $_d:tt, $($ds:tt),*)
        => (bct!($($ps),*, 0 ; $($ds),*));

    // cmd 1p:  1 ... => 1 ... p
    (1, $p:tt, $($ps:tt),* ; 1)
        => (bct!($($ps),*, 1, $p ; 1, $p));
    (1, $p:tt, $($ps:tt),* ; 1, $($ds:tt),*)
        => (bct!($($ps),*, 1, $p ; 1, $($ds),*, $p));

    // cmd 1p:  0 ... => 0 ...
    (1, $p:tt, $($ps:tt),* ; $($ds:tt),*)
        => (bct!($($ps),*, 1, $p ; $($ds),*));

    // halt on empty data string
    ( $($ps:tt),* ; )
        => (());
}

fn main() {
    trace_macros!(true);
    bct!(0, 0, 1, 1, 1 ; 1, 0, 1);
}

This produces the following compiler output:

bct! { 0 , 0 , 1 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }
...
bct.rs:19:13: 19:45 error: recursion limit reached while expanding the macro `bct`
bct.rs:19         => (bct!($($ps),*, 1, $p ; $($ds),*));
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can try it online, as well.

Notes about the macro

I would much rather drop the commas and write

// cmd 0:  d ... => ...
(0 $($ps:tt)* ; $_d:tt $($ds:tt)*)
    => (bct!($($ps)* 0 ; $($ds)*));

// cmd 1p:  1 ... => 1 ... p
(1 $p:tt $($ps:tt)* ; 1 $($ds:tt)*)
    => (bct!($($ps)* 1 $p ; 1 $($ds)* $p));

// cmd 1p:  0 ... => 0 ...
(1 $p:tt $($ps:tt)* ; $($ds:tt)*)
    => (bct!($($ps)* 1 $p ; $($ds)*));

but this runs into the macro future-proofing rules.

If we're required to have commas, then it's at least nice to handle them uniformly, e.g.

// cmd 0:  d ... => ...
(0 $(, $ps:tt)* ; $_d:tt $(, $ds:tt)*)
    => (bct!($($ps),*, 0 ; $($ds),*));

// cmd 1p:  1 ... => 1 ... p
(1, $p:tt $(, $ps:tt)* ; $($ds:tt),*)
    => (bct!($($ps),*, 1, $p ; 1 $(, $ds)*, $p));

// cmd 1p:  0 ... => 0 ...
(1, $p:tt $(, $ps:tt)* ; $($ds:tt),*)
    => (bct!($($ps),*, 1, $p ; $($ds),*));

But this too is disallowed. An $x:tt variable cannot be followed by a repetition $(...)*, even though it's (I believe) harmless. There is an open RFC about this issue. For now I have to handle the "one" and "more than one" cases separately, which is annoying.

In general, I don't think macro_rules! is a good language for arbitrary computation. This experiment shows the hassle involved in implementing one of the simplest known "arbitrary computations". Rather, macro_rules! is good at expressing patterns of code reuse that don't require elaborate compile-time processing. It does so in a way that's declarative, hygienic, and high-level.

However, there is a big middle ground of non-elaborate, but non-trivial computations. macro_rules! is hardly ideal for that, but procedural macros have problems of their own. Indeed, the bct! macro is an extreme case of a pattern I've found useful in the real world. The idea is that every recursive invocation of a macro gives you another opportunity to pattern-match the arguments. Some of html5ever's macros do this, for example.

Saturday, January 10, 2015

151-byte static Linux binary in Rust

Part of the sales pitch for Rust is that it's "as bare metal as C".1 Rust can do anything C can do, run anywhere C can run,2 with code that's just as efficient, and at least as safe (but usually much safer).

I'd say this claim is about 95% true, which is pretty good by the standards of marketing claims. A while back I decided to put it to the test, by making the smallest, most self-contained Rust program possible. After resolving a few issues along the way, I ended up with a 151-byte, statically linked executable for AMD64 Linux. With the release of Rust 1.0-alpha, it's time to show this off.

Here's the Rust code:

#![crate_type="rlib"]
#![allow(unstable)]

#[macro_use] extern crate syscall;

use std::intrinsics;

fn exit(n: usize) -> ! {
    unsafe {
        syscall!(EXIT, n);
        intrinsics::unreachable()
    }
}

fn write(fd: usize, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

#[no_mangle]
pub fn main() {
    write(1, "Hello!\n".as_bytes());
    exit(0);
}

This uses my syscall library, which provides the syscall! macro. We wrap the underlying system calls with Rust functions, each exposing a safe interface to the unsafe syscall! macro. The main function uses these two safe functions and doesn't need its own unsafe annotation. Even in such a small program, Rust allows us to isolate memory unsafety to a subset of the code.

Because of crate_type="rlib", rustc will build this as a static library, from which we extract a single object file tinyrust.o:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 00 00 00 00          mov    $0x0,%esi
                        b: R_X86_64_32  .rodata.str1625
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

We disable stack exhaustion checking, as well as position-independent code, in order to slim down the output. After optimization, the only instructions that survive come from inline assembly blocks in the syscall library.

Note that main doesn't end in a ret instruction. The exit function (which gets inlined) is marked with a "return type" of !, meaning "doesn't return". We make good on this by invoking the unreachable intrinsic after syscall!. LLVM will optimize under the assumption that we can never reach this point, making no guarantees about the program behavior if it is reached. This represents the fact that the kernel is actually going to kill the process before syscall!(EXIT, n) can return.

Because we use inline assembly and intrinsics, this code is not going to work on a stable-channel build of Rust 1.0. It will require an alpha or nightly build until such time as inline assembly and intrinsics::unreachable are added to the stable language of Rust 1.x.

Note that I didn't even use #![no_std]! This program is so tiny that everything it pulls from libstd is a type definition, macro, or fully inlined function. As a result there's nothing of libstd left in the compiler output. In a larger program you may need #![no_std], although its role is greatly reduced following the removal of Rust's runtime.

Linking

This is where things get weird.

Whether we compile from C or Rust,3 the standard linker toolchain is going to include a bunch of junk we don't need. So I cooked up my own linker script:

SECTIONS {
    . = 0x400078;
  
    combined . : AT(0x400078) ALIGN(1) SUBALIGN(1) {
        *(.text*)
        *(.data*)
        *(.rodata*)
        *(.bss*)
    }
}

We smash all the sections together, with no alignment padding, then extract that section as a headerless binary blob:

$ ld --gc-sections -e main -T script.ld -o payload tinyrust.o
$ objcopy -j combined -O binary payload payload.bin

Finally we stick this on the end of a custom ELF header. The header is written in NASM syntax but contains no instructions, only data fields. The base address 0x400078 seen above is the end of this header, when the whole file is loaded at 0x400000. There's no guarantee that ld will put main at the beginning of the file, so we need to separately determine the address of main and fill that in as the e_entry field in the ELF file header.

$ ENTRY=$(nm -f posix payload | grep '^main ' | awk '{print $3}')
$ nasm -f bin -o tinyrust -D entry=0x$ENTRY elf.s
$ chmod +x ./tinyrust
$ ./tinyrust
Hello!

It works! And the size:

$ wc -c < tinyrust
158

Seven bytes too big!

The final trick

To get down to 151 bytes, I took inspiration from this classic article, which observes that padding fields in the ELF header can be used to store other data. Like, say, a string constant. The Rust code changes to access this constant:

use std::{mem, raw};

#[no_mangle]
pub fn main() {
    let message: &'static [u8] = unsafe {
        mem::transmute(raw::Slice {
            data: 0x00400008 as *const u8,
            len: 7,
        })
    };

    write(1, message);
    exit(0);
}

A Rust slice like &[u8] consists of a pointer to some memory, and a length indicating the number of elements that may be found there. The module std::raw exposes this as an ordinary struct that we build, then transmute to the actual slice type. The transmute function generates no code; it just tells the type checker to treat our raw::Slice<u8> as if it were a &[u8]. We return this value out of the unsafe block, taking advantage of the "everything is an expression" syntax, and then print the message as before.

Trying out the new version:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:        
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 08 00 40 00          mov    $0x400008,%esi
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

...
$ wc -c < tinyrust
151
$ ./tinyrust
Hello!

The object code is the same as before, except that the relocation for the string constant has become an absolute address. The binary is smaller by 7 bytes (the size of "Hello!\n") and it still works!

You can find the full code on GitHub. The code in this article works on rustc 1.0.0-dev (44a287e6e 2015-01-08). If I update the code on GitHub, I will also update the version number printed by the included build script.

I'd be curious to hear if anyone can make my program smaller!


  1. C is not really "bare metal", but that's another story

  2. From a pure language perspective. If you want to talk about availability of compilers and libraries, then Rust still has a bit of a disadvantage ;) 

  3. In fact, this code grew out of an earlier experiment with really small binaries in C. 

Wednesday, October 29, 2014

A taste of Rust (yum) for C/C++ programmers

If, like me, you've been frustrated with the status quo in systems languages, this article will give you a taste of why Rust is so exciting. In a tiny amount of code, it shows a lot of ways that Rust really kicks ass compared to C and C++. It's not just safe and fast, it's a lot more convenient.

Web browsers do string interning to condense the strings that make up the Web, such as tag and attribute names, into small values that can be compared quickly. I recently added event logging support to Servo's string interner. This will allow us to record traces from real websites, which we can use to guide further optimizations.

Here are the events we can log:

#[deriving(Show)]
pub enum Event {
    Intern(u64),
    Insert(u64, String),
    Remove(u64),
}

Interned strings have a 64-bit ID, which is recorded in every event. The String we store for "insert" events is like C++'s std::string; it points to a buffer in the heap, and it owns that buffer.

This enum is a bit fancier than a C enum, but its representation in memory is no more complex than a C struct. There's a tag for the three alternatives, a 64-bit ID, and a few fields that make up the String. When we pass or return an Event by value, it's at worst a memcpy of a few dozen bytes. There's no implicit heap allocation, garbage collection, or anything like that. We didn't define a way to copy an event; this means the String buffer always has a unique owner who is responsible for freeing it.

The deriving(Show) attribute tells the compiler to auto-generate a text representation, so we can print an Event just as easily as a built-in type.

Next we declare a global vector of events, protected by a mutex:

lazy_static! {
    pub static ref LOG: Mutex<Vec<Event>>
        = Mutex::new(Vec::with_capacity(50_000));
}

lazy_static! will initialize both of them when LOG is first used. Like String, the Vec is a growable buffer. We won't turn on event logging in release builds, so it's fine to pre-allocate space for 50,000 events. (You can put underscores anywhere in a integer literal to improve readability.)

lazy_static!, Mutex, and Vec are all implemented in Rust using gnarly low-level code. But the amazing thing is that all three expose a safe interface. It's simply not possible to use the variable before it's initialized, or to read the value the Mutex protects without locking it, or to modify the vector while iterating over it.

The worst you can do is deadlock. And Rust considers that pretty bad, still, which is why it discourages global state. But it's clearly what we need here. Rust takes a pragmatic approach to safety. You can always write the unsafe keyword and then use the same pointer tricks you'd use in C. But you don't need to be quite so guarded when writing the other 95% of your code. I want a language that assumes I'm brilliant but distracted :)

Rust catches these mistakes at compile time, and produces the same code you'd see with equivalent constructs in C++. For a more in-depth comparison, see Ruud van Asseldonk's excellent series of articles about porting a spectral path tracer from C++ to Rust. The Rust code performs basically the same as Clang / GCC / MSVC on the same platform. Not surprising, because Rust uses LLVM and benefits from the same backend optimizations as Clang.

lazy_static! is not a built-in language feature; it's a macro provided by a third-party library. Since the library uses Cargo, I can include it in my project by adding

[dependencies.lazy_static]
git = "https://github.com/Kimundi/lazy-static.rs"

to Cargo.toml and then adding

#[phase(plugin)]
extern crate lazy_static;

to src/lib.rs. Cargo will automatically fetch and build all dependencies. Code reuse becomes no harder than in your favorite scripting language.

Finally, we define a function that pushes a new event onto the vector:

pub fn log(e: Event) {
    LOG.lock().push(e);
}

LOG.lock() produces an RAII handle that will automatically unlock the mutex when it falls out of scope. In C++ I always hesitate to use temporaries like this because if they're destroyed too soon, my program will segfault or worse. Rust has compile-time lifetime checking, so I can do things that would be reckless in C++.

If you scroll up you'll see a lot of prose and not a lot of code. That's because I got a huge amount of functionality for free. Here's the logging module again:

#[deriving(Show)]
pub enum Event {
    Intern(u64),
    Insert(u64, String),
    Remove(u64),
}

lazy_static! {
    pub static ref LOG: Mutex<Vec<Event>>
        = Mutex::new(Vec::with_capacity(50_000));
}

pub fn log(e: Event) {
    LOG.lock().push(e);
}

This goes in src/event.rs and we include it from src/lib.rs.

#[cfg(feature = "log-events")]
pub mod event;

The cfg attribute is how Rust does conditional compilation. Another project can specify

[dependencies.string_cache]
git = "https://github.com/servo/string-cache"
features = ["log-events"]

and add code to dump the log:

for e in string_cache::event::LOG.lock().iter() {
    println!("{}", e);
}

Any project which doesn't opt in to log-events will see zero impact from any of this.

If you'd like to learn Rust, the Guide is a good place to start. We're getting close to 1.0 and the important concepts have been stable for a while, but the details of syntax and libraries are still in flux. It's not too early to learn, but it might be too early to maintain a large library.

By the way, here are the events generated by interning the three strings foobarbaz foo blockquote:

Insert(0x7f1daa023090, foobarbaz)
Intern(0x7f1daa023090)
Intern(0x6f6f6631)
Intern(0xb00000002)

There are three different kinds of IDs, indicated by the least significant bits. The first is a pointer into a standard interning table, which is protected by a mutex. The other two are created without synchronization, which improves parallelism between parser threads.

In UTF-8, the string foo is smaller than a 64-bit pointer, so we store the characters directly. blockquote is too big for that, but it corresponds to a well-known HTML tag. 0xb is the index of blockquote in a static list of strings that are common on the Web. Static atoms can also be used in pattern matching, and LLVM's optimizations for C's switch statements will apply.

Wednesday, September 17, 2014

Raw system calls for Rust

I wrote a small library for making raw system calls from Rust programs. It provides a macro that expands into in-line system call instructions, with no run-time dependencies at all. Here's an example:

#![feature(phase)]

#[phase(plugin, link)]
extern crate syscall;

fn write(fd: uint, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

fn main() {
    write(1, "Hello, world!\n".as_bytes());
}

Right now it only supports x86-64 Linux, but I'd love to add other platforms. Pull requests are much appreciated. :)