If, like me, you've been frustrated with the status quo in systems languages, this article will give you a taste of why Rust is so exciting. In a tiny amount of code, it shows a lot of ways that Rust really kicks ass compared to C and C++. It's not just safe and fast, it's a lot more convenient.
Web browsers do string interning to condense the strings that make up the Web, such as tag and attribute names, into small values that can be compared quickly. I recently added event logging support to Servo's string interner. This will allow us to record traces from real websites, which we can use to guide further optimizations.
Here are the events we can log:
#[deriving(Show)] pub enum Event { Intern(u64), Insert(u64, String), Remove(u64), }
Interned strings have a 64-bit ID,
which is recorded in every event.
The String
we store for "insert" events
is like C++'s std::string
;
it points to a buffer in the heap,
and it owns that buffer.
This enum
is a bit fancier than a C enum
,
but its representation in memory
is no more complex than a C struct
.
There's a tag for the three alternatives,
a 64-bit ID,
and a few fields that make up the String
.
When we pass or return an Event
by value,
it's at worst a memcpy
of a few dozen bytes.
There's no implicit heap allocation,
garbage collection,
or anything like that.
We didn't define a way to copy an event;
this means the String
buffer
always has a unique owner
who is responsible for freeing it.
The deriving(Show)
attribute
tells the compiler to auto-generate
a text representation,
so we can print an Event
just as easily as a built-in type.
Next we declare a global vector of events, protected by a mutex:
lazy_static! { pub static ref LOG: Mutex<Vec<Event>> = Mutex::new(Vec::with_capacity(50_000)); }
lazy_static!
will initialize both of them
when LOG
is first used.
Like String
, the Vec
is a growable buffer.
We won't turn on event logging in release builds,
so it's fine to pre-allocate space for 50,000 events.
(You can put underscores
anywhere in a integer literal
to improve readability.)
lazy_static!
, Mutex
, and Vec
are all implemented
in Rust
using gnarly low-level code.
But the amazing thing
is that all three expose a safe interface.
It's simply not possible
to use the variable before it's initialized,
or to read the value the Mutex
protects without locking it,
or to modify the vector while iterating over it.
The worst you can do is deadlock.
And Rust considers that pretty bad, still,
which is why it discourages global state.
But it's clearly what we need here.
Rust takes a pragmatic approach to safety.
You can always write the unsafe
keyword
and then use the same pointer tricks
you'd use in C.
But you don't need to be quite so guarded
when writing the other 95% of your code.
I want a language that assumes I'm brilliant but distracted :)
Rust catches these mistakes at compile time, and produces the same code you'd see with equivalent constructs in C++. For a more in-depth comparison, see Ruud van Asseldonk's excellent series of articles about porting a spectral path tracer from C++ to Rust. The Rust code performs basically the same as Clang / GCC / MSVC on the same platform. Not surprising, because Rust uses LLVM and benefits from the same backend optimizations as Clang.
lazy_static!
is not a built-in language feature;
it's a macro provided by
a third-party library.
Since the library uses Cargo,
I can include it in my project by adding
[dependencies.lazy_static]
git = "https://github.com/Kimundi/lazy-static.rs"
to Cargo.toml
and then adding
#[phase(plugin)] extern crate lazy_static;
to src/lib.rs
.
Cargo will automatically fetch and build all dependencies.
Code reuse becomes no harder
than in your favorite scripting language.
Finally, we define a function that pushes a new event onto the vector:
pub fn log(e: Event) { LOG.lock().push(e); }
LOG.lock()
produces an
RAII handle
that will automatically unlock the mutex
when it falls out of scope.
In C++ I always hesitate to use temporaries like this
because if they're destroyed too soon,
my program will segfault or worse.
Rust has compile-time lifetime checking,
so I can do things that would be reckless in C++.
If you scroll up you'll see a lot of prose and not a lot of code. That's because I got a huge amount of functionality for free. Here's the logging module again:
#[deriving(Show)] pub enum Event { Intern(u64), Insert(u64, String), Remove(u64), } lazy_static! { pub static ref LOG: Mutex<Vec<Event>> = Mutex::new(Vec::with_capacity(50_000)); } pub fn log(e: Event) { LOG.lock().push(e); }
This goes in src/event.rs
and we include it from src/lib.rs
.
#[cfg(feature = "log-events")] pub mod event;
The cfg
attribute
is how Rust does conditional compilation. Another project can specify
[dependencies.string_cache]
git = "https://github.com/servo/string-cache"
features = ["log-events"]
and add code to dump the log:
for e in string_cache::event::LOG.lock().iter() { println!("{}", e); }
Any project which doesn't opt in to log-events
will see zero impact from any of this.
If you'd like to learn Rust, the Guide is a good place to start. We're getting close to 1.0 and the important concepts have been stable for a while, but the details of syntax and libraries are still in flux. It's not too early to learn, but it might be too early to maintain a large library.
By the way,
here are the events generated by
interning the three strings
foobarbaz
foo
blockquote
:
Insert(0x7f1daa023090, foobarbaz)
Intern(0x7f1daa023090)
Intern(0x6f6f6631)
Intern(0xb00000002)
There are three different kinds of IDs, indicated by the least significant bits. The first is a pointer into a standard interning table, which is protected by a mutex. The other two are created without synchronization, which improves parallelism between parser threads.
In UTF-8,
the string foo
is smaller than a 64-bit pointer,
so we store the characters directly.
blockquote
is too big for that,
but it corresponds to a well-known HTML tag.
0xb
is the index of blockquote
in
a static list
of strings that are common
on the Web.
Static atoms
can also be used
in pattern matching, and
LLVM's optimizations
for C's switch
statements will apply.