One reason I'm excited about Rust is that I can compile Rust code to a simple native-code library, without heavy runtime dependencies, and then call it from any language. Imagine writing performance-critical extensions for Python, Ruby, or Node in a safe, pleasant language that has static lifetime checking, pattern matching, a real macro system, and other goodies like that. For this reason, when I started html5ever some six months ago, I wanted it to be more than another "Foo for BarLang" project. I want it to be the HTML parser of choice, for a wide variety of applications in any language.
Today I started work in earnest on the C API for html5ever. In only a few hours I had a working demo. And this is a fairly complicated library, with 5,000+ lines of code incorporating
- most of the hilariously complicated parsing rules in the HTML spec,
- a Rust syntax extension for writing parse rules in a concise form that matches the spec,
- compile-time perfect hash maps for string interning and named characters, and
- lots and lots of generic code — if this library were written in C++, almost all of it would be in header files.
It's pretty cool that we can use all this machinery from C, or any language that can call C. I'll describe first how to build and use the library, and then I'll talk about the implementation of the C API.
html5ever (for C or for Rust) is not finished yet, but if you're feeling adventurous, you are welcome to try it out! And I'd love to have more contributors. Let me know on GitHub about any issues you run into.
Using html5ever from C
Like most Rust libraries, html5ever builds with Cargo.
$ git clone https://github.com/kmcallister/html5ever
$ cd html5ever
$ git checkout dev
$ cargo build
Updating git repository `https://github.com/sfackler/rust-phf`
Compiling phf_mac v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)
Compiling html5ever-macros v0.0.0 (file:///tmp/html5ever)
Compiling phf v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)
Compiling html5ever v0.0.0 (file:///tmp/html5ever)
The C API isn't Cargo-ified yet, so we'll build it using the older Makefile-based system.
$ mkdir build
$ cd build
$ ../configure
$ make libhtml5ever_for_c.a
rustc -D warnings -C rpath -L /tmp/html5ever/target -L /tmp/html5ever/target/deps \
-o libhtml5ever_for_c.a --cfg for_c --crate-type staticlib /tmp/html5ever/src/lib.rs
warning: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: rt
note: library: dl
note: library: pthread
note: library: gcc_s
note: library: pthread
note: library: c
note: library: m
Now we can build an example C program using that library, and following the link instructions produced by rustc
.
$ H5E_PATH=/tmp/html5ever
$ gcc -Wall -o tokenize tokenize.c -I $H5E_PATH/capi -L $H5E_PATH/build \
-lhtml5ever_for_c -lrt -ldl -lpthread -lgcc_s -lpthread -lc -lm
$ ./tokenize 'Hello, <i class=excellent>world!</i>'
CHARS : Hello
CHARS : ,
CHARS :
TAG : <i>
ATTR: class="excellent"
CHARS : world!
TAG : </i>
The build process is pretty standard for C; we just link a .a
file and its dependencies. The biggest obstacle right now is that you won't find the Rust compiler in your distro's package manager, because the language is still changing so rapidly. But there's a ton of effort going into stabilizing the language for a Rust 1.0 release this year. It won't be too long before rustc
is a reasonable build dependency.
Let's look at the C client code.
#include <stdio.h>
#include "html5ever.h"
void put_str(const char *x) {
fputs(x, stdout);
}
void put_buf(struct h5e_buf text) {
fwrite(text.data, text.len, 1, stdout);
}
void do_start_tag(void *user, struct h5e_buf name, int self_closing, size_t num_attrs) {
put_str("TAG : <");
put_buf(name);
if (self_closing) {
putchar('/');
}
put_str(">\n");
}
// ...
struct h5e_token_ops ops = {
.do_chars = do_chars,
.do_start_tag = do_start_tag,
.do_tag_attr = do_tag_attr,
.do_end_tag = do_end_tag,
};
struct h5e_token_sink sink = {
.ops = &ops,
.user = NULL,
};
int main(int argc, char *argv[]) {
if (argc < 2) {
printf("Usage: %s 'HTML fragment'\n", argv[0]);
return 1;
}
struct h5e_tokenizer *tok = h5e_tokenizer_new(&sink);
h5e_tokenizer_feed(tok, h5e_buf_from_cstr(argv[1]));
h5e_tokenizer_end(tok);
h5e_tokenizer_free(tok);
return 0;
}
The struct h5e_token_ops
contains pointers to callbacks. Any events we don't care to handle are left as NULL function pointers. Inside main
, we create a tokenizer and feed it a string. html5ever for C uses a simple pointer+length representation of buffers, which is this struct h5e_buf
you see being passed by value.
This demo only does tokenization, not tree construction. html5ever can perform both phases of parsing, but the API surface for tree construction is much larger and I didn't get around to writing C bindings yet.
Implementing the C API
Some parts of Rust's libstd
depend on runtime services, such as task-local data, that a C program may not have initialized. So the first step in building a C API was to eliminate all std::
imports. This isn't nearly as bad as it sounds, because large parts of libstd
are just re-exports from other libraries like libcore
that we can use with no trouble. To be fair, I did write html5ever with the goal of a C API in mind, and I avoided features like threading that would be difficult to integrate. So your library might give you more trouble, depending on which Rust features you use.
The next step was to add the #![no_std]
crate attribute. This means we no longer import the standard prelude into every module. To compensate, I added use core::prelude::*;
to most of my modules. This brings in the parts of the prelude that can be used without runtime system support. I also added many imports for ubiquitous types like String
and Vec
, which come from libcollections
.
After that I had to get rid of the last references to libstd
. The biggest obstacle here involved macros and deriving
, which would produce references to names under std::
. To work around this, I create a fake little mod std
which re-exports the necessary parts of core
and collections
. This is similar to libstd
's "curious inner-module".
I also had to remove all uses of format!()
, println!()
, etc., or move them inside #[cfg(not(for_c))]
. I needed to copy in the vec!()
macro which is only provided by libstd
, even though the Vec
type is provided by libcollections
. And I had to omit debug log messages when building for C; I did this with conditionally-defined macros.
With all this preliminary work done, it was time to write the C bindings. Here's how the struct of function pointers looks on the Rust side:
#[repr(C)]
pub struct h5e_token_ops {
do_start_tag: extern "C" fn(user: *mut c_void, name: h5e_buf,
self_closing: c_int, num_attrs: size_t),
do_tag_attr: extern "C" fn(user: *mut c_void, name: h5e_buf,
value: h5e_buf),
do_end_tag: extern "C" fn(user: *mut c_void, name: h5e_buf),
// ...
}
The processing of tokens is straightforward. We pattern-match and then call the appropriate function pointer, unless that pointer is NULL. (Edit: eddyb points out that storing NULL as an extern "C" fn
is undefined behavior. Better to use Option<extern "C" fn ...>
, which will optimize to the same one-word representation.)
To create a tokenizer, we heap-allocate the Rust data structure in a Box
, and then transmute that to a raw C pointer. When the C client calls h5e_tokenizer_free
, we transmute this pointer back to a box and drop it, which will invoke destructors and finally free the memory.
You'll note that the functions exported to C have several special annotations:
#[no_mangle]
: skip name mangling, so we end up with a linker symbol namedh5e_tokenizer_free
instead of_ZN5for_c9tokenizer18h5e_tokenizer_free
.unsafe
: don't let Rust code call these functions unless it promises to be careful.extern "C"
: make sure the exported function has a C-compatible ABI. The data structures similarly get a#[repr(C)]
attribute.
Then I wrote a C header file matching this ABI:
struct h5e_buf {
unsigned char *data;
size_t len;
};
struct h5e_buf h5e_buf_from_cstr(const char *str);
struct h5e_token_ops {
void (*do_start_tag)(void *user, struct h5e_buf name,
int self_closing, size_t num_attrs);
void (*do_tag_attr)(void *user, struct h5e_buf name,
struct h5e_buf value);
void (*do_end_tag)(void *user, struct h5e_buf name);
/// ...
};
struct h5e_tokenizer;
struct h5e_tokenizer *h5e_tokenizer_new(struct h5e_token_sink *sink);
void h5e_tokenizer_free(struct h5e_tokenizer *tok);
void h5e_tokenizer_feed(struct h5e_tokenizer *tok, struct h5e_buf buf);
void h5e_tokenizer_end(struct h5e_tokenizer *tok);
One remaining issue is that Rust is hard-wired to use jemalloc, so linking html5ever will bring that in alongside the system's libc malloc. Having two separate malloc heaps will likely increase memory consumption, and it prevents us from doing fun things like allocating Box
es in Rust that can be used and freed in C. Before Rust can really be a great choice for writing C libraries, we need a better solution for integrating the allocators.
If you'd like to talk about calling Rust from C, you can find me as kmc
in #rust
and #rust-internals
on irc.mozilla.org
. And if you run into any issues with html5ever, do let me know, preferably by opening an issue on GitHub. Happy hacking!