Wednesday, March 18, 2015

html5ever project update: one year!

I started the html5ever project just a bit over one year ago. We adopted the current name last July.

<kmc> maybe the whole project needs a better name, idk
<Ms2ger> htmlparser, perhaps
<jdm> tagsoup
<Ms2ger> UglySoup
<Ms2ger> Since BeautifulSoup is already taken
<jdm> html5ever
<Ms2ger> No
<jdm> you just hate good ideas
<pcwalton> kmc: if you don't call it html5ever that will be a massive missed opportunity

By that point we already had a few contributors. Now we have 469 commits from 18 people, which is just amazing. Thank you to everyone who helped with the project. Over the past year we've upgraded Rust almost 50 times; I'm extremely grateful to the community members who had a turn at this Sisyphean task.

Several people have also contributed major enhancements. For example:

  • Clark Gaebel implemented zero-copy parsing. I'm in the process of reviewing this code and will be landing pieces of it in the next few weeks.

  • Josh Matthews made it possible to suspend and resume parsing from the tree sink. Servo needs this to do async resource fetching for external <script>s of the old-school (non-async/defer) variety.

  • Chris Paris implemented fragment parsing and improved serialization. This means Servo can use html5ever not only for parsing whole documents, but also for the innerHTML/outerHTML getters and setters within the DOM.

  • Adam Roben brought us dramatically closer to spec conformance. Aside from foreign (XML) content and <template>, we pass 99.6% of the html5lib tokenizer and tree builder tests! Adam also improved the build and test infrastructure in a number of ways.

I'd also like to thank Simon Sapin for doing the initial review of my code, and finding a few bugs in the process.

html5ever makes heavy use of Rust's metaprogramming features. It's been something of a wild ride, and we've collaborated with the Rust team in a number of ways. Felix Klock came through in a big way when a Rust upgrade broke the entire tree builder. Lately, I've been working on improvements to Rust's macro system ahead of the 1.0 release, based in part on my experience with html5ever.

Even with the early-adopter pains, the use of metaprogramming was absolutely worth it. Most of the spec-conformance patches were only a few lines, because our encoding of parser rules is so close to what's written in the spec. This is especially valuable with a "living standard" like HTML.

The future

Two upcoming enhancements are a high priority for Web compatibility in Servo:

  • Character encoding detection and conversion. This will build on the zero-copy UTF-8 parsing mentioned above. Non-UTF-8 content (~15% of the Web) will have "one-copy parsing" after a conversion to UTF-8. This keeps the parser itself lean and mean.

  • document.write support. This API can insert arbitrary UTF-16 code units (which might not even be valid Unicode) in the middle of the UTF-8 stream. To handle this, we might switch to WTF-8. Along with document.write we'll start to do speculative parsing.

It's likely that I'll work on one or both of these in the next quarter.

Servo may get SVG support in the near future, thanks to canvg. SVG nodes can be embedded in HTML or loaded from an external XML file. To support the first case, html5ever needs to implement WHATWG's rules for parsing foreign content in HTML. To handle external SVG we could use a proper XML parser, or we could extend html5ever to support "XML5", an error-tolerant XML syntax similar to WHATWG HTML. Ygg01 made some progress towards implementing XML5. Servo would most likely use it for XHTML as well.

Improved performance is always a goal. html5ever describes itself as "high-performance" but does not have specific comparisons to other HTML parsers. I'd like to fix that in the near future. Zero-copy parsing will be a substantial improvement, once some performance issues in Rust get fixed. I'd like to revisit SSE-accelerated parsing as well.

I'd also like to support html5ever on some stable Rust 1.x version, although it probably won't happen for 1.0.0. The main obstacle here is procedural macros. Erick Tryzelaar has done some great work recently with syntex, aster, and quasi. Switching to this ecosystem will get us close to 1.x compatibility and will clean up the macro code quite a bit. I'll be working with Erick to use html5ever as an early validation of his approach.

Simon has extracted Servo's CSS selector matching engine as a stand-alone library. Combined with html5ever this provides most of the groundwork for a full-featured HTML manipulation library.

The C API for html5ever still builds, thanks to continuous integration. But it's not complete or well-tested. With the removal of Rust's runtime, maintaining the C API does not restrict the kind of code we can write in other parts of the parser. All we need now is to complete the C API and write tests. This would be a great thing for a community member to work on. Then we can write bindings for every language under the sun and bring fast, correct, memory-safe HTML parsing to the masses :)