Case study: bloom filter

This post is part of a series in the context of my PhD work, showcasing the future Rox* (Rust-Oxide-F*) toolchain that will bring deductive verification to Rust. For the full context, see the initial post. The code for this case study is available here.

### Servo’s bloom filter

Like most major browser engines, Servo cares about optimizing CSS rendering of Web pages. It works like this: each element of the DOM (Document Object Model) is qualified with attributes like classes and identifiers. On the other hand, your styling is declared via rules associating a CSS selector (a combination of attributes) with a specific styling. When you render your Web page, you go through each DOM element and figure out which CSS selector applies to it. However, figuring out which CSS selector applies to a DOM element is expensive, and to optimize it most major browsers use an auxiliary data structure called a bloom filter.

A bloom filter can be seen as some kind of set of elements. You can add and remove elements from it, and query if an element is in the bloom filter. All these operations have a constant-time asymptotic complexity and are really fast (usually a bloom filter is implemented as a simple array), so there is a catch. Indeed, the query operation is not exact:

• if the query returns false, then you know for sure that the queried element is not in the bloom filter;
• if the query returns true, then the element is probably in the bloom filter, but maybe not. The probability of false positives can be computed theoretically from the parameters of the bloom filter.

Inside Servo, the bloom filter is used in the following way for each different DOM element: first you fill the bloom filter with all the attributes of the DOM element. Then for each attribute of the CSS selector, you query inside the bloom filter. If all the queries return true, then your selector migh match the DOM element and you proceed with a expensive but exact check. If at least one of the query returns false, then the selector cannot match the DOM element and you can fast-reject it.

The implementation of the bloom filter in Servo is quite simple: it is just an array of byte-sized counters. When you want to add an element, you compute two indexes of the array from the hash of the element, and increment the counters at these indexes. Removing is similar but you decrement the counters. Query returns true if both indexes have non-zero counters. However, there are some tricky corner cases even with this simple implementation: what if the two indexes of the elements are the same? What happens with overflow and underflow? What happens if you remove an element that you have not added in the first place?

I wanted to make sure that all those cases were accounted for in Servo’s implementation and proceeded to translate the core logic of the bloom filter into F*. The property that should be preserved by all the operations of the bloom filter is: for each element that you’ve inserted into the bloom filter, the query for this element should return true. Indeed, if this property is preserved by all the operations manipulating the bloom filter, then it is always true. However, I need to keep track of all the elements that are exactly inside the bloom filter.

For that, I move out of the Rust world into F* where I define a ghost data structure, which will represent exactly all the elements that are inside the bloom filter at any moment. This data structure is “ghost” because it is not meant to be implemented or executed, but it is rather a proof artifact. Given a bloom filter and its ghost set of present elements, I can then move on to prove the preservation of the property.

The proof takes up around 500 lines of F* code, and took me about two weeks to complete. It is manually and carefully crafted, because it is full of recursive invariants that Z3 cannot figure out by itself. Finishing it gave me three important insights on how the bloom filter actually works.

1. To prove the preservation of the property, I actually had to prove a much stronger invariant: “the value of each counter inside the bloom filter array is either MAX_U8, or is equal to the sum of all the times an element whose hash corresponds to the counter has been inserted (minus the times it’s been removed), counted twice if the two hash indexes coincide for an element”.
2. When a counter reaches MAX_U8, it enters an absorbing state where the query always returns true. This increases the rate of false positives returned by the bloom filter.
3. If you remove from the bloom filter an element that was actually not in it, then you may break the correctness of the bloom filter.

This third property is very interesting because it adds a precondition on the remove function of the bloom filter: the element you remove has to have been inserted before. As a consequence, the code calling the bloom filter’s remove must respect the precondition. In Servo, the bloom filter is updated incrementally during the DOM tree traversal with a clever mechanism heavily commented and debug-asserted by the developer. It would be interesting to continue and prove that the incremental update mechanism never violates the bloom filter precondition along the tree traversal. For now, we can say that this file, up to trusting a manual translation from Rust to F*, is the first verified implementation of a bloom filter in Rust.