https://new.jameshunt.us

Bloom Filters

Bloom filters let you be lazy, and that makes them awesome.

Fundamentally, a Bloom filter is a probabilistic data structure that can answer the question "is X in set S?" with one of two answers:

  1. No, definitely not.
  2. Maybe.

If that sounds not quite useful, let's go build a NoSQL database storage engine. It'll be fun, and it won't take that long at all!

Everybody Gets A NoSQL Engine!

Let's start with some domain models.

An object is a collection attributes, with values.

Our access pattern analysis (yes, we did one) indicates that most objects are created, referenced immediately and then accessed infrequently as they age. To maximize our I/O performance, we will store sets of objects in separate blocks, each of which lives in a file on-disk, and gets memory-mapped in when needed.

A naïve data architecture might look something like this:

A boxes-and-arrows diagram showing how block-allocated memory map data structures might naively be implemented.  An arrow from the left points to the first block.  There are five total blocks, and each block consists of two cells: one pointing to the next block, and the other containing the data for this block.

(We're going to assume that the blocks themselves are structured sanely, but we'll skip the particulars.)

Finding an object is straightforward. Start at the the first block, searching each sequentially until you find an object that has the named attributes, with the requisite values. When you reach the end of one block, continue to the next. If you hit the NULL at the end, the object doesn't exist.

Functional, yes. Performant, hell no.

The problem stems from the non-uniform characteristic of each block — they can contain all kinds of objects. These objects may be so dissimilar from our search criteria that it's no use even scanning the block. If only we had a compact way of representing the attributes present in each blocks object set.

Ooh! Bloom filters!

If we stuff a moderately sized Bloom filter into our block structure (probably after next, but before you get to the data itself), we can do a quick gut check to see if we should even bother scanning. Hey Bloom filter, does the block even have the attributes we're searching on?

The no/maybe nature of the answers you get from a Bloom filter don't hurt us too much here. For starters, false negatives, where the filter mistakenly says that an attribute is definitely not present, are prohibited by the underpinning math. That means no matter what, we'll find all of relevant blocks. The ambiguity of the it might be here answer is also not a problem. If the filter mistakenly infers that the attribute is present when in fact it isn't, we end up scanning the block needlessly — but it doesn't affect the correctness of the search algorithm. These mistakes, by the way, are called false positives, and we can control their chance of occurrence by tuning the Bloom filter.

The Man, The Math, The Legend

Bloom filters were proposed by Burton Bloom, in his 1970 paper, Space/Time Trade-offs in Hash Coding With Allowable Errors. In it, he proposes the use of imperfect hash-coding as an acceptable substitute where perfect hashing would require too much memory.

As an example use case, Bloom turns to the English language. 90% of words in English can be hyphenated correctly using a few simple (low cost) rules. The other 10% require more costly (computationally speaking) measures. So Bloom proposed a solution whereby it would be simple to determine if a single element of a set was either (a) definitely not in the 10% or (b) possibly in the 10%, using probability and some clever math.

The basic idea is this: using a couple of different hash functions, calculate a bunch of different hashed values. Use those hashed values as indices into an array of bits, and set those positions to 1. Then, to test for subset membership, repeat the hashing operations, and see if the bits at all indices are non-zero.

Let's illustrate, graphically.

A diagram consisting of 16 empty cells, as if on a segmented tape.  The leftmost cell is labeled '0'; the rightmost is labeled '15'.

We start with a 16-element bit vector. Each box above is a single slot in the vector, and they are all empty.

A diagram, built from the previous, showing how the string 'key-the-first' gets hashed three times and placed into one of the three different cells in the 16-cell vector.

Here, we insert key-the-first into the filter. We do so by calculating three hashed values, using three different hash functions, \(h_1\), \(h_2\) and \(h_3\). We get the values 2, 11, and 14, and we set the bits at those three positions in our bit vector.

Another iteration on the 16-cell vector diagram, this time showing how 'key-the-second' gets inserted over the top of the previous insertion.

Next, we insert key-the-second, using the same process. Calculate the three hashed values (5, 11, and 15), and set the bit vector positions appropriately. Note that \(h_2\) collided on this particular key — it was bound to happen eventually, and it's actually why we triangulate with more than one hashing function.

Checking keys is similar:

A diagram running the hashing operation 'in reverse' and checking to see if the string 'key-the-first' exists in the Bloom filter by asking if the 3 hash locations in the bit vector (2, 11, and 14) are on.

To see if key-the-first is in the filter set, we again calculate the hashed values using our hashing functions. However, instead of modifying the bit vector, we're just going to query it. Since all three bits (2, 11, and 14) are set, we can say that key-the-first might be in the filter set.

Conversely, looking for a key that we have not yet added to the filter set:

A diagram illustrating how a query for a non-present key, like 'key-the-third' works.

fails. We calculate the hashed values to be 0, 2, and 14, but since not all of those bit vector slots have been set, we can guarantee that key-the-third is definitely not in the filter set.

Why can't a Bloom filter give a definitive positive answer? Because there just isn't enough information. Intuitively, take note that the size of the Bloom filter bit vector is fixed, but the key space is infinite — eventually our hash functions will collide so much that we start getting false positives as residuals from prior insertion. All it takes is three collisions, across the three hash functions we are using to accidentally set the all three bit vector positions to 1.

We can actually calculate the probability of getting a false positive, as a function of the number of hash functions, \(k\), the size of the bit vector, \(m\), and the number of items expected to be inserted into the filter set, \(n\):

$$f = (1 - e^{-km/n})^k$$

(For more rigorous maths, check the literature)

Armed with the false positive probability function, we can not only calculate the false positive rate for any Bloom filter configuration, but we can also calculate the ideal value for \(k\), given the ratio \(m/n\).

Broder and Mitzenmacher do just that by taking the derivative of \(f\), and find:

$$k_{ideal} = ln({2m/n})$$

Bloom filter implementations like this one can use that equation to determine the correct \(k\) given a bloom factor represented as an integer floor/ceiling of \(m/n\).

Back To NoSQL!

So, if we modify the data architecture of our NoSQL storage engine to look like this:

A diagram detailing an improvement on our original block-structured memory storage data structure, this time with an added Bloom filter cell between the pointer and data cells.

We can modify our search algorithm to work thusly:

  1. Start with the first block
  2. For each attribute in the query:
    1. Check the bloom filter for presence
    2. If the attribute is definitely not present, skip to the next block
  3. Sequentially scan the block data looking for the object

Terminating, of course, when next is NULL.

We incur a bit more overhead on the write side, since our insertion process for new objects becomes:

  1. For each attribute in the new object:
    1. set the attribute in bloom filter.
  2. Append the object to the block data.

This can be batched using something like an LSM-tree to bolster insertion performance, at the cost of either durability or the introduction of a write-ahead log. We can even pull the Bloom filter out of each block and put it in special indexing blocks that can remain memory-mapped, further minimizing disk access. And it's all possible thanks to Bloom filters!

Further Reading

If you can't get enough of research papers, this section is for you!

  1. Space-Time Trade-offs in Hash Coding with Allowable Errors is the seminal paper by Burton H. Bloom, ca. 1970.
  2. Network Applications of Bloom Filters: A Survey expands more rigorously on the underlying math, and provides several real-world applications of Bloom filters.
  3. In On The False Positive Rate of Bloom Filters, Bose et al. investigate the nature of the false positive calculation.
  4. Less Hashing, Same Performance: A Better Bloom Filter covers some mathematical and implementation improvements on standard Bloom filters to reduce both computation time and reliance on randomization.

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently exploring Kubernetes, as both a floor wax and a dessert topping.