Bloom filters let you be lazy, and that makes them awesome.
Fundamentally, a Bloom filter is a probabilistic data structure that can answer the question "is X in set S?" with one of two answers:
- No, definitely not.
- Maybe.
If that sounds not quite useful, let's go build a NoSQL database storage engine. It'll be fun, and it won't take that long at all!
Everybody Gets A NoSQL Engine!
Let's start with some domain models.
An object is a collection attributes, with values.
Our access pattern analysis (yes, we did one) indicates that most objects are created, referenced immediately and then accessed infrequently as they age. To maximize our I/O performance, we will store sets of objects in separate blocks, each of which lives in a file on-disk, and gets memory-mapped in when needed.
A naïve data architecture might look something like this:
(We're going to assume that the blocks themselves are structured sanely, but we'll skip the particulars.)
Finding an object is straightforward. Start at the the first
block, searching each sequentially until you find an object that
has the named attributes, with the requisite values. When you
reach the end of one block, continue to the next
. If you hit
the NULL
at the end, the object doesn't exist.
Functional, yes. Performant, hell no.
The problem stems from the non-uniform characteristic of each block — they can contain all kinds of objects. These objects may be so dissimilar from our search criteria that it's no use even scanning the block. If only we had a compact way of representing the attributes present in each blocks object set.
Ooh! Bloom filters!
If we stuff a moderately sized Bloom filter into our block
structure (probably after next
, but before you get to the data
itself), we can do a quick gut check to see if we should even
bother scanning. Hey Bloom filter, does the block even have the
attributes we're searching on?
The no/maybe nature of the answers you get from a Bloom filter
don't hurt us too much here. For starters, false negatives
,
where the filter mistakenly says that an attribute is definitely
not present, are prohibited by the underpinning math. That means
no matter what, we'll find all of relevant blocks. The ambiguity
of the it might be here answer is also not a problem. If the
filter mistakenly infers that the attribute is present when in
fact it isn't, we end up scanning the block needlessly — but
it doesn't affect the correctness of the search algorithm.
These mistakes, by the way, are called false positives
, and we
can control their chance of occurrence by tuning the Bloom filter.
The Man, The Math, The Legend
Bloom filters were proposed by Burton Bloom, in his 1970 paper, Space/Time Trade-offs in Hash Coding With Allowable Errors. In it, he proposes the use of imperfect hash-coding as an acceptable substitute where perfect hashing would require too much memory.
As an example use case, Bloom turns to the English language. 90% of words in English can be hyphenated correctly using a few simple (low cost) rules. The other 10% require more costly (computationally speaking) measures. So Bloom proposed a solution whereby it would be simple to determine if a single element of a set was either (a) definitely not in the 10% or (b) possibly in the 10%, using probability and some clever math.
The basic idea is this: using a couple of different hash functions, calculate a bunch of different hashed values. Use those hashed values as indices into an array of bits, and set those positions to 1. Then, to test for subset membership, repeat the hashing operations, and see if the bits at all indices are non-zero.
Let's illustrate, graphically.
We start with a 16-element bit vector. Each box above is a single slot in the vector, and they are all empty.
Here, we insert key-the-first
into the filter. We do so by
calculating three hashed values, using three different hash
functions, \(h_1\), \(h_2\) and \(h_3\). We get
the values 2, 11, and 14, and we set the bits at those three
positions in our bit vector.
Next, we insert key-the-second
, using the same process.
Calculate the three hashed values (5, 11, and 15), and set the bit
vector positions appropriately. Note that \(h_2\) collided on
this particular key — it was bound to happen eventually, and
it's actually why we triangulate with more than one hashing
function.
Checking keys is similar:
To see if key-the-first
is in the filter set, we again calculate
the hashed values using our hashing functions. However, instead
of modifying the bit vector, we're just going to query it. Since
all three bits (2, 11, and 14) are set, we can say that
key-the-first
might be in the filter set.
Conversely, looking for a key that we have not yet added to the filter set:
fails. We calculate the hashed values to be 0, 2, and 14, but
since not all of those bit vector slots have been set, we can
guarantee that key-the-third
is definitely not in the filter
set.
Why can't a Bloom filter give a definitive positive answer?
Because there just isn't enough information. Intuitively, take
note that the size of the Bloom filter bit vector is fixed, but
the key space is infinite — eventually our hash functions will
collide so much that we start getting false positives
as
residuals from prior insertion. All it takes is three collisions,
across the three hash functions we are using to accidentally set
the all three bit vector positions to 1.
We can actually calculate the probability of getting a false positive, as a function of the number of hash functions, \(k\), the size of the bit vector, \(m\), and the number of items expected to be inserted into the filter set, \(n\):
$$f = (1 - e^{-km/n})^k$$
(For more rigorous maths, check the literature)
Armed with the false positive probability function, we can not only calculate the false positive rate for any Bloom filter configuration, but we can also calculate the ideal value for \(k\), given the ratio \(m/n\).
Broder and Mitzenmacher do just that by taking the derivative of \(f\), and find:
$$k_{ideal} = ln({2m/n})$$
Bloom filter implementations like this one can use that equation to determine the correct \(k\) given a bloom factor represented as an integer floor/ceiling of \(m/n\).
Back To NoSQL!
So, if we modify the data architecture of our NoSQL storage engine to look like this:
We can modify our search algorithm to work thusly:
- Start with the first block
- For each attribute in the query:
- Check the
bloom filter
for presence - If the attribute is definitely not present, skip to the next block
- Check the
- Sequentially scan the block data looking for the object
- Move to the next block and repeat.
Terminating, of course, when next
is NULL
.
We incur a bit more overhead on the write side, since our insertion process for new objects becomes:
- For each attribute in the new object:
- set the attribute in
bloom filter
.
- set the attribute in
- Append the object to the block data.
This can be batched using something like an LSM-tree to bolster insertion performance, at the cost of either durability or the introduction of a write-ahead log. We can even pull the Bloom filter out of each block and put it in special indexing blocks that can remain memory-mapped, further minimizing disk access. And it's all possible thanks to Bloom filters!
Further Reading
If you can't get enough of research papers, this section is for you!
- Space-Time Trade-offs in Hash Coding with Allowable Errors is the seminal paper by Burton H. Bloom, ca. 1970.
- Network Applications of Bloom Filters: A Survey expands more rigorously on the underlying math, and provides several real-world applications of Bloom filters.
- In On The False Positive Rate of Bloom Filters, Bose et al. investigate the nature of the false positive calculation.
- Less Hashing, Same Performance: A Better Bloom Filter covers some mathematical and implementation improvements on standard Bloom filters to reduce both computation time and reliance on randomization.