B-trees are the fascinating data structures at the heart of almost every piece of data storage and retrieval technology you've ever dealt with. Ever created a relational database table with an indexed column? B-trees. Ever accessed a file in a filesystem? B-trees.
They are so useful because they:
- Keep keys sorted, no matter what order they are inserted
- Build up wide but shallow trees to traverse
- Work well with external index storage (i.e. disks)
How They Work, A Love Letter In Pictures
To illustrate how B-trees work, I'm going to use pictures. Later we can get into the code. To make the pictures intelligible, we're going to have to use a smaller degree; each node in our B-tree will only be able to hold 5 keys and 6 links.
We're also going to skip some of the boring parts and start out at "step 1" with 4 elements already inserted into our B-tree:
The blue boxes are the slots in the B-tree node, each containing a key. There are pointers on each side of each slot. For leaf nodes (and our root node is a leaf node at this point), those pointers point to the actual data being stored for each key, i.e., A -> a, B -> b, etc.
Now, let's insert a new key/value pair, E -> e.
Both F and f get moved to make room for E -> e in the correct (sorted) position.
This is the killer feature of B-trees; as elements are inserted, keys remain in sort order. This is why databases and filesystems make such extensive use of them.
We could just stop here, and wait for the next insertion, but we're going to do some proactive housekeeping first. Our B-tree node is now full. We need to grow the tree to give ourselves some breathing room.
We'll do this by splitting our node in two, right down the middle, at D, and migrating D "up" to become the new root. To the left we will place keys A and B (which are strictly less than D), and to the right, E and F.
The choice of D is important. We're hedging our bets by picking the median, because that gives equal footing when key/value pairs are inserted in random order.
Our B-tree now spans three nodes, and looks like this:
Note, particularly, that it is the pointer after B that points to d. You can (and should!) think of this as the pointer between B and D — viewed that way, it's no different from the structure we had in steps 1 and 2.
Now we are starting to see the tree structure emerge.
What happens if we insert C → c?
Insertion starts by considering the root node, which just contains D. Since C < D, and the root node is not a leaf node, we follow the pointer link to the left of D and arrive at the leftmost child node [A, B]. This is a leaf node, so we insert C -> c right after B, remembering to shift the link to d one slot to the right (D > C, after all). This does not fill the child node, so we are done.
On The Importance of Being Proactive
Above, we split the node as soon as it was full.
We didn't have to. We could just as easily have waited until the next insertion, and split the node then, to make room for our new key/value. That would lead to more compact trees, at the expense of a more complicated implementation.
To understand why, let's consider a pathological case where we have several full B-tree nodes along a single ancestry line:
Then, let's try to insert the key/value pair C -> c. To do so, we traverse the root node to the H-node, to the A-node. Since the A-node is full, we need to split it at the median, D:
Uh-oh. We have nowhere to put D. Normally, it would migrate up the tree and be inserted into the H-node, but the H-node is full!
We could split the H-node, but it suffers the same problem; its parent (the N-node) is also full! Our simple insertion has cascaded into a sequence of partitioning operations:
Being proactive, and splitting on the way down (as the literature puts it), means that we can never get into this problematic scenario in the first place.
In Algorithms (1983), Sedgewick has this to say:
As with top-down 2-3-4 trees, it is necessary to "split" nodes
that are "full" on the way down the tree; any time we see a
k-node attached to an M node, we replace it by a (k + 1)-node
attached to two M/2 nodes. This guarantees that when the bottom
is reached there is room to insert the new node.
Aside: Algorithms is a wonderful text, worthy of study. Don't let anyone tell you that seminal computer science texts from the 80's (or before!) are obsolete.
By taking this advice one step further, and splitting on insertion, we can ensure that traversal / retrieval doesn't need to be bothered with such trifles.
Some Mathematics To Consider
I spent a fair amount of time studying B-trees for a project of mine, focusing mainly on how best to size the B-tree node for optimal density and tree height.
Tree height is important because our data values are only stored in the leaf nodes. Therefore, every single lookup has to traverse the height of the tree, period. The shorter the tree, the fewer nodes we have to examine to find what we want.
Given that most B-trees live on-disk, with only the root node kept in memory, tree height becomes even more important; fewer disk accesses = faster lookup.
The following equations consider B-trees of degree \(m\) (for maximum number of children allowed in a node).
First, we need to figure out the minimum number of keys required of interior nodes, which we'll call \(d\). Usually, this is:
$$d = ⌈{m/2}⌉$$
Using a median-split, we should never have fewer than \(d\) keys per interior node. The root node is special, since we have to start somewhere, and the leaf nodes are likewise exempt.
The worst-case height then, given \(n\) items inserted, is:
$$h_{worst} ≤ ⌊{log_d({n + 1}/2)}⌋$$
and the best-case height is:
$$h_{best} = ⌈{log_m({n + 1})}⌉ - 1$$
Common wisdom has it that you should choose a natural block size as your node size, in octets. That's either 4096 bytes (4k) or 8192 bytes (8k), so let's run through the calculations.
For \(T = 4096\):
$$m = 340$$
$$d = ⌈m/2⌉ = ⌈340/2⌉ = ⌈170⌉ = 170$$
We can fit 170 pointers and 169 keys per B-tree node. Our worst-case height, given minimal node usage is:
$$h_{worst} ≤ ⌊{log_d({n + 1}/2)}⌋$$
For my analysis, I was trying to find best- and worst-case scenarios given \(2^32 - 1\) items (one for every second I could fit in a 32-bit unsigned integer). \(n + 1\) then simplifies to just \(2^32\), and dividing by two knocks one off of the exponent, leaving \({n+1}/2 = 2^31\):
$$h_{worst} ≤ ⌊{log_170({2^31)}⌋ ≤ ⌊{3.675}⌋ ≤ 3$$
So with minimally dense nodes, we can expect to traverse 3 nodes to find a value corresponding to any given key. Let's see about best-case scenario. As before, \(n + 1\) simplifies down to \(2^32\):
$$h_{best} = ⌈{log_m({n + 1})}⌉ - 1$$
$$h_{best} = ⌈{log_340({2^32)}⌉ - 1 = ⌈3.805⌉ - 1 = 3$$
So, 3, again. That's what I call balance.
For \(T = 8192\):
$$m = 681$$
$$d = ⌈m/2⌉ = ⌈681/2⌉ = ⌈340.5⌉ = 341$$
Using \({n+1}/2 = 2^31\), as before, we can calculate the worst-case for an 8k-page B-tree:
$$h_{worst} ≤ ⌊{log_d({n + 1}/2)}⌋$$
$$h_{worst} ≤ ⌊{log_341({2^31)}⌋ ≤ ⌊{3.684}⌋ ≤ 3$$
As with 4k pages, we can expect to look at 3 nodes before we find the value we seek. Not bad.
What about \(h_{best}\)?
$$h_{best} = ⌈{log_m({n + 1})}⌉ - 1$$
$$h_{best}= ⌈{log_681({2^32)}⌉ - 1 = ⌈3.4⌉ - 1 = 3$$
It would seem that a choice between 4k and 8k won't affect our best and worst case heights; they are always 3.
The Literature
If you're interested in the genesis of B-trees, you'll want to read Bayer and McCreight's 1970 paper Organization and Maintenance of Large Ordered Indices.
Douglas Comer's 1979 review / survey of B-trees, The Ubiquitous B-Tree, covers the basics of B-trees, explains their ascent to ubiquity, and discusses several major variations, like B+-trees, and their relative merits.
Happy Hacking!