I love round-robin databases. They neatly solve the problem of efficiently storing time-series metrics. When you provision them, they allocate all the storage they will ever need, so you don't have to worry about monitoring servers filling up at oh-dark-thirty. Each RRD is a self-contained file, making maintenance a breeze.
The only downside is that they don't scale well. The more metrics you collect, the more RRDs you need to store the data in. More RRDs mean more files and more updates. The increasing load on the I/O subsystem can be mitigated via rrdcached, which buffers updates in memory so that more data is written in each write op.
That works, to a point, but eventually you will hit the wall. Eventually, there won't be enough I/O throughput. Eventually, there won't be enough disk. Eventually, the server holding your round-robin databases will die.
We could mirror our data submission, sending RRD updates to multiple servers in concert. This is similar to mirroring disks in a RAID-1 array, or multicasting requests to identical servers in a server farm. Unfortunately for us, this only addresses the failure scenario, and doesn't help us track more data.
Scaling Out
When one thing exhibits failure tendencies, we throw more things at the problem. Disks fail, so we put lots of them in a RAID array ensuring that failure of a single drive is non-fatal. Servers fail, so we put lots of them in a load-balanced pool to guard against a complete outage.
We can do the same with RRD, as long as we keep certain things in mind:
- Clients need to be able to update RRDs incrementally
- Updates must be routed to consistently
- Replication is the only defense against server loss
- We must be able to add / remove nodes at will
Point #1 is important, if somewhat obvious. Existing distributed data systems like Cassandra, Redis, and MongoDB fail us here because they do not allow partial updates. Primarily, this is because partial updates require a priori knowledge of the data being stored, and these systems are commodity infrastructure.
The second and fourth points are related. When a client submits an update for RRD $X$, we need to ensure that we always give that update to the node responsible for that RRD. Routing it anywhere else serves little purpose; the other nodes don't know anything about the historical data that the update applies to. Worse, if the node that does handle the RRD misses the update, you have data loss.
We can honor point #2 by using a hashing strategy to translate an RRD filename into a responsible node. A naïve design may rely on modulo operations to turn any hashed value into an index into a list of nodes.
Mathematically, it looks something like this:
$$L(k) = S\_{H(k)\:\:mod\:\:n}$$
$L(k)$ is the locator function, which translates an RRD filename key, $k$, into a node address, in the range $S_0…S_{n-1}$ (assuming that we have $n$ nodes). $H(k)$ is a hash function that maps an arbitrarily long string of characters (our filename) into a number.
We'll come back to why this is wrong later.
Point #3 is vital to the viability of the distributed storage system because it is the only way we can protect against server failure. If we distributed our databases across 12 servers, each with a total failure rate of once every 12 months, we will see, on average, an outage every single month. Without replicating the RRDs to other servers, we will suffer data loss, and it will happen faster than you would intuitvely expect.
Distributing via Hashing
Let's go back to hashing for a minute.
$$L(k) = S_{H(k)\:\:mod\:\:n}$$
Earlier, I pointed out that this was wrong. To understand why, let's look at some real hash values and the distribution across our nodes.
Here is Dan Bernstein's djb2 hash function:
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c;
return hash;
}
Let's start out by taking a set of 7 RRD filenames, and hashing them with djb2 to see how they spread. For brevity's sake, I'm going to drop all but the least-significant 8 bits, to get a number between 0 and 255 (inclusive).
"host01.example.com:cpu" = 77
"host01.example.com:memory" = 62
"host01.example.com:load" = 229
"host02.example.com:cpu" = 78
"host02.example.com:memory" = 159
"host02.example.com:load" = 6
"www.example.com:requests_per_second" = 168
In other words, $H(\text"host01.example.com:cpu") = 77$. Now, assuming we have $n = 3$ nodes, (and that $S$ ranges from $S_0$ to $S_2$), we can calculate the values for $L(k)$:
$$L(\text"host01.example.com:cpu") = S_{H(\text"host01.example.com:cpu")\:\:mod\:\:3}$$
$$= S_{77\:\:mod\:\:3}$$
$$= S_2$$
Likewise, we can calculate the location of all the other RRD filename keys:
L(host01.example.com:cpu) = 77 mod 3 = 2
L(host01.example.com:memory) = 62 mod 3 = 2
L(host01.example.com:load) = 229 mod 3 = 1
L(host02.example.com:cpu) = 78 mod 3 = 0
L(host02.example.com:memory) = 159 mod 3 = 0
L(host02.example.com:load) = 6 mod 3 = 0
L(www.example.com:requests_per_second) = 168 mod 3 = 0
Seems to be working so far. Armed with only knowledge of the locator function $L(k)$, clients can reliably and repeatably figure out what node they should direct their updates at.
But all is not well! Don't forget about point #4! What happens if we add a new node into the mix? Assuming we can inform all clients about the new topology, how will it affect hashing? The introduction of a new member of $S$ changes the range of $S$, and the value of $n$, which you'll recall was used as part of the definition of $L(k)$. Now, $n = 4$.
Let's revisit host01.example.com:cpu:
$$L(\text"host01.example.com:cpu") = S_{H(\text"host01.example.com:cpu")\:\:mod\:\:\bo4}$$
$$= S_{77\:\:mod\:\:\bo4}$$
$$= S_\bo1$$
The hash function $H(k)$ didn't change, so our hashed value is still 77, but the change of $n$ from 3 to 4 perturbed the results of the modulo operation, causing host01.example.com:cpu to move from $S_2$ to $S_1$. Naturally, we expect some part of the keyspace to be redistributed, otherwise adding a new node would have zero effect. So just how many keys migrated?
L(host01.example.com:cpu) = 77 mod 4 = 1
L(host01.example.com:memory) = 62 mod 4 = 2 # same
L(host01.example.com:load) = 229 mod 4 = 1 # same
L(host02.example.com:cpu) = 78 mod 4 = 2
L(host02.example.com:memory) = 159 mod 4 = 3
L(host02.example.com:load) = 6 mod 4 = 2
L(www.example.com:requests_per_second) = 168 mod 4 = 0 # same
Turns out, a little over half of all keys migrated to new hosts. Intuitively, we would have expected something close to a fourth, since the new node (given uniform distribution of keyspace) should be responsible for only one quarter of the entire keyspace. Looking at the recalculated results above, the new node ($S_3$) is in fact only responsible for one of our seven example keys.
In real life, this mass-migration would cause many I/O-intensive and network-heavy transfers of RRD files. Ideally, we would like to minimize this as much as possible.
Consistent Hashing To The Rescue!
Mathematically speaking, the problem with the modulo-hash strategy is that the resulting value depends highly on $n$, which we expect to be able to change up or down as we see fit. What we need is a hashing strategy that doesn't depend on the number of nodes in $S$, and that's what consistent hashing gives us.
The scholarly literature on consistent hashing, primarily David Karger's 1997 ACM paper and Daniel Lewin's 1998 Thesis, develop the idea of a consistent hashing function in the context of scalable web cache architecture. This allows them to sidestep certain issues that are unavoidable in authoritative data storage systems, like the one we're trying to build.
Nevertheless, the theory is sound, so let's jump right in!
It starts with a circle.
If we define a new function, $r(h)$, which takes an arbitrary number and maps it to a point on the circle, we can feed it the result of our hash function, like so:
$$r(H(k)) = (x_k, y_k)$$
(For now, don't worry about how we actually implement $r(h)$. Also, because djb2 doesn't avalanche well for short keys, we'll be switching to a modified SHA1, where we take the first 8-bits of the hash result.)
Assume that we calculate $r(H(k))$ for all of our RRD filename keys, and we get this:
Next, we assign our storage nodes from $S$ to points $A$, $B$ and $C$ on the ring:
To figure out which node $S_x$ is responsible for a key $k$, we find the point $r(H(k))$ on the ring, and then visit each subsequent point, in a clockwise direction, until we find one that corresponds to a node.
For example, let's assume that we hash host02.example.com:memory, run it through $r(h)$, and it spits out the point labeled $4$ in the diagram. Starting there, we walk the ring clockwise, passing point $7$, until we reach $B$.
Effectively, this means that each node point "owns" all key points that precede it. The exciting (and useful!) quality of this is that adding other nodes will not affect the placement of the existing nodes, since each node's point on the ring is determined solely by the hash function $H(k)$ and information intrinsic to the node itself. In our case, we are hashing the node's fully-qualified, canonical hostname.
Looking again at the diagram, we come up with the following ownership table:
A | 3 | ||||||
---|---|---|---|---|---|---|---|
B | 2 | 4 | 5 | 7 | |||
C | 1 | 6 |
Visually, we can see that $B$ accounts for more than 50% of the ring; our ownership table bears this out. We can correct this disparity by adding another node, and hope that it bisects the arc between $A$ and $B$ (preferably between key points $5$ and $4$).
Just for fun, let's go ahead and add that fourth $D$ node, to demonstrate how resilient consistent hashing is:
And here's the updated ownership table:
A | 3 | ||||||
---|---|---|---|---|---|---|---|
B | 4 | 7 | |||||
C | 1 | 6 | |||||
D | 2 | 5 |
Adding new nodes just to balance out statistical clustering is a pretty raw deal. Luckily, there's another way. The more node points we add, the smoother the distribution will be across nodes. What if, instead of adding real, physical nodes, we just assign multiple points to each node? We'll call them virtual nodes:
A good scheme (assuming that we chose an $H(k)$ with good avalanche properties), is to prefix the node name with a number from $0$ to $v - 1$, where $v$ is the number of virtual nodes to create per physical node. Whatever differentiating algorithm, just make sure the clients all know (and agree on) it!
Note that each node now exists as four separate points on the ring. The same ownership behaviors apply (walk the ring clockwise until you hit a node point), so we can regenerate our ownership table:
A | 4 | 7 | |||||
---|---|---|---|---|---|---|---|
B | 1 | 2 | |||||
C | 3 | 5 | 6 |
That's a much more even distribution of the keyspace, and it gets more even as (1) the hash function gets better and (2) the number of virtual nodes goes up.
We can actually use this virtual node trick to weight some nodes differently than others. Let's suppose that node $A$ has twice as much capacity (both I/O and raw disk storage) as the other two nodes. If we assign twice as many virtual nodes (8 instead of 4) to $A$, we should see an increase in the amount of keyspace that $A$ is responsible for:
And here's the new ownership table:
A | 1 | 3 | 4 | 7 | |||
---|---|---|---|---|---|---|---|
B | 2 | ||||||
C | 5 | 6 |
As expected, $A$ now holds twice as much keyspace as either of the other two nodes!