https://new.jameshunt.us

The Birthday Paradox

Did you know that in a group of only 23 people, there's a 50% chance that at least two of those people share the exact same birthday?

Yup. It's called The Birthday Paradox, and it's got some interesting math behind it. Surprisingly, it has some pretty far-reaching implications for crytography, and highlights a fundamental thing I've noticed about crypto: your intuition is usually wrong.

The Math

Let's flip the problem on its head to make the math a little easier. Instead of calculating the probability that people are sharing birthdays, let's consider the probability that no one shares a birthday. Together, these two quantities must add up to 1, since there are no other possible outcomes - either we have collisions, or we don't.

We'll call the probability of unique birthdays \(P\). Let's also ignore February 29th (sorry leap-year kids), and assume that we have a uniform distribution (that is, no dates are special). This gives us a total of 365 possible values, each one equally likely.

If we have a group with only one person in it, we have all 365 possible dates to choose from, so \(P = 1\).

$$ P_1 = 365 / 365 = 1 $$

If we add a person, there are now 364 unique birthdays left, so the problem becomes one of selection:

$$ P_2 = {365 - 1} / 365 = 0.9973 $$

There's a 99.73% chance these two people won't share a birthday.

Incidentally (although it's easy to miss in a cursory glance), we can calculate the total probability \(P\) by multiplying the two independent event probabilities we have so far:

$$ P = P_2 P_1 = ({365 - 1} / 365) (365 / 365) = 0.9973 $$

If we add a third person, we not only have to pick a unique birthday for the second person (\(P_2 = {365-1}/365\)), we now have to pick a unique date from the remaining birthdays for the third person, or:

$$ P_3 = {365 - 2} / 365 = 0.9945 $$

Again, we can calculate the total probability \(P\) from the independent event probabilities:

$$ P = P_3 P_2 P_1 $$

$$ P = ({365 - 2} / 365) ({365 - 1} / 365) (365 / 365) $$

$$ P = 0.9945 × 0.9973 × 1 = 0.9918 $$

Since the probabilities of each event after the first are less than 1, the compound probability is less than either independent event. This is where we start to see that the intuition is wrong, because most people think in terms of the individual events.

In a more general form, for a set of \(N\) unique values, selecting \(k\) of them leads to a probability \(P\) that all values are unique of:

$$ P = {N - 1} / N {N - 2} / N … {N - (k -1 )}/N $$

There will be \(k-1\) terms in this equation, so we can simplify it thusly:

$$ P = {(N - 1) (N - 2) … (N - (k - 1))} / N^{k-1} $$

Introducing an \(N\) term to both numerator denominator, we start to approach \(N!\):

$$ P = (N / N) ({(N - 1) (N - 2) … (N - (k - 1))} / N^{k-1}) $$

Then the denominator simplifies to \(N^{k-1+1}\), which is just \(N^k\):

$$ P = {N (N - 1) (N - 2) … (N - k)} / N^k $$

Further simplification can be had by introducing the remainder of the factorial into both the numerator and denominator:

$$ P = {N (N - 1) (N - 2) … (N - (k - 1)) (N - k)! } / {N^k (N - k)!} $$

This allows us to collapse the numerator to just \(N!\):

$$ P = N! / {N^k (N - k)!} $$

That is the final (factorial) form of what we'll call The Birthday Bound.

Looking Ahead

Armed with our formula for calculating the probability of no collisions, we can easily figure out the probability we actually care about:

$$ P_c = 1 - N! / {N^k (N - k)!} $$

That's still a lot to calculate with a computer. If you want to extend this topic to cover hash collision (and I most certainly do), you have to start looking at rather large n-bit values for \(N\), like \(2^64\), or \(2^192\) (for XSalsa20). I really don't want to have to calculate \(2^192!\). There may not even be enough time before the heat death of the Universe to grind on that calculation, even with our fastest supercomputers.

The general rule of thumb is that you will start seeing hash collisions at \(2^{n/2}\) uniformly distributed random values in the range \([0, 2^n)\). The math behind the approximation seems to involve Taylor Series approximations and the natural logarithm. Hopefully I can dust off the old college Calc texts and re-acquaint myself with those topics and bring you a more in-depth treatment next time!

Happy Hacking!

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently exploring Kubernetes, as both a floor wax and a dessert topping.