How PoW Works

In the previous section, we explained what PoW does. In this segment, we dive into the details of how it works. This section also marks our first encounter with math. The level of formality will steadily increase as we make our presentation more precise. If you find math unsavory, the descriptive parts alone should be enough to take you through the other descriptive parts throughout the book. Even if you do have some math mileage under your belt, be sure to read deliberately and let the details sink in. If you desire to understand the mathematics underlying the descriptive discussions to follow, they are essential.

You might have heard that PoW works by posing the miners with "complex mathematical puzzles". I argue that these mathematical puzzles are not "complex" at all. In fact, they are designed to be incredibly dumb: there is no better way to solve them than banging your head against them by trying out all possible solutions in no particular order until you find one that works. It takes exceptional cleverness to design a puzzle this dumb. A game so impossible to strategise that the dumbest and smartest players have exactly the same odds.

When constructing a dumb puzzle, there is a tunable parameter called the difficulty target that adjusts how hard the puzzle is. By correctly adjusting the difficulty target, we determine how long we should expect to wait before a solution emerges. In this section, we assume that the correct difficulty target is given to us by divine inspiration. In the next segment, we explain how Bitcoin adjusts the difficulty target in practice.

My gratitude to Quai network for sponsoring this segment of the book

Hashes as Random Oracles

The cryptographic building block of PoW is something called a cryptographic hash function (CH). Even those of you who have never heard about CHs can name some that are used in practice. This includes the SHA-256\textsf{SHA-256} hash used in Bitcoin, and many other hashes you might have heard of, such as Keccak\textsf{Keccak}, Blake\textsf{Blake}, Ethash\textsf{Ethash}, and Scrypt\textsf{Scrypt}.

Each of these hash functions is a universe within itself, packing many clever ideas. But they all have the same goal: to appear random on previously untested inputs. Let us choose an arbitrary CH and denote it H\textsf{H}. The property that a CH is trying to emulate is that it is impossible to predict the outcome of H\textsf{H}. We will carefully define what that means in a moment.

In cryptography, we like to have neat formalism to streamline a division of labor. We do so by assuming we have an ideal H\textsf{H}, and leaving it to others to figure out how to implement H\textsf{H} in a way that is sufficiently close to our ideal description. Such an ideal hash is called a random oracle, and assuming it exists means we are working in something called the random oracle model (ROM).

I provide a more substantial survey of random oracles in the appendix. But for now, all you need to know is that a random oracle is a magic box floating in the sky. Anyone can approach the box and give it any string ss they want as an input. The box replies with a string of 256 bits we call H(s)\textsf{H}(s). We call this querying the oracle on ss. We are promised that when the universe was created, the reply for each input was chosen uniformly and independently. That is, the value of H(s)\textsf{H}(s) for each ss was seleccted at the dawn of time by a series of 256 coin flips.

Random oracles cannot exist in reality. They are too ideal. Hash functions are attempts to provide clever substitutes. Efficient functions that are not really random, but scramble the input so thoroughly that it becomes infeasible to predict the function without a galactic-scale computer.

The reason random oracles can't exist is simply because of storage. A random oracle provides for each possible value of ss a truly random string H(s)\textsf{H}(s), of which there are infinitely many.

Even if we limit the length of ss to say 100100 bits (which is much much smaller from what we actually need), we will find that there are 21002^{100} possible inputs, requiring a storage of 21002562^{100} \cdot 256 bits, which is more than one petabyte for each star in the universe.

The crux is that random data is not compressible. If it admits a more compact representation, then it is not truly random. Compression requires strong patterns, which are the opposite of randomness.

Knowing this, I find CHs that much more impressive. Emulating an inherently inefficient phenomenon with fiendishly light and fast circuits, so well that no one has ever managed to distinguish the two, is truly a marvel. How it is actually done, and why we feel comfortable using these "mere" approximations, is a topic for an entirely different book.

A consequence of the independence is that it is hard to find a preimage. Given some output yy, there is no better way to find an ss such that H(s)=y\textsf{H}(s)=y then guessing different values of ss until you find one that works. For any ss, there are 22562^{256} equally likely possibilities for H(s)\textsf{H}(s). This implies that, on average, finding a correct ss should take about 22562^{256} attempts. To understand how fantastically large this number is, check out this video:

This form of security is called preimage resistance. This is a useful property but in itself, not strong enough to call a function a CH.

There are other important security properties that you might encounter. For example, collision resistance: the infeasibility of finding two inputs sss\ne s' such that H(s)=H(s)\mathsf{H}(s) = \mathsf{H}(s'). A function that satisfies this is called a collision-resistant hash (CRH).

It turns out that being a CRH is strictly stronger than being preimage resistance. And even that is not strong enough for some applications.

A random oracle satisfies any property we could dream of. But hashes don't always follow suit. Sometimes, hashes are known to deviate from behaving like random oracles in certain scenarios, but are still assumed to be close enough in other scenarios. Interestingly, SHA-256\textsf{SHA-256} is such a function. There are known cryptographic schemes that are provably secure with respect to a random oracle, but break down when this oracle is replaced with SHA-256\textsf{SHA-256}. See the exercises for more information.

For that reason, handling hash functions should be done with utmost care. Using a hash function outside its recommended scope, or rolling out home-brewed hashes, can be very dangerous.

To streamline the notation a bit further, we set N=2256N=2^{256} and think of H(s)\textsf{H}(s) as a natural number 0n<N0\le n < N. This is just a convenient way to order all possible outcomes. This will allow us to easily specify more general tasks, such as "finding an ss such that H(s)<42\textsf{H}(s) < 42".

We need just one more quantity before we can continue. Something that will bridge oracle queries with clock times. We define tt such that the oracle is queried exactly once per tt seconds. For example, if the oracle is queried once a minute, we set t=60t=60. If the oracle is queried ten times a second, we set t=0.1t=0.1. In practice, we are more likely to encounter values such as t=1080t=10^{-80}. The value tt is called the global hash delay, and its reciprocal 1/t1/t is called the global hash rate.

Approximating the value of tt and maintaining it over time is the difficulty adjustment problem that we consider in the next chapter. For now, we assume that tt is fixed and known. Knowing this tt, we can say things like "the expected time it would take to find ss is tNt\cdot N seconds".

Tuning the Difficulty

We already considered the problem of a preimage of yy. Since the outputs are equally likely, we go ahead and assume that y=0y=0. The problem can be restated a bit awkwardly as "find ss such that H(s)<1\textsf{H}(s) < 1".

Why is it useful? Because we can now talk about a more general problem: find ss such that H(s)<T\textsf{H}(s) < T. That is, we don't require a specific output, but one out of nn specific outputs. (Again, since all outputs are uniformly likely and independent, it does not matter what nn values are allowed, so we choose 0,,n10,\ldots,n-1 out of convenience).

We call TT the target, and our first task is to figure out how to set it. To do that, we work out the expected time to solve the puzzle, and how it is affected by TT.

This is a simple computation. For any ss, there is a one in NN chance that H(s)=0\textsf{H}(s) = 0 and a one in NN chance that H(s)=1\textsf{H}(s)=1. Note that it is impossible that H(s)=0\textsf{H}(s) = 0 and that H(s)=1\textsf{H} (s)=1. In probability theory, two events that cannot happen are called disjoint. A basic property of disjoint events is that the probability that either of them happens is the sum of their probabilities. In other words, the probability that H(s)=0\textsf{H}(s)=0 or H(s)=1\textsf{H}(s)=1 is two (= one + one) in NN. So we expect it should take about N/2N/2 queries, or tN/2t\cdot N/2 seconds, to find such an ss such that H(s)<2\textsf{H}(s)<2.

This argument generalizes directly to show that we expect it would take tN/Tt\cdot N/T seconds to find ss such that H(s)<T\textsf{H}(s) < T.

With this insight, we can control how long it would take to solve the puzzle by adjusting TT. If we want it to be λ\lambda seconds, we need tN/T=λt\cdot N/T = \lambda. Isolating TT, we arrive at an all important formula:

T=tN/λT = t\cdot N/ \lambda

Let's see an example. Say that the global hashrate is one trillion queries per second. In other words,t=1012t=10^{12}. Say we want a block delay of ten minutes. Since we are working in units of seconds, we set λ=600\lambda = 600 and plug everything into the equation to get

T=10122256/6001.876×2206.T=10^{-12}\cdot2^{256}/600\approx1.876\times2^{206}\text{.}

You might be concerned that the solution is generally not an integer. This concern is much more valid than most people expect. Yes, rounding is obviously the solution, but how do you ensure different implementations — or even the same implementation running on different architectures — all handle rounding errors exactly the same way? Even a slight incompatibility, including a round-off bug in a future popular CPU, can split the network and cause vicious complications. For that reason, cryptocurrencies apply measures that ensure that roundoff error handling remains uniform and platform-independent.

One peculiarity with the difficulty target is that lower targets make for harder puzzles. This makes sense, as hitting a number below one million is harder than hitting a number below one trillion. But we have to be mindful of this when using TT as a measure of difficulty.

To circumvent this, the difficulty is often defined as the reciprocal of the target, τ=1/T\tau = 1/T. That way, the hardness of the puzzle grows with τ\tau. Unfortunately, the terms "difficulty" and "target" are sometimes used interchangeably. If that's not confusing enough, sometimes when people say "the difficulty" they actually mean log2τ\log_2\tau. There is a perfectly good reason for that, which I will soon explain.

The best advice I can give is that whenever anyone uses the terms difficulty or target in a context where these details are important, be sure to ask them explicitly what they mean.

Header Coupling

How do we use the puzzle above to space out Bitcoin blocks?

Requiring a block to contain a solution to the puzzle, an ss such that H(s)<T\textsf{H}(s)<T, is not enough. Why? Because there is nothing preventing the miner from using the same solution for several blocks. To solve this, we need the problem to depend on the block itself.

For technical reasons I won't get into, we don't actually want to feed entire blocks into the hash function. We need a smaller piece of information that still completely defines the block. This piece of information is called the block header. The header contains all sorts of metadata about the block. Importantly, it contains a pointer to the previous block (so whenever the chain progresses, the header necessarily channge), and some magic trick called a Merkle transaction root which is much smaller than the content of the block, but is cleverly built such that you can't change the content of the block without changing the transaction root. Even though the header does not contain the transactions, it is committed to them. It is impossible to find a different set of transactions with the same Merkle root. We explain how this magic works in an appendix.

To accommodate the "dumb puzzle", another field called the nonce is added. Importantly, the nonce field can be set to anything. To make our lives easy, we will use B[n]B[n] to denote the header of the block BB when the nonce is set to nn. And here is the crux: instead an ss such that H(s)<T\textsf{H}(s) < T, we require the nonce value nn to satisfy that H(B[n])<T\textsf{H}(B[n])<T. Unlike the string ss, the nonce is only good for the block BB!

The expression H(B[n])\textsf{H}(B[n]) might cause some distress. We described query input as "strings", in what sense is the header "a string"? There are many different ways to spell out the header information in a string, and each will yield different results!

The truth is the input is not even general "strings" like English messages, it is binary strings, a list of ones and zeros. So how can H(B[n])\textsf{H}(B[n]) ever be well defined?

The keyword is serialization. Serialization is the process of encoding abstract data in a binary string. There are many ways to serialize data, and the specific serialization doesn't particularly matter. (Some serializations are preferred to others in terms of rooted conventions or particular optimizations, but mathematically, it is all the same.) But it is of utmost importance that everyone use the same serialization. That way, everyone agrees on H(B[n])\textsf{H}(B[n]).

The nonce field in Bitcoin has a length of 3232 bits. When it was defined, it seemed absurd to even imagine that we would ever need a larger space. (ChatGPT estimates the SHA-256\textsf{SHA-256} hashrate of an average CPU in 2010 to be "several millions". You are invited to do the math and see how many CPUs need to be mining to have more than 2322^{32} hashes per ten minutes.)

However, reality struck, Bitcoin exploded, ASICs hit the mining market, and the hashrate exceeded everyone's expectations by orders of magnitude.

To compensate, Bitcoin miners use several tricks to add more entropy. One popular trick is the "coinbase extrahash". It just means injecting a nonce into the first transaction on the block. This will change the Merkle root tree, hence the hash of the block.

Adding these 32 bits means Bitcoin can now handle hash rates that are 2322^{32} (about 4.3 billion) times larger.

Isn't Difficulty About Counting Zeros?

Some of you may heard that the difficulty target counts "how many left zeros" there are in the hash, and might be wondering what this has to do with all of the above.

To answer this, we need to go into how to interpret a string of bits as an integer. We do it in a very standard way many readers are probably already familiar with, it's called right LSB, where LSB stands for least significant bit.

This just means that the rightmost bit is the smallest one, just like in the decimal system, where the rightmost digit represents units, the next digit represents tens, and so on. We are going to use both decimal and binary numbers in this explanation, so to make things readable, we introduce a common convention. A number written by its own is always considered decimal. A binary number will be prefixed by 0b0b.

In the decimal system, each digit represents a unit ten times larger. Incrementing the second right digit has ten times the effect as incrementing the first digit: 27=17+10,18=17+127 = 17 + 10, 18 = 17+1.

In binary, this is the same, except each digit becomes twice as significant when we move right. For example, 0b11=30b11=3. Why? Because the rightmost 11 adds 11, and the second rightmost 11 adds twice as much, which is 22, so we have a total of 1+2=31+2=3.

In decimal, a 11 followed by nn zeros can be written as 10n10^n. For example, 1000=1031000=10^3. Similarly, in binary, 11 followed by nn zeros is equal 2n2^n. So, for example, 0b1000=23=80b1000=2^3=8. We can use this to corroborate our computation from before: 0b11=0b10+0b1=21+20=2+1=30b11 = 0b10 + 0b1 = 2^1 + 2^0 = 2+1 = 3. This also gives us a general formula for converting binary to decimal. We can denote a general n+1n+1 digit binary number as 0bwnw00bw_n\ldots w_0, where each digit wiw_i is either 00 or 11. By breaking it up into digits and using the formula for a 11 followed by zeros we get that

0bwnw0=k=0nwk2k=k:wk=12k.0bw_{n}\ldots w_{0}=\sum_{k=0}^{n}w_{k}2^{k}=\sum_{k:w_{k}=1}2^{k}\text{.}

To count digits, we use logarithms. You might know already that log2(1)=0\log_2(1)=0 , and more generally that log2(2k)=k\log_2(2^k)=k. Saying a number nn has kk binary digits is equivalent to saying that 2k1n<2k2^{k-1} \le n < 2^k. Why? Because 2k1=0b1002^{k-1} = 0b10\ldots 0 is the smallest number with kk digits, so nn must be at least that, and 2k2^k is the smallest number with k+1k+1 digits, so nn must be smaller.

Logarithms are monotonically increasing: if we put a larger number in the logarithm, we get a larger result. Hence

k1=log22k1log2n<log22k=k.k-1=\log_{2}2^{k-1}\le\log_{2}n<\log_{2}2^{k}=k\text{.}

We now introduce the floor notation x\left\lfloor x\right\rfloor to denote the largest integer that is at most xx. For example, 1.4=1,4.5=5,π=3\left\lfloor 1.4\right\rfloor =1,\left\lfloor -4.5\right\rfloor =-5,\left\lfloor \pi\right\rfloor =3. We have proven that the number of digits in the binary representation of any number kk is exactly log2k+1\left\lfloor \log_{2}k\right\rfloor +1.

Why does this matter? Because now we can use this to count zeros! Say we represent our number as 256256 binary digits, no matter what. If it has fewer digits, we add zeros to the right. This is implicitly what we did above when interpreting a string as a number: we just ignored all left-trailing zeros. If our number kk has mm digits (not counting the trailing zeros), then there are 256m256-m trailing zeros. But we know that m=log2k+1m=\left\lfloor \log_{2}k\right\rfloor +1, so the number of trailing zeros is 255log2k255-\left\lfloor \log_{2}k\right\rfloor.

So if we require that H(s)<T\textsf{H}(s)<T, we in particular require that it has at least 255log2T=256+log2τ255-\left\lfloor \log_{2}T\right\rfloor=256+\left\lfloor \log_{2}\tau\right\rfloor trailing zeros. (I'll leave it to you to figure out how the 255255 magically became 256256.)

The trailing zeros approach is actually too coarse-grained for difficulty targets. The less zeros you have, the harder it becomes to add another zero. This is exactly the information lost when going from log2τ\log_2\tau to log2τ\left\lfloor \log_{2}\tau\right\rfloor. With the full value log2τ\log_2\tau, one can implement a recursive version of the argument above to compute the accurate difficulty. This is a conceptually clunky representation, but some prefer it, because this is how difficulty is represented in Bitcoin.

The Average Wait Time*

This final segment is a detour. It is not pertinent to anything that follows. It describes an interesting experiment I ran on my followers that is relevant to the discussion, and might give you some food for thought, and emphasize how unintuitive the block creation process might get.

In a Twitter poll I made, I asked my followers: given that the Bitcoin network produces a block every 10 minutes, if you start waiting for a block at some random time, how long will you wait for the next block on average? These are the results:

It is not surprising that most people will think that the answer is five minutes. After all, if you know that a bus arrives at the stop once every ten minutes, then (under the unrealistic assumption that buses arrive on time) if you arrive at the station at a random time, you expect to wait five minutes. (People understand this so deeply on the intuitive level, that they aren't even bothered with the fact that to actually prove this you need to solve an integral).

But block creation does not behave like a bus line. The probability that a nonce is correct is independent of how many nonces you have already tried. The process is memoryless. (In contrast, I like to believe that as you wait longer for a bus, the chance that it'll arrive in the next minute increases.)

Imagine repeatedly rolling a fair die until you roll a 1. How many rolls, on average, should this take? The probability of rolling a 1 in a given try is one in six, and the result of each roll is not affected by whatever happened on previous rolls, so it should take six tries on average.

More generally, if an experiment succeeds with probability pp, it would take an average of 1/p1/p attempts before the experiment is successful. This is not a trivial fact (at least, I am not aware of any way to prove it that doesn't require a bit of tricky calculus), yet people seem comfortable accepting it at face value. In fact, I used it several times throughout this section!

Now say that you rolled a die ten times, and annoyingly enough, none of the results were 1. How many additional dice do you expect to roll? Emotionally, we are biased to feel that "I've failed so many times, what are the chances that I fail again?" But, with some clarity, we realize that the die does not remember how many times it was rolled. The expected number of rolls remains six, whether you just started or have already failed a million times.

Mathematically, this can be notated as follows. If XX is the random variable that describes the number of attempts required, then the probability that we need n+kn+k attempts given that the first kk attempts failed is the same, regardless of kk.

That is: P[X=n+kX>k]=P[X=n]\mathbb{P}[X=n+k\mid X>k] = \mathbb{P}[X=n].

Now all we have to notice is that block creation is no different than rolling dice. More exactly, rolling T/NT/N-sided dice at a rate of one roll per tt seconds, until our dice hits $1$. We expect N/TN/T attempts to succeed, taking a total of tN/T=λt\cdot N/T = \lambda seconds.

The immediate but hard-to-swallow consequence is the following: at any point in time, the expected time until the next block is always λ\lambda. Yes, even if you have been waiting for a Bitcoin block for 20 minutes now, the average time remaining is still 10 minutes!

People have a hard time coming to terms with this, so I took the time to complement the theoretical discussion above with some simulations. I simulated mining at a hash rate of one million blocks per block delay. Sampling a simple block delay like this is very easy: just sample uniformly random numbers between 00 and 11 and count how many attempts it took before you hit a number smaller than 1/10000001/1000000. Repeating this process ten thousand times, I obtained a sample of ten thousand block intervals.

I then did the following: I removed all intervals smaller than a minute, and reduced one minute from the remaining intervals. In other words, I removed all miners that waited less than one minute, and adjusted the remaining samples to represent how many additional minutes miners waited. If my claims are correct, and how long you already waited doesn't matter, the new data and old data should distribute the same. However, if I am wrong, and having waited does affect how much longer it remains to wait, we will see a difference representing the effect of waiting.

I also repeated the process above, this time for intervals more than ten minutes long.

The results are displayed in the following graph, and we can see quite clearly that the block creation process couldn't care less about how long you already waited:

The code that created this graph is very simple, feel free to experiment with it:

from random import random
import matplotlib.pyplot as plt

HASHRATE = 10**6
SAMPLES = 10**3

def sample():
    n = 1
    while(random() > 1/HASHRATE):
        n+=1
    return n/HASHRATE

data = [sample() for _ in range(SAMPLES)]
trimmed = list(filter(lambda x: x> 0, [d - 0.1 for d in data]))
very_trimmed = list(filter(lambda x: x> 0, [d - 1 for d in data]))

plt.hist([data, trimmed, very_trimmed], bins=30, density=True)
plt.show()

To conclude, I will remind you once again that there is a rich probability theory of so-called Poisson processes. Poisson processes describe many things, including radioactive decay, typing mistakes in long texts, and PoW block creation. It tells us much more than just the average time we will need to wait. It allows us to precisely compute the probability that the time we have to wait falls within any range of interest. The theory is not too deep and accessible to any patient reader with a bit of background. Using it, one can easily derive powerful formulas to examine the nature of block creation. I did my best to provide a self-contained exposition in the appendix.

Last updated