How PoW Works
In the previous section, we explained what PoW does. In this segment, we dive into the details of how it works. This section also marks our first encounter with math. The level of formality will steadily increase as we make our presentation more precise. If you find math unsavory, the descriptive parts alone should be enough to take you through the other descriptive parts throughout the book. Even if you do have some math mileage under your belt, be sure to read deliberately and let the details sink in. If you desire to understand the mathematics underlying the descriptive discussions to follow, they are essential.
You might have heard that PoW works by posing the miners with "complex mathematical puzzles". I argue that these mathematical puzzles are not "complex" at all. In fact, they are designed to be incredibly dumb: there is no better way to solve them than banging your head against them by trying out all possible solutions in no particular order until you find one that works. It takes exceptional cleverness to design a puzzle this dumb. A game so impossible to strategise that the dumbest and smartest players have exactly the same odds.
When constructing a dumb puzzle, there is a tunable parameter called the difficulty target that adjusts how hard the puzzle is. By correctly adjusting the difficulty target, we determine how long we should expect to wait before a solution emerges. In this section, we assume that the correct difficulty target is given to us by divine inspiration. In the next segment, we explain how Bitcoin adjusts the difficulty target in practice.

Hashes as Random Oracles
The cryptographic building block of PoW is something called a cryptographic hash function (CH). Even those of you who have never heard about CHs can name some that are used in practice. This includes the hash used in Bitcoin, and many other hashes you might have heard of, such as , , , and .
Each of these hash functions is a universe within itself, packing many clever ideas. But they all have the same goal: to appear random on previously untested inputs. Let us choose an arbitrary CH and denote it . The property that a CH is trying to emulate is that it is impossible to predict the outcome of . We will carefully define what that means in a moment.
In cryptography, we like to have neat formalism to streamline a division of labor. We do so by assuming we have an ideal , and leaving it to others to figure out how to implement in a way that is sufficiently close to our ideal description. Such an ideal hash is called a random oracle, and assuming it exists means we are working in something called the random oracle model (ROM).
I provide a more substantial survey of random oracles in the appendix. But for now, all you need to know is that a random oracle is a magic box floating in the sky. Anyone can approach the box and give it any string they want as an input. The box replies with a string of 256 bits we call . We call this querying the oracle on . We are promised that when the universe was created, the reply for each input was chosen uniformly and independently. That is, the value of for each was seleccted at the dawn of time by a series of 256 coin flips.
Random oracles cannot exist in reality. They are too ideal. Hash functions are attempts to provide clever substitutes. Efficient functions that are not really random, but scramble the input so thoroughly that it becomes infeasible to predict the function without a galactic-scale computer.
A consequence of the independence is that it is hard to find a preimage. Given some output , there is no better way to find an such that then guessing different values of until you find one that works. For any , there are equally likely possibilities for . This implies that, on average, finding a correct should take about attempts. To understand how fantastically large this number is, check out this video:
To streamline the notation a bit further, we set and think of as a natural number . This is just a convenient way to order all possible outcomes. This will allow us to easily specify more general tasks, such as "finding an such that ".
We need just one more quantity before we can continue. Something that will bridge oracle queries with clock times. We define such that the oracle is queried exactly once per seconds. For example, if the oracle is queried once a minute, we set . If the oracle is queried ten times a second, we set . In practice, we are more likely to encounter values such as . The value is called the global hash delay, and its reciprocal is called the global hash rate.
Approximating the value of and maintaining it over time is the difficulty adjustment problem that we consider in the next chapter. For now, we assume that is fixed and known. Knowing this , we can say things like "the expected time it would take to find is seconds".
Tuning the Difficulty
We already considered the problem of a preimage of . Since the outputs are equally likely, we go ahead and assume that . The problem can be restated a bit awkwardly as "find such that ".
Why is it useful? Because we can now talk about a more general problem: find such that . That is, we don't require a specific output, but one out of specific outputs. (Again, since all outputs are uniformly likely and independent, it does not matter what values are allowed, so we choose out of convenience).
We call the target, and our first task is to figure out how to set it. To do that, we work out the expected time to solve the puzzle, and how it is affected by .
This is a simple computation. For any , there is a one in chance that and a one in chance that . Note that it is impossible that and that . In probability theory, two events that cannot happen are called disjoint. A basic property of disjoint events is that the probability that either of them happens is the sum of their probabilities. In other words, the probability that or is two (= one + one) in . So we expect it should take about queries, or seconds, to find such an such that .
This argument generalizes directly to show that we expect it would take seconds to find such that .
With this insight, we can control how long it would take to solve the puzzle by adjusting . If we want it to be seconds, we need . Isolating , we arrive at an all important formula:
Let's see an example. Say that the global hashrate is one trillion queries per second. In other words,. Say we want a block delay of ten minutes. Since we are working in units of seconds, we set and plug everything into the equation to get
One peculiarity with the difficulty target is that lower targets make for harder puzzles. This makes sense, as hitting a number below one million is harder than hitting a number below one trillion. But we have to be mindful of this when using as a measure of difficulty.
To circumvent this, the difficulty is often defined as the reciprocal of the target, . That way, the hardness of the puzzle grows with . Unfortunately, the terms "difficulty" and "target" are sometimes used interchangeably. If that's not confusing enough, sometimes when people say "the difficulty" they actually mean . There is a perfectly good reason for that, which I will soon explain.
The best advice I can give is that whenever anyone uses the terms difficulty or target in a context where these details are important, be sure to ask them explicitly what they mean.
Header Coupling
How do we use the puzzle above to space out Bitcoin blocks?
Requiring a block to contain a solution to the puzzle, an such that , is not enough. Why? Because there is nothing preventing the miner from using the same solution for several blocks. To solve this, we need the problem to depend on the block itself.
For technical reasons I won't get into, we don't actually want to feed entire blocks into the hash function. We need a smaller piece of information that still completely defines the block. This piece of information is called the block header. The header contains all sorts of metadata about the block. Importantly, it contains a pointer to the previous block (so whenever the chain progresses, the header necessarily channge), and some magic trick called a Merkle transaction root which is much smaller than the content of the block, but is cleverly built such that you can't change the content of the block without changing the transaction root. Even though the header does not contain the transactions, it is committed to them. It is impossible to find a different set of transactions with the same Merkle root. We explain how this magic works in an appendix.
To accommodate the "dumb puzzle", another field called the nonce is added. Importantly, the nonce field can be set to anything. To make our lives easy, we will use to denote the header of the block when the nonce is set to . And here is the crux: instead an such that , we require the nonce value to satisfy that . Unlike the string , the nonce is only good for the block !
Isn't Difficulty About Counting Zeros?
Some of you may heard that the difficulty target counts "how many left zeros" there are in the hash, and might be wondering what this has to do with all of the above.
To answer this, we need to go into how to interpret a string of bits as an integer. We do it in a very standard way many readers are probably already familiar with, it's called right LSB, where LSB stands for least significant bit.
This just means that the rightmost bit is the smallest one, just like in the decimal system, where the rightmost digit represents units, the next digit represents tens, and so on. We are going to use both decimal and binary numbers in this explanation, so to make things readable, we introduce a common convention. A number written by its own is always considered decimal. A binary number will be prefixed by .
In the decimal system, each digit represents a unit ten times larger. Incrementing the second right digit has ten times the effect as incrementing the first digit: .
In binary, this is the same, except each digit becomes twice as significant when we move right. For example, . Why? Because the rightmost adds , and the second rightmost adds twice as much, which is , so we have a total of .
In decimal, a followed by zeros can be written as . For example, . Similarly, in binary, followed by zeros is equal . So, for example, . We can use this to corroborate our computation from before: . This also gives us a general formula for converting binary to decimal. We can denote a general digit binary number as , where each digit is either or . By breaking it up into digits and using the formula for a followed by zeros we get that
To count digits, we use logarithms. You might know already that , and more generally that . Saying a number has binary digits is equivalent to saying that . Why? Because is the smallest number with digits, so must be at least that, and is the smallest number with digits, so must be smaller.
Logarithms are monotonically increasing: if we put a larger number in the logarithm, we get a larger result. Hence
We now introduce the floor notation to denote the largest integer that is at most . For example, . We have proven that the number of digits in the binary representation of any number is exactly .
Why does this matter? Because now we can use this to count zeros! Say we represent our number as binary digits, no matter what. If it has fewer digits, we add zeros to the right. This is implicitly what we did above when interpreting a string as a number: we just ignored all left-trailing zeros. If our number has digits (not counting the trailing zeros), then there are trailing zeros. But we know that , so the number of trailing zeros is .
So if we require that , we in particular require that it has at least trailing zeros. (I'll leave it to you to figure out how the magically became .)
The trailing zeros approach is actually too coarse-grained for difficulty targets. The less zeros you have, the harder it becomes to add another zero. This is exactly the information lost when going from to . With the full value , one can implement a recursive version of the argument above to compute the accurate difficulty. This is a conceptually clunky representation, but some prefer it, because this is how difficulty is represented in Bitcoin.
The Average Wait Time*
This final segment is a detour. It is not pertinent to anything that follows. It describes an interesting experiment I ran on my followers that is relevant to the discussion, and might give you some food for thought, and emphasize how unintuitive the block creation process might get.
In a Twitter poll I made, I asked my followers: given that the Bitcoin network produces a block every 10 minutes, if you start waiting for a block at some random time, how long will you wait for the next block on average? These are the results:

It is not surprising that most people will think that the answer is five minutes. After all, if you know that a bus arrives at the stop once every ten minutes, then (under the unrealistic assumption that buses arrive on time) if you arrive at the station at a random time, you expect to wait five minutes. (People understand this so deeply on the intuitive level, that they aren't even bothered with the fact that to actually prove this you need to solve an integral).
But block creation does not behave like a bus line. The probability that a nonce is correct is independent of how many nonces you have already tried. The process is memoryless. (In contrast, I like to believe that as you wait longer for a bus, the chance that it'll arrive in the next minute increases.)
Imagine repeatedly rolling a fair die until you roll a 1. How many rolls, on average, should this take? The probability of rolling a 1 in a given try is one in six, and the result of each roll is not affected by whatever happened on previous rolls, so it should take six tries on average.
Now say that you rolled a die ten times, and annoyingly enough, none of the results were 1. How many additional dice do you expect to roll? Emotionally, we are biased to feel that "I've failed so many times, what are the chances that I fail again?" But, with some clarity, we realize that the die does not remember how many times it was rolled. The expected number of rolls remains six, whether you just started or have already failed a million times.
Now all we have to notice is that block creation is no different than rolling dice. More exactly, rolling -sided dice at a rate of one roll per seconds, until our dice hits $1$. We expect attempts to succeed, taking a total of seconds.
The immediate but hard-to-swallow consequence is the following: at any point in time, the expected time until the next block is always . Yes, even if you have been waiting for a Bitcoin block for 20 minutes now, the average time remaining is still 10 minutes!
People have a hard time coming to terms with this, so I took the time to complement the theoretical discussion above with some simulations. I simulated mining at a hash rate of one million blocks per block delay. Sampling a simple block delay like this is very easy: just sample uniformly random numbers between and and count how many attempts it took before you hit a number smaller than . Repeating this process ten thousand times, I obtained a sample of ten thousand block intervals.
I then did the following: I removed all intervals smaller than a minute, and reduced one minute from the remaining intervals. In other words, I removed all miners that waited less than one minute, and adjusted the remaining samples to represent how many additional minutes miners waited. If my claims are correct, and how long you already waited doesn't matter, the new data and old data should distribute the same. However, if I am wrong, and having waited does affect how much longer it remains to wait, we will see a difference representing the effect of waiting.
I also repeated the process above, this time for intervals more than ten minutes long.
The results are displayed in the following graph, and we can see quite clearly that the block creation process couldn't care less about how long you already waited:

The code that created this graph is very simple, feel free to experiment with it:
from random import random
import matplotlib.pyplot as plt
HASHRATE = 10**6
SAMPLES = 10**3
def sample():
n = 1
while(random() > 1/HASHRATE):
n+=1
return n/HASHRATE
data = [sample() for _ in range(SAMPLES)]
trimmed = list(filter(lambda x: x> 0, [d - 0.1 for d in data]))
very_trimmed = list(filter(lambda x: x> 0, [d - 1 for d in data]))
plt.hist([data, trimmed, very_trimmed], bins=30, density=True)
plt.show()To conclude, I will remind you once again that there is a rich probability theory of so-called Poisson processes. Poisson processes describe many things, including radioactive decay, typing mistakes in long texts, and PoW block creation. It tells us much more than just the average time we will need to wait. It allows us to precisely compute the probability that the time we have to wait falls within any range of interest. The theory is not too deep and accessible to any patient reader with a bit of background. Using it, one can easily derive powerful formulas to examine the nature of block creation. I did my best to provide a self-contained exposition in the appendix.
Last updated