How PoW Works*

Our next goal is to understand the mechanics that govern proof-of-work block creation. First, we describe the so-called "complex mathematical puzzles miners solve to create blocks. The key takeaway is that the mathematical puzzle is not "complex". In fact, it is incredibly dumb: the only way to solve it is to bang your head against it by trying all possible solutions in no particular order until you find one that works.

Don't get me wrong, it takes exceptional cleverness to design such a dumb puzzle. A game that is impossible to cheat, where the smartest player has no advantage over the dumbest player. Cryptographers call such puzzles cryptographic hash functions.

Building such a puzzle is not enough. To regulate the block creation rate to a prescribed length of time, we need to parameterize it correctly. You might have heard the name of the relevant parameter, it is called the difficulty target.

In the first part of this section, we will assume we can magically know what the difficulty target should be. In the real world, not only do we not have access to this parameter, but it doesn't even stay fixed. The difficulty target has to adjust to changing hash rates. This is the responsibility of the difficulty adjustment algorithm (DAA). In the second part of this section, we will describe how difficulty adjustment is handled in Bitcoin.

The Dumb Puzzle

You know, that proverbial "complex mathematical puzzle" that miners are trying to solve? Well, first of all, it is not a "complex mathematical puzzle" in any way, it is actually the dumbest possible puzzle, a puzzle that can only be solved by trying all possible solutions one by one until we hit one that works. In other words, puzzles that can only be solved with brute force.

It takes a whole lot of cleverness to create such a dumb puzzle. Such a puzzle is expected to outsmart all attempts to cleverly solve it faster than a brute-force approach would. But the puzzle itself, the task given to miner, is very dumb by design, a magic 8-ball you keep shaking until you get the desired output. And that's a good thing. You do not want miners to get clever and find all sorts of shortcuts that will undermine the security. The entire point of proof-of-work is that it proves that you did the work. If there are clever ways to shirk the work and still obtain the proof, then your Sybil resistance is broken.

Hashes as Random Oracles

As I said, it takes seriously smart people to design a sufficiently dumb "puzzle", so I will treat the problem like a cryptographer and assume someone already solved the problem for me. More formally, I will assume the existence of a random oracle: a magic random function that floats in the sky, accessible to anyone.

I survey random oracles to some extent in the appendix, but all you should know about it is that it transforms arbitrary strings (such as block headers) into strings of 256 bits, such that if you input a new string, you get a completely random output, but if you input the same string again, you will get the same output. Crucially, the output is completely different no matter how similar the inputs are. Changing the input by as much as a single bit will provide outputs that seem completely uncorrelated.

Obviously, random oracles do not exist in reality. But we can create things that are pretty darn close. Hash functions attempt to do just that. We compartmentalize the details of designing a hash function by modeling it as a random oracle and ignoring the details of how this is actually achieved (or rather, delegating them to hash designers).

To simplify our notations, we call the hash function H\mathsf{H} and let H(s)\mathsf{H}(s) denote the string it outputs on input ss. For even more simplicity, we will not think of H(s)\mathsf{H}(s) as a bit string, but as an integer 0H(s)<N0\le \mathsf{H}(s) < N, where NN is usually some power of two, most commonly (at least in the context of proof-of-work mining), N=22561.158×1077N = 2^{256} \approx 1.158 \times 10^{77}. If you want to get some bearing on how hyper-astronomically huge this number is, check out this cool video:

To use a hash function for PoW we need one additional assumption beyond the random oracle model: that making a query takes time. More quantitatively, we assume we know some tt (that could be very small) such that the random oracle answers exactly one query per tt seconds. Why can we assume such a tt exist? Because computation is not free. Why can we assume that we know it or that it remains fixed? That question is a bit more... difficult (pun not-intended).

The "Dumb" Puzzle

As a warmup, consider the following problem: I choose an arbitrary number 0n<22560\le n < 2^{256} and ask you to find any string ss such that H(s)=n\mathsf{H}(s) = n. This problem is called finding a preimage or reverting the hash. How would you go about trying? One way would be to try to feed H\mathsf{H} different strings until you hit one that suits the bill. Is there a better way? Apparently (if we regard H\mathsf{H} as a random oracle) the answer is no. Proving this requires a bit of finesse but the intuition is clear: say I choose some string ss' and note that H(s)n\mathsf{H}(s') \ne n, what can I learn from this? Well, I can learn that ss' is not a solution, but can I learn anything else? Not really. Since for any sss'' \ne s' we have that the outputs H(s)\mathsf{H}(s') and H(s)\mathsf{H}(s'') are random and uncorrelated, it follows that H(s)\mathsf{H}(s') tells us nothing about H(s)\mathsf{H}(s''). In other words, the output H(s)\mathsf{H}(s') is the only thing we learn, and that information is almost useless for finding a solution: it eliminates exactly one option.

There are stronger security properties that are also crucial. One of them is collision resistance: the inability to find two inputs sss\ne s' such that H(s)=H(s)\mathsf{H}(s) = \mathsf{H}(s'). This is the property commonly required of hash functions (though for some applications even that is not enough). Collision resistance is strictly stronger than reverting resistance. That is, we can conceive a hash that is preimage resistant but not collision resistant, but not the other way around.

Random oracles are collision-resistant, and hashes used for proof-of-work are also believed to be collision-resistant. However, there are even stronger properties that hold for random oracles, but we know do not hold for some of the known hash functions. For that reason, extreme caution should be exercised when analyzing a hash as a random oracle.

See the exercises for more details.

So if we accept that spamming inputs is the only way to solve the problem, how long should it take? Each such attempt has a probability of one in NN to succeed, so it will take (on average) tNt\cdot N seconds to find a solution. For N=2256N=2^{256}, If tt is a trillionth of a second, this would only take just over three billion trillion trillion trillion trillion years. Grab a coffee while you wait.

That's a bit too long to wait, so let us make the problem easier: I choose two arbitrary numbers n1n_1 and n2n_2 and ask you to find ss such that either H(s)=n1\mathsf{H}(s)=n_1 or H(s)=n2\mathsf{H}(s) = n_2? The logic above still convinces us that a brute-force approach is the only approach. But now, each attempt has twice the probability to be successful, cutting our total running time in half, to tN/2t\cdot N / 2.

Now, since the random oracle is, well, random, it doesn't actually matter what nn is in the first problem. We can just replace it with 00. And in the second problem, we could have similarly replaced n1n_1 and n2n_2 by 00 and 11. More generally, we can choose some number TT, called the difficulty target, and ask you to find an input ss such that H(s)<T\mathsf{H}(s) < T. The two examples we've seen are the special cases where the target is set to T=1T=1 or T=2T=2 respectively. The number TT is exactly the number of outputs that are considered small enough.

For a general TT, how long will we wait? Well, whenever we input a fresh string ss, there are NN possible, equally likely values for H(s)\mathsf{H}(s), and exactly TT of them are sufficiently small. So an attempt will succeed TT out of NN times, meaning that we will need around T/NT/N attempts, taking a total of tN/Tt\cdot N / T seconds. So if we want that, on average, it would take λ\lambda seconds to solve the puzzle, we want to choose TT such that tN/T=λt\cdot N/T = \lambda, or after rearranging:

T=tN/λT = t\cdot N/ \lambda

For example, say that tt is one trillionth of a second, t=1012t=10^{-12}, and that we want to have a block delay of ten minutes, so λ=600\lambda = 600 (because we are working in units of seconds), then we get that the appropriate difficulty target is T=10122256/6001.876×2206T=10^{-12}\cdot2^{256}/600\approx1.876\times2^{206}.

You might be concerned that the solution is generally not an integer. This concern is much more valid than most people expect. Yes, rounding is obviously the solution, but how do you assure different implementations — or even the same implementation running on different architectures — all handle rounding errors exactly the same way? Even a slight incompatibility, including a round-off bug in a future popular CPU, can split the network and cause vicious complications. For that reason, cryptocurrencies apply measures that ensure that roundoff error handling remains uniform and platform-independent.

One peculiarity with the difficulty target is that lower targets make for harder blocks. This makes sense, as hitting a number below one million is harder than hitting a number below one trillion. But we have to be mindful of this if we ever want to use TT as a measure of how hard solving the puzzle was. For that reason, it is common to define the difficulty of the puzzle as 1/T1/T, so we would have that a higher difficulty means a harder puzzle. Despite the similar names, the terms difficulty and difficulty target refer to inverse quantities, and that these terms are commonly used interchangeably does not help resolve the confusion. The best advice I can give you is that whenever anyone uses the terms difficulty or target in a context where these details are important, be sure to ask them explicitly what they mean.

OK, so now we have a grasp of how "dumb puzzles" are constructed, is our problem solved? Can we just require a miner to find some ss such that H(s)\mathsf{H}(s) is sufficiently small knowing this would delay them? Well, obviously not. If the puzzle is just to find some ss such that H(s)<T\mathsf{H}(s) < T then a miner could just find this number once, and use it as "proof" for all their future blocks.

The next step is to modify the puzzle such that the string ss depends on the block. That's why we say that the miner is mining a block. The idea is this: we can separate a block into two parts, header and data. The header of the block contains a few fields of information that depend on the data therein, and an important field called nonce that can be any number. Why is the nonce field there? For the miner to set it to arbitrary values over and over until the has of the block header is sufficiently low. The nonce is where there is room for brute force.

Let BB be a block header without the nonce set, and let B[n]B[n] be that same block with the nonce set to nn. When a miner mines for the block BB, what they actually do is compute H(B[n])\mathsf{H}(B[n]) for different values of nn until they find nn for which H(B[n])<T\mathsf{H}(B[n]) < T.

So are we done? Not quite. The remaining question is: if we only hash the header, what does this say about the data? Can't we just change the data, and use the same header as proof? The "obvious" solution is to hash the entire block instead of just the header, but that will make mining much harder and more centralized. Instead, Satoshi had the foresight to use a cool construction called Merkle trees. For our current purposes, you don't have to understand exactly what a Merkle tree is (though you should, because they are very useful and very cool), only that it allows us to take an arbitrary amount of information and extract from it a 256 bit string called the Merkle root such that it is impossible to manipulate the data without changing the Merkle root. By including the Merkle root in the block header, we get that the string B[n]B[n] affirms the veracity of the block contents as well.

The Average Wait Time

In a Twitter poll I made, I asked my followers the following question: given that the Bitcoin network produces a block once per ten minutes, if you start waiting for a block at some random time, how long will you wait for the next block on average? These are the results:

It is not surprising that most people will think that the answer is five minutes. After all, if you know that a bus arrives at the stop once every ten minutes then (under the unrealistic assumption that busses arrive on time) if you arrive at the station at a random time, you expect to wait five minutes. (People understand this so deeply on the intuitive level, that they aren't even bothered with the fact that to actually prove this you need to solve an integral).

But block creation does not behave like a bus line. The probability that a nonce is correct is independent of how many nonces you already tried, and the process is memoryless.

Imagine rolling a fair day until you roll a 1. How many rolls, on average, should this take? The probability of rolling a 1 in a given try is one in six, and the result of each roll is not affected by whatever happened on previous rolls, so it should take six tries on average.

More generally, if some experiment succeeds with probability pp, it would take an average of 1/p1/p attempts before the experiment is successful. This is not a trivial fact (at least, I am not aware of any way to prove it that doesn't require a bit of tricky calculus) yet people seem comfortable accepting it at face value.

Now say that you rolled a die ten times, and annoyingly enough, none of the results were 1. How many additional dice do you expect to roll? Emotionally, we are biased to feel that "I've failed so many times, what are the chances that I fail again?" But, with some clarity, we realize that the die does not remember how may times it was rolled. The expected number of rolls remains six, whether you just started, or failed a million times already.

Mathematically, this can be notated as follows. If XX is the random variable that describes the number of attempts required, then the probability that we need n+kn+k attempts given that the first kk attempts failed is the same, regardless of kk.

That is: P[X=n+kX>k]=P[X=n]\mathbb{P}[X=n+k\mid X>k] = \mathbb{P}[X=n].

Now all we have to notice is that block creation is no different than rolling dice. More exactly, rolling T/NT/N-sided dice at a rate of one roll per tt seconds, until our dice hits $1$. We expect N/TN/T attempts to succeed, taking a total of tN/T=λt\cdot N/T = \lambda seconds.

So the hard to swallow pill is the following: at any point in time the expected time until the next block is always λ\lambda. Yes, even if you have been waiting for a Bitcoin block for twenty minutes now, the average time remaining to wait remains ten minutes.

People have a hard time coming to terms with this, so I took the time to complement the theoretical discussion above with some simulations. I simulated mining in a hash rate of one million blocks per block delay. Sampling a simple block delay like this is very easy: just sample uniformly random numbers between 00 and 11 and count how many attempts it took before you hit a number smaller than 1/10000001/1000000. Repeating this process ten thousand times, I obtained a sample of ten thousand block intervals.

I then did the following: I removed all intervals smaller than a minute, and reduced one minute from the remaining intervals. In other words, I removed all miners that waited less than one minute, and adjusted the remaining samples to represent how many additional minutes miners waited. If my claims are correct, and how long you already waited doesn't matter, the new data and old data should distribute the same. However, if I am wrong, and having waited does affect how much longer it remains to wait, we will see a difference representing the effect of waiting.

I also repeated the process above, this time for intervals more than ten minutes long.

The results are displayed in the following graph, and we can see quite clearly that the block creation process couldn't care less about how long you already waited:

The code that created this graph is very simple, feel free to experiment with it:

from random import random
import matplotlib.pyplot as plt

HASHRATE = 10**6
SAMPLES = 10**3

def sample():
    n = 1
    while(random() > 1/HASHRATE):
        n+=1
    return n/HASHRATE

data = [sample() for _ in range(SAMPLES)]
trimmed = list(filter(lambda x: x> 0, [d - 0.1 for d in data]))
very_trimmed = list(filter(lambda x: x> 0, [d - 1 for d in data]))

plt.hist([data, trimmed, very_trimmed], bins=30, density=True)
plt.show()

To conclude, I will point out that there is more to the theory of block creation. We know exactly what the distribution we see in the image is, and can explain very well why it is the distribution we expect to see. We can derive a formula for it and use it to analyze block creation in many ways. For a somewhat formal treatment, see the appendix.

Difficulty Adjustment

So remember we assumed there is this number tt that measures how long a single hash takes? This is not a realistic assumption. The number tt represents, in a sense, how hard it is to compute H\mathsf{H}, and the global amount of effort dedicated to solving the puzzle. The value of tt is not fixed, but time dependent. If the miner replaces their machine with a faster one, tt has decreased. If another miner joins the party, tt has decreased. If a miner went out of business or had a long-term technical failure, tt has increased. This value of tt that we so arrogantly took for obvious, not only does it depend on information that is not available to us, but it isn't even fixed.

In practice, instead of pretending we know tt, we try to estimate it from the information we do have: the blocks themselves. If we see that blocks are created too fast or too slow, we try to adjust tt appropriately.

As the person who designed Kaspa's difficulty adjustment, I can give you a first-hand testimony that designing difficulty adjustment is very intricate. There are many approaches to choosing how to sample the blocks, and then a plethora of methods to evaluating the difficulty from the data. Later in the book I will describe Kaspa's difficulty adjustment algorithm, and maybe also provide a general discussion of the more common difficulty adjustment approaches (material that is unfortunately not covered in any textbook or other unified source). For now, the task of understanding how difficulty adjustment works in Bitcoin is daunting enough.

Difficulty Adjustment in Bitcoin

Bitcoin was naturally the first crypto to suggest any form of difficulty adjustment. Up to a few tweaks, it was lifted almost verbatim from the original Bitcoin whitepaper, which was published before cryptocurrencies even existed. Many argue that Bitcoin's approach to difficulty adjustment is outdated, has many drawbacks, and can even be dangerous, while others argue that it is time-tested and its simplicity protects us from unexpected attacks. So basically, a carbon copy of any other argument about Bitcoin.

Bitcoin uses a fixed difficulty window. A difficulty epoch (or window) lasts N=2016N=2016 blocks (or approximately two weeks), and at the end of each epoch, the difficulty is adjusted according to the blocks within that epoch.

The alternative is a sliding window approach, where the difficulty is recalculated for each block according to the NN blocks that preceded it.

The advantage of a sliding window approach is that it is more responsive, making the network quickly react to changes in the global hashing power instead of waiting for the end of the epoch. Note that in cataclysmic circumstances, waiting for the end of the epoch could be a very long time, as less mining means longer block delays, delaying the end of the epoch, and possibly causing more miners to jettison, making the end of the epoch even more distant. I know of exactly one example of a small chain that had more than 80% of its mining on a single private pool. One day the pool decided to switch to another coin, making the hashrate instantly drop to 20%, increasing the block delays five times over. Displeased miners hopped as well, dropping the hashrate to about 1% of what it was, effectively halting the chain. For a two weeks long difficulty epoch, the network would have to wait literal years for such a drop to be properly adjusted. The stall was resolved by a hard fork, manually increasing the target, while implementing a sliding window difficulty adjustment.

The disadvantage of a sliding window approach is that it is, well, more responsive. Responsiveness is a double-edged sword and a difficulty adjustment mechanism that is too sensitive can be abused to manipulate the network in ways that the robust difficulty epoch would not allow. Moreover, sliding windows are naturally more complex, and have complicated and not completely understood dynamics, such as difficulty fluctuations that create new vectors for opportunistic mining, further disrupting block creation rate consistency.

TODO: find references, I remember a series of posts explaining this problem in BCH or BSV's attempts to use ASERT and in Zawy's posts, I need to dig for it

So given a window of NN blocks, and the times they were created, how do we adjust TT? Quite simply, compare how long it was supposed to take with how long it actually took, and adjust TT accordingly.

If the block delay is λ\lambda, then the expected length of an epoch is λN\lambda \cdot N. For example, in Bitcoin we have λ=10 min\lambda = 10\text{ min} and N=2016N=2016 so the length of a difficulty epoch is expected to be 2016020160 minutes which are exactly two weeks.

A comfortable way to state what we observed is by using the ratio α\alpha satisfying that the epoch lasted αλN\alpha \cdot \lambda \cdot N. For example, α=1/2\alpha = 1/2 means the epoch was twice shorter than expected, while α=2\alpha = 2 means it was twice longer. For reasonably large values of NN (20162016 is more than enough) we can deduce from this that the average block time is extremely close to αλ\alpha\cdot \lambda (to understand why using a small NN is too noisy, you can review the discussion on the math of block creation). We want to adjust this to the desired λ\lambda, which means that we want to make mining 1/α1/\alpha times harder: if α=1/2\alpha = 1/2 we want to make it twice harder, if α=2\alpha=2 we want to make it twice easier.

Recall that a larger target means easier blocks, so to make mining 1/α1/\alpha times harder, we choose αT\alpha\cdot T as our new difficulty.

Handling Timestamps

This is all good and well, but there is one problem: how do we know how long it took to create these NN blocks?

Each block includes a timestamp that allegedly reports when it was discovered, but it cannot be authenticated. It could be that the miner's clock is out of sync, or worse, that she is deliberately trying to interfere with the network by messing with the difficulty adjustment.

As a first line of defense, Bitcoin bounds the difficulty change to a factor of four. No matter what the timestamps look like, if the old and new targets are TT and TT', then we will always have that 14TT4\frac{1}{4} \le \frac{T}{T'} \le 4. But that's obviously not enough.

So how can we avoid timestamp manipulations? If we just take the first and last timestamp of the epoch and check their difference, and the malfeasant miner happens to create the first or last block, she essentially gets a carte blanche to set the difficulty to whatever she wants. The adjustment cannot depend just on two blocks.

One might be tempted to tackle this by analyzing the timestamps within the epoch better, and trying to isolate the "true" ones. Indeed, one can improve the robustness using such approaches, but any such mechanism would be exploitable if it can be fed arbitrary timestamps.

Instead, the solution is to impose smart limitations on the block's timestamp when it is processed.

The idea is not to allow a timestamp to be more than two hours off the mark. If we can guarantee that, then the worse an attacker can do is to compress/stretch the difficulty window by four hours, which are about 1%1\% of two weeks. But how can we guarantee it?

The first observation is that we can verify that a timestamp is not too deep into the past from the consensus data. Since a block delay is ten minutes, we expect 1212 blocks to be created in two hours. So what should we do? Should we go 12 blocks back from the candidate block, and check that it has an earlier timestamp? That happens to be exploitable. If the miner happens to have created that block too, and the block 12 blocks before that one, and the blocks 12 blocks before that one, and so on. Each block in this chain allows two more hours of manipulation. This attack might not seem practical, but if we let an adversary try it consistently, it will succeed at some point, and it only requires less than 10%10\%of the hashing power, a far cry from thee honest majority security we were promised. Bitcoin overcomes this by taking the timestamps of the last 2323 blocks, and picking the median. This is a great idea because highly irregular timestamps will not be the median, but in the fringes.

So say we want to enforce a timestamp deviation of at most NN block delays. Say that the block BB has a timestamp SS, and let MM be the median timestamp of the 2N12N-1 blocks preceding it, the we have the following rule

Rule I: if S<MS<M then the block is invalid.

OK, but what about blocks whose timestamp deviates into the future? Now we don't have any blocks to rely on. We can't have the validity of a block depend on the blocks succeeding it!

The idea is to use the actual system clock. Let CC be the system clock while validating the block. The length of time SMS-M is an approximation of NN actual block delays (that is, according to the current, possibly inaccurate difficulty target). We would like to invalidate blocks whose timestamps are more than SMS-M later than CC, and it is easy to check this just means that M>CM>C. The problem is that CC, unlike MM, is not in consensus, and can vary from miner to miner. If we tell miners to invalidate such blocks we are practically begging for a net split. The correct solution is to delay the block:

Rule II: if M>CM>C, wait until C=MC=M before including the block, if while waiting a new valid block arrived, prefer it over the delayed block.

Note that miners are incentivized to follow this validation rule, as it gives them more time to create a competing block.

This does not prohibit blocks with timestamps set to the far future, but it strongly discourages them by degrading their probability to be in the chain. The later the timestamp is, the more miners will delay mining over it, increasing the probability that the block will be orphaned.

With these two rules in place, we can finally enforce the very simple policy: at the end of an epoch, find the earliest and latest time stamps S,SS, S' in the epoch, set α=SSλ\alpha = \frac{S'-S}{\lambda}, and adjust the difficulty as explained in the previous section. The timestamp deviation rules we outlined strongly limit the ability to abuse this mechanism through timestamp manipulation.

One might be curious why we look for the earliest and latest timestamps and not just take the first and last one. The answer is that timestamps in Bitcoin are not always monotone. There are known examples of later blocks with earlier timestamps. Amusingly enough, I found that out from a 2014 paper that quotes a paper from 2015.

Last updated