Byzantine Fault Tolerance
Like many good stories, ours also begins with war. In the fifth century, the titular city-state of Byzantium was expending. In one particular excusion, several Byzantine legions all surrounded the same fortress. Knowing their only chance to topple its impregnable walls is if all generals work together, they need to somehow agree among themselves whether to attack or retreat. Most crucially, they must all reach the same decision. But here is the crux: due to their importance, it is never allowed that more than two generals confer. They cannot all convene in one tent and hash it out. The entire decision must only rely on one-on-one communication. This is not just quite a problem, it is the problem.
The Byzantine Generals Problem (BGT)
While the historical novel style has its charm, we will shed it for a more colloquial jargon. (After all, it would be weird if three chapters from now we would refer to miners as generals).
The BGT problem essentially desccribes a network of nodes communicating with each other, having to agree on one of given list of options. (e.g. a bunch of generals having to decide whether to attack or retreat). The nodes communicate with each other in rounds, where in each round, any node can send a message to any other nodes. However, nodes only start to receive messages after all nodes are done sending their messages. Their goal is to devise a protocol that guarantees that, if all nodes follow the protocol they will all reach the same decision after a known number of rounds.
We've actually already mentioned one such protocol: majority voting. In this protocol, each node tells all other nodes its preference, and nodes agree to accept the popular choice. This protocol only takes one round.
Majority voting turns out to be a very bad protocol. Recall the story of the generals, and assume there are only three generals: Nicephorus, who wants to fight; Martin, who wants to retreat; and Euphemius, who is secretly a traitor to Byzantium.
Nicephorus and Martin are loyal to Byzantium and its protocols, so they broadcast their intentions. However, Euphemius the treacherous had other plans: he told Martin that he wants to fight, and Nicephorus that he wants to retreat! Consequentially, Nicephorus the fierce was coerced to retreat, while Martin the calm-minded was coerced to fight alone, leading to a great defeat!
That example shows where majority voting lacks as a protocol: in its fault tolerance. We call a node faulty if it deviates from the protocol for any reason. Adversarial generals and hackers are a reason for concern, but in real world scenarios nodes that are simply malfunctioning are also a serious concern.
A byzantine fault tolerance (BFT) protocol is a protocol that can guarantee that the nodes will reach a decision in the prescribed amount of time even if some of the nodes are faulty. The level of security such a protocol provides is usually described as a number between and that describes the fraction of faulty nodes a protocol can tolerate.
In this terms, we can say that PSL created a protocol with fault tolerance, and proved that this is the maximal fault tolerance possible.
Remark: The BFT property has two requirements. One is that the less than \alpha minority cannot coerce a decision, and the other is that it cannot delay it. The first form of security is called safety while the other is called liveness, and security requires both. PSLs protocol requires a fixed number of rounds that can not be changed as a result of faulty nodes (though it does depend on the number of nodes), but more modern types of BFT adapt the number of rounds according to how the protocol is progressing, making liveness a concern.
PSL's Protocol
We describe the PSL protocol for four nodes called John, Paul, George and Ringo who are trying to decide whether they should call their new album Revolver or Rubber Soul.
In the first round, each Beetle tells just tells all other Beetles how they want to call the album. That's a good start, but we also know that's not quite enough yet. In the second round, each Beetle shares with each other Beetles what the other Beetles told them.
After the first round, John should have three pieces of information, for example:
Paul wants to call the album Revolver
George wants to call the album Revolver
Ringo wants to call the album Rubber Soul
After the round is over, John should have six more pieces of information. If anyone were honest, they should look like this:
Paul said that George wants to call the album Revolver
Paul said that Ringo wants to call the album Rubber Soul
George said that Paul wants to call the album Revolver
George said that Ringo wants to call the album Rubber Soul
Ringo said that Paul wants to call the album Revolver
Ringo said that George wants to call the album Revolver
From which John can clearly understand that (discounting his vote) it is a two-to-one for Revolver. If John also want to call the album Revolver, then he would agree with this name. If all other Beetles followed the protocol, they will also agree on the name Revolver, including Ringo who preferred Rubber Soul!
What if John wants to call the album Revolver too? That's an interesting question I am going to ignore.
This works nicely if everyone is playing nicely, but what if someone tries to cheat? Say that Ringo really wants to call the album Rubber soul, and he knows John is on his side. What if he lies to John about what George and Paul had to say? Then John would see the following messages:
Paul wants to call the album Revolver
George wants to call the album Revolver
Ringo wants to call the album Rubber Soul
Paul said that George wants to call the album Revolver
Paul said that Ringo wants to call the album Rubber Soul
George said that Paul wants to call the album Revolver
George said that Ringo wants to call the album Rubber Soul
Ringo said that Paul wants to call the album Rubber Soul
Ringo said that George wants to call the album Rubber Soul
Here John will note a contradiction: Paul told him they want to call the album revolver, George also told him that Paul wants to call the album revolver, only Ringo for some reason claims that George wants to call the album Rubber Soul.
What John essentially does is second order majority voting. He sees how Paul, George and Ringo voted on how Paul wants to call the album. Note that even though Paul is the subject of discussion, his vote counts exactly as much as George and Paul.
It is not hard to complete this line of thought, and convince ourselves that no matter how Ringo tries to cheat, as long the rest of the Beetles follow the protocol there is no way for him to coerce a decision that is not backed by a strict majority.
But what if Ringo was innocent, and it was Paul and George who collude to enact the name Revolver, despite half of the Beetles being against it? Now John will see the following messages:
Paul wants to call the album Revolver
George wants to call the album Revolver
Ringo wants to call the album Rubber Soul
Paul said that George wants to call the album Revolver
Paul said that Ringo wants to call the album Revolver
George said that Paul wants to call the album Revolver
George said that Ringo wants to call the album Revolver
Ringo said that Paul wants to call the album Revolver
Ringo said that George wants to call the album Revolver
Upon the contradictions regarding Ringo's position, Paul will again carry out the second order voting, and see that two Beetles claim Ringo is pro Revolver, but only one Beetle claims Ringo wants to call the album Rubber Soul.
But wait, you might ask, the Beetle who thinks Ringo prefers Rubber soul is Ringo, shouldn't that account for something? Maybe, but the ability to override Ringo is exactly what protects us from Ringo playing a double game, telling each Beetle they have a different preference in an attempt to split the consensus. It is the crux of the protocol and how it works.
This protocol can be extended to Beetles (or any other coleopterans) by adding more rounds. So by the end of the say fourth round John will hold messages such as
George said that Ringo said that said that Paul said that Yoko wants to call the album Let it Be
One can prove by way of induction that if at most of the nodes are faulty, then the protocol is guaranteed to provide consensus within rounds.
Practical BFT*
A BFT protocol is not only measured by its security. Fault tolerance is an important feature, but your fault tolerant protocol isn't going to help anyone if it requires a trillion rounds or terabytes of RAM. The communication complexity is the number of rounds the protocol requires, the space complexity is how much data it has to keep track of, and the time complexity is the amount of processing required to compute the result from the received messages. One might wonder whether the length of the messages should play a part, and it does: it manifests itself in the time complexity, as we consider reading the messages a part of the work.
In the PSL protocol, the number of rounds is a third of the nodes. That's already less than ideal. For example, the Ethereum network has more than 750,000 validators, so using the PSL protocol on this network would require 250,000 rounds per block.
However, it is the runtime complexity where the real problem shows. To see that, it suffices just to look at how many times a Beetle is named in each message.
Let us assume there are Beetles and consider the messages Paul sends John. In the first round, Paul just sends his messages, so he names no Beetles. In the second round, Paul lists all Beetles besides Paul and John (and their alleged preferences), making him name Beetles.
What happens in the third round? Well, consider Ringo. For any Beetle X, Paul's message will include a line starting with "Ringo says that X wants", where there are options for X, as it could be any Beetle beside John, Paul, or Ringo. This means that we have lines starting with "Ringo said that". However, we don't really have to name Ringo in each and every line, as we can only write "Ringo says" once, and then list everything he told us. The total number of Beetle names is . But wait, that's just Ringo! We have a similar list for all Beetles besides Paul and John, making us name a total of Beetles.
By the time we reach round we find that each message contains Beetle names, which is a lot. Even if we assume storing the name of Beetle only requires one bit, we get that if there are Beetles, each message in the last round is more than 10 megabytes. Doesn't sound like a lot? Well, for Beetles the message size would surpass 10 terabytes, for we succeed 344 exabyte, and Beetles will topple two brontobyte, a unit of storage so large I never heard of it before writing this paragraph, and even my spell checker does not recognize. And what about 750,000 Ethereum validators? Well, the number is so large you would need about 170 kilobytes just to write down how many digits it has.
Fortunately, BFT protocols only started with PSL's work. In 1999, Miguel Castro and Barbara Liskov introduced the Parctical Byzantine Fault Tolerance (PBFT) protocol. PBFT most significant contribution is actually in terms we have not defined here: synchronicity model. Roughly, a synchronicity model models message latency. In the PSL analysis, there is an implicit assumption of full synchronicity. That is, all participants know when each round starts and when it ends. Full synchronicity is not how networks such as the internet naturally behave, and imposing it incurs large overheads. The internet is best modeled as admitting partial synchronicity, where we don't know how long it takes messages to transmit, but we do have an upper bound. Castro and Liskov's protocol is the first to provide fault-tolerance in the partially synchronous model. But more than that, it reduces the complexity to . Several improvements of various sorts succeeded PBFT, such as Zyzzyva and ABsTRACTs who provide improved performence, Aaardvark who provides improved robustness, and Adapt that switches between protocols to respond to changing conditions. More recent protocols such as HotStuff and Marlin further reduce the complexity to linear, and we should probably also mention BFT protocols crafted specifically for proof-of-stake such as Alphabet and Tendermint.
Last updated