Part 2: Technical Overview
The Incentive Layer
- Beacon chain incentives strongly encourage diversity among client deployments, hosting infrastructure, and staking pools.
- Lack of diversity puts at risk both the chain in general and all those running the majority client.
- The greater the share of validators hosted by a single client implementation the greater the risk.
- The beacon chain is at its most robust and fault-tolerant when no single client type manages more than one-third (33%) of validators.
Just as diversity in biological ecosystems makes them more resilient, and monocultures make them very fragile – yes, I've been watching David Attenborough –, so it is with Ethereum staking.
For example, the inactivity leak is much more likely to occur on a network in which a single client implementation runs over 33% of validators, or a single staking operator controls over 33% of validators, or over 33% of validators are deployed to the same hosting infrastructure. All these scenarios constitute single points of failure that could prevent the beacon chain from finalising and lead to a leak that penalises those running the majority (offline) client most harshly.
Let's consider some scenarios. For the sake of this exercise you are running the beacon chain client X. In each scenario you and others using client X host validators managing a certain fraction of the total stake. We will consider what happens if client X has a bug that takes it down. It might be a consensus bug or another kind of bug that takes the client off the network: we saw examples of both of these on the pre-launch testnets.
When a client managing less than one-third of the total stake goes down, the consequences are minimal. The beacon chain can continue to finalise as normal. Users of client X will suffer only the normal offline penalties until the bug is fixed, though rewards will be lower across the board for the other validators. But this is not catastrophic and there is time to recover without a panic, either by fixing the bug or swapping to a different client.
The beacon chain is at its most robust and fault-tolerant when no single client type manages more than one-third (33%) of validators.
If client X goes down while managing more than one-third of the total stake, then the beacon chain will be unable to finalise and will enter the inactivity leak.
In this situation no validators will receive rewards for attesting. Users of non-X clients will not lose stake, but users of client X will suffer much bigger losses than usual, due to the quadratically increasing inactivity leak. There is strong time pressure to get the issue with client X resolved either by fixing the bug or swapping to a different client.
The situation becomes potentially much worse when X hosts around half of the validators. If X were to have a consensus bug, but otherwise keep running, the beacon chain would split into two similarly sized chains. Each chain would see half its validators missing and start leaking out the stakes of those validators. Within three to four weeks each chain would have leaked out enough of the stake of the missing validators that the present validators would control two-thirds of the remaining stake, meaning that the chains could each finalise separately. It would be extremely difficult – effectively impossible – to reunite these chains ever again since they would contain conflicting finalised checkpoints. The beacon chain would be permanently partitioned.
Hopefully, 3-4 weeks is sufficient time for client X to fix its bug or for users of X to migrate to other clients. Meanwhile, users of X are suffering large inactivity penalties on the correct chain as per scenario 2.
A scenario in which a single client approaches1 hosting two-thirds (66%) of the validators is potentially catastrophic. A consensus bug in that client would very quickly – possibly within 13 minutes – finalise a broken version of the chain with no chance to intervene.
That would leave the Ethereum community with a horrible dilemma.
One possible response would be to modify the other clients (and the specification) to reproduce the bug and allow them to join X's chain. The feasibility of this depends on the nature of the consensus bug. For a trivial bug it might be possible, but it would be very unfair to the non-X clients since they would suffer penalties despite having acted perfectly correctly. In any case, many types of consensus bug would make this infeasible: one way or another X's chain is broken and now incompatible with the entirety of the rest of the ecosystem.
The correct – but nuclear – option is to fix the bug in client X. Unfortunately, however, there would be no way for the stakers on the incorrect X chain to rejoin the correct chain. Any that tried to do so would be slashed, having previously finalised a checkpoint on the incorrect chain. The only reasonable strategy for (former) users of client X would be to stop validating and voluntarily exit their stakes. Exiting could take a long time due to the queuing mechanism, resulting in large penalties from the inactivity leak. Many of the affected stakers are likely to try to start validating again and would surely be slashed.
There are no good outcomes here, which is why it is critical that we never have a client with a two-thirds or more supermajority.2
As for slashing, once again running a majority client could be an act of self-harm. In the unlikely event that a client implementation has a bug that leads to its validators becoming slashed en-masse, the correlated slashing penalties would be much more severe than if the same thing happened to those running a minority client.
Danny Ryan has presented a slightly different angle on client diversity that's insightful:
If a single client:
- Does not exceed 66.6%, a fault/bug in a single client cannot be finalized.
- Does not exceed 50%, a fault/bug in a single client's fork choice cannot dominate the head of the chain.
- Does not exceed 33.3%, a fault/bug in a single client cannot disrupt finality.
Let me emphasise that these scenarios are far from theoretical. It is of existential importance to the Ethereum network that stakers pay attention to the distribution of client software and avoid adding to the share of the majority client.
It is instructive to revisit the major incident that occurred on the Medalla testnet, in which an issue in the majority client caused a high degree of chaos and led to large numbers of slashings. Had that client managed a smaller proportion of the network, the consequences for everybody would have been much less severe.