Part 2: Technical Overview
The Building Blocks
- The beacon chain uses a novel serialisation method called Simple Serialize (SSZ).
- After much debate we chose to use SSZ for both consensus and communication.
- SSZ is not self-describing; you need to know in advance what you are deserialising.
- An offset scheme allows fast access to subsets of the data.
- SSZ plays nicely with Merkleization and generalised indices in Merkle proofs.
Serialisation is the process of taking structured information (in our case, a data structure) and transforming it into a representation that can be stored or transmitted.
A cooking recipe is a kind of serialisation. I can write down a method for cooking something in such a way that you and others can recreate the method to cook the same thing. The recipe can be written in a book, appear online, even be spoken and memorised – this is serialisation. Using the recipe to cook something is deserialisation.
Serialisation is used for three main purposes on the beacon chain.
- Consensus: if you and I each have information in a data structure, such as the beacon state, how can we know if our data structures are the same or not? Serialisation allows us to answer this question, as long as all clients use the same method. Note that this is also bound up with Merkleization.
- Peer-to-peer communication: we need to exchange data structures over the Internet, such as attestations and blocks. We can't transmit structured data as-is, it must be serialised for transmission and deserialised at the other end. All clients must use the same p2p serialisation, but it doesn't need to be the same as the consensus serialisation.
- Similarly, data structures need to be serialised for users accessing a beacon node's API. Clients are free to choose their own API serialisation. For example, the Prysm client has an API that uses Protocol Buffers (which is being deprecated now that we have agreed a common API format that uses both SSZ and JSON).
In addition, data must be serialised before being written to disk. Each client is free to do this internally however they wish.
Ethereum 2.0 uses a bespoke serialisation scheme called Simple Serialize, or more commonly just "SSZ"1, for all of these purposes.
It seems like we spent months over the end of 2018 and the start of 2019 talking about serialisation, and the story below is highly simplified. But I think it's worth recording some of the considerations and design decisions.
So, we had the freedom to choose a new serialisation protocol. What kind of decision points did we consider?
Starting with serialisation in the consensus protocol, the first big question was whether to adopt an existing off-the-shelf protocol or to roll our own.
One major issue with many existing schemes is that they do not guarantee that the serialisation is deterministic: they sometimes re-order fields in unpredictable ways. This makes them totally unsuitable for consensus; the same data must result in the same output every time.
A more general concern was around using third-party libraries in a consensus-critical situation. Back in 2014, Vitalik wrote a justification, titled Why not use X?, of Ethereum implementing its own technology (such as RLP) for so many things. Here's an excerpt:
One of our core principles in Ethereum is simplicity; the protocol should be as simple as possible, and the protocol should not contain any black boxes. Every single feature of every single sub-protocol should be precisely 100% documented on the whitepaper or wiki, and implemented using that as a specification.
Certainly, with respect to serialisation, some third-party libraries are far more generic than we need, which can lead to issues. Others don't map nicely to the data types that we want to use.
In view of these concerns, momentum was in favour of adopting a bespoke, tightly specified serialisation method. It was the development of Merkleization on top of SSZ that cemented this, making SSZ (in some form) the clear leader for consensus serialisation.
That decision made, the next big question was whether to use the same scheme for both consensus serialisation and peer-to-peer communications serialisation (the "wire-protocol"). This was finely balanced, and good arguments were made in favour of using Protocol Buffers for p2p communication and SSZ for consensus.
The factors that tipped the balance in favour of SSZ for communications were (1) a desire to maintain only one serialisation library, and (2) some possible performance benefit.
On the first of these, there is a bias in Ethereum 2 to favour "simplicity over efficiency". Maintaining two serialisation libraries is arguably more overhead than any potential gain from using different ones. Having said that, RLP is still used in Eth2's discovery layer (since it is shared with Eth1), so this argument loses some of its force.
On the second, when we receive an object over the wire, often the first thing we will want to do is to serialise it to calculate its data root for consensus. If we receive it already serialised in the right format then it saves a deserialise/reserialise round trip.
SSZ does not make any effort to compact or compress the serialised data, and there were concerns that this might make it inefficient for the wire transfer protocol. These concerns were alleviated by adding Snappy compression on the wire, as is already done in Ethereum 1.
SSZ is based on Ethereum's smart contract ABI, but with 4-byte position and size records rather than 32-byte, and different basic data types. It will immediately feel familiar to anyone who has fiddled with that. The rudiments of SSZ were laid down by Vitalik in August 2017.
A big step forward in the utility of SSZ, and what established it as the serialisation protocol of choice for consensus, was the development of Merkleization (also known as tree hashing), first discussed in October 2018 and adopted into the spec in November.
Also in November 2018 we agreed to switch the byte ordering for integer types from big-endian to little-endian at the request of the Nimbus team. This means that the 32-bit number representing 66 decimal is now serialised as
0x42000000 rather than
0x00000042. The main motivation for the change was to map better to byte-ordering in typical microprocessors.
April 2019 saw a major change to SSZ with the adoption of offsets. This came from a scheme, Simple Offset Serialisation, previously proposed by Péter Szilágyi. The idea is to split the objects we are serialising according to whether they are fixed length or variable length. The serialisation then has two sections. The first section contains both actual serialisations of any fixed length objects, and pointers (offsets) to the serialisations of any variable length objects. The second section contains the serialisations of the variable length objects. The motivation for this is to allow fast access to arbitrary parts of the serialised data without having to deserialise the whole structure.
The specification of SSZ is maintained in the main consensus specs repo, and that's the place to go for all the details. I will only be presenting an introductory overview here, with a few examples.
The ultimate goal of SSZ is to be able to represent complex internal data structures such as the BeaconState as strings of bytes.
The formal properties that we require for SSZ to be useful for both consensus and communications are as defined in the SSZ formal verification exercise. Given objects and , both of type , we require that SSZ be
- involutive: (required for communications), and
- injective: implies that (required for consensus).
The first property says that when we serialise an object of a certain type then deserialise the result, we end up with an object identical to the one we started with. This is essential for the communications protocol.
The second says that if we serialise two objects of the same type and get the same result then the two objects are identical. Equivalently, if we have two different objects of the same type then their serialisations will differ. This is essential for the consensus protocol.
Beyond those basic functional requirements, other goals for SSZ are to be (relatively) simple, to create (fairly) compact serialisations, and to be compatible with Merkleization. It is also useful to be able to quickly access specific bits of data within the serialisation without deserialising the entire object. The adoption of offsets into SSZ improved its performance in that respect.
Unlike RLP, SSZ is not self-describing. You can decode RLP data into a structured object without knowing in advance what that object looks like. This is not the case for SSZ: you must know in advance exactly what you are deserialising. In practice this has not been a problem for Eth2: we always know in advance what class of object a particular deserialised blob of data corresponds to. A consequence of this is that, while in RLP two objects of different types cannot serialise to the same output, in SSZ they can. We'll see an example of this shortly.
The building blocks of SSZ are its basic types and its composite types.
SSZ's basic types are very simple and limited, comprising only the following two classes.
- Unsigned integers: a
N-bit unsigned integer, where
Ncan be 8, 16, 32, 64, 128 or 256.
- Booleans: a
The serialisation of basic types lives up to the "simple" name:
uintNtypes are encoded as the little-endian representation in
N/8bytes. For example, the decimal number 12345 (
0x3039in hexadecimal) as a
uint16type is serialised as
0x3930(two bytes). The same number as a
uint32type is serialised as
booleantypes are always one byte and serialised as
0x01for true and
I have embedded some examples in the following descriptions. You can run them yourself if you set up the Eth2 spec as per the instructions in the Appendices. The examples can be run via the Python REPL or by putting the commands in a file (I show both approaches).
>>> from eth2spec.utils.ssz.ssz_typing import uint64, boolean >>> uint64(0x0123456789abcdef).encode_bytes().hex() 'efcdab8967452301' >>> boolean(True).encode_bytes().hex() '01' >>> boolean(False).encode_bytes().hex() '00'
Composite types hold combinations of or multiples of smaller types. The spec defines the following composite types: vectors, lists, bitvectors, bitlists, unions, and containers. I will skip unions in the following as they are not currently used in Ethereum 2.
A vector is an ordered fixed-length homogeneous collection with exactly
N values. "Homogeneous" means that all the elements of a vector must be of the same type, but they do not need to be of the same size. For example, we could have a vector containing lists that each have different numbers of elements.
In the SSZ spec a vector is denoted by
Vector[type, N]. For example
Vector[uint8, 32] is a 32 element list of
uint8 types (bytes). The
type can be anything, including other vectors or even containers.
Vectors provide a simple example of needing to know what kind of object you are deserialising before you attempt it. In the following example, the same string of bytes encodes both a four element set of two-byte integers, and an eight element set of one-byte integers. When we deserialise this we need to know which of these (or many other possibilities) we are expecting to get.
>>> from eth2spec.utils.ssz.ssz_typing import uint8, uint16, Vector >>> Vector[uint16, 4](1, 2, 3, 4).encode_bytes().hex() '0100020003000400' >>> Vector[uint8, 8](1, 0, 2, 0, 3, 0, 4, 0).encode_bytes().hex() '0100020003000400'
Fun fact: in early versions of the SSZ spec, vectors were called tuples.
A list is an ordered variable-length homogeneous collection with a maximum of
In the SSZ spec a list is denoted by
List[type, N]. For example,
List[uint64, 100] is a list containing anywhere between zero and one hundred
The maximum length parameter,
N, on lists is not used in serialisation or deserialisation. It is used, however, in Merkleization, and in particular enables generalised indices in Merkle proof generation.
Both vectors and lists have the same serialisation when they are treated as stand-alone objects:
>>> from eth2spec.utils.ssz.ssz_typing import uint8, List, Vector >>> List[uint8, 100](1, 2, 3).encode_bytes().hex() '010203' >>> Vector[uint8, 3](1, 2, 3).encode_bytes().hex() '010203'
So why not use lists everywhere? Since lists are variable sized objects in SSZ they are encoded differently from fixed sized vectors when contained within another object, so there is a small overhead. The container
Foo holding the variable sized list is encoded with an extra four-byte offset at the start. We'll see why a bit later.
>>> from eth2spec.utils.ssz.ssz_typing import uint8, Vector, List, Container >>> class Foo(Container): ... x: List[uint8, 3] >>> class Bar(Container): ... x: Vector[uint8, 3] >>> Foo(x = [1, 2, 3]).encode_bytes().hex() '04000000010203' >>> Bar(x = [1, 2, 3]).encode_bytes().hex() '010203'
A bitvector is an ordered fixed-length collection of
boolean values with
N bits. In the SSZ spec, a bitvector is denoted by
It is not obvious from the spec, but bitvectors use little-endian bit format:
>>> from eth2spec.utils.ssz.ssz_typing import Bitvector >>> Bitvector(0,0,0,0,0,0,0,1).encode_bytes().hex() '80'
Bitvectors are encoded into the minimum necessary number of whole bytes (
N // 8) and padded with zeroes in the high bits if
N is not a multiple of 8.
As noted in the spec, functionally we could use either
Vector[boolean, N] or
Bitvector[N] to represent a list of bits. However, the latter will have a serialisation up to eight times shorter in practice since the former will use a whole byte per bit.
>>> from eth2spec.utils.ssz.ssz_typing import Vector, Bitvector, boolean >>> Bitvector(1,0,1,0,1).encode_bytes().hex() '15' >>> Vector[boolean,5](1,0,1,0,1).encode_bytes().hex() '0100010001'
The same consideration applies for lists and bitlists.
A bitlist is an ordered variable-length collection of
boolean values with a maximum of
N bits. In the SSZ spec, a bitlist is denoted by
An interesting feature of bitlists3 is that they use a sentinel bit to indicate the length of the list. The number of whole bytes in the bitlist is easily derived from the offsets in the serialisation, but that doesn't give us the precise number of bits. For example, in a naive scheme 13 bits would be serialised into two bytes, so we would only know that the actual list length is somewhere between 9 and 16 bits.
To resolve this problem, bitlist serialisation adds an extra
1 bit at the end of the list (which becomes the highest-order bit in the little-endian encoding). The exact length of the bitlist can then be found by ignoring any consecutive high-order zero bits and then stripping off the single sentinel bit.
As an example, this bitlist with three elements is encoded into a single byte. To deserialise this, we take the total length in bits (eight), skip the four high-order zero bits, skip the sentinel bit, and then our list comprises the remaining three bits. Equivalently, the bitlist length is the index of the highest
1 bit in the serialisation.
>>> from eth2spec.utils.ssz.ssz_typing import Bitlist >>> Bitlist(0,0,0).encode_bytes().hex() '08'
As a consequence of the sentinel, we require an extra byte to serialise a bitlist if its actual length is a multiple of eight (irrespective of the maximum length). This is not the case for a bitvector.
>>> Bitlist(0,0,0,0,0,0,0,0).encode_bytes().hex() '0001' >>> Bitvector(0,0,0,0,0,0,0,0).encode_bytes().hex() '00'
A container is an ordered heterogeneous collection of values. Basically, a container can contain any arbitrary mix of types, including containers.
We define containers using Python's
dataclass notation with key–type pairs. For example, this is a
Deposit container. In the following examples I have indicated the underlying types in the appended comments.
class Deposit(Container): proof: Vector[Bytes32, DEPOSIT_CONTRACT_TREE_DEPTH + 1] # Vector[Vector[uint8, 32], N] data: DepositData
Deposit container contains a
DepositData container which is defined as follows.
class DepositData(Container): pubkey: BLSPubkey # Bytes48 / Vector[uint8, 48] withdrawal_credentials: Bytes32 # Vector[uint8, 32] amount: Gwei # uint64 signature: BLSSignature # Bytes96 / Vector[uint8, 96]
We'll see how containers are serialised in the worked example, below.
SSZ distinguishes between fixed size and variable size types, and treats them differently when they are contained within other types.
- Variable size types are lists, bitlists, and any type that contains a variable size type.
- Everything else is fixed size.
This distinction is important when we serialise a compound type. The serialised output is created in two parts, as follows.
- The serialisation of fixed length types, along with 32-bit offsets to any variable length types.
- The serialisation of any variable length types.
This split between a fixed length part and a variable length part came about as a result of the offset encoding described earlier: it allows fast access to specific fields within a serialised data structure without needing to deserialise the whole thing.
As an example, consider the following container. It has a single fixed length
uint8 type, followed by a variable length
List[uint8,10] type, followed again by a fixed length
>>> from eth2spec.utils.ssz.ssz_typing import uint8, List, Container >>> class Baz(Container): ... x: uint8 ... y: List[uint8, 10] ... z: uint8 >>> Baz(x = 1, y = [2, 3], z = 4).encode_bytes().hex() '0106000000040203'
We see that the serialisation contains an unexpected
0x06 byte and some zero bytes. To see where they come from I'll break down the output as follows, where the first column is the byte number in the serialised string.
Start of Part 1 (fixed size elements) 00 01 - The serialisation of x = uint8(1) 01 06000000 - A 32-bit offset to byte 6 (in little-endian format), the start of the serialisation of y 05 04 - The serialisation of z = uint8(4) Start of Part 2 (variable size elements) 06 0203 - The serialisation of y = List[uint8, N]([2, 3])
In Part 1, instead of directly encoding the variable size list in place, it is replaced with a pointer (an offset) to its serialisation in Part 2. So, for any container, the size of Part 1 is known and fixed no matter what kinds of variable size types are present. The actual lengths of the variable size objects can be deduced from the offsets in Part 1 and the overall length of the serialisation string.
It's not only containers that use this format, it applies to any type that contains variable size types. Here's a vector whose elements are lists. As an exercise for the reader I'll leave you to decode what's going on here.
>>> from eth2spec.utils.ssz.ssz_typing import uint8, List, Vector >>> Vector[List[uint8,3],4]([1,2],[3,4,5],,).encode_bytes().hex() '10000000120000001500000015000000010203040506'
For convenience we alias:
uint8(this is a basic type)
Vector[byte, N](this is not a basic type)
In the main beacon chain spec, a bunch of custom types are also defined in terms of the standard SSZ types and aliases. For example,
Slot is an SSZ
BLSPubkey is an SSZ
Bytes48 type, and so on.
Finally, each type has a default value. Once again directly from the SSZ spec:
Let's explore a worked example to gather all of this together. I'd rather use a real example than make up a synthetic object, so we are going to look at the aggregate
IndexedAttestation that was included in the beacon chain block at slot 3080831, at position 87 within the block. (It would actually have been an
Attestation object in the block, but those bitlists are fiddly, so we'll look at the equivalent
IndexedAttestation container looks like this.
class IndexedAttestation(Container): attesting_indices: List[ValidatorIndex, MAX_VALIDATORS_PER_COMMITTEE] data: AttestationData signature: BLSSignature
It contains an
class AttestationData(Container): slot: Slot index: CommitteeIndex beacon_block_root: Root source: Checkpoint target: Checkpoint
which in turn contains two
class Checkpoint(Container): epoch: Epoch root: Root
Now we have enough information to build the
IndexedAttestation object and calculate its SSZ serialisation.
from eth2spec.utils.ssz.ssz_typing import * from eth2spec.capella import mainnet from eth2spec.capella.mainnet import * attestation = IndexedAttestation( attesting_indices = [33652, 59750, 92360], data = AttestationData( slot = 3080829, index = 9, beacon_block_root = '0x4f4250c05956f5c2b87129cf7372f14dd576fc152543bf7042e963196b843fe6', source = Checkpoint ( epoch = 96274, root = '0xd24639f2e661bc1adcbe7157280776cf76670fff0fee0691f146ab827f4f1ade' ), target = Checkpoint( epoch = 96275, root = '0x9bcd31881817ddeab686f878c8619d664e8bfa4f8948707cba5bc25c8d74915d' ) ), signature = '0xaaf504503ff15ae86723c906b4b6bac91ad728e4431aea3be2e8e3acc888d8af' + '5dffbbcf53b234ea8e3fde67fbb09120027335ec63cf23f0213cc439e8d1b856' + 'c2ddfc1a78ed3326fb9b4fe333af4ad3702159dbf9caeb1a4633b752991ac437' ) print(attestation.encode_bytes().hex())
The resulting serialised blob of data that represents this
IndexedAttestation object is (in hexadecimal):
e40000007d022f000000000009000000000000004f4250c05956f5c2b87129cf7372f14dd576fc15 2543bf7042e963196b843fe61278010000000000d24639f2e661bc1adcbe7157280776cf76670fff 0fee0691f146ab827f4f1ade13780100000000009bcd31881817ddeab686f878c8619d664e8bfa4f 8948707cba5bc25c8d74915daaf504503ff15ae86723c906b4b6bac91ad728e4431aea3be2e8e3ac c888d8af5dffbbcf53b234ea8e3fde67fbb09120027335ec63cf23f0213cc439e8d1b856c2ddfc1a 78ed3326fb9b4fe333af4ad3702159dbf9caeb1a4633b752991ac437748300000000000066e90000 00000000c868010000000000
This can be transmitted as a string of bytes over the wire and, knowing at the other end that it represents an
IndexedAttestation, reconstituted into an identical copy.
To make sense of this, we'll break down the serialisation into its parts. The first column is the byte-offset from the start of the byte string (in hexadecimal). Before each line I've indicated which part of the data structure it corresponds to, and I've translated the type aliases into their basic underlying SSZ types. Remember that all integer types are little-endian, so
7d022f0000000000 is the hexadecimal number
0x2f027d, which is 3080829 in decimal (the slot number).
Start of Part 1 (fixed size elements) 4-byte offset to the variable length attestation.attesting_indices starting at 0xe4 00 e4000000 attestation.data.slot: Slot / uint64 04 7d022f0000000000 attestation.data.index: CommitteeIndex / uint64 0c 0900000000000000 attestation.data.beacon_block_root: Root / Bytes32 / Vector[uint8, 32] 14 4f4250c05956f5c2b87129cf7372f14dd576fc152543bf7042e963196b843fe6 attestation.data.source.epoch: Epoch / uint64 34 1278010000000000 attestation.data.source.root: Root / Bytes32 / Vector[uint8, 32] 3c d24639f2e661bc1adcbe7157280776cf76670fff0fee0691f146ab827f4f1ade attestation.data.target.epoch: Epoch / uint64 5c 1378010000000000 attestation.data.target.root: Root / Bytes32 / Vector[uint8, 32] 64 9bcd31881817ddeab686f878c8619d664e8bfa4f8948707cba5bc25c8d74915d attestation.signature: BLSSignature / Bytes96 / Vector[uint8, 96] 84 aaf504503ff15ae86723c906b4b6bac91ad728e4431aea3be2e8e3acc888d8af a4 5dffbbcf53b234ea8e3fde67fbb09120027335ec63cf23f0213cc439e8d1b856 c4 c2ddfc1a78ed3326fb9b4fe333af4ad3702159dbf9caeb1a4633b752991ac437 Start of Part 2 (variable size elements) attestation.attesting_indices: List[uint64, MAX_VALIDATORS_PER_COMMITTEE] e4 748300000000000066e9000000000000c868010000000000
The first thing to notice is that the
attesting_indices list is variable size, so it is represented in Part 1 by an offset pointing to where the actual data is. In this case, at
0xe4 bytes (228 bytes) from the start of the serialised data. The actual length of the list can be calculated as the length of the whole string (252 bytes) minus 228 bytes (the start of the list) divided by 8 bytes, one per element. And so, we recover our list of three validator indices.
All the remaining items are fixed size, and are encoded in-place, including recursively encoding the fixed size
AttestationData object, and its fixed size
It is instructive to see how container with multiple variable size child objects is serialised. For this example we will make an
AttesterSlashing object that contains two of the above
IndexedAttestation objects. This is a contrived example; the slashing report is not valid since the contents are duplicates.
AttesterSlashing container is defined as follows,
class AttesterSlashing(Container): attestation_1: IndexedAttestation attestation_2: IndexedAttestation
which we can populate and serialise like this, using our previously defined
slashing = AttesterSlashing( attestation_1 = attestation, attestation_2 = attestation ) print(slashing.encode_bytes().hex())
From this we get the following serialisation, again shown with the byte-offset within the byte string in the first column.
Start of Part 1 (fixed size elements) 0000 08000000 0004 04010000 Start of Part 2 (variable size elements) 0008 e40000007d022... 0104 e40000007d022...
This time we have two variable length types, so they are both replaced by offsets pointing to the start of the actual variable length data which appears in Part 2. The length of
attestation_1 is calculated as the difference between the two offsets, and the length of
attestation_2 is calculated as the length from its offset to the end of the string.
Another thing to note is that, since
attestation_2 are identical, their serialisations within this compound object are identical, including their internal offsets to their own variable length parts. That is, both attestations have variable length data at offset
0xe4 within their own serialisations; the offset is relative to the start of each sub-object's serialisation, not the entire string. This property simplifies recursive serialisation and deserialisation: a given object will have the same serialisation no matter what context it is found in.
The historical discussion threads around whether to use SSZ for both consensus and p2p serialisation or not are a goldmine of insight and wisdom.
- Possibly the first substantial discussion around which serialisation scheme to adopt. It covers various alternatives, touches on the p2p vs. consensus issues, and rehearses some of the desirable properties.
- An early discussion of SSZ went over some of the issues and led into the discussion below.
- Proposal to use SSZ for consensus only.
- Piper Merriam's Everything You Never Wanted To Know About Serialization remains a good summary of many of the considerations.
Other SSZ resources:
- SSZ encoding diagrams by Protolambda.
- Formal verification of the SSZ specification: Notes and Code.
- An excellent SSZ explainer by Raul Jordan with a deep dive into implementing it in Golang. (Note that the specific library referenced in the article has now been deprecated in favour of fastssz.)
- An interactive SSZ serialiser/deserialiser by ChainSafe with all the containers for the various consensus layer upgrades available to play with. On the "Deserialize" tab you can paste the data from the
IndexedAttestationabove and verify that it deserialises correctly (you'll need to remove line breaks).
- Thus enshrining that ugly "z" in the full name, and the ghastly "ess-ess-zee" pronunciation.↩
- Vitalik, "As the inventor of RLP, I'm inclined to prefer SSZ", and again, "RLP honestly sucks" (with some explanation as to why!).↩
- Though not entirely uncontroversial. Basically, if the application layer already knows what length of bitlist it expects – which it generally does in Eth2, since although committee sizes change, the sizes are known – then we could in principle dispense with the sentinel bit.↩
- An instructive discussion of the wisdom or otherwise of aliasing
uint8was sparked when we began defining a canonical JSON mapping for SSZ data. The words "fight to the death" were used.↩