[This is an expanded idea, I don't have an implementation to back it up yet]
Every piece of data in Ethereum is identified by its hash. For structured data, this means serializing it down into a flat binary blob and then running a cryptographically secure hash function on it. Since the hash depends on the data format, the serialization algorithm needs to be standardized and deterministic across all implementations.
There's a wide range of data serialization formats in existence, usually each honed to some specific use cases. Ethereum has a unique requirement that makes most of the well known serialization formats completely inappropriate: to permit smart contracts accessing encoded data, we are bound by the abysmal computing capacity of the Ethereum Virtual Machine. The primary goal is to make data access fast and dumb, everything else is secondary.
A previous proposal called Simple Serialization (SSZ) achieved many of the simplicity requirements for primitive data types, but it incurs O(N) access time for complex dynamic types (e.g. list of transactions). This spec defines an alternative that guarantees O(log N) access times for arbitrary nested data.
SOS supports three primitive types: booleans, unsigned integers and fixed-size binary blobs. The encoding of the primitive data types is simply their binary content (little endian for integers).
Type | Constraints | Examples |
---|---|---|
bool | false → 0x00 true → 0x01 |
|
uintN | N % 8 == 0 8 <= N <= 256 |
uint16(256) → 0x0001 |
bytesN | N < 2 ^ 32 |
bytes3("sos") → 0x736f73 |
SOS supports a single list type, containing arbitrary heterogeneous items. All other list variations (homogeneous, primitives) are subsets and implicitly supported.
List are encoded into three concatenated binary segments (some may be missing):
- The first segment contains all the primitive items and is the concatenation of the individually encoded primitives.
- The third segment contains all the dynamic items and is the concatenation of the individually encoded dynamics.
- The middle segment contains an offset index to permit direct data lookups and is the concatenation of the starting offsets of the dynamic items encoded as
[]uint32
, where the starting offset of an item is the length of the primitive segment + length of the offset segment + length of all previous dynamic items.
Type | Example |
---|---|
[]bool | {true, false, true} → 0x010001 |
[]uint8 | "sos" → 0x736f73 |
[]uint16 | {256, 255} → 0x0001ff00 |
[]bytes3 | {"foo", "bar", "baz"} → 0x666f6f62617262617 |
[][]uint8 | {"Simple", "Offset", "Serialization"} → 0x0c000000 0x12000000 0x24000000 // Offsets 0+12+0, 0+12+6, 0+12+12 0x53696d706c65 0x4f6666736574 0x53657269616c697a6174696f6e // Content |
struct{bool, []uint8} | {true, "sos"} → 0x01 // Primitive 0x05000000 // Offset 1+4+0 0x736f73 // Content "sos" |
Note, the bytes
type defined in the SSZ format is simply a fancy name for []uint8
or []bytes1
in SOS. Implementations can feature this syntactic sugar, but there's no difference in the binary format.
Retrieving primitive items is straightforward. To extract a dynamic item, retrieve the starting offset of the required item and use the next offset to calculate the desired item's length. The last dynamic item's length can be calculated from the overall size of the list being parsed, which is propagated down during parsing.
To retrieve the very last dynamic item, the parser needs to know where the data stream ends (only the starting offset is calculable in the encoded data). If we load the data from a self-contained location (e.g. database entry), it's fine as we can pass the size to the parser. If we're streaming data over the network, the addition of a delimiter (e.g. length prefix) solves it.
The length prefix could be added as an envelope on all encodings, but that would be redundant data that can be calculated with other means in all practical cases so should not be part of the actual serialized binary blob.
Neither SSZ nor SOS are streaming encoders. This means it might take quite a lot of memory shuffling and allocations to actually implement both encoders. Are we sure this won't hit back at larger data structures?