We’ve been saying for a long time that we need “something like a CAP theorum for IPLD.” By this, I mean that we need a way to talk about the design tradeoffs that are made when designing data structures in IPLD.
This is going to end up looking a lot different than the CAP theorem but the end goal is to have something that provides us a blueprint for how to discuss the performance tradeoffs of different data structure designs.
Anyway, here’s my attempt at a first draft. It’s incomplete but I need some feedback in order to iterate on it and turn it into a proper PR.
To put it simply, performance of data structures can be measured in two ways: read speed and write speed.
When examined thoroughly, there are only two primary sources of cost in this design.
- Storage Space
- Link Traversal
The amount of storage space a particular data structure will consumer in bytes and the number of links traversed to perform an operation.
Quantifying the cost of each of these factors is non-trivial and application specific.
The storage space consumed by a structure depends not only on its design but the codec used, overhead in the storage layer, not only the block data but the keys of each block contribute to storage consumption. Advanced storage engines may also index the links of each block in order to provide efficient graph replication or improve garbage collection.
While the number of links in a given structure is predictable based on its design the cost of link traversal itself is not. Data localization, expected cache states, and advanced replication protocols all play a role in the cost of link traversal. To a lesser extend, de-serialization and hashing speed also play a role.
The primary cost of mutation is garbage generation. Yes, there is a cost to writing data itself (serialization and hashing) but that cost is rather minimal compared to the cost of garbage generation because you can add concurrency to writes by distributing the mutation task, but garbage collection is much harder to parallelize and orphaned nodes often have already been persisted through the network and stored all over the world.
While we do spend a lot of time optimizing the computational speed of creating these data structures these costs are an order of magnitude less than the costs associated with storage and traversal since these touch, at best, a local disc or, at worst, the network.
Read speeds can be quantified as the storage cost in bytes of the sub-graph for a specific read divided by the retrieval speed of the block data. Blocks stored in memory will be zero, local storage a bit more, and network retrievals significantly more.
You must then add the round-trip time of each link traversal over the given protocol if the blocks are not available locally.
( bytes / retrieval speed ) + ( links * round-trip time)
In a multi-block structure, like a HAMT, using larger intermediary nodes could reduce the number of links while simultaneously increasing the number of bytes required for a given read. You can imagine what different retrieval and caching scenarios would make this tradeoff work it.
Round trip times are highly protocol dependent and when designing replication protocols you also need to keep these tradeoffs in mind. A point-to-point protocol like Graphsync can reduce the round-trips to 1 no matter how many links are in the retrieval, but as a result it has to ignore any local cache that can’t be included in the read query itself. As an example, syncing a blockchain works quite well in this protocol because I can ask for everything from the latest root node to the last root node I stored locally. But, if I have a directory of files in IPFS and a single byte changes in a large file, using Graphsync to get the graph of this file will return me a large amount of data I already have in cache.
We need to start by considering not just what the initial data structure will consume when stored but also what storage will it consume over time as it is mutated.
As an example: using large intermediary nodes in a HAMT will increase the storage cost associated with each write because the blocks that are orphaned as a result will be larger.
Different garbage collection methods are not entirely relevant to the cost calculation here, because garbage collection methods will vary depending on the storage and persistence layer. Even if the writer has a fast GC, if old copies of the data structure are persisted in the network then the true cost of the write is related to how much is stored, and for how long, in every persistence layer of the network.