Motivation

Today snapshotting of filecoin state into a CAR file from lotus node takes 2.5 hours on a 256 GB RAM machine. This is the bottleneck for proviing snapshots. The other phases are SHA256 hashing (10 minutes) S3 upload (20 minutes).

This snapshotting time is even in the presense of significant hardware resources -- 256 G RAM + 64 G Swap. Filecoin snapshots are smaller than this today. Latest snapshot shows ~120 G for snapshot size: ipfs://bafybeia5wm3jpf4nviu66w5sz3gu3xcmtwmsasswoar2bnupu3t5wdyl6y/

It should be possible to reduce lotus snapshotting time to the same order as other steps and remove the bottleneck. A nice feature of the proposed approach to do this is that it should interoperate well with forest snapshot generation and potentially speed that up significantly as well

Overview

We can write snapshotter (or better name) software solely responsible for in memory tracking of snapshot state using lotus API
snapshotter generates a CAR file directly from this in memory state
snapshotter only uses on the order of snapshot memory
It keeps the latest 2000 blocks of data in a sliding window, adding the next and GCing the oldest state in its window as the chain progresses.
Estimates are that these operatons (GC + lotus API fetch are bottleneck) should be feasible within 30s so this will keep up with the chain in steady state.
From @Magik6k last year: ~30MB of data and ~20k new blocks every epoch. Should be feasible to transfer in 30s especially if we use websocket api which does intelligent batching.
To handle reorgs snapshotter will GC and re-add blocks back and forward along reorged chains. Keeping snapshotter head a few epochs behind chain head will help reduce this work if it gets expensive.
We'll also need some auxillary memory for CAR writing and most importantly for ref counted tracking of cid state. Conservatively I guess 20% overhead for this putting us at ~150 G RAM needed at current snapshot size

Implementation

I'm imaging a simple go process launched by CLI

Lotus API communication

To hear about new blocks in the filecoin blockchain and potential reorgs snapshotter will use ChainNotify To transfer ipld blocks during dag traversal snapshotter will use: ChainReadObj. There might be some other alternatives to explore too IIRC but couldn't find them at first glance.

Snapshotter sliding window

Last year I built a very simple implementation of indirect ref counting with ipld interfaces in mind: https://github.com/ZenGround0/saaf. The idea is that with references tracked at the closest level to the root in the dag GC + addition both scale in the number of nodes + links being removed or added. It should be immediately straightforward to adapt

saaf Pointer to cid string,
saaf Node to blocks.Block / ipld.Node
saaf Source to blockstore wrapping lotus api

Then just use saaf MapNodeStore to maintain in memory mapping from cid to block of all snapshot state.

At the top level we'll maintain a slice of tipset root cids to pin. When new tipset cids come in from lotus api snapshotter will add them to the slice and run saaf Link on each cid. If everything is hooked up as expected this should take care of the full state traversal. In the steady state snapshotter will run Unlink on the tipset cids 2001 back from the head. To handle reorgs we'll run Unlink to wind back and Link to fast forward.

Snapshotter CAR generation

We can likely make small modifications to lotus snapshot export functions.

I think this code supports streaming writes to disk since the car writer is called throughout traversal, though we need to confirm and if false build that out. Of couse we'll need to pass the in memory map of cid |-> data blocks as the underlying blockstore which probably means importing lotus chainstore and using code unmodified is more trouble than its worth. We should just copy paste and add blockstore as pararmeter and replace call here to that blockstore.

Risks and Caveats

Big

I might be wrong about lotus API block fetching + GC operations fitting within 30s. If they lag this then this approach doesn't work as snapshotter will fall further and further behind. Before full commit to this project this must be tested. But in order to test it you need to know the set of CIDs that are diffed between two block states and to get that you'll need to implement a significant amount of the snapshotter service.
--edit-- after thinking about this more the naive fetching in saaf is probably not going to cut it. Assume ~10ms per api query and 20k blocks to fetch that gets us to 200s if we fetch them all in sequence as is naively done today. It should be possible to change that code to resolve the entire queue of known cids to resolve at once and then we can make use of JSON RPC parallelism, potentially enhanced by websocket batchin. As long as there is a branching factor of at least 10 we should be good. In reality it will be able to start off much higher as there are ~10^6 actor state roots to traverse, so we'll want to do bounded parallelism. This works assuming lotus api parallelism is smart enough which I don't know much about.

Not so big

Simplest implementation needs a snapshot window to warm up. Should be relatively straightforward to do forced catchup but that is engineering work.
GC library linked above has not been run in production before or hooked up to true ipld node storage so will need to be reviewed and debugged. There might be more battletested things you can use immediately but I haven't found anything
GC mapping memory overhead might be more than I'm guessing. We could measure lotus hotstore markset size to get more accurate estimate.
You'll still need to run lotus and this service. I think you won't need more than 100 G memory to run simple lotus hotstore api provider but I might be underestimating requirements. If you need a lot of memory to run lotus + 150 G for snapshotter it might increase RAM needed
Running two services is more annoying than running one service
I don't know the state of CAR generation streaming from memory to filesystem. It may be trivial/already exist with today's CAR libraries it may need engineering work to not require twice the memory.

Opportunities

If forest has ChainNotify and ChainReadObj apis then it can also use this approach to speed things up.
An observation here is that its inherently wasteful to do snapshotting on both lotus and forest when we have blockchain hashes to rely upon. Another separate approach from this doc is to do away with lotus snapshotting but to always check against lotus api on tipset cids of the epoch being snapshotted before running forest snapshot. As long as they match we must be all good (assuming forest snapshot process is good which is very well tested in production at this point). Then go back and evaluate if its worth the cost to reduce forest snapshotting time using snapshotter service.

ZenGround0/FastSnapshots.md