CLVM compression was designed for straightforward implementation within CLVM itself. To understand how it works, we'll first examine the standard deserialization algorithm implemented in CLVM. See also here https://github.com/Chia-Network/clvm_rs/blob/7e58298b85aed07672678e54b9f28571724814dd/cl/deserialize_w_backrefs.cl
A CLVM stack is a nil-terminated list of CLVM objects. Stack operations push and pop from the front:
- Push operation:
(c new_object old_stack)
- Pop operation:
(f stack)
returns topmost object - Get remaining stack:
(r old_stack)
Examples of stack states:
() # empty stack
(foo) # one-element stack with "foo" on top
(bar foo) # two-element stack with "bar" on top
Visual representation as trees:
0 # empty tree
🔲
| \
| \
foo 0 # one-element stack
🔲
| \
| \
foo 🔲
/ \
bar 0 # two-element stack
A cursor is analogous to a C FILE
pointer, consisting of:
- An atom representing the serialized CLVM stream
- An integer index tracking the current position
The core recursive function sexp_from_stream
takes two parameters:
- A CLVM stack
- A cursor
- Read the first byte from the cursor
- If byte ≠ 0xff:
- Deserialize as atom directly
- Push onto stack: (atom ...)
- Return updated stack and cursor
- If byte = 0xff:
- Recursively read LEFT object onto stack
- Recursively read RIGHT object onto stack
- Stack becomes:
(RIGHT LEFT ...)
- Pop RIGHT and LEFT
- Push
(LEFT . RIGHT)
- Final stack:
((LEFT . RIGHT) ...)
In either case, a single new clvm object is pushed onto the stack.
The main algorithm:
- Calls
sexp_from_stream
with empty stack and cursor at index 0 - Returns final stack and cursor
- Pop top element to get deserialized CLVM
Compression introduces the back-reference prefix (0xfe). When encountered:
- Deserialize following atom as path
- Navigate path down stack using standard CLVM conventions:
tree@0 = nil tree@1 = tree tree@2n = (f (tree@n)) tree@2n+1 = (r (tree@n))
- Copy referenced subtree to stack
While decompression is straightforward, compression is complex due to context-sensitive path references that change with stack updates.
If you consider the clvm node, there is likely a fairly straightforward way to write a function that shows how to "find the path" to a prior node from a later node just using the two nodes "final paths". This might lead to a compression algorithm that differs from the existing implementation, so might have different performance characteristics. Being able to quickly calculate how long decompression paths are based on final paths may turn out to be very useful.
Roughly, the "further away" two nodes are in the tree, the longer the path is. Generally, to get from one node to another, you go up to the least common ancestor, then down to the particular node. The total number of steps corresponds to how many bits are in the path of the stack.