Understanding CLVM Compression

Introduction

CLVM compression was designed for straightforward implementation within CLVM itself. To understand how it works, we'll first examine the standard deserialization algorithm implemented in CLVM. See also here https://github.com/Chia-Network/clvm_rs/blob/7e58298b85aed07672678e54b9f28571724814dd/cl/deserialize_w_backrefs.cl

Basic Concepts: CLVM Stack Operations

A CLVM stack is a nil-terminated list of CLVM objects. Stack operations push and pop from the front:

Push operation: (c new_object old_stack)
Pop operation: (f stack) returns topmost object
Get remaining stack: (r old_stack)

Examples of stack states:

()          # empty stack
(foo)       # one-element stack with "foo" on top
(bar foo)   # two-element stack with "bar" on top

Visual representation as trees:

0           # empty tree

🔲
|  \
|   \
foo  0      # one-element stack

🔲
|  \
|   \
foo  🔲
    /  \
   bar  0   # two-element stack

Cursor and Stream Processing

A cursor is analogous to a C FILE pointer, consisting of:

An atom representing the serialized CLVM stream
An integer index tracking the current position

Deserialization Algorithm

The core recursive function sexp_from_stream takes two parameters:

A CLVM stack
A cursor

Processing Steps of `sexp_from_stream`:

Read the first byte from the cursor
If byte ≠ 0xff:
- Deserialize as atom directly
- Push onto stack: (atom ...)
- Return updated stack and cursor
If byte = 0xff:
- Recursively read LEFT object onto stack
- Recursively read RIGHT object onto stack
- Stack becomes: (RIGHT LEFT ...)
- Pop RIGHT and LEFT
- Push (LEFT . RIGHT)
- Final stack: ((LEFT . RIGHT) ...)

In either case, a single new clvm object is pushed onto the stack.

The main algorithm:

Calls sexp_from_stream with empty stack and cursor at index 0
Returns final stack and cursor
Pop top element to get deserialized CLVM

Compression Implementation

Compression introduces the back-reference prefix (0xfe). When encountered:

Deserialize following atom as path

Navigate path down stack using standard CLVM conventions:

tree@0    = nil
tree@1    = tree
tree@2n   = (f (tree@n))
tree@2n+1 = (r (tree@n))

Copy referenced subtree to stack

While decompression is straightforward, compression is complex due to context-sensitive path references that change with stack updates.

If you consider the clvm node, there is likely a fairly straightforward way to write a function that shows how to "find the path" to a prior node from a later node just using the two nodes "final paths". This might lead to a compression algorithm that differs from the existing implementation, so might have different performance characteristics. Being able to quickly calculate how long decompression paths are based on final paths may turn out to be very useful.

Roughly, the "further away" two nodes are in the tree, the longer the path is. Generally, to get from one node to another, you go up to the least common ancestor, then down to the particular node. The total number of steps corresponds to how many bits are in the path of the stack.

richardkiss/clvm-backrefs-details.md

Understanding CLVM Compression

Introduction

Basic Concepts: CLVM Stack Operations

Cursor and Stream Processing

Deserialization Algorithm

Processing Steps of `sexp_from_stream`:

Compression Implementation

richardkiss commented Dec 24, 2024

Uh oh!

richardkiss/clvm-backrefs-details.md

Understanding CLVM Compression

Introduction

Basic Concepts: CLVM Stack Operations

Cursor and Stream Processing

Deserialization Algorithm

Processing Steps of sexp_from_stream:

Compression Implementation

richardkiss commented Dec 24, 2024

Uh oh!

Processing Steps of `sexp_from_stream`: