Skip to content

Instantly share code, notes, and snippets.

@leonardoalt
Last active April 16, 2026 22:02
Show Gist options
  • Select an option

  • Save leonardoalt/a65047d99d1bb95e312d831756333622 to your computer and use it in GitHub Desktop.

Select an option

Save leonardoalt/a65047d99d1bb95e312d831756333622 to your computer and use it in GitHub Desktop.
RISCV-X: 1024 registers for zkVM — technical summary with links

RISCV-X: 1024 registers for zkVM — technical summary with links

RISCV-X: 1024 Registers for zkVM — Technical Summary

Experiment: modify the LLVM RISC-V backend to support 1024 general-purpose registers (up from 32), integrate with the OpenVM zkVM framework, and measure proof cost reduction. Goal: eliminate register spills in compute-heavy zkVM workloads where memory operations dominate proof cost.

Result Headline (updated after correctness fixes)

For 1000 SHA3-256 iterations through OpenVM with actual STARK proofs, 4 segments, aggregated across all AIRs:

Note: earlier versions of this gist reported ~8% fewer trace cells. Those numbers came from a build that had three latent bugs (two in the LLVM calling convention, one in the OpenVM transpiler's S-type immediate decoding) — the extended binary was still producing correct hashes in simpler benchmarks so the numbers looked plausible, but in tiny_sha3 specifically the broken build was secretly looping ~38× more than the baseline. All three are now fixed; see Part 6.

Metric Baseline (32 regs) Extended (1024 regs) Change
Executed instructions 35,896,008 31,067,004 −13.5%
total_cells (allocated) 4,049,774,760 3,853,796,520 −4.8%
total_cells_used (actually filled) 3,275,912,933 2,829,414,701 −13.6%
main_cells_used 1,426,439,001 1,229,953,329 −13.8%
Total proof time 130.8s 127.7s −2.4%

total_cells is the allocated trace matrix — AIR heights rounded up to powers of two for FRI. total_cells_used is the actually filled cells (what matters for proving work in principle). The allocated metric drops less because power-of-two padding wastes the gain; the filled metric is the honest figure.

Proof wall-clock time barely moves because keccak proof time is dominated by FRI/commitment work that doesn't scale down with the trace. Most of the saving would show up as headroom in trace-cell budget (e.g., fitting into one fewer segment for compute-heavier workloads).

Raw metrics.json (combined baseline + extended): https://gist.github.com/leonardoalt/0268bcf198fb657bd3899c7d8376718e

The benchmark is SHA3-256 applied 1000 times in a row to a 32-byte buffer initially full of zeros. Reference hash 52cf48e88ce4dea40f272b6aaf083675ade26504a0129f51ec30204a2fdb1c5b (matches Python hashlib.sha3_256); both baseline and extended ELFs produce it exactly.

Motivation

zkVMs like OpenVM use RISC-V as their ISA because it's simple. But RISC-V has only 32 registers, which causes significant spills in compute-heavy code. In hardware, spills hit cache/memory cheaply. In a zkVM, every memory operation is expensive — it must be proven. Each load/store adds constraints.

Powdr's crush demonstrated that compiling from WASM to an ISA with infinite registers (no spills at all) is much more efficient in a zkVM. This experiment takes a different approach: what if we give LLVM 1024 physical registers? LLVM IR already uses infinite virtual registers; the register allocator's job is to map them to physical registers. With 1024 physical registers, the allocator has so much headroom that it trivially succeeds without spills for most functions.

Forks

Metrics Gists (actual STARK proof traces)

Part 1: LLVM Backend Changes

1.1 Feature flag

Added FeatureVendorXRegs1024 in RISCVFeatures.td. Enable via -mattr=+xregs1024 or -march=rv32im_xregs1024.

1.2 Register file expansion

In RISCVRegisterInfo.td:

  • Widened RISCVReg encoding from bits<5> to bits<10>.
  • Added X32-X1023 via foreach i = 32...1023 in { def X#i : RISCVReg<i, "x"#i> }.
  • Included the extended registers in the GPR register class.
  • In RISCVRegisterInfo.cpp, X32-X1023 are marked as reserved in getReservedRegs() when XRegs1024 is not active, so standard RISC-V behavior is preserved.

1.3 Calling convention

In RISCVCallingConv.td, defined CSR_XRegs1024 as an empty callee-saved list. In zkVM there are no interrupts or context switches, so making every register caller-saved eliminates prologue/epilogue saves. The register allocator inserts spills only for values actually live across calls.

GP (x3) and TP (x4) are also unreserved when XRegs1024 is active (not needed in zkVM).

1.4 64-bit instruction encoding

Standard RISC-V encodes instructions in 32 bits with 5-bit register fields — not enough for 1024 registers. We designed a fixup-compatible 64-bit encoding in RISCVInstrFormats.td:

  • Low u32 (bytes 0-3): standard RISC-V bit layout with [6:0] = 0b0111111 marker. Immediate bits stay at standard positions → existing LLVM fixups (fixup_riscv_branch, fixup_riscv_hi20, etc.) work unchanged.
  • High u32 (bytes 4-7): original 7-bit opcode at [16:10], high 5 bits of each register ID at [21:17] (rd), [26:22] (rs1), [31:27] (rs2).

Transpiler reconstructs full 10-bit register IDs as reg = (hi_bits << 5) | lo_bits.

1.5 MC code emitter re-encoding

Rather than creating separate 64-bit instruction definitions for every instruction (which would require duplicating ~30 defs with new format classes), we intercept in RISCVMCCodeEmitter.cpp. Each 32-bit instruction is re-encoded as 64-bit on-the-fly when FeatureVendorXRegs1024 is set. Register operands are read from the MCInst at their full 10-bit encoding (via getEncodingValue()) and placed into both the low 5-bit field and the high 5-bit field.

1.6 Function calls (AUIPC + JALR)

AUIPC+JALR is the standard pseudo-CALL sequence. The R_RISCV_CALL_PLT relocation requires these two instructions at consecutive 4-byte offsets — incompatible with arbitrary 8-byte re-encoding. Solution: keep AUIPC+JALR as two consecutive standard 32-bit instructions (8 bytes total). The linker fixup works at its usual offsets. The OpenVM transpiler detects this pair and combines them into a single PC-relative JAL.

Part 2: OpenVM Changes

2.1 Transpiler extension

Created XRegs1024TranspilerExtension. For each 2-u32 chunk:

  1. If lo[6:0] == 0x3F → decode as a 64-bit instruction, extract full 10-bit registers, produce the equivalent 7-field Instruction<F>.
  2. If lo[6:0] == 0x17 (AUIPC) and hi[6:0] == 0x67 (JALR) → combine into a single JAL with the combined offset (halved for PC compression).
  3. Otherwise → return None (fallback to standard transpilers if registered).

2.2 PC compression

64-bit instructions take 8 bytes in the ELF but OpenVM's DEFAULT_PC_STEP = 4. Each 64-bit instruction occupies 1 PC slot (4 bytes). The transpiler maps:

  • ELF byte address 0 → OpenVM PC base + 0
  • ELF byte address 8 → OpenVM PC base + 4

So OpenVM PC = base + (ELF_offset) / 2. Consequently:

  • Branch/jump offsets are halved in the transpiler (LLVM calculated them for 8-byte instructions).
  • The ELF entry point is halved when building VmExe.

2.3 Register file expansion

Previously RV32_NUM_REGISTERS = 32, register file in address space 1 allocated 32 * 4 = 128 bytes. Changed to 1024 in riscv.rs; MemoryConfig now allocates RV32_NUM_REGISTERS * 4 = 4096 bytes for the register file.

2.4 Register byte offset widening

Circuit execution files stored register byte offsets as u8 (max value 255). With 1024 registers, offsets go up to 4092. Changed u8 → u16 in 13 execution files across extensions/rv32im/circuit/src/*/execution.rs (base_alu, shift, less_than, branch_eq, branch_lt, mul, mulh, divrem, jalr, jal_lui, auipc, loadstore, load_sign_extend, hintstore).

Also expanded RISCV_TO_X86_OVERRIDE_MAP in aot/common.rs from a [Option<&str>; 32] to [Option<&str>; 1024] to avoid out-of-bounds access in AOT paths (entries 32-1023 are None).

2.5 LUI enable flag

Subtle bug: the standard from_u_type helper produces f = 0, but process_lui in the Rv32I transpiler explicitly sets f = 1 afterwards — because LUI shares a chip with JAL, and f is the "write rd" enable flag. My transpiler was missing this. Fixed in make_lui in xregs1024.rs.

2.6 R-type opcode handling bug

Critical fix: my original R-type transpiler had a nested match where SLL/SRL/SRA/SLT/SLTU hit the outer catch-all _ => unimp() before the inner shift/lt handling could run (dead code). Flattened into a single match that covers all (funct7, funct3) combinations.

Part 3: Toolchain

3.1 Working: C guests

C programs compile end-to-end:

clang --target=riscv32 -march=rv32im -O2 -ffreestanding -fno-builtin \
  -emit-llvm -S keccak.c -o keccak.ll

# Baseline
llc -march=riscv32 -mattr=+m -O2 -filetype=obj keccak.ll -o baseline.o
ld.lld --no-relax -T keccak.ld baseline.o -o baseline.elf

# Extended (1024 registers, 64-bit encoding)
llc -march=riscv32 -mattr=+m,+xregs1024 -O2 -filetype=obj keccak.ll -o extended.o
ld.lld --no-relax -T keccak.ld extended.o -o extended.elf

Linker script puts .text at OpenVM's expected address (0x200800) and separates .data/.rodata/.bss into distinct segments.

3.2 Partial: Rust guests

The standard OpenVM Rust build (cargo build --target riscv32im-risc0-zkvm-elf) uses rustc's bundled LLVM, not our fork. Attempted approaches:

  • Intercept at bitcode: RUSTFLAGS="--emit=llvm-bc" produces one .bc per crate. llvm-link fails on duplicate memcpy/memset/rust_begin_unwind from compiler_builtins, core, std. The --override option helps partially but doesn't resolve all duplicates.
  • Compile each .bc separately + link: lld complains about duplicates; --allow-multiple-definition helps but then runtime symbols like __rust_no_alloc_shim_is_unstable_v2 and sys_panic are missing.
  • -C linker-plugin-lto: gets close, but rustc's linker invocation adds clang-specific flags incompatible with -Ttext.

The proper solution for Rust guests is to build a custom rustc sysroot with our LLVM fork. Not yet done.

Part 4: How We Measured

4.1 Assembly-level (keccakf function)

Compile tiny-keccak crate to LLVM IR via standard rustc, then recompile with our llc:

keccakf function:
  Standard:  879 instructions, 156 stack stores + 180 stack loads = 336 spill ops
  Extended:  535 instructions, 0 stack ops
  Reduction: 39% fewer instructions, 100% spills eliminated

See docs/keccak-benchmark.md for assembly-level comparison details.

4.2 STARK proof trace cells

The metrics that matter. Generated with:

OUTPUT_PATH=/tmp/metrics_baseline.json XREGS_VARIANT=baseline \
  cargo test -p openvm-toolchain-tests --test transpiler_tests \
  --features metrics -- test_xregs1024_proof_with_metrics --nocapture

OUTPUT_PATH=/tmp/metrics_extended.json XREGS_VARIANT=extended \
  cargo test -p openvm-toolchain-tests --test transpiler_tests \
  --features metrics -- test_xregs1024_proof_with_metrics --nocapture

The metrics feature enables openvm_stark_sdk::bench::run_with_metric_collection which sets up a DebuggingRecorder and dumps a snapshot to the path in OUTPUT_PATH as JSON. The JSON contains gauges and counters for every AIR: constraints, interactions, trace heights, main/perm/quotient cells, and so on.

The workload is a C keccak program (bench/keccak_standalone.c) doing N iterations of keccak256(buf, buf). Baseline compiled for standard RV32IM; extended compiled with +xregs1024.

4.3 Numbers

Aggregating the metrics JSON across all AIRs:

Keccak 1000 iterations (4 segments):

Metric Baseline Extended Change
total_cells (allocated) 4,049,020,328 3,853,992,104 -4.8%
total_cells_used (actual) 3,176,459,749 2,926,616,970 -7.9%
main_cells_used 1,382,971,605 1,273,258,966 -7.9%
Total proof time 168.8s 162.9s -3.5%

Keccak 100 iterations (1 segment):

Metric Baseline Extended Change
total_cells (allocated) 461,105,706 363,573,802 -21.2%
total_cells_used (actual) 323,386,753 298,398,042 -7.7%
Total proof time 22.9s 20.6s -9.9%

Important distinction: total_cells is the allocated trace matrix (heights rounded up to powers of 2 because of FRI). total_cells_used is actually-filled cells. The difference between runs:

  • At 100 iter: total_cells drops 21% because an AIR crosses a power-of-2 boundary.
  • At 1000 iter (4 segments): the rounding effects wash out and total_cells drops only ~5%, matching total_cells_used more closely.

The total_cells_used reduction stays consistently ~8% across workload sizes — that's the honest figure.

The reduction is modest because keccakf, while the dominant hot function, is surrounded by equally-costly code that doesn't benefit from extended registers: the memcpy between iterations, the loop bookkeeping, the function call overhead, the VM's memory/IO chips, etc. If the workload were pure keccakf (inlined, no function call overhead, no memcpy), the reduction would approach the 39% instruction reduction we see at the assembly level.

Part 5: What Made This Hard

  • Fixup compatibility: LLVM's branch/jump fixups assume standard encoding. Our low u32 preserves standard bit positions so fixups work without modification — this was the key insight that made the encoding practical.
  • AUIPC+JALR addressing: PC-relative addressing combined with byte-aligned ELF offsets vs compressed OpenVM PC space is genuinely subtle. Resolved by combining the pair into a single JAL.
  • Phantom gaps: Initial approach emitted [instruction, phantom_nop] pairs to keep ELF byte addresses matching OpenVM PCs. This worked but the phantoms counted against trace cells (~8% worse). Removing them required halving all branch/jump/call offsets.
  • Scattered u8 assumptions: The register byte offset being u8 was hardcoded in 13+ circuit files. All needed widening to u16.
  • Silent transpiler bugs: The R-type dead-code path silently produced unimp() for SLL/SRL/SRA/SLT/SLTU. Only caught by tracing TERMINATE(2) occurrences in the transpiled program.
  • Linker relaxation: ld.lld by default rewrites 8-byte auipc+jalr call pairs into a 4-byte jal when the target is in range. The transpiler's halving arithmetic assumes every ELF slot is 8 bytes, so relaxation breaks alignment silently. Must pass --no-relax to the linker.

Part 6: Correctness bugs caught post-deployment

After switching the guest-side keccak from a hand-rolled C implementation to the stock tiny_sha3 library, the 1000-iteration benchmark started reporting ~38× more instructions than baseline (1.386B vs 35.9M), eventually faulting on an out-of-bounds load. Investigating the regression uncovered three distinct bugs that had been latent, each masked on simpler code paths:

Bug 1 — getCallPreservedMask still returned the ILP32 mask under XRegs1024

The calling-convention design was "everything caller-saved except whatever the call-preserved mask says is preserved". The CSR list for XRegs1024 was empty, so the callee prologue saved nothing. But getCallPreservedMask was not changed — it still returned CSR_ILP32_LP64_RegMask, telling the register allocator that s0-s11 survive a call. The allocator believed the lie and left live values in s0/s1 across calls; the callee didn't preserve them; values got silently overwritten.

In tiny_sha3 specifically this broke the last step of sha3_final: md and c pointers kept in s0/s1, clobbered by the call to sha3_keccakf, and then the post-call memcpy used garbage pointers. The reloaded "mdlen" was enormous and the copy loop ran millions of iterations before falling off the end of memory. That's where the 38× came from.

Fix: make getCallPreservedMask return CSR_IPRA_RegMask (only x1 preserved) when hasVendorXRegs1024().

Bug 2 — ra wasn't being spilled across calls either

With an empty CSR list, PEI also didn't save ra in any non-leaf prologue. Every nested call overwrote ra with the inner-call return address, and the outer function's final ret jumped somewhere inside itself.

Fix: add X1 to CSR_XRegs1024. Non-leaf functions now spill ra in their prologue, caller-saved for everything else.

Both of these had been masked in simpler call chains because the compiler happened to pick allocations that didn't need preservation. tiny_sha3 is the first case we tested where values were actually live across a call.

Bug 3 — s_imm in the transpiler didn't sign-extend

The 12-bit S-type immediate extractor in xregs1024.rs masked the high half with & 0x7F after a signed right shift:

let hi7 = ((lo as i32) >> 25) & 0x7F;  // <- wipes the sign
let imm = (hi7 << 5) | lo5;

-1 decoded as +4095. So sb a, -1(ptr) (the final-pad write of state[rsiz-1] ^= 0x80 in sha3_final) stored to ptr + 4095 — 4 KiB past the state instead of the last byte of it. The pad never applied; keccakf then ran on an unpadded state and produced garbage. The hash came out almost right (25 round permutation of the wrong initial state) which is why it looked subtly off rather than catastrophically wrong.

Fix: sign-extend the 12-bit result correctly.

Why these were latent

All three bugs are on paths that basic tests never exercised:

  • 1 and 2 only trigger when a non-leaf function has values live across a call.
  • 3 only triggers when LLVM emits a store with a negative immediate (most stores use positive offsets).

Whoever first adds a new codebase on top of a custom backend should run it against a proper hash function. Keccak state + XOR pad at rsiz-1 + nested function calls through s-registers is a surprisingly thorough calling-convention fuzzer.

Part 7: Build recipe

Full build recipe for the benchmark ELFs (tiny_sha3 source, driver, linker script, llc / ld.lld commands) is at https://gist.github.com/leonardoalt/66ab3ae06c0e21532b10d7e637afedeb. The tiny_sha3 source is byte-identical to upstream mjosaarinen/tiny_sha3@dcbb319 — no modifications.

Related Documentation in the Repo

  • docs/final-results.md — bottom-line summary
  • docs/keccak-benchmark.md — assembly-level keccakf comparison
  • docs/encoding-format.md — 64-bit encoding reference
  • docs/openvm-integration.md — integration plan and flow
  • docs/progress-log.md — chronological progress log
  • docs/thoughts.md — design analysis and rationale
  • docs/phase-{0-7}-*.md — per-phase implementation notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment