RISCV-X: 1024 registers for zkVM — technical summary with links
Experiment: modify the LLVM RISC-V backend to support 1024 general-purpose registers (up from 32), integrate with the OpenVM zkVM framework, and measure proof cost reduction. Goal: eliminate register spills in compute-heavy zkVM workloads where memory operations dominate proof cost.
For 1000 SHA3-256 iterations through OpenVM with actual STARK proofs, 4 segments, aggregated across all AIRs:
Note: earlier versions of this gist reported ~8% fewer trace cells. Those numbers came from a build that had three latent bugs (two in the LLVM calling convention, one in the OpenVM transpiler's S-type immediate decoding) — the extended binary was still producing correct hashes in simpler benchmarks so the numbers looked plausible, but in tiny_sha3 specifically the broken build was secretly looping ~38× more than the baseline. All three are now fixed; see Part 6.
| Metric | Baseline (32 regs) | Extended (1024 regs) | Change |
|---|---|---|---|
| Executed instructions | 35,896,008 | 31,067,004 | −13.5% |
total_cells (allocated) |
4,049,774,760 | 3,853,796,520 | −4.8% |
total_cells_used (actually filled) |
3,275,912,933 | 2,829,414,701 | −13.6% |
main_cells_used |
1,426,439,001 | 1,229,953,329 | −13.8% |
| Total proof time | 130.8s | 127.7s | −2.4% |
total_cells is the allocated trace matrix — AIR heights rounded up to powers of two for FRI. total_cells_used is the actually filled cells (what matters for proving work in principle). The allocated metric drops less because power-of-two padding wastes the gain; the filled metric is the honest figure.
Proof wall-clock time barely moves because keccak proof time is dominated by FRI/commitment work that doesn't scale down with the trace. Most of the saving would show up as headroom in trace-cell budget (e.g., fitting into one fewer segment for compute-heavier workloads).
Raw metrics.json (combined baseline + extended): https://gist.github.com/leonardoalt/0268bcf198fb657bd3899c7d8376718e
The benchmark is SHA3-256 applied 1000 times in a row to a 32-byte buffer initially full of zeros. Reference hash 52cf48e88ce4dea40f272b6aaf083675ade26504a0129f51ec30204a2fdb1c5b (matches Python hashlib.sha3_256); both baseline and extended ELFs produce it exactly.
zkVMs like OpenVM use RISC-V as their ISA because it's simple. But RISC-V has only 32 registers, which causes significant spills in compute-heavy code. In hardware, spills hit cache/memory cheaply. In a zkVM, every memory operation is expensive — it must be proven. Each load/store adds constraints.
Powdr's crush demonstrated that compiling from WASM to an ISA with infinite registers (no spills at all) is much more efficient in a zkVM. This experiment takes a different approach: what if we give LLVM 1024 physical registers? LLVM IR already uses infinite virtual registers; the register allocator's job is to map them to physical registers. With 1024 physical registers, the allocator has so much headroom that it trivially succeeds without spills for most functions.
- LLVM: https://github.com/leonardoalt/llvm-project/tree/riscvx-1024regs
- OpenVM: https://github.com/leonardoalt/openvm/tree/xregs1024
- Baseline keccak 100 iter: https://gist.github.com/leonardoalt/116252da97292c3efd65bffe5e99eda1
- Extended keccak 100 iter: https://gist.github.com/leonardoalt/d928023d3d216c93d54bbd64d259a1a8
- Combined 1000 iter: https://gist.github.com/leonardoalt/acfa2a722c08ca0a0326bef3c06a01de
Added FeatureVendorXRegs1024 in RISCVFeatures.td. Enable via -mattr=+xregs1024 or -march=rv32im_xregs1024.
- Widened
RISCVRegencoding frombits<5>tobits<10>. - Added X32-X1023 via
foreach i = 32...1023 in { def X#i : RISCVReg<i, "x"#i> }. - Included the extended registers in the
GPRregister class. - In
RISCVRegisterInfo.cpp, X32-X1023 are marked as reserved ingetReservedRegs()when XRegs1024 is not active, so standard RISC-V behavior is preserved.
In RISCVCallingConv.td, defined CSR_XRegs1024 as an empty callee-saved list. In zkVM there are no interrupts or context switches, so making every register caller-saved eliminates prologue/epilogue saves. The register allocator inserts spills only for values actually live across calls.
GP (x3) and TP (x4) are also unreserved when XRegs1024 is active (not needed in zkVM).
Standard RISC-V encodes instructions in 32 bits with 5-bit register fields — not enough for 1024 registers. We designed a fixup-compatible 64-bit encoding in RISCVInstrFormats.td:
- Low u32 (bytes 0-3): standard RISC-V bit layout with
[6:0] = 0b0111111marker. Immediate bits stay at standard positions → existing LLVM fixups (fixup_riscv_branch,fixup_riscv_hi20, etc.) work unchanged. - High u32 (bytes 4-7): original 7-bit opcode at
[16:10], high 5 bits of each register ID at[21:17](rd),[26:22](rs1),[31:27](rs2).
Transpiler reconstructs full 10-bit register IDs as reg = (hi_bits << 5) | lo_bits.
Rather than creating separate 64-bit instruction definitions for every instruction (which would require duplicating ~30 defs with new format classes), we intercept in RISCVMCCodeEmitter.cpp. Each 32-bit instruction is re-encoded as 64-bit on-the-fly when FeatureVendorXRegs1024 is set. Register operands are read from the MCInst at their full 10-bit encoding (via getEncodingValue()) and placed into both the low 5-bit field and the high 5-bit field.
AUIPC+JALR is the standard pseudo-CALL sequence. The R_RISCV_CALL_PLT relocation requires these two instructions at consecutive 4-byte offsets — incompatible with arbitrary 8-byte re-encoding. Solution: keep AUIPC+JALR as two consecutive standard 32-bit instructions (8 bytes total). The linker fixup works at its usual offsets. The OpenVM transpiler detects this pair and combines them into a single PC-relative JAL.
Created XRegs1024TranspilerExtension. For each 2-u32 chunk:
- If
lo[6:0] == 0x3F→ decode as a 64-bit instruction, extract full 10-bit registers, produce the equivalent 7-fieldInstruction<F>. - If
lo[6:0] == 0x17(AUIPC) andhi[6:0] == 0x67(JALR) → combine into a single JAL with the combined offset (halved for PC compression). - Otherwise → return None (fallback to standard transpilers if registered).
64-bit instructions take 8 bytes in the ELF but OpenVM's DEFAULT_PC_STEP = 4. Each 64-bit instruction occupies 1 PC slot (4 bytes). The transpiler maps:
- ELF byte address 0 → OpenVM PC
base + 0 - ELF byte address 8 → OpenVM PC
base + 4
So OpenVM PC = base + (ELF_offset) / 2. Consequently:
- Branch/jump offsets are halved in the transpiler (LLVM calculated them for 8-byte instructions).
- The ELF entry point is halved when building VmExe.
Previously RV32_NUM_REGISTERS = 32, register file in address space 1 allocated 32 * 4 = 128 bytes. Changed to 1024 in riscv.rs; MemoryConfig now allocates RV32_NUM_REGISTERS * 4 = 4096 bytes for the register file.
Circuit execution files stored register byte offsets as u8 (max value 255). With 1024 registers, offsets go up to 4092. Changed u8 → u16 in 13 execution files across extensions/rv32im/circuit/src/*/execution.rs (base_alu, shift, less_than, branch_eq, branch_lt, mul, mulh, divrem, jalr, jal_lui, auipc, loadstore, load_sign_extend, hintstore).
Also expanded RISCV_TO_X86_OVERRIDE_MAP in aot/common.rs from a [Option<&str>; 32] to [Option<&str>; 1024] to avoid out-of-bounds access in AOT paths (entries 32-1023 are None).
Subtle bug: the standard from_u_type helper produces f = 0, but process_lui in the Rv32I transpiler explicitly sets f = 1 afterwards — because LUI shares a chip with JAL, and f is the "write rd" enable flag. My transpiler was missing this. Fixed in make_lui in xregs1024.rs.
Critical fix: my original R-type transpiler had a nested match where SLL/SRL/SRA/SLT/SLTU hit the outer catch-all _ => unimp() before the inner shift/lt handling could run (dead code). Flattened into a single match that covers all (funct7, funct3) combinations.
C programs compile end-to-end:
clang --target=riscv32 -march=rv32im -O2 -ffreestanding -fno-builtin \
-emit-llvm -S keccak.c -o keccak.ll
# Baseline
llc -march=riscv32 -mattr=+m -O2 -filetype=obj keccak.ll -o baseline.o
ld.lld --no-relax -T keccak.ld baseline.o -o baseline.elf
# Extended (1024 registers, 64-bit encoding)
llc -march=riscv32 -mattr=+m,+xregs1024 -O2 -filetype=obj keccak.ll -o extended.o
ld.lld --no-relax -T keccak.ld extended.o -o extended.elfLinker script puts .text at OpenVM's expected address (0x200800) and separates .data/.rodata/.bss into distinct segments.
The standard OpenVM Rust build (cargo build --target riscv32im-risc0-zkvm-elf) uses rustc's bundled LLVM, not our fork. Attempted approaches:
- Intercept at bitcode:
RUSTFLAGS="--emit=llvm-bc"produces one.bcper crate.llvm-linkfails on duplicatememcpy/memset/rust_begin_unwindfromcompiler_builtins,core,std. The--overrideoption helps partially but doesn't resolve all duplicates. - Compile each .bc separately + link:
lldcomplains about duplicates;--allow-multiple-definitionhelps but then runtime symbols like__rust_no_alloc_shim_is_unstable_v2andsys_panicare missing. -C linker-plugin-lto: gets close, but rustc's linker invocation adds clang-specific flags incompatible with-Ttext.
The proper solution for Rust guests is to build a custom rustc sysroot with our LLVM fork. Not yet done.
Compile tiny-keccak crate to LLVM IR via standard rustc, then recompile with our llc:
keccakf function:
Standard: 879 instructions, 156 stack stores + 180 stack loads = 336 spill ops
Extended: 535 instructions, 0 stack ops
Reduction: 39% fewer instructions, 100% spills eliminated
See docs/keccak-benchmark.md for assembly-level comparison details.
The metrics that matter. Generated with:
OUTPUT_PATH=/tmp/metrics_baseline.json XREGS_VARIANT=baseline \
cargo test -p openvm-toolchain-tests --test transpiler_tests \
--features metrics -- test_xregs1024_proof_with_metrics --nocapture
OUTPUT_PATH=/tmp/metrics_extended.json XREGS_VARIANT=extended \
cargo test -p openvm-toolchain-tests --test transpiler_tests \
--features metrics -- test_xregs1024_proof_with_metrics --nocaptureThe metrics feature enables openvm_stark_sdk::bench::run_with_metric_collection which sets up a DebuggingRecorder and dumps a snapshot to the path in OUTPUT_PATH as JSON. The JSON contains gauges and counters for every AIR: constraints, interactions, trace heights, main/perm/quotient cells, and so on.
The workload is a C keccak program (bench/keccak_standalone.c) doing N iterations of keccak256(buf, buf). Baseline compiled for standard RV32IM; extended compiled with +xregs1024.
Aggregating the metrics JSON across all AIRs:
Keccak 1000 iterations (4 segments):
| Metric | Baseline | Extended | Change |
|---|---|---|---|
total_cells (allocated) |
4,049,020,328 | 3,853,992,104 | -4.8% |
total_cells_used (actual) |
3,176,459,749 | 2,926,616,970 | -7.9% |
main_cells_used |
1,382,971,605 | 1,273,258,966 | -7.9% |
| Total proof time | 168.8s | 162.9s | -3.5% |
Keccak 100 iterations (1 segment):
| Metric | Baseline | Extended | Change |
|---|---|---|---|
total_cells (allocated) |
461,105,706 | 363,573,802 | -21.2% |
total_cells_used (actual) |
323,386,753 | 298,398,042 | -7.7% |
| Total proof time | 22.9s | 20.6s | -9.9% |
Important distinction: total_cells is the allocated trace matrix (heights rounded up to powers of 2 because of FRI). total_cells_used is actually-filled cells. The difference between runs:
- At 100 iter:
total_cellsdrops 21% because an AIR crosses a power-of-2 boundary. - At 1000 iter (4 segments): the rounding effects wash out and
total_cellsdrops only ~5%, matchingtotal_cells_usedmore closely.
The total_cells_used reduction stays consistently ~8% across workload sizes — that's the honest figure.
The reduction is modest because keccakf, while the dominant hot function, is surrounded by equally-costly code that doesn't benefit from extended registers: the memcpy between iterations, the loop bookkeeping, the function call overhead, the VM's memory/IO chips, etc. If the workload were pure keccakf (inlined, no function call overhead, no memcpy), the reduction would approach the 39% instruction reduction we see at the assembly level.
- Fixup compatibility: LLVM's branch/jump fixups assume standard encoding. Our low u32 preserves standard bit positions so fixups work without modification — this was the key insight that made the encoding practical.
- AUIPC+JALR addressing: PC-relative addressing combined with byte-aligned ELF offsets vs compressed OpenVM PC space is genuinely subtle. Resolved by combining the pair into a single JAL.
- Phantom gaps: Initial approach emitted
[instruction, phantom_nop]pairs to keep ELF byte addresses matching OpenVM PCs. This worked but the phantoms counted against trace cells (~8% worse). Removing them required halving all branch/jump/call offsets. - Scattered u8 assumptions: The register byte offset being
u8was hardcoded in 13+ circuit files. All needed widening tou16. - Silent transpiler bugs: The R-type dead-code path silently produced
unimp()for SLL/SRL/SRA/SLT/SLTU. Only caught by tracing TERMINATE(2) occurrences in the transpiled program. - Linker relaxation:
ld.lldby default rewrites 8-byteauipc+jalrcall pairs into a 4-bytejalwhen the target is in range. The transpiler's halving arithmetic assumes every ELF slot is 8 bytes, so relaxation breaks alignment silently. Must pass--no-relaxto the linker.
After switching the guest-side keccak from a hand-rolled C implementation to the stock tiny_sha3 library, the 1000-iteration benchmark started reporting ~38× more instructions than baseline (1.386B vs 35.9M), eventually faulting on an out-of-bounds load. Investigating the regression uncovered three distinct bugs that had been latent, each masked on simpler code paths:
The calling-convention design was "everything caller-saved except whatever the call-preserved mask says is preserved". The CSR list for XRegs1024 was empty, so the callee prologue saved nothing. But getCallPreservedMask was not changed — it still returned CSR_ILP32_LP64_RegMask, telling the register allocator that s0-s11 survive a call. The allocator believed the lie and left live values in s0/s1 across calls; the callee didn't preserve them; values got silently overwritten.
In tiny_sha3 specifically this broke the last step of sha3_final: md and c pointers kept in s0/s1, clobbered by the call to sha3_keccakf, and then the post-call memcpy used garbage pointers. The reloaded "mdlen" was enormous and the copy loop ran millions of iterations before falling off the end of memory. That's where the 38× came from.
Fix: make getCallPreservedMask return CSR_IPRA_RegMask (only x1 preserved) when hasVendorXRegs1024().
With an empty CSR list, PEI also didn't save ra in any non-leaf prologue. Every nested call overwrote ra with the inner-call return address, and the outer function's final ret jumped somewhere inside itself.
Fix: add X1 to CSR_XRegs1024. Non-leaf functions now spill ra in their prologue, caller-saved for everything else.
Both of these had been masked in simpler call chains because the compiler happened to pick allocations that didn't need preservation. tiny_sha3 is the first case we tested where values were actually live across a call.
The 12-bit S-type immediate extractor in xregs1024.rs masked the high half with & 0x7F after a signed right shift:
let hi7 = ((lo as i32) >> 25) & 0x7F; // <- wipes the sign
let imm = (hi7 << 5) | lo5;-1 decoded as +4095. So sb a, -1(ptr) (the final-pad write of state[rsiz-1] ^= 0x80 in sha3_final) stored to ptr + 4095 — 4 KiB past the state instead of the last byte of it. The pad never applied; keccakf then ran on an unpadded state and produced garbage. The hash came out almost right (25 round permutation of the wrong initial state) which is why it looked subtly off rather than catastrophically wrong.
Fix: sign-extend the 12-bit result correctly.
All three bugs are on paths that basic tests never exercised:
- 1 and 2 only trigger when a non-leaf function has values live across a call.
- 3 only triggers when LLVM emits a store with a negative immediate (most stores use positive offsets).
Whoever first adds a new codebase on top of a custom backend should run it against a proper hash function. Keccak state + XOR pad at rsiz-1 + nested function calls through s-registers is a surprisingly thorough calling-convention fuzzer.
Full build recipe for the benchmark ELFs (tiny_sha3 source, driver, linker script, llc / ld.lld commands) is at https://gist.github.com/leonardoalt/66ab3ae06c0e21532b10d7e637afedeb. The tiny_sha3 source is byte-identical to upstream mjosaarinen/tiny_sha3@dcbb319 — no modifications.
docs/final-results.md— bottom-line summarydocs/keccak-benchmark.md— assembly-level keccakf comparisondocs/encoding-format.md— 64-bit encoding referencedocs/openvm-integration.md— integration plan and flowdocs/progress-log.md— chronological progress logdocs/thoughts.md— design analysis and rationaledocs/phase-{0-7}-*.md— per-phase implementation notes