Skip to content

Instantly share code, notes, and snippets.

@leonardoalt
leonardoalt / metrics_combined_1000_v2.json
Created April 16, 2026 22:01
xregs1024 v2: corrected tiny_sha3 proof metrics (baseline vs extended, 1000 SHA3-256 iter, 4 segments)
{"baseline": {"counter": [{"labels": [["air_name", "ProgramAir"]], "metric": "quotient_deg", "value": "1"}, {"labels": [["air_name", "ProgramAir"]], "metric": "constraints", "value": "4"}, {"labels": [["air_name", "ProgramAir"]], "metric": "interactions", "value": "1"}, {"labels": [["air_name", "VmConnectorAir"]], "metric": "quotient_deg", "value": "2"}, {"labels": [["air_name", "VmConnectorAir"]], "metric": "constraints", "value": "11"}, {"labels": [["air_name", "VmConnectorAir"]], "metric": "interactions", "value": "5"}, {"labels": [["air_name", "PersistentBoundaryAir<8>"]], "metric": "quotient_deg", "value": "2"}, {"labels": [["air_name", "PersistentBoundaryAir<8>"]], "metric": "constraints", "value": "7"}, {"labels": [["air_name", "PersistentBoundaryAir<8>"]], "metric": "interactions", "value": "3"}, {"labels": [["air_name", "MemoryMerkleAir<8>"]], "metric": "quotient_deg", "value": "2"}, {"labels": [["air_name", "MemoryMerkleAir<8>"]], "metric": "constraints", "value": "39"}, {"labels": [["air_name", "Me
@leonardoalt
leonardoalt / xregs1024_blog.md
Last active April 16, 2026 22:02
Blog post: giving RISC-V 1024 registers for zkVMs

Giving RISC-V 1024 registers for zkVMs

zkVMs like OpenVM pick RISC-V because it's simple. It's a clean ISA, well-understood, with mature toolchains. But RISC-V was designed for hardware: 32 general-purpose registers is a reasonable number when spills go to cache and cost you a few cycles.

In a zkVM, memory is expensive. Every load and store has to be proven, and memory consistency constraints dominate the proof cost of compute-heavy programs. That 32-register limit, harmless on silicon, becomes a bottleneck.

At Powdr we already built crush, which compiles from WASM to a custom ISA with infinite registers and zero spills. That's the clean solution. But I got curious about a different question: what if we just took plain old RISC-V and gave it a bigger register file? LLVM IR already uses infinite virtual registers internally. The register allocator's job is to map them to a finite physical register set. If we give it 1024 regist

@leonardoalt
leonardoalt / xregs1024_build.md
Created April 16, 2026 21:56
xregs1024 benchmark build recipe (tiny_sha3 driver + llc/ld.lld commands)

xregs1024 benchmark build recipe

All artifacts compiled from the stock mjosaarinen/tiny_sha3 library (sha3.c, sha3.h at commit dcbb319, byte-identical to what we use — no modifications) plus a small freestanding driver that runs 1000 SHA3-256 iterations on a buffer starting as 32 zero bytes.

Verified 1000-iteration reference hash: 52cf48e88ce4dea40f272b6aaf083675ade26504a0129f51ec30204a2fdb1c5b (computed with Python hashlib.sha3_256; both the baseline and extended ELFs produce this after 1000 iterations).

Driver (tiny_sha3_driver.c)

#include "tiny_sha3/sha3.h"
@leonardoalt
leonardoalt / xregs1024_summary.md
Last active April 16, 2026 22:02
RISCV-X: 1024 registers for zkVM — technical summary with links

RISCV-X: 1024 registers for zkVM — technical summary with links

RISCV-X: 1024 Registers for zkVM — Technical Summary

Experiment: modify the LLVM RISC-V backend to support 1024 general-purpose registers (up from 32), integrate with the OpenVM zkVM framework, and measure proof cost reduction. Goal: eliminate register spills in compute-heavy zkVM workloads where memory operations dominate proof cost.

Result Headline (updated after correctness fixes)

For 1000 SHA3-256 iterations through OpenVM with actual STARK proofs, 4 segments, aggregated across all AIRs:

@leonardoalt
leonardoalt / evm_transpilation_targets.md
Created April 8, 2026 18:40
EVM Bytecode Transpilation Targets for zkVM Proving (crush, RISC-V, WASM, custom IR)

EVM Bytecode Transpilation Targets for zkVM Proving

Problem statement

To build a faster zkEVM, we must start from EVM bytecode — not Solidity source. Any deployed contract, any bytecode, must be provable. The two approaches are:

  1. Interpret — run an EVM interpreter (e.g., revm) inside a zkVM. This is what we do today. The zkVM proves the interpreter.
  2. Transpile — translate EVM bytecode into a target ISA and prove that directly. The zkVM proves the translated program.

The interpreter approach has ~2-5x overhead from dispatch, stack checks, and stack I/O (see overhead analysis). Can we do better by transpiling?

@leonardoalt
leonardoalt / interpreter_overhead_analysis.md
Last active April 8, 2026 18:36
EVM Interpreter Overhead vs Direct Compilation: RISC-V Instruction-Level Analysis

EVM Interpreter Overhead vs Direct Compilation: Instruction-Level Analysis

The interpreter loop (per opcode)

From revm-interpreter source (interpreter.rs:282), every single EVM opcode goes through this path:

fn step(&mut self, instruction_table: &InstructionTable, host: &mut H) {
    let opcode = self.bytecode.opcode();        // 1. Fetch opcode byte
 self.bytecode.relative_jump(1); // 2. Advance PC
@leonardoalt
leonardoalt / AddMul.sol
Last active April 8, 2026 17:49
Solidity → Yul → LLVM IR → RISC-V compilation pipeline (uint256 addMul)
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.0;
contract AddMul {
function addMul(uint256 x, uint256 y, uint256 z) public pure returns (uint256) {
return (x + y) * z;
}
}
@leonardoalt
leonardoalt / 010-additional-ideas.md
Created April 7, 2026 23:09
010 - Additional Ideas Tested

010 - Additional Ideas Tested

Ideas Tested After Initial 8 Ideas

Increase parallel stream count (4 -> 6/8 threads)

  • Result: OOM at 6 and 8 threads. Peak concurrent GPU memory from intermediate buffers exceeds 24GB.
  • Root cause: Each thread's AIR evaluation allocates temp_sums and intermediates buffers that coexist across threads.
  • Conclusion: 4 threads is the maximum for RTX 4090 with current buffer management.

Cache logup Round 0 DAG rules at keygen

@leonardoalt
leonardoalt / 009-round0-batching.md
Created April 7, 2026 23:09
009 - Round 0 Kernel Batching (Attempted)

009 - Full Round 0 CUDA Kernel Batching (Attempted)

Idea

Batch the 539 zerocheck_ntt_evaluate_constraints_coset_parallel and 396 logup_r0_ntt_eval_interactions_coset_parallel kernel launches into single batched launches, similar to how GKR input evaluation was batched.

Changes Attempted

  • Added ZerocheckR0Ctx struct and R0BlockCtx to CUDA kernel and FFI
  • Implemented batch_zerocheck_r0_coset_parallel_kernel with per-AIR context dispatch
  • Added evaluate_round0_constraints_gpu_batched Rust function
@leonardoalt
leonardoalt / metrics_optimized_combined.json
Last active April 7, 2026 22:24
Optimized metrics: pairing apc{0,100,300} separate experiments (branch 002-parallel-streams)
This file has been truncated, but you can view the full file.
{"metrics_optimized_apc000": {"counter": [{"labels": [["air_name", "ProgramAir"], ["air_id", "0"]], "metric": "constraint_deg", "value": "1"}, {"labels": [["air_name", "ProgramAir"], ["air_id", "0"]], "metric": "constraints", "value": "0"}, {"labels": [["air_name", "ProgramAir"], ["air_id", "0"]], "metric": "interactions", "value": "1"}, {"labels": [["air_name", "VmConnectorAir"], ["air_id", "1"]], "metric": "constraint_deg", "value": "3"}, {"labels": [["air_name", "VmConnectorAir"], ["air_id", "1"]], "metric": "constraints", "value": "8"}, {"labels": [["air_name", "VmConnectorAir"], ["air_id", "1"]], "metric": "interactions", "value": "5"}, {"labels": [["air_name", "PersistentBoundaryAir<8>"], ["air_id", "2"]], "metric": "constraint_deg", "value": "3"}, {"labels": [["air_name", "PersistentBoundaryAir<8>"], ["air_id", "2"]], "metric": "constraints", "value": "3"}, {"labels": [["air_name", "PersistentBoundaryAir<8>"], ["air_id", "2"]], "metric": "interactions", "value": "4"}, {"labels": [["air_name", "MemoryMe