Skip to content

Instantly share code, notes, and snippets.

@leonardoalt
Created April 7, 2026 23:09
Show Gist options
  • Select an option

  • Save leonardoalt/478d24c4e594665bcec0bc60094b0fcf to your computer and use it in GitHub Desktop.

Select an option

Save leonardoalt/478d24c4e594665bcec0bc60094b0fcf to your computer and use it in GitHub Desktop.
009 - Round 0 Kernel Batching (Attempted)

009 - Full Round 0 CUDA Kernel Batching (Attempted)

Idea

Batch the 539 zerocheck_ntt_evaluate_constraints_coset_parallel and 396 logup_r0_ntt_eval_interactions_coset_parallel kernel launches into single batched launches, similar to how GKR input evaluation was batched.

Changes Attempted

  • Added ZerocheckR0Ctx struct and R0BlockCtx to CUDA kernel and FFI
  • Implemented batch_zerocheck_r0_coset_parallel_kernel with per-AIR context dispatch
  • Added evaluate_round0_constraints_gpu_batched Rust function

Issues Encountered

1. cudaErrorLaunchOutOfResources

With 128 threads/block, the batched kernel exceeded per-block register limits due to loading the large ZerocheckR0Ctx struct (~160 bytes) plus NttEvalContext plus intermediates.

Reducing to 64 threads/block resolved the launch error.

2. GPU Memory (OOM)

The GLOBAL intermediate buffer indexing used gridDim.x * gridDim.y * blockDim.x as stride, which with total_blocks=500+ and max_num_cosets=3 created enormous stride values. Switching to per-AIR local indexing didn't fully resolve the issue.

The batched approach requires all per-AIR intermediate buffers and output buffers to coexist on the GPU, pushing peak memory beyond 24GB for APC300.

3. Alternative: Parallel streams already effective

The parallel CUDA streams approach (Idea 002) already reduces Round 0 from 570ms to 280ms by overlapping 935 per-AIR kernel launches across 4 concurrent streams. This achieves ~40% GPU utilization for Round 0 without kernel batching.

Full kernel batching would improve GPU utilization to ~80%+ but the complexity and memory issues make it impractical without deeper changes to the intermediate buffer management.

Result

Reverted. Parallel streams remain the best approach for Round 0 on the current codebase.

Recommendations for Future Work

  1. Implement a shared intermediate buffer pool across AIRs in the batched kernel
  2. Use cooperative groups or grid-stride loops to share intermediates
  3. Consider a hybrid: batch small AIRs (height < 256, no intermediates needed) while keeping large AIRs on per-AIR paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment