009 - Full Round 0 CUDA Kernel Batching (Attempted)

Idea

Batch the 539 zerocheck_ntt_evaluate_constraints_coset_parallel and 396 logup_r0_ntt_eval_interactions_coset_parallel kernel launches into single batched launches, similar to how GKR input evaluation was batched.

Changes Attempted

Added ZerocheckR0Ctx struct and R0BlockCtx to CUDA kernel and FFI
Implemented batch_zerocheck_r0_coset_parallel_kernel with per-AIR context dispatch
Added evaluate_round0_constraints_gpu_batched Rust function

Issues Encountered

1. cudaErrorLaunchOutOfResources

With 128 threads/block, the batched kernel exceeded per-block register limits due to loading the large ZerocheckR0Ctx struct (~160 bytes) plus NttEvalContext plus intermediates.

Reducing to 64 threads/block resolved the launch error.

2. GPU Memory (OOM)

The GLOBAL intermediate buffer indexing used gridDim.x * gridDim.y * blockDim.x as stride, which with total_blocks=500+ and max_num_cosets=3 created enormous stride values. Switching to per-AIR local indexing didn't fully resolve the issue.

The batched approach requires all per-AIR intermediate buffers and output buffers to coexist on the GPU, pushing peak memory beyond 24GB for APC300.

3. Alternative: Parallel streams already effective

The parallel CUDA streams approach (Idea 002) already reduces Round 0 from 570ms to 280ms by overlapping 935 per-AIR kernel launches across 4 concurrent streams. This achieves ~40% GPU utilization for Round 0 without kernel batching.

Full kernel batching would improve GPU utilization to ~80%+ but the complexity and memory issues make it impractical without deeper changes to the intermediate buffer management.

Result

Reverted. Parallel streams remain the best approach for Round 0 on the current codebase.

Recommendations for Future Work

Implement a shared intermediate buffer pool across AIRs in the batched kernel
Use cooperative groups or grid-stride loops to share intermediates
Consider a hybrid: batch small AIRs (height < 256, no intermediates needed) while keeping large AIRs on per-AIR paths

leonardoalt/009-round0-batching.md

Select an option

No results found