Batch the 539 zerocheck_ntt_evaluate_constraints_coset_parallel and 396 logup_r0_ntt_eval_interactions_coset_parallel kernel launches into single batched launches, similar to how GKR input evaluation was batched.
- Added
ZerocheckR0Ctxstruct andR0BlockCtxto CUDA kernel and FFI - Implemented
batch_zerocheck_r0_coset_parallel_kernelwith per-AIR context dispatch - Added
evaluate_round0_constraints_gpu_batchedRust function
With 128 threads/block, the batched kernel exceeded per-block register limits due to loading the large ZerocheckR0Ctx struct (~160 bytes) plus NttEvalContext plus intermediates.
Reducing to 64 threads/block resolved the launch error.
The GLOBAL intermediate buffer indexing used gridDim.x * gridDim.y * blockDim.x as stride, which with total_blocks=500+ and max_num_cosets=3 created enormous stride values. Switching to per-AIR local indexing didn't fully resolve the issue.
The batched approach requires all per-AIR intermediate buffers and output buffers to coexist on the GPU, pushing peak memory beyond 24GB for APC300.
The parallel CUDA streams approach (Idea 002) already reduces Round 0 from 570ms to 280ms by overlapping 935 per-AIR kernel launches across 4 concurrent streams. This achieves ~40% GPU utilization for Round 0 without kernel batching.
Full kernel batching would improve GPU utilization to ~80%+ but the complexity and memory issues make it impractical without deeper changes to the intermediate buffer management.
Reverted. Parallel streams remain the best approach for Round 0 on the current codebase.
- Implement a shared intermediate buffer pool across AIRs in the batched kernel
- Use cooperative groups or grid-stride loops to share intermediates
- Consider a hybrid: batch small AIRs (height < 256, no intermediates needed) while keeping large AIRs on per-AIR paths