maaquib · September 5, 2025 15:41
diff --git a/perf.txt b/perf.txt
 Benchmark Configuration:
  Batch sizes: [1]
  Sequence lengths: [32, 64, 128, 256, 512, 1024, 1536, 2048, 4096, 8192, 16384]
  Number of heads: [16, 32, 64, 128]
  Head dimensions: [64, 128]
  Causal: True
  Data type: torch.bfloat16
  Warmup iterations: 10
  Benchmark iterations: 100

 ================================================================================
 Running benchmarks on NVIDIA B200
 ================================================================================

 [1/88] Testing: seq_len=32, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.013 ± 0.002 ms
    flash_attn     : 0.025 ± 0.003 ms
    max            : 0.032 ± 0.002 ms

 [2/88] Testing: seq_len=32, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.002 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.051 ± 0.001 ms

 [3/88] Testing: seq_len=32, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.013 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.052 ± 0.002 ms

 [4/88] Testing: seq_len=32, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.014 ± 0.000 ms
    flash_attn     : 0.024 ± 0.001 ms
    max            : 0.092 ± 0.001 ms

 [5/88] Testing: seq_len=32, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.013 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.092 ± 0.001 ms

 [6/88] Testing: seq_len=32, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.017 ± 0.026 ms
    flash_attn     : 0.024 ± 0.001 ms
    max            : 0.174 ± 0.001 ms

 [7/88] Testing: seq_len=32, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.019 ± 0.002 ms
    flashinfer     : 0.018 ± 0.002 ms
    flash_attn     : 0.035 ± 0.012 ms
    max            : 0.177 ± 0.002 ms

 [8/88] Testing: seq_len=32, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.027 ms
    flashinfer     : 0.014 ± 0.002 ms
    flash_attn     : 0.027 ± 0.008 ms
    max            : 0.340 ± 0.001 ms

 [9/88] Testing: seq_len=64, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.013 ± 0.001 ms
    flashinfer     : 0.013 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.050 ± 0.001 ms

 [10/88] Testing: seq_len=64, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.014 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.088 ± 0.001 ms

 [11/88] Testing: seq_len=64, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.013 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.090 ± 0.001 ms

 [12/88] Testing: seq_len=64, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.014 ± 0.001 ms
    flash_attn     : 0.023 ± 0.001 ms
    max            : 0.169 ± 0.001 ms

 [13/88] Testing: seq_len=64, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.013 ± 0.002 ms
    flash_attn     : 0.027 ± 0.005 ms
    max            : 0.172 ± 0.002 ms

 [14/88] Testing: seq_len=64, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.015 ± 0.002 ms
    flashinfer     : 0.016 ± 0.002 ms
    flash_attn     : 0.029 ± 0.005 ms
    max            : 0.332 ± 0.001 ms

 [15/88] Testing: seq_len=64, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.018 ± 0.003 ms
    flashinfer     : 0.017 ± 0.003 ms
    flash_attn     : 0.033 ± 0.003 ms
    max            : 0.340 ± 0.001 ms

 [16/88] Testing: seq_len=64, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.014 ± 0.001 ms
    flash_attn     : 0.024 ± 0.001 ms
    max            : 0.665 ± 0.002 ms

 [17/88] Testing: seq_len=128, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.018 ± 0.001 ms
    flashinfer     : 0.017 ± 0.001 ms
    flash_attn     : 0.032 ± 0.002 ms
    max            : 0.088 ± 0.001 ms

 [18/88] Testing: seq_len=128, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.015 ± 0.002 ms
    flashinfer     : 0.018 ± 0.003 ms
    flash_attn     : 0.032 ± 0.002 ms
    max            : 0.160 ± 0.001 ms

 [19/88] Testing: seq_len=128, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.017 ± 0.001 ms
    flashinfer     : 0.017 ± 0.001 ms
    flash_attn     : 0.032 ± 0.002 ms
    max            : 0.170 ± 0.001 ms

 [20/88] Testing: seq_len=128, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.323 ± 0.002 ms

 [22/88] Testing: seq_len=128, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.014 ± 0.001 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.024 ± 0.003 ms
    max            : 0.647 ± 0.001 ms

 [23/88] Testing: seq_len=128, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.015 ± 0.001 ms
    flashinfer     : 0.017 ± 0.029 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.669 ± 0.001 ms

 [24/88] Testing: seq_len=128, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.017 ± 0.028 ms
    flash_attn     : 0.024 ± 0.001 ms
    max            : 1.314 ± 0.001 ms

 [25/88] Testing: seq_len=256, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.024 ± 0.001 ms
    max            : 0.172 ± 0.001 ms

 [26/88] Testing: seq_len=256, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.029 ± 0.001 ms
    max            : 0.324 ± 0.001 ms

 [27/88] Testing: seq_len=256, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.024 ± 0.002 ms
    max            : 0.336 ± 0.001 ms

 [28/88] Testing: seq_len=256, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.030 ± 0.002 ms
    max            : 0.649 ± 0.001 ms

 [29/88] Testing: seq_len=256, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.002 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.024 ± 0.001 ms
    max            : 0.671 ± 0.001 ms

 [30/88] Testing: seq_len=256, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.002 ms
    flashinfer     : 0.016 ± 0.001 ms
    flash_attn     : 0.026 ± 0.001 ms
    max            : 1.315 ± 0.002 ms

 [31/88] Testing: seq_len=256, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.018 ± 0.003 ms
    flashinfer     : 0.015 ± 0.001 ms
    flash_attn     : 0.026 ± 0.003 ms
    max            : 1.337 ± 0.026 ms

 [32/88] Testing: seq_len=256, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.019 ± 0.032 ms
    flashinfer     : 0.023 ± 0.001 ms
    flash_attn     : 0.026 ± 0.002 ms
    max            : 2.622 ± 0.035 ms

 [33/88] Testing: seq_len=512, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.017 ± 0.001 ms
    flash_attn     : 0.030 ± 0.002 ms
    max            : 0.338 ± 0.001 ms

 [34/88] Testing: seq_len=512, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.017 ± 0.001 ms
    flashinfer     : 0.020 ± 0.001 ms
    flash_attn     : 0.030 ± 0.002 ms
    max            : 0.651 ± 0.002 ms

 [35/88] Testing: seq_len=512, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.019 ± 0.001 ms
    flash_attn     : 0.026 ± 0.002 ms
    max            : 0.673 ± 0.002 ms

 [36/88] Testing: seq_len=512, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.017 ± 0.001 ms
    flashinfer     : 0.029 ± 0.001 ms
    flash_attn     : 0.026 ± 0.002 ms
    max            : 1.318 ± 0.002 ms

 [37/88] Testing: seq_len=512, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.016 ± 0.001 ms
    flashinfer     : 0.021 ± 0.001 ms
    flash_attn     : 0.026 ± 0.002 ms
    max            : 1.341 ± 0.006 ms

 [38/88] Testing: seq_len=512, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.018 ± 0.001 ms
    flashinfer     : 0.036 ± 0.001 ms
    flash_attn     : 0.034 ± 0.002 ms
    max            : 2.642 ± 0.005 ms

 [39/88] Testing: seq_len=512, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.021 ± 0.001 ms
    flashinfer     : 0.035 ± 0.028 ms
    flash_attn     : 0.033 ± 0.002 ms
    max            : 2.672 ± 0.008 ms

 [40/88] Testing: seq_len=512, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.030 ± 0.019 ms
    flashinfer     : 0.053 ± 0.005 ms
    flash_attn     : 0.051 ± 0.002 ms
    max            : 5.274 ± 0.011 ms

 [41/88] Testing: seq_len=1024, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.021 ± 0.001 ms
    flashinfer     : 0.026 ± 0.001 ms
    flash_attn     : 0.026 ± 0.003 ms
    max            : 0.679 ± 0.002 ms

 [42/88] Testing: seq_len=1024, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.023 ± 0.001 ms
    flashinfer     : 0.041 ± 0.001 ms
    flash_attn     : 0.036 ± 0.002 ms
    max            : 1.324 ± 0.002 ms

 [43/88] Testing: seq_len=1024, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.021 ± 0.001 ms
    flashinfer     : 0.034 ± 0.001 ms
    flash_attn     : 0.035 ± 0.002 ms
    max            : 1.355 ± 0.004 ms

 [44/88] Testing: seq_len=1024, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.024 ± 0.001 ms
    flashinfer     : 0.060 ± 0.001 ms
    flash_attn     : 0.058 ± 0.002 ms
    max            : 2.656 ± 0.013 ms

 [45/88] Testing: seq_len=1024, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.030 ± 0.001 ms
    flashinfer     : 0.050 ± 0.002 ms
    flash_attn     : 0.050 ± 0.002 ms
    max            : 2.689 ± 0.006 ms

 [46/88] Testing: seq_len=1024, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.040 ± 0.001 ms
    flashinfer     : 0.088 ± 0.003 ms
    flash_attn     : 0.084 ± 0.003 ms
    max            : 5.302 ± 0.008 ms

 [47/88] Testing: seq_len=1024, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.045 ± 0.001 ms
    flashinfer     : 0.078 ± 0.002 ms
    flash_attn     : 0.078 ± 0.002 ms
    max            : 5.408 ± 0.020 ms

 [48/88] Testing: seq_len=1024, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.063 ± 0.018 ms
    flashinfer     : 0.137 ± 0.003 ms
    flash_attn     : 0.135 ± 0.005 ms
    max            : 13.151 ± 0.026 ms

 [49/88] Testing: seq_len=1536, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.027 ± 0.002 ms
    flashinfer     : 0.048 ± 0.001 ms
    flash_attn     : 0.046 ± 0.002 ms
    max            : 1.030 ± 0.004 ms

 [50/88] Testing: seq_len=1536, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.029 ± 0.001 ms
    flashinfer     : 0.084 ± 0.001 ms
    flash_attn     : 0.081 ± 0.002 ms
    max            : 2.001 ± 0.007 ms

 [51/88] Testing: seq_len=1536, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.035 ± 0.002 ms
    flashinfer     : 0.054 ± 0.002 ms
    flash_attn     : 0.054 ± 0.003 ms
    max            : 2.037 ± 0.004 ms

 [52/88] Testing: seq_len=1536, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.039 ± 0.002 ms
    flashinfer     : 0.096 ± 0.005 ms
    flash_attn     : 0.092 ± 0.004 ms
    max            : 3.987 ± 0.010 ms

 [53/88] Testing: seq_len=1536, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.046 ± 0.002 ms
    flashinfer     : 0.086 ± 0.002 ms
    flash_attn     : 0.085 ± 0.002 ms
    max            : 4.060 ± 0.011 ms

 [54/88] Testing: seq_len=1536, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.065 ± 0.002 ms
    flashinfer     : 0.155 ± 0.004 ms
    flash_attn     : 0.150 ± 0.004 ms
    max            : 8.797 ± 0.021 ms

 [55/88] Testing: seq_len=1536, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.075 ± 0.002 ms
    flashinfer     : 0.150 ± 0.021 ms
    flash_attn     : 0.145 ± 0.003 ms
    max            : 9.515 ± 0.018 ms

 [56/88] Testing: seq_len=1536, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.101 ± 0.011 ms
    flashinfer     : 0.263 ± 0.004 ms
    flash_attn     : 0.253 ± 0.004 ms
    max            : 21.508 ± 0.010 ms

 [57/88] Testing: seq_len=2048, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.032 ± 0.001 ms
    flashinfer     : 0.060 ± 0.001 ms
    flash_attn     : 0.059 ± 0.002 ms
    max            : 1.373 ± 0.004 ms

 [58/88] Testing: seq_len=2048, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.036 ± 0.001 ms
    flashinfer     : 0.109 ± 0.001 ms
    flash_attn     : 0.103 ± 0.002 ms
    max            : 2.665 ± 0.005 ms

 [59/88] Testing: seq_len=2048, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.046 ± 0.001 ms
    flashinfer     : 0.085 ± 0.003 ms
    flash_attn     : 0.082 ± 0.003 ms
    max            : 2.724 ± 0.006 ms

 [60/88] Testing: seq_len=2048, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.051 ± 0.002 ms
    flashinfer     : 0.148 ± 0.005 ms
    flash_attn     : 0.144 ± 0.006 ms
    max            : 5.335 ± 0.015 ms

 [61/88] Testing: seq_len=2048, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.064 ± 0.002 ms
    flashinfer     : 0.136 ± 0.003 ms
    flash_attn     : 0.132 ± 0.003 ms
    max            : 5.482 ± 0.014 ms

 [62/88] Testing: seq_len=2048, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.093 ± 0.003 ms
    flashinfer     : 0.245 ± 0.006 ms
    flash_attn     : 0.235 ± 0.006 ms
    max            : 13.294 ± 0.017 ms

 [63/88] Testing: seq_len=2048, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.114 ± 0.002 ms
    flashinfer     : 0.238 ± 0.003 ms
    flash_attn     : 0.232 ± 0.003 ms
    max            : 14.078 ± 0.026 ms

 [64/88] Testing: seq_len=2048, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.152 ± 0.007 ms
    flashinfer     : 0.427 ± 0.005 ms
    flash_attn     : 0.411 ± 0.005 ms
    max            : 29.816 ± 0.133 ms

 [65/88] Testing: seq_len=4096, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.077 ± 0.001 ms
    flashinfer     : 0.154 ± 0.009 ms
    flash_attn     : 0.151 ± 0.006 ms
    max            : 2.798 ± 0.006 ms

 [66/88] Testing: seq_len=4096, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.084 ± 0.002 ms
    flashinfer     : 0.286 ± 0.010 ms
    flash_attn     : 0.266 ± 0.010 ms
    max            : 5.403 ± 0.010 ms

 [67/88] Testing: seq_len=4096, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.104 ± 0.002 ms
    flashinfer     : 0.251 ± 0.006 ms
    flash_attn     : 0.241 ± 0.005 ms
    max            : 5.612 ± 0.009 ms

 [68/88] Testing: seq_len=4096, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.120 ± 0.002 ms
    flashinfer     : 0.458 ± 0.010 ms
    flash_attn     : 0.436 ± 0.009 ms
    max            : 13.537 ± 0.013 ms

 [69/88] Testing: seq_len=4096, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.192 ± 0.002 ms
    flashinfer     : 0.440 ± 0.005 ms
    flash_attn     : 0.423 ± 0.006 ms
    max            : 14.300 ± 0.024 ms

 [70/88] Testing: seq_len=4096, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.261 ± 0.005 ms
    flashinfer     : 0.806 ± 0.009 ms
    flash_attn     : 0.767 ± 0.011 ms
    max            : 30.013 ± 0.016 ms

 [71/88] Testing: seq_len=4096, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.371 ± 0.001 ms
    flashinfer     : 0.836 ± 0.020 ms
    flash_attn     : 0.789 ± 0.005 ms
    max            : 30.959 ± 0.015 ms

 [72/88] Testing: seq_len=4096, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.479 ± 0.004 ms
    flashinfer     : 1.590 ± 0.075 ms
    flash_attn     : 1.428 ± 0.015 ms
    max            : 60.491 ± 0.017 ms

 [73/88] Testing: seq_len=8192, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.185 ± 0.002 ms
    flashinfer     : 0.482 ± 0.011 ms
    flash_attn     : 0.461 ± 0.011 ms
    max            : 5.840 ± 0.028 ms

 [74/88] Testing: seq_len=8192, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.215 ± 0.001 ms
    flashinfer     : 0.919 ± 0.036 ms
    flash_attn     : 0.845 ± 0.021 ms
    max            : 13.872 ± 0.011 ms

 [75/88] Testing: seq_len=8192, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.347 ± 0.002 ms
    flashinfer     : 0.850 ± 0.011 ms
    flash_attn     : 0.808 ± 0.012 ms
    max            : 14.761 ± 0.013 ms

 [76/88] Testing: seq_len=8192, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.432 ± 0.002 ms
    flashinfer     : 1.655 ± 0.114 ms
    flash_attn     : 1.478 ± 0.019 ms
    max            : 30.551 ± 0.250 ms

 [77/88] Testing: seq_len=8192, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.677 ± 0.002 ms
    flashinfer     : 1.622 ± 0.049 ms
    flash_attn     : 1.500 ± 0.011 ms
    max            : 31.808 ± 0.009 ms

 [78/88] Testing: seq_len=8192, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.923 ± 0.049 ms
    flashinfer     : 3.053 ± 0.260 ms
    flash_attn     : 2.747 ± 0.019 ms
    max            : 61.206 ± 0.016 ms

 [79/88] Testing: seq_len=8192, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 1.436 ± 0.099 ms
    flashinfer     : 3.156 ± 0.185 ms
    flash_attn     : 2.886 ± 0.010 ms
    max            : 63.581 ± 0.013 ms

 [80/88] Testing: seq_len=8192, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 1.805 ± 0.113 ms
    flashinfer     : 5.778 ± 0.427 ms
    flash_attn     : 5.291 ± 0.017 ms
    max            : 122.288 ± 0.024 ms

 [81/88] Testing: seq_len=16384, num_heads=16, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 All outputs match within tolerance
  Results:
    pytorch_sdpa   : 0.660 ± 0.002 ms
    flashinfer     : 1.673 ± 0.024 ms
    flash_attn     : 1.582 ± 0.018 ms
    max            : 15.658 ± 0.025 ms

 [82/88] Testing: seq_len=16384, num_heads=16, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 0.820 ± 0.023 ms
    flashinfer     : 3.388 ± 0.445 ms
    flash_attn     : 2.903 ± 0.036 ms
    max            : 31.268 ± 0.008 ms

 [83/88] Testing: seq_len=16384, num_heads=32, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 1.321 ± 0.036 ms
    flashinfer     : 3.140 ± 0.076 ms
    flash_attn     : 2.940 ± 0.021 ms
    max            : 33.547 ± 0.019 ms

 [84/88] Testing: seq_len=16384, num_heads=32, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 1.769 ± 0.093 ms
    flashinfer     : 5.928 ± 0.457 ms
    flash_attn     : 5.400 ± 0.030 ms
    max            : 62.747 ± 0.012 ms

 [85/88] Testing: seq_len=16384, num_heads=64, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 2.787 ± 0.102 ms
    flashinfer     : 6.016 ± 0.086 ms
    flash_attn     : 5.658 ± 0.020 ms
    max            : 66.786 ± 0.024 ms

 [86/88] Testing: seq_len=16384, num_heads=64, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 3.593 ± 0.112 ms
    flashinfer     : 11.100 ± 0.046 ms
    flash_attn     : 10.406 ± 0.038 ms
    max            : 124.996 ± 0.025 ms

 [87/88] Testing: seq_len=16384, num_heads=128, head_dim=64
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 5.960 ± 0.187 ms
    flashinfer     : 11.750 ± 0.096 ms
    flash_attn     : 11.091 ± 0.021 ms
    max            : 133.292 ± 0.240 ms

 [88/88] Testing: seq_len=16384, num_heads=128, head_dim=128
  - Running PyTorch SDPA w/ CUDNN attention...
  - Running FlashInfer...
  - Running Flash Attention...
  - Running Max...
 Reference: pytorch_sdpa
 Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
  Results:
    pytorch_sdpa   : 7.280 ± 0.272 ms
    flashinfer     : 21.756 ± 0.031 ms
    flash_attn     : 20.401 ± 0.032 ms
    max            : 249.674 ± 0.035 ms

 Detailed results saved to: ./flash_benchmark_results_20250905_020930.csv
 Comparison table saved to: ./flash_benchmark_comparison_20250905_020930.csv

 ================================================================================
 BENCHMARK SUMMARY
 ================================================================================

 Average Latencies by Implementation:
 implementation
 flash_attn       0.964
 flashinfer       1.031
 max             16.071
 pytorch_sdpa     0.384
 Name: mean_latency_ms, dtype: float64

 ================================================================================
 SPEEDUP SUMMARY (relative to PyTorch SDPA)
 ================================================================================

 flashinfer:
  Average speedup: 0.660x
  Median speedup:  0.562x
  Max speedup:     1.197x
  Min speedup:     0.234x

 flash_attn:
  Average speedup: 0.509x
  Median speedup:  0.537x
  Max speedup:     0.810x
  Min speedup:     0.254x

 max:
  Average speedup: 0.047x
  Median speedup:  0.022x
  Max speedup:     0.431x
  Min speedup:     0.005x
No results found