Skip to content

Instantly share code, notes, and snippets.

@maaquib
Created September 5, 2025 15:41
Show Gist options
  • Select an option

  • Save maaquib/4e762336b732fd32cb76740c28689bef to your computer and use it in GitHub Desktop.

Select an option

Save maaquib/4e762336b732fd32cb76740c28689bef to your computer and use it in GitHub Desktop.
PrefilPerfComparision
Benchmark Configuration:
Batch sizes: [1]
Sequence lengths: [32, 64, 128, 256, 512, 1024, 1536, 2048, 4096, 8192, 16384]
Number of heads: [16, 32, 64, 128]
Head dimensions: [64, 128]
Causal: True
Data type: torch.bfloat16
Warmup iterations: 10
Benchmark iterations: 100
================================================================================
Running benchmarks on NVIDIA B200
================================================================================
[1/88] Testing: seq_len=32, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.013 ± 0.002 ms
flash_attn : 0.025 ± 0.003 ms
max : 0.032 ± 0.002 ms
[2/88] Testing: seq_len=32, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.002 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.051 ± 0.001 ms
[3/88] Testing: seq_len=32, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.013 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.052 ± 0.002 ms
[4/88] Testing: seq_len=32, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.014 ± 0.000 ms
flash_attn : 0.024 ± 0.001 ms
max : 0.092 ± 0.001 ms
[5/88] Testing: seq_len=32, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.013 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.092 ± 0.001 ms
[6/88] Testing: seq_len=32, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.017 ± 0.026 ms
flash_attn : 0.024 ± 0.001 ms
max : 0.174 ± 0.001 ms
[7/88] Testing: seq_len=32, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.019 ± 0.002 ms
flashinfer : 0.018 ± 0.002 ms
flash_attn : 0.035 ± 0.012 ms
max : 0.177 ± 0.002 ms
[8/88] Testing: seq_len=32, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.027 ms
flashinfer : 0.014 ± 0.002 ms
flash_attn : 0.027 ± 0.008 ms
max : 0.340 ± 0.001 ms
[9/88] Testing: seq_len=64, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.013 ± 0.001 ms
flashinfer : 0.013 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.050 ± 0.001 ms
[10/88] Testing: seq_len=64, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.014 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.088 ± 0.001 ms
[11/88] Testing: seq_len=64, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.013 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.090 ± 0.001 ms
[12/88] Testing: seq_len=64, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.014 ± 0.001 ms
flash_attn : 0.023 ± 0.001 ms
max : 0.169 ± 0.001 ms
[13/88] Testing: seq_len=64, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.013 ± 0.002 ms
flash_attn : 0.027 ± 0.005 ms
max : 0.172 ± 0.002 ms
[14/88] Testing: seq_len=64, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.015 ± 0.002 ms
flashinfer : 0.016 ± 0.002 ms
flash_attn : 0.029 ± 0.005 ms
max : 0.332 ± 0.001 ms
[15/88] Testing: seq_len=64, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.018 ± 0.003 ms
flashinfer : 0.017 ± 0.003 ms
flash_attn : 0.033 ± 0.003 ms
max : 0.340 ± 0.001 ms
[16/88] Testing: seq_len=64, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.014 ± 0.001 ms
flash_attn : 0.024 ± 0.001 ms
max : 0.665 ± 0.002 ms
[17/88] Testing: seq_len=128, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.018 ± 0.001 ms
flashinfer : 0.017 ± 0.001 ms
flash_attn : 0.032 ± 0.002 ms
max : 0.088 ± 0.001 ms
[18/88] Testing: seq_len=128, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.015 ± 0.002 ms
flashinfer : 0.018 ± 0.003 ms
flash_attn : 0.032 ± 0.002 ms
max : 0.160 ± 0.001 ms
[19/88] Testing: seq_len=128, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.017 ± 0.001 ms
flashinfer : 0.017 ± 0.001 ms
flash_attn : 0.032 ± 0.002 ms
max : 0.170 ± 0.001 ms
[20/88] Testing: seq_len=128, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.323 ± 0.002 ms
[22/88] Testing: seq_len=128, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.014 ± 0.001 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.024 ± 0.003 ms
max : 0.647 ± 0.001 ms
[23/88] Testing: seq_len=128, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.015 ± 0.001 ms
flashinfer : 0.017 ± 0.029 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.669 ± 0.001 ms
[24/88] Testing: seq_len=128, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.017 ± 0.028 ms
flash_attn : 0.024 ± 0.001 ms
max : 1.314 ± 0.001 ms
[25/88] Testing: seq_len=256, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.024 ± 0.001 ms
max : 0.172 ± 0.001 ms
[26/88] Testing: seq_len=256, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.029 ± 0.001 ms
max : 0.324 ± 0.001 ms
[27/88] Testing: seq_len=256, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.024 ± 0.002 ms
max : 0.336 ± 0.001 ms
[28/88] Testing: seq_len=256, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.030 ± 0.002 ms
max : 0.649 ± 0.001 ms
[29/88] Testing: seq_len=256, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.002 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.024 ± 0.001 ms
max : 0.671 ± 0.001 ms
[30/88] Testing: seq_len=256, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.002 ms
flashinfer : 0.016 ± 0.001 ms
flash_attn : 0.026 ± 0.001 ms
max : 1.315 ± 0.002 ms
[31/88] Testing: seq_len=256, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.018 ± 0.003 ms
flashinfer : 0.015 ± 0.001 ms
flash_attn : 0.026 ± 0.003 ms
max : 1.337 ± 0.026 ms
[32/88] Testing: seq_len=256, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.019 ± 0.032 ms
flashinfer : 0.023 ± 0.001 ms
flash_attn : 0.026 ± 0.002 ms
max : 2.622 ± 0.035 ms
[33/88] Testing: seq_len=512, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.017 ± 0.001 ms
flash_attn : 0.030 ± 0.002 ms
max : 0.338 ± 0.001 ms
[34/88] Testing: seq_len=512, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.017 ± 0.001 ms
flashinfer : 0.020 ± 0.001 ms
flash_attn : 0.030 ± 0.002 ms
max : 0.651 ± 0.002 ms
[35/88] Testing: seq_len=512, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.019 ± 0.001 ms
flash_attn : 0.026 ± 0.002 ms
max : 0.673 ± 0.002 ms
[36/88] Testing: seq_len=512, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.017 ± 0.001 ms
flashinfer : 0.029 ± 0.001 ms
flash_attn : 0.026 ± 0.002 ms
max : 1.318 ± 0.002 ms
[37/88] Testing: seq_len=512, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.016 ± 0.001 ms
flashinfer : 0.021 ± 0.001 ms
flash_attn : 0.026 ± 0.002 ms
max : 1.341 ± 0.006 ms
[38/88] Testing: seq_len=512, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.018 ± 0.001 ms
flashinfer : 0.036 ± 0.001 ms
flash_attn : 0.034 ± 0.002 ms
max : 2.642 ± 0.005 ms
[39/88] Testing: seq_len=512, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.021 ± 0.001 ms
flashinfer : 0.035 ± 0.028 ms
flash_attn : 0.033 ± 0.002 ms
max : 2.672 ± 0.008 ms
[40/88] Testing: seq_len=512, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.030 ± 0.019 ms
flashinfer : 0.053 ± 0.005 ms
flash_attn : 0.051 ± 0.002 ms
max : 5.274 ± 0.011 ms
[41/88] Testing: seq_len=1024, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.021 ± 0.001 ms
flashinfer : 0.026 ± 0.001 ms
flash_attn : 0.026 ± 0.003 ms
max : 0.679 ± 0.002 ms
[42/88] Testing: seq_len=1024, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.023 ± 0.001 ms
flashinfer : 0.041 ± 0.001 ms
flash_attn : 0.036 ± 0.002 ms
max : 1.324 ± 0.002 ms
[43/88] Testing: seq_len=1024, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.021 ± 0.001 ms
flashinfer : 0.034 ± 0.001 ms
flash_attn : 0.035 ± 0.002 ms
max : 1.355 ± 0.004 ms
[44/88] Testing: seq_len=1024, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.024 ± 0.001 ms
flashinfer : 0.060 ± 0.001 ms
flash_attn : 0.058 ± 0.002 ms
max : 2.656 ± 0.013 ms
[45/88] Testing: seq_len=1024, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.030 ± 0.001 ms
flashinfer : 0.050 ± 0.002 ms
flash_attn : 0.050 ± 0.002 ms
max : 2.689 ± 0.006 ms
[46/88] Testing: seq_len=1024, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.040 ± 0.001 ms
flashinfer : 0.088 ± 0.003 ms
flash_attn : 0.084 ± 0.003 ms
max : 5.302 ± 0.008 ms
[47/88] Testing: seq_len=1024, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.045 ± 0.001 ms
flashinfer : 0.078 ± 0.002 ms
flash_attn : 0.078 ± 0.002 ms
max : 5.408 ± 0.020 ms
[48/88] Testing: seq_len=1024, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.063 ± 0.018 ms
flashinfer : 0.137 ± 0.003 ms
flash_attn : 0.135 ± 0.005 ms
max : 13.151 ± 0.026 ms
[49/88] Testing: seq_len=1536, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.027 ± 0.002 ms
flashinfer : 0.048 ± 0.001 ms
flash_attn : 0.046 ± 0.002 ms
max : 1.030 ± 0.004 ms
[50/88] Testing: seq_len=1536, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.029 ± 0.001 ms
flashinfer : 0.084 ± 0.001 ms
flash_attn : 0.081 ± 0.002 ms
max : 2.001 ± 0.007 ms
[51/88] Testing: seq_len=1536, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.035 ± 0.002 ms
flashinfer : 0.054 ± 0.002 ms
flash_attn : 0.054 ± 0.003 ms
max : 2.037 ± 0.004 ms
[52/88] Testing: seq_len=1536, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.039 ± 0.002 ms
flashinfer : 0.096 ± 0.005 ms
flash_attn : 0.092 ± 0.004 ms
max : 3.987 ± 0.010 ms
[53/88] Testing: seq_len=1536, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.046 ± 0.002 ms
flashinfer : 0.086 ± 0.002 ms
flash_attn : 0.085 ± 0.002 ms
max : 4.060 ± 0.011 ms
[54/88] Testing: seq_len=1536, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.065 ± 0.002 ms
flashinfer : 0.155 ± 0.004 ms
flash_attn : 0.150 ± 0.004 ms
max : 8.797 ± 0.021 ms
[55/88] Testing: seq_len=1536, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.075 ± 0.002 ms
flashinfer : 0.150 ± 0.021 ms
flash_attn : 0.145 ± 0.003 ms
max : 9.515 ± 0.018 ms
[56/88] Testing: seq_len=1536, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.101 ± 0.011 ms
flashinfer : 0.263 ± 0.004 ms
flash_attn : 0.253 ± 0.004 ms
max : 21.508 ± 0.010 ms
[57/88] Testing: seq_len=2048, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.032 ± 0.001 ms
flashinfer : 0.060 ± 0.001 ms
flash_attn : 0.059 ± 0.002 ms
max : 1.373 ± 0.004 ms
[58/88] Testing: seq_len=2048, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.036 ± 0.001 ms
flashinfer : 0.109 ± 0.001 ms
flash_attn : 0.103 ± 0.002 ms
max : 2.665 ± 0.005 ms
[59/88] Testing: seq_len=2048, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.046 ± 0.001 ms
flashinfer : 0.085 ± 0.003 ms
flash_attn : 0.082 ± 0.003 ms
max : 2.724 ± 0.006 ms
[60/88] Testing: seq_len=2048, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.051 ± 0.002 ms
flashinfer : 0.148 ± 0.005 ms
flash_attn : 0.144 ± 0.006 ms
max : 5.335 ± 0.015 ms
[61/88] Testing: seq_len=2048, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.064 ± 0.002 ms
flashinfer : 0.136 ± 0.003 ms
flash_attn : 0.132 ± 0.003 ms
max : 5.482 ± 0.014 ms
[62/88] Testing: seq_len=2048, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.093 ± 0.003 ms
flashinfer : 0.245 ± 0.006 ms
flash_attn : 0.235 ± 0.006 ms
max : 13.294 ± 0.017 ms
[63/88] Testing: seq_len=2048, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.114 ± 0.002 ms
flashinfer : 0.238 ± 0.003 ms
flash_attn : 0.232 ± 0.003 ms
max : 14.078 ± 0.026 ms
[64/88] Testing: seq_len=2048, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.152 ± 0.007 ms
flashinfer : 0.427 ± 0.005 ms
flash_attn : 0.411 ± 0.005 ms
max : 29.816 ± 0.133 ms
[65/88] Testing: seq_len=4096, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.077 ± 0.001 ms
flashinfer : 0.154 ± 0.009 ms
flash_attn : 0.151 ± 0.006 ms
max : 2.798 ± 0.006 ms
[66/88] Testing: seq_len=4096, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.084 ± 0.002 ms
flashinfer : 0.286 ± 0.010 ms
flash_attn : 0.266 ± 0.010 ms
max : 5.403 ± 0.010 ms
[67/88] Testing: seq_len=4096, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.104 ± 0.002 ms
flashinfer : 0.251 ± 0.006 ms
flash_attn : 0.241 ± 0.005 ms
max : 5.612 ± 0.009 ms
[68/88] Testing: seq_len=4096, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.120 ± 0.002 ms
flashinfer : 0.458 ± 0.010 ms
flash_attn : 0.436 ± 0.009 ms
max : 13.537 ± 0.013 ms
[69/88] Testing: seq_len=4096, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.192 ± 0.002 ms
flashinfer : 0.440 ± 0.005 ms
flash_attn : 0.423 ± 0.006 ms
max : 14.300 ± 0.024 ms
[70/88] Testing: seq_len=4096, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.261 ± 0.005 ms
flashinfer : 0.806 ± 0.009 ms
flash_attn : 0.767 ± 0.011 ms
max : 30.013 ± 0.016 ms
[71/88] Testing: seq_len=4096, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.371 ± 0.001 ms
flashinfer : 0.836 ± 0.020 ms
flash_attn : 0.789 ± 0.005 ms
max : 30.959 ± 0.015 ms
[72/88] Testing: seq_len=4096, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.479 ± 0.004 ms
flashinfer : 1.590 ± 0.075 ms
flash_attn : 1.428 ± 0.015 ms
max : 60.491 ± 0.017 ms
[73/88] Testing: seq_len=8192, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.185 ± 0.002 ms
flashinfer : 0.482 ± 0.011 ms
flash_attn : 0.461 ± 0.011 ms
max : 5.840 ± 0.028 ms
[74/88] Testing: seq_len=8192, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.215 ± 0.001 ms
flashinfer : 0.919 ± 0.036 ms
flash_attn : 0.845 ± 0.021 ms
max : 13.872 ± 0.011 ms
[75/88] Testing: seq_len=8192, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.347 ± 0.002 ms
flashinfer : 0.850 ± 0.011 ms
flash_attn : 0.808 ± 0.012 ms
max : 14.761 ± 0.013 ms
[76/88] Testing: seq_len=8192, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.432 ± 0.002 ms
flashinfer : 1.655 ± 0.114 ms
flash_attn : 1.478 ± 0.019 ms
max : 30.551 ± 0.250 ms
[77/88] Testing: seq_len=8192, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.677 ± 0.002 ms
flashinfer : 1.622 ± 0.049 ms
flash_attn : 1.500 ± 0.011 ms
max : 31.808 ± 0.009 ms
[78/88] Testing: seq_len=8192, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.923 ± 0.049 ms
flashinfer : 3.053 ± 0.260 ms
flash_attn : 2.747 ± 0.019 ms
max : 61.206 ± 0.016 ms
[79/88] Testing: seq_len=8192, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 1.436 ± 0.099 ms
flashinfer : 3.156 ± 0.185 ms
flash_attn : 2.886 ± 0.010 ms
max : 63.581 ± 0.013 ms
[80/88] Testing: seq_len=8192, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 1.805 ± 0.113 ms
flashinfer : 5.778 ± 0.427 ms
flash_attn : 5.291 ± 0.017 ms
max : 122.288 ± 0.024 ms
[81/88] Testing: seq_len=16384, num_heads=16, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
All outputs match within tolerance
Results:
pytorch_sdpa : 0.660 ± 0.002 ms
flashinfer : 1.673 ± 0.024 ms
flash_attn : 1.582 ± 0.018 ms
max : 15.658 ± 0.025 ms
[82/88] Testing: seq_len=16384, num_heads=16, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 0.820 ± 0.023 ms
flashinfer : 3.388 ± 0.445 ms
flash_attn : 2.903 ± 0.036 ms
max : 31.268 ± 0.008 ms
[83/88] Testing: seq_len=16384, num_heads=32, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 1.321 ± 0.036 ms
flashinfer : 3.140 ± 0.076 ms
flash_attn : 2.940 ± 0.021 ms
max : 33.547 ± 0.019 ms
[84/88] Testing: seq_len=16384, num_heads=32, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 1.769 ± 0.093 ms
flashinfer : 5.928 ± 0.457 ms
flash_attn : 5.400 ± 0.030 ms
max : 62.747 ± 0.012 ms
[85/88] Testing: seq_len=16384, num_heads=64, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 2.787 ± 0.102 ms
flashinfer : 6.016 ± 0.086 ms
flash_attn : 5.658 ± 0.020 ms
max : 66.786 ± 0.024 ms
[86/88] Testing: seq_len=16384, num_heads=64, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 3.593 ± 0.112 ms
flashinfer : 11.100 ± 0.046 ms
flash_attn : 10.406 ± 0.038 ms
max : 124.996 ± 0.025 ms
[87/88] Testing: seq_len=16384, num_heads=128, head_dim=64
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 5.960 ± 0.187 ms
flashinfer : 11.750 ± 0.096 ms
flash_attn : 11.091 ± 0.021 ms
max : 133.292 ± 0.240 ms
[88/88] Testing: seq_len=16384, num_heads=128, head_dim=128
- Running PyTorch SDPA w/ CUDNN attention...
- Running FlashInfer...
- Running Flash Attention...
- Running Max...
Reference: pytorch_sdpa
Validation failed: pytorch_sdpa vs flashinfer, max_diff=0.015625
Results:
pytorch_sdpa : 7.280 ± 0.272 ms
flashinfer : 21.756 ± 0.031 ms
flash_attn : 20.401 ± 0.032 ms
max : 249.674 ± 0.035 ms
Detailed results saved to: ./flash_benchmark_results_20250905_020930.csv
Comparison table saved to: ./flash_benchmark_comparison_20250905_020930.csv
================================================================================
BENCHMARK SUMMARY
================================================================================
Average Latencies by Implementation:
implementation
flash_attn 0.964
flashinfer 1.031
max 16.071
pytorch_sdpa 0.384
Name: mean_latency_ms, dtype: float64
================================================================================
SPEEDUP SUMMARY (relative to PyTorch SDPA)
================================================================================
flashinfer:
Average speedup: 0.660x
Median speedup: 0.562x
Max speedup: 1.197x
Min speedup: 0.234x
flash_attn:
Average speedup: 0.509x
Median speedup: 0.537x
Max speedup: 0.810x
Min speedup: 0.254x
max:
Average speedup: 0.047x
Median speedup: 0.022x
Max speedup: 0.431x
Min speedup: 0.005x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment