Skip to content

Instantly share code, notes, and snippets.

@bjacob
Created October 9, 2024 18:19
Show Gist options
  • Save bjacob/661e423a3a3995a23832dea48b832a2d to your computer and use it in GitHub Desktop.
Save bjacob/661e423a3a3995a23832dea48b832a2d to your computer and use it in GitHub Desktop.
Early benchmarking of GPU data-tiled matmuls

Early benchmarking of GPU data-tiled matmuls

Test files

It's in this branch / single commit: https://github.com/iree-org/iree/compare/main...bjacob:iree:gpu_matmul_benchmarks

experimental/gpu_matmul_benchmarks/benchmark.sh
experimental/gpu_matmul_benchmarks/matmul_f16.mlir
experimental/gpu_matmul_benchmarks/matmul_i8.mlir
experimental/gpu_matmul_benchmarks/mfma_f16.mlir
experimental/gpu_matmul_benchmarks/mfma_i8.mlir

So there's a Bash script compiling and benchmarking 4 .mlir programs.

The matmul_ ones are linalg.matmul with dynamic shapes and zero-filled accumulator. The mfma_ones are just themulti_mma` dispatch (still including the zero-fill accumulator).

Results

Just run the script from a IREE build directory.

Obtained on GPUF280 (MI300X configured in CPX mode):

Note: the format is CSV size,Tflops/s where size is the size of all square matrices.

Benchmarking mfma_f16.mlir:
128,0.11
256,0.87
512,5.99
1024,33.67
2048,82.97
4096,116.38

Benchmarking mfma_i8.mlir:
128,0.12
256,0.88
512,6.03
1024,35.47
2048,128.53
4096,215.77

Benchmarking matmul_f16.mlir:
128,0.04
256,0.20
512,0.52
1024,1.46
2048,3.11
4096,6.22

Benchmarking matmul_i8.mlir:
128,0.04
256,0.23
512,0.75
1024,2.35
2048,5.28
4096,10.88

So:

  • The mfma_ are doing well. At size 4096:
    • For f16, the 116.38 compare to 118.9 on my equivalent HIP kernel (98%).
    • For i8, the 215.77 compare to 222.7 on my equivalent HIP kernel (97%).
  • The matmul_ are slow, meaning the work outside of the MFMA dispatch is slow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment