It's in this branch / single commit: https://github.com/iree-org/iree/compare/main...bjacob:iree:gpu_matmul_benchmarks
experimental/gpu_matmul_benchmarks/benchmark.sh
experimental/gpu_matmul_benchmarks/matmul_f16.mlir
experimental/gpu_matmul_benchmarks/matmul_i8.mlir
experimental/gpu_matmul_benchmarks/mfma_f16.mlir
experimental/gpu_matmul_benchmarks/mfma_i8.mlir
So there's a Bash script compiling and benchmarking 4 .mlir programs.
The matmul_
ones are linalg.matmul with dynamic shapes and zero-filled accumulator. The
mfma_ones are just the
multi_mma` dispatch (still including the zero-fill accumulator).
Just run the script from a IREE build directory.
Obtained on GPUF280 (MI300X configured in CPX mode):
Note: the format is CSV size,Tflops/s
where size
is the size of all square matrices.
Benchmarking mfma_f16.mlir:
128,0.11
256,0.87
512,5.99
1024,33.67
2048,82.97
4096,116.38
Benchmarking mfma_i8.mlir:
128,0.12
256,0.88
512,6.03
1024,35.47
2048,128.53
4096,215.77
Benchmarking matmul_f16.mlir:
128,0.04
256,0.20
512,0.52
1024,1.46
2048,3.11
4096,6.22
Benchmarking matmul_i8.mlir:
128,0.04
256,0.23
512,0.75
1024,2.35
2048,5.28
4096,10.88
So:
- The
mfma_
are doing well. At size 4096:- For
f16
, the 116.38 compare to 118.9 on my equivalent HIP kernel (98%). - For
i8
, the 215.77 compare to 222.7 on my equivalent HIP kernel (97%).
- For
- The
matmul_
are slow, meaning the work outside of the MFMA dispatch is slow.