Early benchmarking of GPU data-tiled matmuls

Test files

It's in this branch / single commit: https://github.com/iree-org/iree/compare/main...bjacob:iree:gpu_matmul_benchmarks

experimental/gpu_matmul_benchmarks/benchmark.sh
experimental/gpu_matmul_benchmarks/matmul_f16.mlir
experimental/gpu_matmul_benchmarks/matmul_i8.mlir
experimental/gpu_matmul_benchmarks/mfma_f16.mlir
experimental/gpu_matmul_benchmarks/mfma_i8.mlir

So there's a Bash script compiling and benchmarking 4 .mlir programs.

The matmul_ ones are linalg.matmul with dynamic shapes and zero-filled accumulator. The mfma_ones are just themulti_mma` dispatch (still including the zero-fill accumulator).

Results

Just run the script from a IREE build directory.

Obtained on GPUF280 (MI300X configured in CPX mode):

Note: the format is CSV size,Tflops/s where size is the size of all square matrices.

Benchmarking mfma_f16.mlir:
128,0.11
256,0.87
512,5.99
1024,33.67
2048,82.97
4096,116.38

Benchmarking mfma_i8.mlir:
128,0.12
256,0.88
512,6.03
1024,35.47
2048,128.53
4096,215.77

Benchmarking matmul_f16.mlir:
128,0.04
256,0.20
512,0.52
1024,1.46
2048,3.11
4096,6.22

Benchmarking matmul_i8.mlir:
128,0.04
256,0.23
512,0.75
1024,2.35
2048,5.28
4096,10.88

So:

The mfma_ are doing well. At size 4096:
- For f16, the 116.38 compare to 118.9 on my equivalent HIP kernel (98%).
- For i8, the 215.77 compare to 222.7 on my equivalent HIP kernel (97%).
The matmul_ are slow, meaning the work outside of the MFMA dispatch is slow.

bjacob/README.md

Early benchmarking of GPU data-tiled matmuls

Test files

Results