The interesting document is: DATA-TILING-CPU-VS-GPU.md.

A ZIP archive of all the ai-generated files (pass-specific MLIR logs) should be attached to this gist. EDIT: Unfortunately, gists no longer allow attaching ZIP files. Grrr. Available on request.

This is a snapshot of CPU vs GPU differences in data-tiling on a simple matmul example, taken on April 9, 2026. The IREE commit is 007b1ee.

To reproduce, run:

iree-compile \
  matmul_bf16.mlir -o /tmp/matmul_bf16.cpu.vmfb \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver5 \
  --iree-dispatch-creation-data-tiling \
  -mlir-disable-threading -mlir-print-ir-after-all -mlir-print-ir-before-all \
  2>log-cpu.mlir

iree-compile \
  matmul_bf16.mlir -o /tmp/matmul_bf16.gpu.vmfb \
  --iree-hal-target-backends=rocm \
  --iree-hip-target=gfx950 \
  --iree-dispatch-creation-data-tiling \
  -mlir-disable-threading -mlir-print-ir-after-all -mlir-print-ir-before-all \
  2>log-gpu.mlir

Then use AI (I used Cursor's "auto" mode) with this prompt:

In /home/ossci/data-tiling, I have two MLIR before/after-all compilation logs: log-cpu.mlir, log-gpu.mlir. They were both generated by the commands in README.md. My goal is to document the differences and commonalities between CPU and GPU compilation specifically concerning data-tiling. In the log, look particularly for anything mentioning #iree_encoding. These "tensor encoding" attributes are the core of where data-tiling happens. Look also for any changes to the following kinds of ops: linalg.matmul, linalg.mmt4d, and inner_tiled. I want to to copy out of log-{cpu,gpu}.mlir any before/after pass dump where a relevant change occurs, into its own separate log file. Create a ai-generated/ subdirectory under data-tiling: all your output files should go to ai-generated. I want separate output files (still in the same dir) for each of CPU/GPU, for each pass making a significant difference, and separate before/after file for each. Moreover, I want you to number these files with increasing positive integers reflecting the chronology. If a pass only makes a significant change on one side (e.g. only CPU but not GPU) I still want dumps for both sides, if only to keep the numbering aligned. You should not need to rerun the iree-compile commands: you should only need to read log-{cpu,gpu}.mlir. E.g. if "FooPass" is the first pass making a significant change on CPU then create 4 files: log-1-before-foo-cpu.mlir, log-1-after-foo-cpu.mlir, log-1-before-foo-gpu.mlir, log-1-after-foo-gpu.mlir. Putting the number 1 first before the pass name helps ensure that alphanumeric sorting of files remains useful. Create an overview file, ai-generated/DATA-TILING-CPU-VS-GPU.md, giving a high-level overview while linking to some select .mlir files.

Data-tiling: CPU (`llvm-cpu`, znver5) vs GPU (`rocm`, gfx950)

The commands in README.md build the same matmul_bf16.mlir with --iree-dispatch-creation-data-tiling and full --mlir-print-ir-before-all / --mlir-print-ir-after-all logs redirected to log-cpu.mlir and log-gpu.mlir in the parent directory.

This folder (ai-generated/) holds the extracted slice files, the extractor script, and this note.

Common front end (shared pass sequence)

For the first 347 paired Before/After dumps, CPU and GPU logs list the same passes in the same order. Through that region, both targets see the same data-tiling story in the dispatch/stream pipeline:

#iree_encoding appears after SetEncoding: iree_encoding.set_encoding / unset_encoding wrap operands, and linalg.matmul runs on tensors with #iree_encoding.encoding<op_type = matmul, ...> (operand indices 0/1/2, user indexing maps, iteration sizes).
HoistEncodingOps and ConvertEncodingToFlow move encoding into the flow/stream layer; later SpecializeEncodings, EncodeHostTensors, and MaterializeEncodings continue to touch the same encoding metadata until executable translation diverges.

Representative extracted slices (four files each: before/after × cpu/gpu):

Event	Pass flag (filename slug)	What to notice
6	`iree-dispatch-creation-set-encoding`	Introduction of `#iree_encoding` around the matmul operands.
7–9	`iree-op-pipeline-adaptor`, `iree-dispatch-creation-hoist-encoding-ops`, `iree-dispatch-creation-convert-encoding-to-flow`	Encoding hoisted and lowered through flow.
22–24	`iree-stream-specialize-encodings`, `iree-stream-encode-host-tensors`, `iree-stream-materialize-encodings`	Stream passes refine / materialize encodings for the program.

Where the pipelines diverge

At dump index 347, the names no longer line up index-by-index:

CPU next runs codegen passes such as TypePropagation (not significant for the data-tiling snippet filter used here).
GPU immediately runs MaterializeDeviceEncoding (iree-codegen-materialize-device-encoding).

Extracts after that point pair by matching pass name on the GPU side when possible, and insert GPU-only or CPU-only slices when one backend runs passes the other does not. Files that duplicate one side’s IR include a leading // NOTE: explaining that.

Backend-specific lowering (high level)

CPU

After MaterializeDeviceEncoding, the dispatch uses linalg.mmt4d on blocked types (e.g. tensor<?x?x16x2xbf16>), reflecting CPU data layout / micro-kernel tiling (e.g. 16×2 tiles), not inner_tiled.
Later, CPULowerToUKernels lowers toward iree_codegen.ukernel.generic "iree_uk_mmt4d" (see event 105 and following in the extracts).

GPU (gfx950 / MFMA)

The GPU path lowers the contraction to iree_codegen.inner_tiled with #iree_gpu.data_tiled_mma_layout (e.g. MFMA_F32_16x16x32_BF16, subgroup and intrinsic counts). Operand tensor shapes pick up the extra data-tile dimensions (e.g. tensor<?x?x2x4x4x16x8xbf16>).
DistributeInnerTiledToLanes (event 42 in the extracts) further reshapes how that inner-tiled op is mapped for the GPU.

So: both sides start from the same #iree_encoding dispatch creation story; CPU converges on mmt4d (+ ukernels) while GPU converges on inner_tiled + MFMA-oriented layouts.

Numbered extract files

Script: extract_data_tiling_logs.py — regenerates slices from ../log-cpu.mlir / ../log-gpu.mlir only (no iree-compile rerun).
Rule: A pass is included if any of these change between before and after on either backend: lines matching #iree_encoding / iree_encoding, set_encoding / unset_encoding, linalg.matmul, linalg.mmt4d, inner_tiled.
Naming: log-<N>-before-<slug>-{cpu,gpu}.mlir and log-<N>-after-<slug>-{cpu,gpu}.mlir, with N increasing in merge order (shared prefix in lockstep, then CPU-ordered significant passes with GPU-only inserts, then trailing GPU-only).
Scale: This produces 125 logical events (500 files). Many slugs repeat (e.g. repeated iree-codegen-materialize-device-encoding / HAL configure passes for multiple executables); sort by log-<N>- to follow chronology.

Quick pointers

Event	Slug (abbrev.)	Role
32	`iree-op-pipeline-adaptor` (GPU after)	Early `inner_tiled` with dynamic `tensor<?x?x…>` shapes on GPU.
36	`iree-convert-accgemm-to-gemm`	Another GPU `inner_tiled` snapshot after ACCGEMM cleanup.
42	`iree-gpu-distribute-inner-tiled-to-lanes`	Distribution of inner_tiled to lanes.
73+	`iree-codegen-materialize-device-encoding` (CPU)	`linalg.mmt4d` with `?x?x16x2` tiles on CPU.
105	`iree-codegen-cpu-lower-to-ukernels`	Move from mmt4d toward `iree_uk_mmt4d`.

For a full machine-readable ordering, from the parent data-tiling directory run:

python3 ai-generated/extract_data_tiling_logs.py

or cd ai-generated && python3 extract_data_tiling_logs.py, then inspect the emitted filenames here.

bjacob/DATA-TILING-CPU-VS-GPU.md

Select an option

No results found

Select an option

No results found

Data-tiling: CPU (`llvm-cpu`, znver5) vs GPU (`rocm`, gfx950)

Common front end (shared pass sequence)

Where the pipelines diverge

Backend-specific lowering (high level)

CPU

GPU (gfx950 / MFMA)

Numbered extract files

Quick pointers

bjacob/DATA-TILING-CPU-VS-GPU.md

Data-tiling: CPU (llvm-cpu, znver5) vs GPU (rocm, gfx950)

Common front end (shared pass sequence)

Where the pipelines diverge

Backend-specific lowering (high level)

CPU

GPU (gfx950 / MFMA)

Numbered extract files

Quick pointers

Data-tiling: CPU (`llvm-cpu`, znver5) vs GPU (`rocm`, gfx950)