Skip to content

Instantly share code, notes, and snippets.

@bjacob
Last active April 9, 2026 16:28
Show Gist options
  • Select an option

  • Save bjacob/a34960a8e5e4b35bf4fcb7d93cdad8bb to your computer and use it in GitHub Desktop.

Select an option

Save bjacob/a34960a8e5e4b35bf4fcb7d93cdad8bb to your computer and use it in GitHub Desktop.
IREE data-tiling: CPU vs GPU differences, April 9 snapshot

The interesting document is: DATA-TILING-CPU-VS-GPU.md.

A ZIP archive of all the ai-generated files (pass-specific MLIR logs) should be attached to this gist. EDIT: Unfortunately, gists no longer allow attaching ZIP files. Grrr. Available on request.

This is a snapshot of CPU vs GPU differences in data-tiling on a simple matmul example, taken on April 9, 2026. The IREE commit is 007b1ee.

To reproduce, run:

iree-compile \
  matmul_bf16.mlir -o /tmp/matmul_bf16.cpu.vmfb \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver5 \
  --iree-dispatch-creation-data-tiling \
  -mlir-disable-threading -mlir-print-ir-after-all -mlir-print-ir-before-all \
  2>log-cpu.mlir

iree-compile \
  matmul_bf16.mlir -o /tmp/matmul_bf16.gpu.vmfb \
  --iree-hal-target-backends=rocm \
  --iree-hip-target=gfx950 \
  --iree-dispatch-creation-data-tiling \
  -mlir-disable-threading -mlir-print-ir-after-all -mlir-print-ir-before-all \
  2>log-gpu.mlir

Then use AI (I used Cursor's "auto" mode) with this prompt:

In /home/ossci/data-tiling, I have two MLIR before/after-all compilation logs: log-cpu.mlir, log-gpu.mlir. They were both generated by the commands in README.md. My goal is to document the differences and commonalities between CPU and GPU compilation specifically concerning data-tiling. In the log, look particularly for anything mentioning #iree_encoding. These "tensor encoding" attributes are the core of where data-tiling happens. Look also for any changes to the following kinds of ops: linalg.matmul, linalg.mmt4d, and inner_tiled. I want to to copy out of log-{cpu,gpu}.mlir any before/after pass dump where a relevant change occurs, into its own separate log file. Create a ai-generated/ subdirectory under data-tiling: all your output files should go to ai-generated. I want separate output files (still in the same dir) for each of CPU/GPU, for each pass making a significant difference, and separate before/after file for each. Moreover, I want you to number these files with increasing positive integers reflecting the chronology. If a pass only makes a significant change on one side (e.g. only CPU but not GPU) I still want dumps for both sides, if only to keep the numbering aligned. You should not need to rerun the iree-compile commands: you should only need to read log-{cpu,gpu}.mlir. E.g. if "FooPass" is the first pass making a significant change on CPU then create 4 files: log-1-before-foo-cpu.mlir, log-1-after-foo-cpu.mlir, log-1-before-foo-gpu.mlir, log-1-after-foo-gpu.mlir. Putting the number 1 first before the pass name helps ensure that alphanumeric sorting of files remains useful. Create an overview file, ai-generated/DATA-TILING-CPU-VS-GPU.md, giving a high-level overview while linking to some select .mlir files.

Data-tiling: CPU (llvm-cpu, znver5) vs GPU (rocm, gfx950)

The commands in README.md build the same matmul_bf16.mlir with --iree-dispatch-creation-data-tiling and full --mlir-print-ir-before-all / --mlir-print-ir-after-all logs redirected to log-cpu.mlir and log-gpu.mlir in the parent directory.

This folder (ai-generated/) holds the extracted slice files, the extractor script, and this note.

Common front end (shared pass sequence)

For the first 347 paired Before/After dumps, CPU and GPU logs list the same passes in the same order. Through that region, both targets see the same data-tiling story in the dispatch/stream pipeline:

  • #iree_encoding appears after SetEncoding: iree_encoding.set_encoding / unset_encoding wrap operands, and linalg.matmul runs on tensors with #iree_encoding.encoding<op_type = matmul, ...> (operand indices 0/1/2, user indexing maps, iteration sizes).
  • HoistEncodingOps and ConvertEncodingToFlow move encoding into the flow/stream layer; later SpecializeEncodings, EncodeHostTensors, and MaterializeEncodings continue to touch the same encoding metadata until executable translation diverges.

Representative extracted slices (four files each: before/after × cpu/gpu):

Event Pass flag (filename slug) What to notice
6 iree-dispatch-creation-set-encoding Introduction of #iree_encoding around the matmul operands.
7–9 iree-op-pipeline-adaptor, iree-dispatch-creation-hoist-encoding-ops, iree-dispatch-creation-convert-encoding-to-flow Encoding hoisted and lowered through flow.
22–24 iree-stream-specialize-encodings, iree-stream-encode-host-tensors, iree-stream-materialize-encodings Stream passes refine / materialize encodings for the program.

Where the pipelines diverge

At dump index 347, the names no longer line up index-by-index:

  • CPU next runs codegen passes such as TypePropagation (not significant for the data-tiling snippet filter used here).
  • GPU immediately runs MaterializeDeviceEncoding (iree-codegen-materialize-device-encoding).

Extracts after that point pair by matching pass name on the GPU side when possible, and insert GPU-only or CPU-only slices when one backend runs passes the other does not. Files that duplicate one side’s IR include a leading // NOTE: explaining that.

Backend-specific lowering (high level)

CPU

  • After MaterializeDeviceEncoding, the dispatch uses linalg.mmt4d on blocked types (e.g. tensor<?x?x16x2xbf16>), reflecting CPU data layout / micro-kernel tiling (e.g. 16×2 tiles), not inner_tiled.
  • Later, CPULowerToUKernels lowers toward iree_codegen.ukernel.generic "iree_uk_mmt4d" (see event 105 and following in the extracts).

GPU (gfx950 / MFMA)

  • The GPU path lowers the contraction to iree_codegen.inner_tiled with #iree_gpu.data_tiled_mma_layout (e.g. MFMA_F32_16x16x32_BF16, subgroup and intrinsic counts). Operand tensor shapes pick up the extra data-tile dimensions (e.g. tensor<?x?x2x4x4x16x8xbf16>).
  • DistributeInnerTiledToLanes (event 42 in the extracts) further reshapes how that inner-tiled op is mapped for the GPU.

So: both sides start from the same #iree_encoding dispatch creation story; CPU converges on mmt4d (+ ukernels) while GPU converges on inner_tiled + MFMA-oriented layouts.

Numbered extract files

  • Script: extract_data_tiling_logs.py — regenerates slices from ../log-cpu.mlir / ../log-gpu.mlir only (no iree-compile rerun).
  • Rule: A pass is included if any of these change between before and after on either backend: lines matching #iree_encoding / iree_encoding, set_encoding / unset_encoding, linalg.matmul, linalg.mmt4d, inner_tiled.
  • Naming: log-<N>-before-<slug>-{cpu,gpu}.mlir and log-<N>-after-<slug>-{cpu,gpu}.mlir, with N increasing in merge order (shared prefix in lockstep, then CPU-ordered significant passes with GPU-only inserts, then trailing GPU-only).
  • Scale: This produces 125 logical events (500 files). Many slugs repeat (e.g. repeated iree-codegen-materialize-device-encoding / HAL configure passes for multiple executables); sort by log-<N>- to follow chronology.

Quick pointers

Event Slug (abbrev.) Role
32 iree-op-pipeline-adaptor (GPU after) Early inner_tiled with dynamic tensor<?x?x…> shapes on GPU.
36 iree-convert-accgemm-to-gemm Another GPU inner_tiled snapshot after ACCGEMM cleanup.
42 iree-gpu-distribute-inner-tiled-to-lanes Distribution of inner_tiled to lanes.
73+ iree-codegen-materialize-device-encoding (CPU) linalg.mmt4d with ?x?x16x2 tiles on CPU.
105 iree-codegen-cpu-lower-to-ukernels Move from mmt4d toward iree_uk_mmt4d.

For a full machine-readable ordering, from the parent data-tiling directory run:

python3 ai-generated/extract_data_tiling_logs.py

or cd ai-generated && python3 extract_data_tiling_logs.py, then inspect the emitted filenames here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment