Skip to content

Instantly share code, notes, and snippets.

@powderluv
Created March 24, 2026 06:29
Show Gist options
  • Select an option

  • Save powderluv/76cffd5e34104d5d9e8e9f97449f979b to your computer and use it in GitHub Desktop.

Select an option

Save powderluv/76cffd5e34104d5d9e8e9f97449f979b to your computer and use it in GitHub Desktop.
Split-K Matmul Transpiler Bug: Deep Investigation (GFX1250→GFX942)

Split-K Matmul Transpiler Bug: Deep Investigation

Overview

The GFX1250→GFX942 cross-family ISA transpiler passes 18-19/20 tests. The remaining failure is the matmul_splitk test, where the inner loop exits prematurely and non-deterministically. This document details the exhaustive investigation.

The Bug

The split-K matmul uses two kernels:

  • matmul_splitk_compute (236 GFX12 instructions): Each workgroup computes a partial matmul for a chunk of K. Uses blockIdx.y for split index.
  • matmul_splitk_reduce: Sums partial results across splits.

The compute kernel's inner loop accumulates C[row][col] += A[row][k] * B[k*N+col] for k = start_k..end_k. The loop exits early, producing fewer iterations than expected.

Symptom

K=256 across 5 runs: got=256, 252, 193, 76, 193
K=64  across 5 runs: got=49, 64, 49, 64, 49
K=32  across 5 runs: got=25, 32, 25, 19, 25
K=1-4: always perfect

The results are wildly non-deterministic — the same input produces different iteration counts across consecutive runs.

The Inner Loop (Transpiled Assembly)

.L_br3:
    v_lshlrev_b32_e32 v14, 2, v3        ; B byte offset (scale_offset emulation)
    global_load_dword v4, v14, s[2:3]    ; load B[k*N+col]
    s_load_dword s11, s[6:7], 0x0        ; load A[row, k] (scalar)
    v_add_u32_e32 v3, s5, v3             ; v3 += N (advance B index)
    s_add_i32 s10, s10, 1                ; k++
    s_waitcnt vmcnt(0) lgkmcnt(0)        ; wait for loads
    v_fmac_f32_e32 v2, s11, v4           ; accumulate A[k]*B[k*N+col]
    s_add_u32 s6, s6, 4                  ; advance A pointer lo
    s_addc_u32 s7, s7, 0                 ; advance A pointer hi (carry)
    s_cmp_ge_i32 s10, s4                 ; k >= end_k? → sets SCC
    s_cbranch_scc0 .L_br3                ; loop if k < end_k (SCC=0)

The loop control is:

  • s10 = loop counter (starts at start_k, increments by 1)
  • s4 = loop bound (min(end_k, K))
  • s_cmp_ge_i32 s10, s4 sets SCC=1 when done
  • s_cbranch_scc0 loops back when SCC=0

What Was Ruled Out

1. scale_offset Modifier (RULED OUT)

GFX12 global_load_b32 v4, v3, s[2:3] scale_offset means addr = s[2:3] + v3 * 4 (hardware multiplies offset by data size). On GFX9, no such modifier exists. The transpiler correctly handles this by inserting v_lshlrev_b32_e32 v14, 2, v3 before the load to pre-multiply. Verified in both assembly dump and binary disassembly.

2. SGPR-Base Addressing Mode (RULED OUT)

Initial hypothesis: global_load_dword v4, v14, s[2:3] (SGPR base + VGPR offset) behaves differently from global_load_dword v4, v[addr_lo:addr_hi], off (full VGPR address) on MI300X.

Disproved: Native gfx942 kernels (e.g., stencil_1d) also use SGPR-base addressing. Attempted conversion to full-VGPR addressing caused assembly errors (VGPR allocation issues) and was abandoned.

3. SCC Clobber from s_addc_u32 (RULED OUT)

The s_addc_u32 s7, s7, 0 writes SCC (carry-out = 0 for small addresses). Then s_cmp_ge_i32 s10, s4 writes SCC (comparison result). The concern was that the branch reads the s_addc's SCC instead of s_cmp's.

Testing: Added s_nop 0 through s_nop 7 between s_addc and s_cmp. No improvement. Also tried explicit SCC save/restore (s_cselect_b32/s_cmp_lg_u32). No improvement.

4. SCC Forwarding Hazard (VALU Between CMP and Branch) (RULED OUT)

Hypothesis: On GFX942, a VALU instruction between s_cmp_ge_i32 (SCC write) and s_cbranch_scc0 (SCC read) corrupts SCC forwarding.

Testing: Reordered the loop so v_fmac_f32 executes BEFORE s_cmp_ge_i32, placing s_cbranch_scc0 immediately after the compare with no VALU gap. The bug persists.

5. Memory Loads / Stale Cache (RULED OUT)

NOWAIT test (env var HSA_HOTSWAP_SPLITK_NOWAIT=1): Stripped ALL global_load_dword, s_load_dword, and s_waitcnt from the inner loop. Replaced v_fmac_f32 with v_add_f32 v2, 1.0, v2 (constant accumulation). The resulting loop has zero memory operations and zero waitcnts.

Result: The pure SALU loop still exits early non-deterministically. This definitively rules out memory, cache, and waitcnt as root causes.

6. s_waitcnt Pipeline Interaction (RULED OUT)

By extension from #5 — even without any s_waitcnt in the loop, the bug persists.

7. s_wait_xcnt Translation (FOUND BUG, BUT NOT ROOT CAUSE)

GFX12's s_wait_xcnt waits for global store completions ("transaction count"). The transpiler was mapping this to s_waitcnt expcnt(N), but on GFX9, global stores use vmcnt, not expcnt. Fixed to s_waitcnt vmcnt(N). This is a correctness fix but does not resolve the split-K loop bug.

8. s_add_nc_u64 SCC Preservation (FOUND BUG, BUT NOT ROOT CAUSE)

GFX12's s_add_nc_u64 does NOT write SCC. The GFX9 expansion (s_add_u32 + s_addc_u32) DOES write SCC. Added SCC save/restore, then removed it (the SCC is always overwritten by a subsequent s_cmp). Neither version affects the split-K bug.

9. MSGPACK Metadata (RULED OUT)

The transpiled MSGPACK had stale GFX12 values:

  • vgpr_count: 5 (should be 20 for transpiled kernel)
  • sgpr_count: 20 (should be 24)

Patched to match the kernel descriptor's RSRC1. No improvement for split-K.

10. v_cmpx_e64 Ghost Lanes (RULED OUT)

The _e64 handler was missing s_mov_b32 exec_hi, 0 after v_cmpx (the _e32 handler had it). Fixed. Ghost lanes 32-63 can no longer execute stores. However, the split-K bug persists — and in fact ghost lanes were accidentally masking the bug (contributing correct values by coincidence).

11. ELF Header and Metadata (RULED OUT)

Field Transpiled Native gfx942
e_flags 0x54c (gfx942, xnack, sramecc) 0x54c (same)
COMPUTE_PGM_RSRC2 0x384 0x384 (same)
USER_SGPR 2 2 (same)
TGID_X/Y/Z_EN 1/1/1 1/1/1 (same)
kernel_code_properties 0x0008 0x0008 (same)
wavefront_size (MSGPACK) 64 64 (same)
amdhsa.target gfx942 gfx942 (same)
COMPUTE_PGM_RSRC1 0x00af0102 0x00af0081 (different)
COMPUTE_PGM_RSRC3 0x03 0x01 (different)
kernarg_segment_size 296 280 (different)

RSRC1 differs in VGPR/SGPR allocation (transpiled allocates more for save registers). RSRC3 differs in ACCUM_OFFSET (3 vs 1). kernarg_size differs (transpiled has 16 extra bytes for hidden args). None of these should affect loop control.

12. .text Padding (RULED OUT)

Changed NOP padding (0xBF800000) to s_code_end (0xBF9F0000) after the assembled code. Tells the instruction prefetcher where the kernel ends. No improvement.

13. Instruction Cache Alignment (NOT TESTED)

The inner loop's branch target might straddle a 64-byte instruction cache line boundary, causing non-deterministic instruction fetch behavior. This was not explicitly tested but is unlikely to cause the observed magnitude of failures.

14. Count-Down Loop Conversion (RULED OUT)

Converted the inner loop from count-up (s10++ >= s4) to count-down (s10-- > 0). No improvement.

The Smoking Gun: Non-Determinism in Pure SALU Loop

The NOWAIT test produces this minimal loop:

.L_br3:
    v_lshlrev_b32_e32 v14, 2, v3     ; VALU (harmless index calc)
    s_nop 0                            ; was: global_load_dword
    s_nop 0                            ; was: s_load_dword
    ; stripped: s_waitcnt
    v_add_u32_e32 v3, s5, v3          ; VALU (advance B index, harmless)
    s_add_i32 s10, s10, 1             ; SALU: k++
    s_add_u32 s6, s6, 4              ; SALU: advance A ptr lo
    s_addc_u32 s7, s7, 0             ; SALU: carry
    s_cmp_ge_i32 s10, s4             ; SALU: compare, sets SCC
    v_add_f32_e32 v2, 1.0, v2        ; VALU: constant accumulate
    s_cbranch_scc0 .L_br3            ; SALU: loop if SCC=0

This loop has no memory operations, no waitcnts, no cache interaction. Yet it exits early non-deterministically.

A native gfx942 kernel with an equivalent loop (compiled by the ROCm compiler) works perfectly every time.

Remaining Hypotheses

1. Wavefront Preemption/Save-Restore Bug

MI300X supports workgroup preemption. When the GPU preempts a wavefront, it saves SGPR/VGPR state and restores it later. If the kernel descriptor configuration (RSRC1/RSRC3 differences — more VGPRs allocated, different ACCUM_OFFSET) causes the save/restore to corrupt SGPR values (specifically s10 or s4), the loop bound would be wrong.

Evidence: The non-determinism is consistent with preemption timing — the amount of work done before preemption varies between runs, and corrupted state after restore causes early exit.

Counter-evidence: Preemption save/restore is a well-tested mechanism. Other transpiled kernels (attn_forward, multihead) don't show this issue. However, those kernels have different RSRC1/RSRC3 values.

2. ACCUM_OFFSET Configuration Issue

The transpiled kernel has ACCUM_OFFSET = 3 (accumulators start at VGPR group 3 = v24). The native kernel has ACCUM_OFFSET = 1 (accumulators start at v8). On CDNA3, the ACCUM_OFFSET affects the VGPR register file partition between arch VGPRs and accumulator VGPRs.

If the hardware uses ACCUM_OFFSET to configure the register file in a way that affects how SGPRs are saved during preemption (unlikely, but not verified), a wrong ACCUM_OFFSET could cause SGPR corruption.

3. CU Scheduling / Occupancy Interaction

The transpiled kernel allocates 24 VGPRs (RSRC1 VGPRS=2) vs 16 for native (VGPRS=1). This reduces occupancy (fewer concurrent wavefronts per CU). Lower occupancy changes the CU scheduling patterns, which affects when wavefronts are preempted and how the SALU pipeline is shared among wavefronts.

If there's a subtle SALU pipeline hazard that only manifests under specific occupancy patterns (e.g., when exactly N wavefronts share the CU), the non-determinism would correlate with GPU load and scheduling decisions.

4. Hardware Errata

The pattern — non-deterministic SALU loop termination in a transpiled kernel but not a native one, with no visible instruction-level difference — could indicate an undocumented hardware errata specific to GFX942 when certain KD configurations are used. This would require AMD hardware team involvement to diagnose.

Fixes Applied (Not Root Cause but Correct)

Fix Impact
s_wait_xcnts_waitcnt vmcnt(N) Correct counter mapping for global stores
s_code_end padding Instruction prefetcher termination
MSGPACK vgpr/sgpr patching Metadata consistency
v_cmpx_e64 exec_hi clear Ghost lane prevention
SCC forwarding reorder No VALU between s_cmp and s_cbranch
Inner loop waitcnt merge 3 waitcnts → 1

Reproduction

# On MI300X machine with ROCm 7.2:
cd /home/anush/github/TheRock/therock-build
ROCR=core/ROCR-Runtime/build/rocr/lib
ELFLIB=compiler/amd-comgr/dist/lib/rocm_sysdeps/lib
CLRLIB=core/clr/dist/lib
SYSDEPS=third-party/sysdeps/linux/zlib/build/dist/lib
export LD_LIBRARY_PATH=$ROCR:$ELFLIB:$CLRLIB:$SYSDEPS
export HSA_HOTSWAP_ISA_OVERRIDE=gfx942

# K iteration count test (run 5 times to see non-determinism):
for i in 1 2 3 4 5; do /tmp/splitk_ktest4 2>&1 | grep "K=256"; done

# NOWAIT diagnostic (pure SALU loop, no loads):
HSA_HOTSWAP_SPLITK_NOWAIT=1 /tmp/splitk_ktest4

# Full test suite:
/tmp/ts_fresh

Related Code

  • Transpiler: rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/hotswap/transpiler.cpp
  • Branch: users/powderluv/rocm-hotswap on ROCm/rocm-systems
  • Split-K GFX1250 ELF: /tmp/matmul_splitk_compute_gfx1250.elf
  • Test binaries: /tmp/splitk_ktest4, /tmp/ts_fresh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment