The GFX1250→GFX942 cross-family ISA transpiler passes 18-19/20 tests. The remaining failure is the matmul_splitk test, where the inner loop exits prematurely and non-deterministically. This document details the exhaustive investigation.
The split-K matmul uses two kernels:
matmul_splitk_compute(236 GFX12 instructions): Each workgroup computes a partial matmul for a chunk of K. UsesblockIdx.yfor split index.matmul_splitk_reduce: Sums partial results across splits.
The compute kernel's inner loop accumulates C[row][col] += A[row][k] * B[k*N+col] for k = start_k..end_k. The loop exits early, producing fewer iterations than expected.
K=256 across 5 runs: got=256, 252, 193, 76, 193
K=64 across 5 runs: got=49, 64, 49, 64, 49
K=32 across 5 runs: got=25, 32, 25, 19, 25
K=1-4: always perfect
The results are wildly non-deterministic — the same input produces different iteration counts across consecutive runs.
.L_br3:
v_lshlrev_b32_e32 v14, 2, v3 ; B byte offset (scale_offset emulation)
global_load_dword v4, v14, s[2:3] ; load B[k*N+col]
s_load_dword s11, s[6:7], 0x0 ; load A[row, k] (scalar)
v_add_u32_e32 v3, s5, v3 ; v3 += N (advance B index)
s_add_i32 s10, s10, 1 ; k++
s_waitcnt vmcnt(0) lgkmcnt(0) ; wait for loads
v_fmac_f32_e32 v2, s11, v4 ; accumulate A[k]*B[k*N+col]
s_add_u32 s6, s6, 4 ; advance A pointer lo
s_addc_u32 s7, s7, 0 ; advance A pointer hi (carry)
s_cmp_ge_i32 s10, s4 ; k >= end_k? → sets SCC
s_cbranch_scc0 .L_br3 ; loop if k < end_k (SCC=0)The loop control is:
s10= loop counter (starts atstart_k, increments by 1)s4= loop bound (min(end_k, K))s_cmp_ge_i32 s10, s4sets SCC=1 when dones_cbranch_scc0loops back when SCC=0
GFX12 global_load_b32 v4, v3, s[2:3] scale_offset means addr = s[2:3] + v3 * 4 (hardware multiplies offset by data size). On GFX9, no such modifier exists. The transpiler correctly handles this by inserting v_lshlrev_b32_e32 v14, 2, v3 before the load to pre-multiply. Verified in both assembly dump and binary disassembly.
Initial hypothesis: global_load_dword v4, v14, s[2:3] (SGPR base + VGPR offset) behaves differently from global_load_dword v4, v[addr_lo:addr_hi], off (full VGPR address) on MI300X.
Disproved: Native gfx942 kernels (e.g., stencil_1d) also use SGPR-base addressing. Attempted conversion to full-VGPR addressing caused assembly errors (VGPR allocation issues) and was abandoned.
The s_addc_u32 s7, s7, 0 writes SCC (carry-out = 0 for small addresses). Then s_cmp_ge_i32 s10, s4 writes SCC (comparison result). The concern was that the branch reads the s_addc's SCC instead of s_cmp's.
Testing: Added s_nop 0 through s_nop 7 between s_addc and s_cmp. No improvement. Also tried explicit SCC save/restore (s_cselect_b32/s_cmp_lg_u32). No improvement.
Hypothesis: On GFX942, a VALU instruction between s_cmp_ge_i32 (SCC write) and s_cbranch_scc0 (SCC read) corrupts SCC forwarding.
Testing: Reordered the loop so v_fmac_f32 executes BEFORE s_cmp_ge_i32, placing s_cbranch_scc0 immediately after the compare with no VALU gap. The bug persists.
NOWAIT test (env var HSA_HOTSWAP_SPLITK_NOWAIT=1): Stripped ALL global_load_dword, s_load_dword, and s_waitcnt from the inner loop. Replaced v_fmac_f32 with v_add_f32 v2, 1.0, v2 (constant accumulation). The resulting loop has zero memory operations and zero waitcnts.
Result: The pure SALU loop still exits early non-deterministically. This definitively rules out memory, cache, and waitcnt as root causes.
By extension from #5 — even without any s_waitcnt in the loop, the bug persists.
GFX12's s_wait_xcnt waits for global store completions ("transaction count"). The transpiler was mapping this to s_waitcnt expcnt(N), but on GFX9, global stores use vmcnt, not expcnt. Fixed to s_waitcnt vmcnt(N). This is a correctness fix but does not resolve the split-K loop bug.
GFX12's s_add_nc_u64 does NOT write SCC. The GFX9 expansion (s_add_u32 + s_addc_u32) DOES write SCC. Added SCC save/restore, then removed it (the SCC is always overwritten by a subsequent s_cmp). Neither version affects the split-K bug.
The transpiled MSGPACK had stale GFX12 values:
vgpr_count: 5(should be 20 for transpiled kernel)sgpr_count: 20(should be 24)
Patched to match the kernel descriptor's RSRC1. No improvement for split-K.
The _e64 handler was missing s_mov_b32 exec_hi, 0 after v_cmpx (the _e32 handler had it). Fixed. Ghost lanes 32-63 can no longer execute stores. However, the split-K bug persists — and in fact ghost lanes were accidentally masking the bug (contributing correct values by coincidence).
| Field | Transpiled | Native gfx942 |
|---|---|---|
| e_flags | 0x54c (gfx942, xnack, sramecc) | 0x54c (same) |
| COMPUTE_PGM_RSRC2 | 0x384 | 0x384 (same) |
| USER_SGPR | 2 | 2 (same) |
| TGID_X/Y/Z_EN | 1/1/1 | 1/1/1 (same) |
| kernel_code_properties | 0x0008 | 0x0008 (same) |
| wavefront_size (MSGPACK) | 64 | 64 (same) |
| amdhsa.target | gfx942 | gfx942 (same) |
| COMPUTE_PGM_RSRC1 | 0x00af0102 | 0x00af0081 (different) |
| COMPUTE_PGM_RSRC3 | 0x03 | 0x01 (different) |
| kernarg_segment_size | 296 | 280 (different) |
RSRC1 differs in VGPR/SGPR allocation (transpiled allocates more for save registers). RSRC3 differs in ACCUM_OFFSET (3 vs 1). kernarg_size differs (transpiled has 16 extra bytes for hidden args). None of these should affect loop control.
Changed NOP padding (0xBF800000) to s_code_end (0xBF9F0000) after the assembled code. Tells the instruction prefetcher where the kernel ends. No improvement.
The inner loop's branch target might straddle a 64-byte instruction cache line boundary, causing non-deterministic instruction fetch behavior. This was not explicitly tested but is unlikely to cause the observed magnitude of failures.
Converted the inner loop from count-up (s10++ >= s4) to count-down (s10-- > 0). No improvement.
The NOWAIT test produces this minimal loop:
.L_br3:
v_lshlrev_b32_e32 v14, 2, v3 ; VALU (harmless index calc)
s_nop 0 ; was: global_load_dword
s_nop 0 ; was: s_load_dword
; stripped: s_waitcnt
v_add_u32_e32 v3, s5, v3 ; VALU (advance B index, harmless)
s_add_i32 s10, s10, 1 ; SALU: k++
s_add_u32 s6, s6, 4 ; SALU: advance A ptr lo
s_addc_u32 s7, s7, 0 ; SALU: carry
s_cmp_ge_i32 s10, s4 ; SALU: compare, sets SCC
v_add_f32_e32 v2, 1.0, v2 ; VALU: constant accumulate
s_cbranch_scc0 .L_br3 ; SALU: loop if SCC=0This loop has no memory operations, no waitcnts, no cache interaction. Yet it exits early non-deterministically.
A native gfx942 kernel with an equivalent loop (compiled by the ROCm compiler) works perfectly every time.
MI300X supports workgroup preemption. When the GPU preempts a wavefront, it saves SGPR/VGPR state and restores it later. If the kernel descriptor configuration (RSRC1/RSRC3 differences — more VGPRs allocated, different ACCUM_OFFSET) causes the save/restore to corrupt SGPR values (specifically s10 or s4), the loop bound would be wrong.
Evidence: The non-determinism is consistent with preemption timing — the amount of work done before preemption varies between runs, and corrupted state after restore causes early exit.
Counter-evidence: Preemption save/restore is a well-tested mechanism. Other transpiled kernels (attn_forward, multihead) don't show this issue. However, those kernels have different RSRC1/RSRC3 values.
The transpiled kernel has ACCUM_OFFSET = 3 (accumulators start at VGPR group 3 = v24). The native kernel has ACCUM_OFFSET = 1 (accumulators start at v8). On CDNA3, the ACCUM_OFFSET affects the VGPR register file partition between arch VGPRs and accumulator VGPRs.
If the hardware uses ACCUM_OFFSET to configure the register file in a way that affects how SGPRs are saved during preemption (unlikely, but not verified), a wrong ACCUM_OFFSET could cause SGPR corruption.
The transpiled kernel allocates 24 VGPRs (RSRC1 VGPRS=2) vs 16 for native (VGPRS=1). This reduces occupancy (fewer concurrent wavefronts per CU). Lower occupancy changes the CU scheduling patterns, which affects when wavefronts are preempted and how the SALU pipeline is shared among wavefronts.
If there's a subtle SALU pipeline hazard that only manifests under specific occupancy patterns (e.g., when exactly N wavefronts share the CU), the non-determinism would correlate with GPU load and scheduling decisions.
The pattern — non-deterministic SALU loop termination in a transpiled kernel but not a native one, with no visible instruction-level difference — could indicate an undocumented hardware errata specific to GFX942 when certain KD configurations are used. This would require AMD hardware team involvement to diagnose.
| Fix | Impact |
|---|---|
s_wait_xcnt → s_waitcnt vmcnt(N) |
Correct counter mapping for global stores |
s_code_end padding |
Instruction prefetcher termination |
| MSGPACK vgpr/sgpr patching | Metadata consistency |
v_cmpx_e64 exec_hi clear |
Ghost lane prevention |
| SCC forwarding reorder | No VALU between s_cmp and s_cbranch |
| Inner loop waitcnt merge | 3 waitcnts → 1 |
# On MI300X machine with ROCm 7.2:
cd /home/anush/github/TheRock/therock-build
ROCR=core/ROCR-Runtime/build/rocr/lib
ELFLIB=compiler/amd-comgr/dist/lib/rocm_sysdeps/lib
CLRLIB=core/clr/dist/lib
SYSDEPS=third-party/sysdeps/linux/zlib/build/dist/lib
export LD_LIBRARY_PATH=$ROCR:$ELFLIB:$CLRLIB:$SYSDEPS
export HSA_HOTSWAP_ISA_OVERRIDE=gfx942
# K iteration count test (run 5 times to see non-determinism):
for i in 1 2 3 4 5; do /tmp/splitk_ktest4 2>&1 | grep "K=256"; done
# NOWAIT diagnostic (pure SALU loop, no loads):
HSA_HOTSWAP_SPLITK_NOWAIT=1 /tmp/splitk_ktest4
# Full test suite:
/tmp/ts_fresh- Transpiler:
rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/hotswap/transpiler.cpp - Branch:
users/powderluv/rocm-hotswapon ROCm/rocm-systems - Split-K GFX1250 ELF:
/tmp/matmul_splitk_compute_gfx1250.elf - Test binaries:
/tmp/splitk_ktest4,/tmp/ts_fresh