🛸 RDNA 4 (GFX1201) AI MASTER LIST

Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)

🟢 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.

Prefill Latency: Currently 27-40s. Bottleneck identified in Triton prefill kernels; target is <5s.
FP8 Hardware Status: RESEARCHED. GFX1201 supports float8_e4m3fnuz natively, but Triton 3.5.1 lacks intrinsic legalization. 10x slowdown due to software emulation.
Frankenstein Build: Setup uses 24.04 container libraries on 22.04 host. OS migration to 24.04 planned.

Native FP8 MatMul: PyTorch addmm and Triton kernels currently fail legalization/intrinsic mapping for GFX1201.
Native pip install causal-conv1d: Still blocked by hardcoded NVIDIA/NVCC checks.
vLLM Native Linking: ABI drift in PyTorch Nightly breaks binary extension loading (getCurrentHIPStream error).

Methodology: Sovereign Extraction & Infiltration

Liberated optimized RDNA 4 wheels (Torch/Triton/Apex) from rocm/vllm-dev:rocm7.2_navi.
Poached internal Triton kernels (fla, causal_conv1d) directly from container source.
Engineered local shims to strip vllm and cuda dependencies.
Nuclear Patch applied to Unsloth to ignore hardware gatekeeping.

Current Verdict: The RX 9070 XT is a fully-functional, resident-capable AI workstation for Logic (14B) + Vision (4B) workflows.