Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)
- Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
- 4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
- Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
- Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
- Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.
- Prefill Latency: Currently 27-40s. Bottleneck identified in Triton prefill kernels; target is <5s.
- FP8 Hardware Status: RESEARCHED. GFX1201 supports
float8_e4m3fnuznatively, but Triton 3.5.1 lacks intrinsic legalization. 10x slowdown due to software emulation. - Frankenstein Build: Setup uses 24.04 container libraries on 22.04 host. OS migration to 24.04 planned.
- Native FP8 MatMul: PyTorch
addmmand Triton kernels currently fail legalization/intrinsic mapping for GFX1201. - Native pip install causal-conv1d: Still blocked by hardcoded NVIDIA/NVCC checks.
- vLLM Native Linking: ABI drift in PyTorch Nightly breaks binary extension loading (
getCurrentHIPStreamerror).
Methodology: Sovereign Extraction & Infiltration
- Liberated optimized RDNA 4 wheels (Torch/Triton/Apex) from
rocm/vllm-dev:rocm7.2_navi. - Poached internal Triton kernels (
fla,causal_conv1d) directly from container source. - Engineered local shims to strip
vllmandcudadependencies. - Nuclear Patch applied to Unsloth to ignore hardware gatekeeping.
Current Verdict: The RX 9070 XT is a fully-functional, resident-capable AI workstation for Logic (14B) + Vision (4B) workflows.