Hardware: AMD Radeon RX 9070 XT (gfx1201)
OS: Ubuntu 22.04 LTS
Software: ROCm 7.1.0, PyTorch 2.8 (Nightly/Custom)
Getting bleeding-edge AMD hardware to play nice with AI workflows often feels like solving a puzzle. If you picked up an RDNA 4 card (like the RX 9070 XT) and tried to install standard AI libraries, you likely hit walls of C++ assembly errors.
Here is the breakdown of why it fails, and the specific strategies to get Flash Attention 2, ComfyUI, and Flux.1 running natively.
Most ROCm libraries (like flash-attention) default to the Composable Kernel (CK) backend. This backend is heavily optimized for CDNA (Instinct MI200/MI300) architectures, which typically use a "Wave64" execution model (64 threads per wavefront).
RDNA 3 and 4 (Consumer GPUs like the RX 7000/9000 series) utilize Wave32.
When you try to compile standard Flash Attention, the compiler chokes on assembly instructions like v_cmpx_le_u32 exec. On RDNA 4, the execution mask register (exec) is 32-bit, but legacy code often treats it as 64-bit. This results in the dreaded build error:
error: invalid operand for instruction
Do not try to patch the C++ assembly code manually. It is a rabbit hole of conditional compilation flags.
The solution is to switch the backend from Composable Kernel to OpenAI Triton. Triton works like a JIT (Just-In-Time) compiler. Instead of relying on pre-compiled binaries that assume specific hardware, Triton generates kernels specifically for your GPU architecture (gfx1201) at runtime.
Ensure you have the ROCm 7.1 SDK installed. Then, inside your Python environment:
# Install the Python Triton wrapper (ensure it matches your ROCm version)
pip install tritonYou must force the installer to ignore the broken C++ backend and enable Triton support.
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
pip install flash-attn --no-build-isolationRDNA 4 GPUs support native FP8 operations, but standard PyTorch kernels often fail to utilize them efficiently. Enabling TunableOp allows PyTorch to benchmark available matrix multiplication kernels (GEMM) at runtime and select the fastest ones for your specific architecture.
Add these variables to your launch script:
# Enable runtime kernel tuning
export PYTORCH_TUNABLEOP_ENABLED="1"
# Keep startup fast (benchmarks only a few iterations)
export PYTORCH_TUNABLEOP_TUNING_DURATION="short"If you are using ROCm 7.1, you are likely on a pre-release version of PyTorch (e.g., 2.8.0). Standard pip install torchaudio will fail because it tries to pull a stable version (e.g., 2.5.1) that relies on older C++ headers (torch/csrc/stable/...) that may not exist or have changed in the nightly build.
The Fix: Build torchaudio from source, but match the branch to your API compatibility.
- Install CMake:
pip install cmake - Clone & Build:
git clone https://github.com/pytorch/audio.git
cd audio
# The "main" branch is often too far ahead.
# "release/2.5" is currently the safe bet for 2.8 nightlies.
git checkout release/2.5
# Ensure your venv binaries are in PATH so setup.py finds cmake
export PATH=$PWD/../venv/bin:$PATH
python setup.py installUse this script to ensure all variables are set correctly every time you run ComfyUI.
#!/bin/bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
source "$DIR/venv/bin/activate"
# 1. Enable Triton Backend for Flash Attention (Fixes crash)
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
# 2. Enable TunableOp (Fixes performance/FP8)
export PYTORCH_TUNABLEOP_ENABLED="1"
export PYTORCH_TUNABLEOP_TUNING_DURATION="short"
python "$DIR/ComfyUI/main.py" "$@"If you have an RDNA 4 card on Linux:
- Backend: Don't use the default Flash Attention build; it assumes Wave64. Use Triton.
- Performance: Enable
PYTORCH_TUNABLEOP_ENABLEDto unlock proper FP8/GEMM speeds. - Audio: Compile
torchaudiofrom therelease/2.5branch if on PyTorch Nightly. - Status: The RX 9000 series is fully capable of heavy AI workloads on Linux once you bypass the enterprise-default settings.
Awesome! Flash-Attention is finally working for me. Good Job!