Skip to content

Instantly share code, notes, and snippets.

@apollo-mg
Last active March 31, 2026 21:28
Show Gist options
  • Select an option

  • Save apollo-mg/ecba6a0c29323325a7ac3babf08e53be to your computer and use it in GitHub Desktop.

Select an option

Save apollo-mg/ecba6a0c29323325a7ac3babf08e53be to your computer and use it in GitHub Desktop.
Guide for running AI on AMD RDNA 4 GPUs with ROCm 7.1 on Linux (Updated with Performance Tuning)

Running AI on AMD RDNA 4 (RX 9000 Series): The ROCm 7.1 & Flash Attention Survival Guide

Hardware: AMD Radeon RX 9070 XT (gfx1201)
OS: Ubuntu 22.04 LTS
Software: ROCm 7.1.0, PyTorch 2.8 (Nightly/Custom)

Getting bleeding-edge AMD hardware to play nice with AI workflows often feels like solving a puzzle. If you picked up an RDNA 4 card (like the RX 9070 XT) and tried to install standard AI libraries, you likely hit walls of C++ assembly errors.

Here is the breakdown of why it fails, and the specific strategies to get Flash Attention 2, ComfyUI, and Flux.1 running natively.

The Core Problem: Wave32 vs. Wave64

Most ROCm libraries (like flash-attention) default to the Composable Kernel (CK) backend. This backend is heavily optimized for CDNA (Instinct MI200/MI300) architectures, which typically use a "Wave64" execution model (64 threads per wavefront).

RDNA 3 and 4 (Consumer GPUs like the RX 7000/9000 series) utilize Wave32.

When you try to compile standard Flash Attention, the compiler chokes on assembly instructions like v_cmpx_le_u32 exec. On RDNA 4, the execution mask register (exec) is 32-bit, but legacy code often treats it as 64-bit. This results in the dreaded build error:

error: invalid operand for instruction

The Fix: The Triton Pivot

Do not try to patch the C++ assembly code manually. It is a rabbit hole of conditional compilation flags.

The solution is to switch the backend from Composable Kernel to OpenAI Triton. Triton works like a JIT (Just-In-Time) compiler. Instead of relying on pre-compiled binaries that assume specific hardware, Triton generates kernels specifically for your GPU architecture (gfx1201) at runtime.

1. Install Dependencies

Ensure you have the ROCm 7.1 SDK installed. Then, inside your Python environment:

# Install the Python Triton wrapper (ensure it matches your ROCm version)
pip install triton

2. Install Flash Attention (The Right Way)

You must force the installer to ignore the broken C++ backend and enable Triton support.

export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
pip install flash-attn --no-build-isolation

Performance Tuning: TunableOp (The "60% Speedup" Fix)

RDNA 4 GPUs support native FP8 operations, but standard PyTorch kernels often fail to utilize them efficiently. Enabling TunableOp allows PyTorch to benchmark available matrix multiplication kernels (GEMM) at runtime and select the fastest ones for your specific architecture.

Add these variables to your launch script:

# Enable runtime kernel tuning
export PYTORCH_TUNABLEOP_ENABLED="1"

# Keep startup fast (benchmarks only a few iterations)
export PYTORCH_TUNABLEOP_TUNING_DURATION="short"

Handling Bleeding-Edge PyTorch (Audio)

If you are using ROCm 7.1, you are likely on a pre-release version of PyTorch (e.g., 2.8.0). Standard pip install torchaudio will fail because it tries to pull a stable version (e.g., 2.5.1) that relies on older C++ headers (torch/csrc/stable/...) that may not exist or have changed in the nightly build.

The Fix: Build torchaudio from source, but match the branch to your API compatibility.

  1. Install CMake: pip install cmake
  2. Clone & Build:
git clone https://github.com/pytorch/audio.git
cd audio
# The "main" branch is often too far ahead. 
# "release/2.5" is currently the safe bet for 2.8 nightlies.
git checkout release/2.5 
# Ensure your venv binaries are in PATH so setup.py finds cmake
export PATH=$PWD/../venv/bin:$PATH 
python setup.py install

Final Launch Script (run_comfy_optimized.sh)

Use this script to ensure all variables are set correctly every time you run ComfyUI.

#!/bin/bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
source "$DIR/venv/bin/activate"

# 1. Enable Triton Backend for Flash Attention (Fixes crash)
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"

# 2. Enable TunableOp (Fixes performance/FP8)
export PYTORCH_TUNABLEOP_ENABLED="1"
export PYTORCH_TUNABLEOP_TUNING_DURATION="short"

python "$DIR/ComfyUI/main.py" "$@"

TL;DR Summary

If you have an RDNA 4 card on Linux:

  1. Backend: Don't use the default Flash Attention build; it assumes Wave64. Use Triton.
  2. Performance: Enable PYTORCH_TUNABLEOP_ENABLED to unlock proper FP8/GEMM speeds.
  3. Audio: Compile torchaudio from the release/2.5 branch if on PyTorch Nightly.
  4. Status: The RX 9000 series is fully capable of heavy AI workloads on Linux once you bypass the enterprise-default settings.
@flexusjan
Copy link
Copy Markdown

Awesome! Flash-Attention is finally working for me. Good Job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment