Skip to content

Instantly share code, notes, and snippets.

@apollo-mg
Created November 24, 2025 19:44
Show Gist options
  • Select an option

  • Save apollo-mg/516faacbac35fd55f50621684c8f6191 to your computer and use it in GitHub Desktop.

Select an option

Save apollo-mg/516faacbac35fd55f50621684c8f6191 to your computer and use it in GitHub Desktop.
Stability Guide for Wan 2.1 on AMD RDNA 4 (RX 9070 XT)

Running Wan 2.1 (14B) on AMD RDNA 4 (RX 9070 XT) - Stability Guide

Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 / Linux ROCm: 7.0 / 7.1 Preview Goal: Stable Text-to-Video generation with Wan 2.1 (14B) without crashing or OOM.

The Problem

Running Wan 2.1 on RDNA 4 currently causes frequent HIP error: illegal memory access crashes or immediate OOMs during VAE decoding. This is due to kernel conflicts with PyTorch TunableOp and memory fragmentation.

The Fix (Launch Script)

Save this as run_wan_safe.sh. The specific environment variables are critical.

#!/bin/bash

# 1. DISABLE System Direct Memory Access (SDMA)
# Prevents data corruption during heavy GGUF transfers on RDNA 4.
export HSA_ENABLE_SDMA=0

# 2. DISABLE PyTorch TunableOp
# Crucial. While TunableOp helps Flux, it causes "Illegal Memory Access" 
# crashes with Wan 2.1 kernels on Navi 4x.
export PYTORCH_TUNABLEOP_ENABLED=0

# 3. ENABLE Triton Backend for Flash Attention
# The default Composable Kernel (CK) backend often fails on RDNA 4.
# Requires flash-attn to be built with this var set.
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"

# 4. Aggressive Memory Fragmentation Control
# Forces PyTorch to split blocks earlier (128MB) and GC sooner (60%).
export PYTORCH_HIP_ALLOC_CONF="garbage_collection_threshold:0.6,max_split_size_mb:128,expandable_segments:True"

echo "Launch Config:"
echo "  SDMA: OFF (Stability)"
echo "  TunableOp: OFF (Fix Illegal Access)"
echo "  Triton FA: ON (Performance)"
echo "  HIP Alloc: Optimized for 16GB"

# Launch ComfyUI with Low VRAM mode to force aggressive offloading
python3 main.py --lowvram --use-split-cross-attention

ComfyUI Workflow Settings

Even with the script, you will OOM during the final VAE Decode step unless you use these settings:

  1. Node: Use VAEDecodeTiled (Not standard VAEDecode).
  2. Tile Size: 256 (Default 512 is too large for 16GB VRAM + 14B Model).
  3. Temporal Tiling: 16 (Helps smooth out the decoding).
  4. Overlap: 64.

Notes on Flash Attention

You must build flash-attention from source with the Triton flag enabled:

export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
pip install git+https://github.com/Dao-AILab/flash-attention.git
@flexusjan
Copy link
Copy Markdown

In this guide you install flash-attention like this:

export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
pip install git+https://github.com/Dao-AILab/flash-attention.git

in your flash-attention guide like this:

export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
pip install flash-attn --no-build-isolation

what is the preferred way to do it?

@apollo-mg
Copy link
Copy Markdown
Author

apollo-mg commented Nov 30, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment