Skip to content

Instantly share code, notes, and snippets.

@oglok
Created April 27, 2026 15:56
Show Gist options
  • Select an option

  • Save oglok/20bbcbd6031455aff2fa83d9595609fa to your computer and use it in GitHub Desktop.

Select an option

Save oglok/20bbcbd6031455aff2fa83d9595609fa to your computer and use it in GitHub Desktop.
# vLLM on DGX Spark (GB10) — Compatibility Issues Report
**Date:** 2026-04-27
**Host:** `octo-et-spark-2.khw.eng.rdu2.dc.redhat.com`
**Image:** `quay.io/aipcc/rhaiis/cuda-ubi9:3.4`
**Model tested:** `RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16`
---
## Hardware Summary
| Property | Value |
|---|---|
| GPU | NVIDIA GB10 (DGX Spark) |
| Architecture | Blackwell |
| Compute Capability | **sm_121** (12.1) |
| GPU Memory | 119.6 GB (unified with CPU) |
| CPU Architecture | aarch64 (ARM Grace) |
| Host OS | Fedora 43, kernel 6.18.10 |
| NVIDIA Driver | 590.48.01 |
| Host CUDA | 13.1 |
## Software Stack in Container
| Component | Version |
|---|---|
| vLLM | 0.18.0+rhaiv.7 |
| PyTorch | 2.10.0 |
| CUDA (container) | 13.0 |
| Triton | 3.6.0 |
| FlashInfer | 0.6.6 |
| flash_attn | **NOT INSTALLED** |
| Transformers | 4.57.6 |
| cuDNN | 9.19.0 |
---
## Key Architectural Context: Two Separate Compilation Pipelines
Understanding the issues below requires distinguishing two completely independent CUDA compilation paths:
1. **Pre-compiled C++/CUDA kernels** (PyTorch, vLLM extensions, FA2, FA3, FlashMLA, FlashInfer):
- Built at image build time via `nvcc`
- Controlled by `TORCH_CUDA_ARCH_LIST`
- Produce `.cubin` (native SASS) or `.ptx` (virtual ISA) embedded in `.so` files
- **Adding `sm_121` to `TORCH_CUDA_ARCH_LIST` and rebuilding gives native SASS for these**
2. **Triton JIT-compiled Python kernels** (FLA/GDN, layernorm fusions, custom ops):
- Compiled at runtime: Python → Triton IR → PTX → `ptxas` → SASS
- Requires a working `ptxas` binary that supports the target arch at **runtime**
- `TORCH_CUDA_ARCH_LIST` has **zero effect** on these — Triton always needs `ptxas`
**These are orthogonal.** Adding `sm_121` to PyTorch's arch list does NOT eliminate the need for `ptxas-blackwell`. Conversely, fixing `ptxas-blackwell` does NOT fix the pre-compiled kernel gaps.
---
## Issue 1: sm_121 IS a real SASS target — but PyTorch was not built with it
**Severity: HIGH (pre-compiled kernels fall back to sm_120 forward-compat)**
Contrary to initial assumption, **`sm_121` is a real, distinct SASS target** in CUDA 13.0+:
```
$ nvcc --list-gpu-code
...
sm_120
sm_121 ← real SASS target, NOT just PTX
$ nvcc --list-gpu-arch
...
compute_120
compute_121 ← corresponding virtual arch
```
Both `nvcc -arch=sm_121` and `ptxas --gpu-name sm_121` succeed. The CUDA 13.0 toolkit in the container fully supports `sm_121` as a native compilation target.
However, PyTorch 2.10.0 in the container was built with:
```python
torch.cuda.get_arch_list() →
['sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_89', 'sm_90a', 'sm_100', 'sm_100a', 'sm_120', 'compute_120']
```
**`sm_121` is absent.** The highest SASS is `sm_120`, and `compute_120` provides PTX forward-compat. On the GB10, `sm_120` SASS runs natively via intra-family forward-compatibility (same ISA family), so this is **not a PTX JIT situation** — the performance impact is minor. But `sm_121`-specific features (if any) are unavailable.
**What needs to happen:** Rebuild PyTorch (and all CUDA extensions) with `sm_121` added to `TORCH_CUDA_ARCH_LIST`. The CUDA 13.0 toolkit in the container already supports it. This gives native SASS with any `sm_121`-specific optimizations.
---
## Issue 2: PyTorch warns about unsupported compute capability 12.1
**Severity: LOW (cosmetic, but signals missing build target)**
```
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is (7.5) - (12.0)
```
This warning is emitted because `sm_121` is not in PyTorch's compiled arch list. The GPU works via `sm_120` forward-compat. The warning disappears if PyTorch is rebuilt with `sm_121`.
**What needs to happen:** Same as Issue 1 — add `sm_121` to `TORCH_CUDA_ARCH_LIST` when rebuilding PyTorch.
---
## Issue 3: vLLM Flash Attention 2 — compiled only for sm_80 (SASS)
**Severity: HIGH (real performance regression — cross-generation PTX JIT)**
The `_vllm_fa2_C.abi3.so` (vLLM's bundled Flash Attention 2) contains cubins **only for `sm_80`**:
```
_vllm_fa2_C.abi3.so: all 52 cubins are sm_80 only
```
Unlike the core vLLM `_C.abi3.so` (which has `sm_120` SASS), Flash Attention 2 was compiled only for Ampere. On `sm_121`, this means:
- The kernel must run via **PTX JIT from `sm_80` → `sm_121`**, which is a **cross-generation** JIT (Ampere → Blackwell, skipping Hopper entirely)
- No Blackwell tensor core optimizations, no Blackwell shared memory layout, no Blackwell-specific instruction scheduling
- This is the **hot path for attention computation** — every forward pass goes through this kernel
- Likely significant performance penalty vs native `sm_121` SASS
**What needs to happen:** Rebuild `_vllm_fa2_C.abi3.so` with `sm_120` and `sm_121` in the arch list. The vLLM FA2 build configuration needs these targets added. The CUDA 13.0 toolkit in the container can compile for these targets today.
---
## Issue 4: vLLM Flash Attention 3 — compiled only for sm_90a
**Severity: MEDIUM (not usable on GB10, falls back to FA2)**
```
_vllm_fa3_C.abi3.so: all 192 cubins are sm_90a only
```
FA3 uses Hopper-specific features (TMA, warp-specialization). On `sm_121`:
- No native cubins → falls back to FA2 (which itself is only `sm_80`, see Issue 3)
- PTX JIT from `sm_90a` to `sm_121` may work but is untested and likely suboptimal
- vLLM correctly selects FA2 over FA3 on this device
**What needs to happen:** FA3 needs Blackwell-native kernels (`sm_120`/`sm_121`). This may require upstream Dao-AILab/flash-attention work, as Blackwell has different TMA semantics than Hopper.
---
## Issue 5: FlashMLA — compiled only for sm_90a and sm_100
**Severity: LOW (only affects DeepSeek-style MLA models)**
```
_flashmla_C.abi3.so: sm_90a, sm_100 only
_flashmla_extension_C.abi3.so: sm_90a, sm_100 only
```
No `sm_120`/`sm_121` cubins. MLA (Multi-Latent Attention) kernels won't run natively on GB10.
**What needs to happen:** Add `sm_120`/`sm_121` to FlashMLA build targets.
---
## Issue 6: flash_attn standalone package is NOT installed
**Severity: MEDIUM**
```
flash_attn import error: No module named 'flash_attn'
```
The standalone Dao-AILab `flash_attn` package is absent. vLLM uses its bundled FA2/FA3 instead. Some models or integrations may expect the standalone package.
**What needs to happen:** Either install `flash_attn` compiled for `sm_120`/`sm_121`, or confirm all code paths use vLLM's bundled attention.
---
## Issue 7: FlashInfer — installed but likely no sm_120/sm_121 cubins
**Severity: HIGH (FlashInfer backend unavailable)**
FlashInfer 0.6.6 is installed. `cuobjdump` returned no embedded cubins, suggesting it relies on JIT compilation (which requires Triton, which requires `ptxas-blackwell` — see Issue 8).
vLLM chose FA2 over FlashInfer:
```
Using FLASH_ATTN attention backend out of potential backends:
['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
```
**What needs to happen:** Either rebuild FlashInfer with `sm_120`/`sm_121` SASS pre-compiled, or ensure the Triton JIT path works (requires fixing Issues 8 and 9 first).
---
## Issue 8: Triton missing `ptxas-blackwell` — a container packaging gap
**Severity: CRITICAL (blocks ALL Triton kernel JIT on Blackwell)**
**Background:** Triton 3.6.0 normally ships its own bundled `ptxas` and `ptxas-blackwell` binaries inside `triton/backends/nvidia/bin/`. The `ptxas-blackwell` binary is simply a newer build of NVIDIA's `ptxas` assembler that supports Blackwell PTX. The split exists because Triton wanted to add Blackwell support without risking regressions on older architectures by upgrading the main `ptxas`.
In this container image, the `triton/backends/nvidia/bin/` directory **does not exist** — Triton was installed without its bundled tools. Instead, the container relies on the system CUDA toolkit at `/usr/local/cuda/bin/`. Triton's fallback logic checks env vars:
```python
# triton/backends/nvidia/compiler.py line 35
return knobs.nvidia.ptxas_blackwell if arch >= 100 else knobs.nvidia.ptxas
# Env var: TRITON_PTXAS_BLACKWELL_PATH
```
Neither env var is set, and neither bundled binary exists → crash.
**The fix is straightforward:** The system `ptxas` from CUDA 13.0 in the container **already fully supports `sm_121`**:
```
$ ptxas --gpu-name sm_121 --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Cuda compilation tools, release 13.0, V13.0.88
```
So `ptxas` IS the "blackwell-capable ptxas" that Triton is looking for. Setting `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` is the correct fix.
**This is independent of `TORCH_CUDA_ARCH_LIST` / `sm_121`.** Triton JIT-compiles kernels at runtime and always needs `ptxas`. Adding `sm_121` to PyTorch's build does not help Triton at all.
**What needs to happen (pick one):**
1. **Best:** Add to the container image Dockerfile:
```dockerfile
ENV TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas
ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
```
2. **Alternative:** Create the expected directory and symlinks:
```dockerfile
RUN mkdir -p $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin'))") && \
ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas'))") && \
ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas-blackwell'))")
```
---
## Issue 9: Triton runtime memory allocator not configured
**Severity: HIGH (Triton kernels crash at runtime even after ptxas fix)**
After fixing `ptxas-blackwell`, Triton kernels compile successfully but fail at execution:
```
RuntimeError: Kernel requires a runtime memory allocation, but no allocator was set.
Use triton.set_allocator to specify an allocator.
```
This affects the FLA (Flash Linear Attention) GDN prefill kernels in `qwen3_next.py`. On Blackwell, certain Triton kernels use "global scratch" memory, which requires an allocator to be registered via `triton.set_allocator()`.
vLLM catches this as a WARNING and falls back to a slower non-Triton path:
```
WARNING [qwen3_next.py:767] RuntimeError: Kernel requires a runtime memory allocation...
```
**What needs to happen:** vLLM should call `triton.set_allocator()` early in initialization. A `cumem_allocator.abi3.so` exists in the container, suggesting infrastructure is present but not wired up. Likely fix:
```python
import triton
from triton.runtime.allocation import TorchAllocator
triton.set_allocator(TorchAllocator)
```
---
## Issue 10: NVML reports "Not Supported" for memory queries
**Severity: MEDIUM (affects monitoring and memory management)**
```python
>>> nvmlDeviceGetMemoryInfo(handle)
# Raises: "Not Supported"
```
`nvidia-smi` shows `Memory-Usage: Not Supported`. The GB10 uses **unified memory** (shared CPU+GPU pool). NVML's discrete GPU memory API doesn't apply.
Consequences:
- `torch.cuda.get_device_properties().total_memory` reports 119.6 GB (the full unified pool), but actual available GPU memory depends on CPU usage
- `--gpu-memory-utilization 0.9` failed because it requested 107.66 GB, but only 39.9 GB was free (CPU page cache consumed the shared pool)
- No standard tool can report real-time GPU memory usage
**What needs to happen:**
- vLLM needs unified-memory-aware memory management for DGX Spark
- Default `gpu_memory_utilization` should be lower (0.5 or less) when unified memory is detected
- Consider detecting GB10/unified memory at startup and auto-adjusting
---
## Issue 11: SELinux blocks GPU access in containers (podman)
**Severity: MEDIUM (requires workaround for every container launch)**
Running with `--device nvidia.com/gpu=all` alone:
```
Failed to initialize NVML: Insufficient Permissions
CUDA available: False
```
Fix: add `--security-opt=label=disable` to disable SELinux label enforcement for the container.
**What needs to happen:** Either update SELinux policy with nvidia device rules, or document the required flag.
---
## Issue 12: Container image has `HF_HUB_OFFLINE=1` baked in
**Severity: LOW**
Prevents model downloads. Intentional for production, blocks first-time setup.
**Workaround:** `-e HF_HUB_OFFLINE=0`
---
## Issue 13: torch.compile / torch.inductor blocked — cascading from ptxas issue
**Severity: HIGH (blocks CUDA graphs and compilation optimizations)**
Even with `--enforce-eager`, some model layers use `@torch.compile` decorators. The Inductor backend invokes Triton, which hits the `ptxas-blackwell` issue (Issue 8). This forced the following workarounds:
- `TORCH_COMPILE_DISABLE=1`
- `TORCHINDUCTOR_DISABLE=1`
- `--enforce-eager`
The model runs in pure eager mode — no torch.compile fusion, no CUDA graphs. This is a significant performance penalty.
**What needs to happen:** This is a **cascading effect of Issue 8**. Once `TRITON_PTXAS_BLACKWELL_PATH` is set in the container image, torch.compile and CUDA graphs should work without these env vars.
---
## Issue 14: vLLM core CUDA extensions — have sm_120, lack sm_121
**Severity: LOW (functional, minor optimization gap)**
The main vLLM `_C.abi3.so` includes cubins for:
```
sm_75, sm_80, sm_86, sm_87, sm_89, sm_90, sm_90a, sm_100, sm_100a, sm_120, sm_120a
```
`_moe_C.abi3.so` similarly includes up to `sm_120`.
Since `sm_121` is in the same ISA family as `sm_120`, these SASS cubins run **natively** on the GB10 via intra-family forward-compatibility. This is NOT the same as PTX JIT — there's no JIT compilation happening. The performance impact is minimal.
Adding `sm_121` to the build would only matter if there are `sm_121`-specific instruction variants or scheduling optimizations.
---
## Issue 15: Transformers library too old for Gemma 4
**Severity: MEDIUM (blocks specific models)**
Transformers 4.57.6 does not recognize `model_type: gemma4`. Requires >= 4.58.0.
**What needs to happen:** Upgrade transformers if Gemma 4 support is desired.
---
## Working `podman run` Command (All Workarounds Applied)
```bash
podman run -d \
--name vllm-serving \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--user 0 \
-p 8000:8000 \
-v /tmp/hf-cache:/root/.cache/huggingface \
-e HF_HUB_OFFLINE=0 \
-e HF_HOME=/root/.cache/huggingface \
-e TORCH_COMPILE_DISABLE=1 \
-e TORCHINDUCTOR_DISABLE=1 \
-e TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas \
--shm-size=16g \
--entrypoint python3 \
quay.io/aipcc/rhaiis/cuda-ubi9:3.4 \
-m vllm.entrypoints.openai.api_server \
--model RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16 \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.5 \
--enforce-eager \
--host 0.0.0.0 \
--port 8000
```
---
## Priority Work Items
| Priority | Issue | Fix | Impact |
|---|---|---|---|
| **P0** | Triton `ptxas-blackwell` missing (#8) | Add `TRITON_PTXAS_BLACKWELL_PATH` env var to container image | Unblocks ALL Triton JIT, torch.compile, CUDA graphs |
| **P0** | torch.compile/inductor blocked (#13) | Cascading fix from #8 — remove `TORCH_COMPILE_DISABLE` once ptxas is fixed | Restores compiled execution + CUDA graphs |
| **P0** | FA2 compiled only for sm_80 (#3) | Rebuild with `sm_120 sm_121` in arch list | Native SASS for attention hot path |
| **P1** | Triton runtime allocator (#9) | Call `triton.set_allocator(TorchAllocator)` in vLLM init | Enables Triton FLA/GDN kernels |
| **P1** | FlashInfer no sm_120 support (#7) | Rebuild or ensure JIT path works after #8 fix | Enables faster attention backend |
| **P1** | FA3 no Blackwell kernels (#4) | Upstream FA3 work + rebuild for sm_120/sm_121 | Best-in-class attention perf |
| **P1** | NVML/unified memory (#10) | Detect unified memory, auto-lower `gpu_memory_utilization` | Prevents OOM on startup |
| **P2** | PyTorch sm_121 not in arch list (#1, #2) | Rebuild PyTorch with `sm_121` in `TORCH_CUDA_ARCH_LIST` | Native SASS for all PyTorch ops, removes warning |
| **P2** | SELinux blocking GPU (#11) | SELinux policy update or documentation | Smoother container UX |
| **P2** | FlashMLA no sm_120 (#5) | Rebuild for sm_120/sm_121 | DeepSeek model support |
| **P3** | HF_HUB_OFFLINE=1 (#12) | Document workaround | Dev experience |
| **P3** | Transformers < 4.58 (#15) | Upgrade | Gemma 4 support |
---
## Summary of Current State
The model **does serve** on the DGX Spark, but in a heavily degraded mode:
- **Eager execution only** — no torch.compile, no CUDA graphs (due to missing `ptxas-blackwell`)
- **Flash Attention 2 running from sm_80 PTX JIT** — cross-generation JIT from Ampere to Blackwell
- **No Triton FLA/GDN kernels** — allocator not configured, fall back to slower path
- **No FlashInfer** — JIT compilation blocked by same ptxas issue
- **Memory management fragile** — unified memory pool requires manual `gpu_memory_utilization` tuning
The **single highest-impact fix** is adding `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` to the container image. This one env var unblocks Triton JIT, torch.compile, CUDA graphs, and FlashInfer JIT — effectively unlocking the entire compiled execution stack.
The **second highest-impact fix** is rebuilding Flash Attention 2 with `sm_120`/`sm_121` SASS targets, giving native attention kernels instead of cross-generation PTX JIT from sm_80.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment