Created
April 27, 2026 15:56
-
-
Save oglok/20bbcbd6031455aff2fa83d9595609fa to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # vLLM on DGX Spark (GB10) — Compatibility Issues Report | |
| **Date:** 2026-04-27 | |
| **Host:** `octo-et-spark-2.khw.eng.rdu2.dc.redhat.com` | |
| **Image:** `quay.io/aipcc/rhaiis/cuda-ubi9:3.4` | |
| **Model tested:** `RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16` | |
| --- | |
| ## Hardware Summary | |
| | Property | Value | | |
| |---|---| | |
| | GPU | NVIDIA GB10 (DGX Spark) | | |
| | Architecture | Blackwell | | |
| | Compute Capability | **sm_121** (12.1) | | |
| | GPU Memory | 119.6 GB (unified with CPU) | | |
| | CPU Architecture | aarch64 (ARM Grace) | | |
| | Host OS | Fedora 43, kernel 6.18.10 | | |
| | NVIDIA Driver | 590.48.01 | | |
| | Host CUDA | 13.1 | | |
| ## Software Stack in Container | |
| | Component | Version | | |
| |---|---| | |
| | vLLM | 0.18.0+rhaiv.7 | | |
| | PyTorch | 2.10.0 | | |
| | CUDA (container) | 13.0 | | |
| | Triton | 3.6.0 | | |
| | FlashInfer | 0.6.6 | | |
| | flash_attn | **NOT INSTALLED** | | |
| | Transformers | 4.57.6 | | |
| | cuDNN | 9.19.0 | | |
| --- | |
| ## Key Architectural Context: Two Separate Compilation Pipelines | |
| Understanding the issues below requires distinguishing two completely independent CUDA compilation paths: | |
| 1. **Pre-compiled C++/CUDA kernels** (PyTorch, vLLM extensions, FA2, FA3, FlashMLA, FlashInfer): | |
| - Built at image build time via `nvcc` | |
| - Controlled by `TORCH_CUDA_ARCH_LIST` | |
| - Produce `.cubin` (native SASS) or `.ptx` (virtual ISA) embedded in `.so` files | |
| - **Adding `sm_121` to `TORCH_CUDA_ARCH_LIST` and rebuilding gives native SASS for these** | |
| 2. **Triton JIT-compiled Python kernels** (FLA/GDN, layernorm fusions, custom ops): | |
| - Compiled at runtime: Python → Triton IR → PTX → `ptxas` → SASS | |
| - Requires a working `ptxas` binary that supports the target arch at **runtime** | |
| - `TORCH_CUDA_ARCH_LIST` has **zero effect** on these — Triton always needs `ptxas` | |
| **These are orthogonal.** Adding `sm_121` to PyTorch's arch list does NOT eliminate the need for `ptxas-blackwell`. Conversely, fixing `ptxas-blackwell` does NOT fix the pre-compiled kernel gaps. | |
| --- | |
| ## Issue 1: sm_121 IS a real SASS target — but PyTorch was not built with it | |
| **Severity: HIGH (pre-compiled kernels fall back to sm_120 forward-compat)** | |
| Contrary to initial assumption, **`sm_121` is a real, distinct SASS target** in CUDA 13.0+: | |
| ``` | |
| $ nvcc --list-gpu-code | |
| ... | |
| sm_120 | |
| sm_121 ← real SASS target, NOT just PTX | |
| $ nvcc --list-gpu-arch | |
| ... | |
| compute_120 | |
| compute_121 ← corresponding virtual arch | |
| ``` | |
| Both `nvcc -arch=sm_121` and `ptxas --gpu-name sm_121` succeed. The CUDA 13.0 toolkit in the container fully supports `sm_121` as a native compilation target. | |
| However, PyTorch 2.10.0 in the container was built with: | |
| ```python | |
| torch.cuda.get_arch_list() → | |
| ['sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_89', 'sm_90a', 'sm_100', 'sm_100a', 'sm_120', 'compute_120'] | |
| ``` | |
| **`sm_121` is absent.** The highest SASS is `sm_120`, and `compute_120` provides PTX forward-compat. On the GB10, `sm_120` SASS runs natively via intra-family forward-compatibility (same ISA family), so this is **not a PTX JIT situation** — the performance impact is minor. But `sm_121`-specific features (if any) are unavailable. | |
| **What needs to happen:** Rebuild PyTorch (and all CUDA extensions) with `sm_121` added to `TORCH_CUDA_ARCH_LIST`. The CUDA 13.0 toolkit in the container already supports it. This gives native SASS with any `sm_121`-specific optimizations. | |
| --- | |
| ## Issue 2: PyTorch warns about unsupported compute capability 12.1 | |
| **Severity: LOW (cosmetic, but signals missing build target)** | |
| ``` | |
| Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. | |
| Minimum and Maximum cuda capability supported by this version of PyTorch is (7.5) - (12.0) | |
| ``` | |
| This warning is emitted because `sm_121` is not in PyTorch's compiled arch list. The GPU works via `sm_120` forward-compat. The warning disappears if PyTorch is rebuilt with `sm_121`. | |
| **What needs to happen:** Same as Issue 1 — add `sm_121` to `TORCH_CUDA_ARCH_LIST` when rebuilding PyTorch. | |
| --- | |
| ## Issue 3: vLLM Flash Attention 2 — compiled only for sm_80 (SASS) | |
| **Severity: HIGH (real performance regression — cross-generation PTX JIT)** | |
| The `_vllm_fa2_C.abi3.so` (vLLM's bundled Flash Attention 2) contains cubins **only for `sm_80`**: | |
| ``` | |
| _vllm_fa2_C.abi3.so: all 52 cubins are sm_80 only | |
| ``` | |
| Unlike the core vLLM `_C.abi3.so` (which has `sm_120` SASS), Flash Attention 2 was compiled only for Ampere. On `sm_121`, this means: | |
| - The kernel must run via **PTX JIT from `sm_80` → `sm_121`**, which is a **cross-generation** JIT (Ampere → Blackwell, skipping Hopper entirely) | |
| - No Blackwell tensor core optimizations, no Blackwell shared memory layout, no Blackwell-specific instruction scheduling | |
| - This is the **hot path for attention computation** — every forward pass goes through this kernel | |
| - Likely significant performance penalty vs native `sm_121` SASS | |
| **What needs to happen:** Rebuild `_vllm_fa2_C.abi3.so` with `sm_120` and `sm_121` in the arch list. The vLLM FA2 build configuration needs these targets added. The CUDA 13.0 toolkit in the container can compile for these targets today. | |
| --- | |
| ## Issue 4: vLLM Flash Attention 3 — compiled only for sm_90a | |
| **Severity: MEDIUM (not usable on GB10, falls back to FA2)** | |
| ``` | |
| _vllm_fa3_C.abi3.so: all 192 cubins are sm_90a only | |
| ``` | |
| FA3 uses Hopper-specific features (TMA, warp-specialization). On `sm_121`: | |
| - No native cubins → falls back to FA2 (which itself is only `sm_80`, see Issue 3) | |
| - PTX JIT from `sm_90a` to `sm_121` may work but is untested and likely suboptimal | |
| - vLLM correctly selects FA2 over FA3 on this device | |
| **What needs to happen:** FA3 needs Blackwell-native kernels (`sm_120`/`sm_121`). This may require upstream Dao-AILab/flash-attention work, as Blackwell has different TMA semantics than Hopper. | |
| --- | |
| ## Issue 5: FlashMLA — compiled only for sm_90a and sm_100 | |
| **Severity: LOW (only affects DeepSeek-style MLA models)** | |
| ``` | |
| _flashmla_C.abi3.so: sm_90a, sm_100 only | |
| _flashmla_extension_C.abi3.so: sm_90a, sm_100 only | |
| ``` | |
| No `sm_120`/`sm_121` cubins. MLA (Multi-Latent Attention) kernels won't run natively on GB10. | |
| **What needs to happen:** Add `sm_120`/`sm_121` to FlashMLA build targets. | |
| --- | |
| ## Issue 6: flash_attn standalone package is NOT installed | |
| **Severity: MEDIUM** | |
| ``` | |
| flash_attn import error: No module named 'flash_attn' | |
| ``` | |
| The standalone Dao-AILab `flash_attn` package is absent. vLLM uses its bundled FA2/FA3 instead. Some models or integrations may expect the standalone package. | |
| **What needs to happen:** Either install `flash_attn` compiled for `sm_120`/`sm_121`, or confirm all code paths use vLLM's bundled attention. | |
| --- | |
| ## Issue 7: FlashInfer — installed but likely no sm_120/sm_121 cubins | |
| **Severity: HIGH (FlashInfer backend unavailable)** | |
| FlashInfer 0.6.6 is installed. `cuobjdump` returned no embedded cubins, suggesting it relies on JIT compilation (which requires Triton, which requires `ptxas-blackwell` — see Issue 8). | |
| vLLM chose FA2 over FlashInfer: | |
| ``` | |
| Using FLASH_ATTN attention backend out of potential backends: | |
| ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'] | |
| ``` | |
| **What needs to happen:** Either rebuild FlashInfer with `sm_120`/`sm_121` SASS pre-compiled, or ensure the Triton JIT path works (requires fixing Issues 8 and 9 first). | |
| --- | |
| ## Issue 8: Triton missing `ptxas-blackwell` — a container packaging gap | |
| **Severity: CRITICAL (blocks ALL Triton kernel JIT on Blackwell)** | |
| **Background:** Triton 3.6.0 normally ships its own bundled `ptxas` and `ptxas-blackwell` binaries inside `triton/backends/nvidia/bin/`. The `ptxas-blackwell` binary is simply a newer build of NVIDIA's `ptxas` assembler that supports Blackwell PTX. The split exists because Triton wanted to add Blackwell support without risking regressions on older architectures by upgrading the main `ptxas`. | |
| In this container image, the `triton/backends/nvidia/bin/` directory **does not exist** — Triton was installed without its bundled tools. Instead, the container relies on the system CUDA toolkit at `/usr/local/cuda/bin/`. Triton's fallback logic checks env vars: | |
| ```python | |
| # triton/backends/nvidia/compiler.py line 35 | |
| return knobs.nvidia.ptxas_blackwell if arch >= 100 else knobs.nvidia.ptxas | |
| # Env var: TRITON_PTXAS_BLACKWELL_PATH | |
| ``` | |
| Neither env var is set, and neither bundled binary exists → crash. | |
| **The fix is straightforward:** The system `ptxas` from CUDA 13.0 in the container **already fully supports `sm_121`**: | |
| ``` | |
| $ ptxas --gpu-name sm_121 --version | |
| ptxas: NVIDIA (R) Ptx optimizing assembler | |
| Cuda compilation tools, release 13.0, V13.0.88 | |
| ``` | |
| So `ptxas` IS the "blackwell-capable ptxas" that Triton is looking for. Setting `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` is the correct fix. | |
| **This is independent of `TORCH_CUDA_ARCH_LIST` / `sm_121`.** Triton JIT-compiles kernels at runtime and always needs `ptxas`. Adding `sm_121` to PyTorch's build does not help Triton at all. | |
| **What needs to happen (pick one):** | |
| 1. **Best:** Add to the container image Dockerfile: | |
| ```dockerfile | |
| ENV TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas | |
| ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas | |
| ``` | |
| 2. **Alternative:** Create the expected directory and symlinks: | |
| ```dockerfile | |
| RUN mkdir -p $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin'))") && \ | |
| ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas'))") && \ | |
| ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas-blackwell'))") | |
| ``` | |
| --- | |
| ## Issue 9: Triton runtime memory allocator not configured | |
| **Severity: HIGH (Triton kernels crash at runtime even after ptxas fix)** | |
| After fixing `ptxas-blackwell`, Triton kernels compile successfully but fail at execution: | |
| ``` | |
| RuntimeError: Kernel requires a runtime memory allocation, but no allocator was set. | |
| Use triton.set_allocator to specify an allocator. | |
| ``` | |
| This affects the FLA (Flash Linear Attention) GDN prefill kernels in `qwen3_next.py`. On Blackwell, certain Triton kernels use "global scratch" memory, which requires an allocator to be registered via `triton.set_allocator()`. | |
| vLLM catches this as a WARNING and falls back to a slower non-Triton path: | |
| ``` | |
| WARNING [qwen3_next.py:767] RuntimeError: Kernel requires a runtime memory allocation... | |
| ``` | |
| **What needs to happen:** vLLM should call `triton.set_allocator()` early in initialization. A `cumem_allocator.abi3.so` exists in the container, suggesting infrastructure is present but not wired up. Likely fix: | |
| ```python | |
| import triton | |
| from triton.runtime.allocation import TorchAllocator | |
| triton.set_allocator(TorchAllocator) | |
| ``` | |
| --- | |
| ## Issue 10: NVML reports "Not Supported" for memory queries | |
| **Severity: MEDIUM (affects monitoring and memory management)** | |
| ```python | |
| >>> nvmlDeviceGetMemoryInfo(handle) | |
| # Raises: "Not Supported" | |
| ``` | |
| `nvidia-smi` shows `Memory-Usage: Not Supported`. The GB10 uses **unified memory** (shared CPU+GPU pool). NVML's discrete GPU memory API doesn't apply. | |
| Consequences: | |
| - `torch.cuda.get_device_properties().total_memory` reports 119.6 GB (the full unified pool), but actual available GPU memory depends on CPU usage | |
| - `--gpu-memory-utilization 0.9` failed because it requested 107.66 GB, but only 39.9 GB was free (CPU page cache consumed the shared pool) | |
| - No standard tool can report real-time GPU memory usage | |
| **What needs to happen:** | |
| - vLLM needs unified-memory-aware memory management for DGX Spark | |
| - Default `gpu_memory_utilization` should be lower (0.5 or less) when unified memory is detected | |
| - Consider detecting GB10/unified memory at startup and auto-adjusting | |
| --- | |
| ## Issue 11: SELinux blocks GPU access in containers (podman) | |
| **Severity: MEDIUM (requires workaround for every container launch)** | |
| Running with `--device nvidia.com/gpu=all` alone: | |
| ``` | |
| Failed to initialize NVML: Insufficient Permissions | |
| CUDA available: False | |
| ``` | |
| Fix: add `--security-opt=label=disable` to disable SELinux label enforcement for the container. | |
| **What needs to happen:** Either update SELinux policy with nvidia device rules, or document the required flag. | |
| --- | |
| ## Issue 12: Container image has `HF_HUB_OFFLINE=1` baked in | |
| **Severity: LOW** | |
| Prevents model downloads. Intentional for production, blocks first-time setup. | |
| **Workaround:** `-e HF_HUB_OFFLINE=0` | |
| --- | |
| ## Issue 13: torch.compile / torch.inductor blocked — cascading from ptxas issue | |
| **Severity: HIGH (blocks CUDA graphs and compilation optimizations)** | |
| Even with `--enforce-eager`, some model layers use `@torch.compile` decorators. The Inductor backend invokes Triton, which hits the `ptxas-blackwell` issue (Issue 8). This forced the following workarounds: | |
| - `TORCH_COMPILE_DISABLE=1` | |
| - `TORCHINDUCTOR_DISABLE=1` | |
| - `--enforce-eager` | |
| The model runs in pure eager mode — no torch.compile fusion, no CUDA graphs. This is a significant performance penalty. | |
| **What needs to happen:** This is a **cascading effect of Issue 8**. Once `TRITON_PTXAS_BLACKWELL_PATH` is set in the container image, torch.compile and CUDA graphs should work without these env vars. | |
| --- | |
| ## Issue 14: vLLM core CUDA extensions — have sm_120, lack sm_121 | |
| **Severity: LOW (functional, minor optimization gap)** | |
| The main vLLM `_C.abi3.so` includes cubins for: | |
| ``` | |
| sm_75, sm_80, sm_86, sm_87, sm_89, sm_90, sm_90a, sm_100, sm_100a, sm_120, sm_120a | |
| ``` | |
| `_moe_C.abi3.so` similarly includes up to `sm_120`. | |
| Since `sm_121` is in the same ISA family as `sm_120`, these SASS cubins run **natively** on the GB10 via intra-family forward-compatibility. This is NOT the same as PTX JIT — there's no JIT compilation happening. The performance impact is minimal. | |
| Adding `sm_121` to the build would only matter if there are `sm_121`-specific instruction variants or scheduling optimizations. | |
| --- | |
| ## Issue 15: Transformers library too old for Gemma 4 | |
| **Severity: MEDIUM (blocks specific models)** | |
| Transformers 4.57.6 does not recognize `model_type: gemma4`. Requires >= 4.58.0. | |
| **What needs to happen:** Upgrade transformers if Gemma 4 support is desired. | |
| --- | |
| ## Working `podman run` Command (All Workarounds Applied) | |
| ```bash | |
| podman run -d \ | |
| --name vllm-serving \ | |
| --device nvidia.com/gpu=all \ | |
| --security-opt=label=disable \ | |
| --user 0 \ | |
| -p 8000:8000 \ | |
| -v /tmp/hf-cache:/root/.cache/huggingface \ | |
| -e HF_HUB_OFFLINE=0 \ | |
| -e HF_HOME=/root/.cache/huggingface \ | |
| -e TORCH_COMPILE_DISABLE=1 \ | |
| -e TORCHINDUCTOR_DISABLE=1 \ | |
| -e TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas \ | |
| --shm-size=16g \ | |
| --entrypoint python3 \ | |
| quay.io/aipcc/rhaiis/cuda-ubi9:3.4 \ | |
| -m vllm.entrypoints.openai.api_server \ | |
| --model RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16 \ | |
| --dtype bfloat16 \ | |
| --max-model-len 4096 \ | |
| --gpu-memory-utilization 0.5 \ | |
| --enforce-eager \ | |
| --host 0.0.0.0 \ | |
| --port 8000 | |
| ``` | |
| --- | |
| ## Priority Work Items | |
| | Priority | Issue | Fix | Impact | | |
| |---|---|---|---| | |
| | **P0** | Triton `ptxas-blackwell` missing (#8) | Add `TRITON_PTXAS_BLACKWELL_PATH` env var to container image | Unblocks ALL Triton JIT, torch.compile, CUDA graphs | | |
| | **P0** | torch.compile/inductor blocked (#13) | Cascading fix from #8 — remove `TORCH_COMPILE_DISABLE` once ptxas is fixed | Restores compiled execution + CUDA graphs | | |
| | **P0** | FA2 compiled only for sm_80 (#3) | Rebuild with `sm_120 sm_121` in arch list | Native SASS for attention hot path | | |
| | **P1** | Triton runtime allocator (#9) | Call `triton.set_allocator(TorchAllocator)` in vLLM init | Enables Triton FLA/GDN kernels | | |
| | **P1** | FlashInfer no sm_120 support (#7) | Rebuild or ensure JIT path works after #8 fix | Enables faster attention backend | | |
| | **P1** | FA3 no Blackwell kernels (#4) | Upstream FA3 work + rebuild for sm_120/sm_121 | Best-in-class attention perf | | |
| | **P1** | NVML/unified memory (#10) | Detect unified memory, auto-lower `gpu_memory_utilization` | Prevents OOM on startup | | |
| | **P2** | PyTorch sm_121 not in arch list (#1, #2) | Rebuild PyTorch with `sm_121` in `TORCH_CUDA_ARCH_LIST` | Native SASS for all PyTorch ops, removes warning | | |
| | **P2** | SELinux blocking GPU (#11) | SELinux policy update or documentation | Smoother container UX | | |
| | **P2** | FlashMLA no sm_120 (#5) | Rebuild for sm_120/sm_121 | DeepSeek model support | | |
| | **P3** | HF_HUB_OFFLINE=1 (#12) | Document workaround | Dev experience | | |
| | **P3** | Transformers < 4.58 (#15) | Upgrade | Gemma 4 support | | |
| --- | |
| ## Summary of Current State | |
| The model **does serve** on the DGX Spark, but in a heavily degraded mode: | |
| - **Eager execution only** — no torch.compile, no CUDA graphs (due to missing `ptxas-blackwell`) | |
| - **Flash Attention 2 running from sm_80 PTX JIT** — cross-generation JIT from Ampere to Blackwell | |
| - **No Triton FLA/GDN kernels** — allocator not configured, fall back to slower path | |
| - **No FlashInfer** — JIT compilation blocked by same ptxas issue | |
| - **Memory management fragile** — unified memory pool requires manual `gpu_memory_utilization` tuning | |
| The **single highest-impact fix** is adding `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` to the container image. This one env var unblocks Triton JIT, torch.compile, CUDA graphs, and FlashInfer JIT — effectively unlocking the entire compiled execution stack. | |
| The **second highest-impact fix** is rebuilding Flash Attention 2 with `sm_120`/`sm_121` SASS targets, giving native attention kernels instead of cross-generation PTX JIT from sm_80. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment