oglok · April 27, 2026 15:56
diff --git a/gistfile1.txt b/gistfile1.txt
 # vLLM on DGX Spark (GB10) — Compatibility Issues Report

 **Date:** 2026-04-27  
 **Host:** `octo-et-spark-2.khw.eng.rdu2.dc.redhat.com`  
 **Image:** `quay.io/aipcc/rhaiis/cuda-ubi9:3.4`  
 **Model tested:** `RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16`

 ---

 ## Hardware Summary

 | Property | Value |
 |---|---|
 | GPU | NVIDIA GB10 (DGX Spark) |
 | Architecture | Blackwell |
 | Compute Capability | **sm_121** (12.1) |
 | GPU Memory | 119.6 GB (unified with CPU) |
 | CPU Architecture | aarch64 (ARM Grace) |
 | Host OS | Fedora 43, kernel 6.18.10 |
 | NVIDIA Driver | 590.48.01 |
 | Host CUDA | 13.1 |

 ## Software Stack in Container

 | Component | Version |
 |---|---|
 | vLLM | 0.18.0+rhaiv.7 |
 | PyTorch | 2.10.0 |
 | CUDA (container) | 13.0 |
 | Triton | 3.6.0 |
 | FlashInfer | 0.6.6 |
 | flash_attn | **NOT INSTALLED** |
 | Transformers | 4.57.6 |
 | cuDNN | 9.19.0 |

 ---

 ## Key Architectural Context: Two Separate Compilation Pipelines

 Understanding the issues below requires distinguishing two completely independent CUDA compilation paths:

 1. **Pre-compiled C++/CUDA kernels** (PyTorch, vLLM extensions, FA2, FA3, FlashMLA, FlashInfer):
   - Built at image build time via `nvcc`
   - Controlled by `TORCH_CUDA_ARCH_LIST`
   - Produce `.cubin` (native SASS) or `.ptx` (virtual ISA) embedded in `.so` files
   - **Adding `sm_121` to `TORCH_CUDA_ARCH_LIST` and rebuilding gives native SASS for these**

 2. **Triton JIT-compiled Python kernels** (FLA/GDN, layernorm fusions, custom ops):
   - Compiled at runtime: Python → Triton IR → PTX → `ptxas` → SASS
   - Requires a working `ptxas` binary that supports the target arch at **runtime**
   - `TORCH_CUDA_ARCH_LIST` has **zero effect** on these — Triton always needs `ptxas`

 **These are orthogonal.** Adding `sm_121` to PyTorch's arch list does NOT eliminate the need for `ptxas-blackwell`. Conversely, fixing `ptxas-blackwell` does NOT fix the pre-compiled kernel gaps.

 ---

 ## Issue 1: sm_121 IS a real SASS target — but PyTorch was not built with it

 **Severity: HIGH (pre-compiled kernels fall back to sm_120 forward-compat)**

 Contrary to initial assumption, **`sm_121` is a real, distinct SASS target** in CUDA 13.0+:

 ```
 $ nvcc --list-gpu-code
 ...
 sm_120
 sm_121    ← real SASS target, NOT just PTX

 $ nvcc --list-gpu-arch
 ...
 compute_120
 compute_121   ← corresponding virtual arch
 ```

 Both `nvcc -arch=sm_121` and `ptxas --gpu-name sm_121` succeed. The CUDA 13.0 toolkit in the container fully supports `sm_121` as a native compilation target.

 However, PyTorch 2.10.0 in the container was built with:

 ```python
 torch.cuda.get_arch_list() →
 ['sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_89', 'sm_90a', 'sm_100', 'sm_100a', 'sm_120', 'compute_120']
 ```

 **`sm_121` is absent.** The highest SASS is `sm_120`, and `compute_120` provides PTX forward-compat. On the GB10, `sm_120` SASS runs natively via intra-family forward-compatibility (same ISA family), so this is **not a PTX JIT situation** — the performance impact is minor. But `sm_121`-specific features (if any) are unavailable.

 **What needs to happen:** Rebuild PyTorch (and all CUDA extensions) with `sm_121` added to `TORCH_CUDA_ARCH_LIST`. The CUDA 13.0 toolkit in the container already supports it. This gives native SASS with any `sm_121`-specific optimizations.

 ---

 ## Issue 2: PyTorch warns about unsupported compute capability 12.1

 **Severity: LOW (cosmetic, but signals missing build target)**

 ```
 Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
 Minimum and Maximum cuda capability supported by this version of PyTorch is (7.5) - (12.0)
 ```

 This warning is emitted because `sm_121` is not in PyTorch's compiled arch list. The GPU works via `sm_120` forward-compat. The warning disappears if PyTorch is rebuilt with `sm_121`.

 **What needs to happen:** Same as Issue 1 — add `sm_121` to `TORCH_CUDA_ARCH_LIST` when rebuilding PyTorch.

 ---

 ## Issue 3: vLLM Flash Attention 2 — compiled only for sm_80 (SASS)

 **Severity: HIGH (real performance regression — cross-generation PTX JIT)**

 The `_vllm_fa2_C.abi3.so` (vLLM's bundled Flash Attention 2) contains cubins **only for `sm_80`**:

 ```
 _vllm_fa2_C.abi3.so: all 52 cubins are sm_80 only
 ```

 Unlike the core vLLM `_C.abi3.so` (which has `sm_120` SASS), Flash Attention 2 was compiled only for Ampere. On `sm_121`, this means:

 - The kernel must run via **PTX JIT from `sm_80` → `sm_121`**, which is a **cross-generation** JIT (Ampere → Blackwell, skipping Hopper entirely)
 - No Blackwell tensor core optimizations, no Blackwell shared memory layout, no Blackwell-specific instruction scheduling
 - This is the **hot path for attention computation** — every forward pass goes through this kernel
 - Likely significant performance penalty vs native `sm_121` SASS

 **What needs to happen:** Rebuild `_vllm_fa2_C.abi3.so` with `sm_120` and `sm_121` in the arch list. The vLLM FA2 build configuration needs these targets added. The CUDA 13.0 toolkit in the container can compile for these targets today.

 ---

 ## Issue 4: vLLM Flash Attention 3 — compiled only for sm_90a

 **Severity: MEDIUM (not usable on GB10, falls back to FA2)**

 ```
 _vllm_fa3_C.abi3.so: all 192 cubins are sm_90a only
 ```

 FA3 uses Hopper-specific features (TMA, warp-specialization). On `sm_121`:

 - No native cubins → falls back to FA2 (which itself is only `sm_80`, see Issue 3)
 - PTX JIT from `sm_90a` to `sm_121` may work but is untested and likely suboptimal
 - vLLM correctly selects FA2 over FA3 on this device

 **What needs to happen:** FA3 needs Blackwell-native kernels (`sm_120`/`sm_121`). This may require upstream Dao-AILab/flash-attention work, as Blackwell has different TMA semantics than Hopper.

 ---

 ## Issue 5: FlashMLA — compiled only for sm_90a and sm_100

 **Severity: LOW (only affects DeepSeek-style MLA models)**

 ```
 _flashmla_C.abi3.so: sm_90a, sm_100 only
 _flashmla_extension_C.abi3.so: sm_90a, sm_100 only
 ```

 No `sm_120`/`sm_121` cubins. MLA (Multi-Latent Attention) kernels won't run natively on GB10.

 **What needs to happen:** Add `sm_120`/`sm_121` to FlashMLA build targets.

 ---

 ## Issue 6: flash_attn standalone package is NOT installed

 **Severity: MEDIUM**

 ```
 flash_attn import error: No module named 'flash_attn'
 ```

 The standalone Dao-AILab `flash_attn` package is absent. vLLM uses its bundled FA2/FA3 instead. Some models or integrations may expect the standalone package.

 **What needs to happen:** Either install `flash_attn` compiled for `sm_120`/`sm_121`, or confirm all code paths use vLLM's bundled attention.

 ---

 ## Issue 7: FlashInfer — installed but likely no sm_120/sm_121 cubins

 **Severity: HIGH (FlashInfer backend unavailable)**

 FlashInfer 0.6.6 is installed. `cuobjdump` returned no embedded cubins, suggesting it relies on JIT compilation (which requires Triton, which requires `ptxas-blackwell` — see Issue 8).

 vLLM chose FA2 over FlashInfer:

 ```
 Using FLASH_ATTN attention backend out of potential backends:
 ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
 ```

 **What needs to happen:** Either rebuild FlashInfer with `sm_120`/`sm_121` SASS pre-compiled, or ensure the Triton JIT path works (requires fixing Issues 8 and 9 first).

 ---

 ## Issue 8: Triton missing `ptxas-blackwell` — a container packaging gap

 **Severity: CRITICAL (blocks ALL Triton kernel JIT on Blackwell)**

 **Background:** Triton 3.6.0 normally ships its own bundled `ptxas` and `ptxas-blackwell` binaries inside `triton/backends/nvidia/bin/`. The `ptxas-blackwell` binary is simply a newer build of NVIDIA's `ptxas` assembler that supports Blackwell PTX. The split exists because Triton wanted to add Blackwell support without risking regressions on older architectures by upgrading the main `ptxas`.

 In this container image, the `triton/backends/nvidia/bin/` directory **does not exist** — Triton was installed without its bundled tools. Instead, the container relies on the system CUDA toolkit at `/usr/local/cuda/bin/`. Triton's fallback logic checks env vars:

 ```python
 # triton/backends/nvidia/compiler.py line 35
 return knobs.nvidia.ptxas_blackwell if arch >= 100 else knobs.nvidia.ptxas
 # Env var: TRITON_PTXAS_BLACKWELL_PATH
 ```

 Neither env var is set, and neither bundled binary exists → crash.

 **The fix is straightforward:** The system `ptxas` from CUDA 13.0 in the container **already fully supports `sm_121`**:

 ```
 $ ptxas --gpu-name sm_121 --version
 ptxas: NVIDIA (R) Ptx optimizing assembler
 Cuda compilation tools, release 13.0, V13.0.88
 ```

 So `ptxas` IS the "blackwell-capable ptxas" that Triton is looking for. Setting `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` is the correct fix.

 **This is independent of `TORCH_CUDA_ARCH_LIST` / `sm_121`.** Triton JIT-compiles kernels at runtime and always needs `ptxas`. Adding `sm_121` to PyTorch's build does not help Triton at all.

 **What needs to happen (pick one):**

 1. **Best:** Add to the container image Dockerfile:
   ```dockerfile
   ENV TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas
   ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
   ```
 2. **Alternative:** Create the expected directory and symlinks:
   ```dockerfile
   RUN mkdir -p $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin'))") && \
       ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas'))") && \
       ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas-blackwell'))")
   ```

 ---

 ## Issue 9: Triton runtime memory allocator not configured

 **Severity: HIGH (Triton kernels crash at runtime even after ptxas fix)**

 After fixing `ptxas-blackwell`, Triton kernels compile successfully but fail at execution:

 ```
 RuntimeError: Kernel requires a runtime memory allocation, but no allocator was set.
 Use triton.set_allocator to specify an allocator.
 ```

 This affects the FLA (Flash Linear Attention) GDN prefill kernels in `qwen3_next.py`. On Blackwell, certain Triton kernels use "global scratch" memory, which requires an allocator to be registered via `triton.set_allocator()`.

 vLLM catches this as a WARNING and falls back to a slower non-Triton path:

 ```
 WARNING [qwen3_next.py:767] RuntimeError: Kernel requires a runtime memory allocation...
 ```

 **What needs to happen:** vLLM should call `triton.set_allocator()` early in initialization. A `cumem_allocator.abi3.so` exists in the container, suggesting infrastructure is present but not wired up. Likely fix:

 ```python
 import triton
 from triton.runtime.allocation import TorchAllocator
 triton.set_allocator(TorchAllocator)
 ```

 ---

 ## Issue 10: NVML reports "Not Supported" for memory queries

 **Severity: MEDIUM (affects monitoring and memory management)**

 ```python
 >>> nvmlDeviceGetMemoryInfo(handle)
 # Raises: "Not Supported"
 ```

 `nvidia-smi` shows `Memory-Usage: Not Supported`. The GB10 uses **unified memory** (shared CPU+GPU pool). NVML's discrete GPU memory API doesn't apply.

 Consequences:

 - `torch.cuda.get_device_properties().total_memory` reports 119.6 GB (the full unified pool), but actual available GPU memory depends on CPU usage
 - `--gpu-memory-utilization 0.9` failed because it requested 107.66 GB, but only 39.9 GB was free (CPU page cache consumed the shared pool)
 - No standard tool can report real-time GPU memory usage

 **What needs to happen:**

 - vLLM needs unified-memory-aware memory management for DGX Spark
 - Default `gpu_memory_utilization` should be lower (0.5 or less) when unified memory is detected
 - Consider detecting GB10/unified memory at startup and auto-adjusting

 ---

 ## Issue 11: SELinux blocks GPU access in containers (podman)

 **Severity: MEDIUM (requires workaround for every container launch)**

 Running with `--device nvidia.com/gpu=all` alone:

 ```
 Failed to initialize NVML: Insufficient Permissions
 CUDA available: False
 ```

 Fix: add `--security-opt=label=disable` to disable SELinux label enforcement for the container.

 **What needs to happen:** Either update SELinux policy with nvidia device rules, or document the required flag.

 ---

 ## Issue 12: Container image has `HF_HUB_OFFLINE=1` baked in

 **Severity: LOW**

 Prevents model downloads. Intentional for production, blocks first-time setup.

 **Workaround:** `-e HF_HUB_OFFLINE=0`

 ---

 ## Issue 13: torch.compile / torch.inductor blocked — cascading from ptxas issue

 **Severity: HIGH (blocks CUDA graphs and compilation optimizations)**

 Even with `--enforce-eager`, some model layers use `@torch.compile` decorators. The Inductor backend invokes Triton, which hits the `ptxas-blackwell` issue (Issue 8). This forced the following workarounds:

 - `TORCH_COMPILE_DISABLE=1`
 - `TORCHINDUCTOR_DISABLE=1`
 - `--enforce-eager`

 The model runs in pure eager mode — no torch.compile fusion, no CUDA graphs. This is a significant performance penalty.

 **What needs to happen:** This is a **cascading effect of Issue 8**. Once `TRITON_PTXAS_BLACKWELL_PATH` is set in the container image, torch.compile and CUDA graphs should work without these env vars.

 ---

 ## Issue 14: vLLM core CUDA extensions — have sm_120, lack sm_121

 **Severity: LOW (functional, minor optimization gap)**

 The main vLLM `_C.abi3.so` includes cubins for:

 ```
 sm_75, sm_80, sm_86, sm_87, sm_89, sm_90, sm_90a, sm_100, sm_100a, sm_120, sm_120a
 ```

 `_moe_C.abi3.so` similarly includes up to `sm_120`.

 Since `sm_121` is in the same ISA family as `sm_120`, these SASS cubins run **natively** on the GB10 via intra-family forward-compatibility. This is NOT the same as PTX JIT — there's no JIT compilation happening. The performance impact is minimal.

 Adding `sm_121` to the build would only matter if there are `sm_121`-specific instruction variants or scheduling optimizations.

 ---

 ## Issue 15: Transformers library too old for Gemma 4

 **Severity: MEDIUM (blocks specific models)**

 Transformers 4.57.6 does not recognize `model_type: gemma4`. Requires >= 4.58.0.

 **What needs to happen:** Upgrade transformers if Gemma 4 support is desired.

 ---

 ## Working `podman run` Command (All Workarounds Applied)

 ```bash
 podman run -d \
  --name vllm-serving \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  --user 0 \
  -p 8000:8000 \
  -v /tmp/hf-cache:/root/.cache/huggingface \
  -e HF_HUB_OFFLINE=0 \
  -e HF_HOME=/root/.cache/huggingface \
  -e TORCH_COMPILE_DISABLE=1 \
  -e TORCHINDUCTOR_DISABLE=1 \
  -e TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas \
  --shm-size=16g \
  --entrypoint python3 \
  quay.io/aipcc/rhaiis/cuda-ubi9:3.4 \
  -m vllm.entrypoints.openai.api_server \
  --model RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16 \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.5 \
  --enforce-eager \
  --host 0.0.0.0 \
  --port 8000
 ```

 ---

 ## Priority Work Items

 | Priority | Issue | Fix | Impact |
 |---|---|---|---|
 | **P0** | Triton `ptxas-blackwell` missing (#8) | Add `TRITON_PTXAS_BLACKWELL_PATH` env var to container image | Unblocks ALL Triton JIT, torch.compile, CUDA graphs |
 | **P0** | torch.compile/inductor blocked (#13) | Cascading fix from #8 — remove `TORCH_COMPILE_DISABLE` once ptxas is fixed | Restores compiled execution + CUDA graphs |
 | **P0** | FA2 compiled only for sm_80 (#3) | Rebuild with `sm_120 sm_121` in arch list | Native SASS for attention hot path |
 | **P1** | Triton runtime allocator (#9) | Call `triton.set_allocator(TorchAllocator)` in vLLM init | Enables Triton FLA/GDN kernels |
 | **P1** | FlashInfer no sm_120 support (#7) | Rebuild or ensure JIT path works after #8 fix | Enables faster attention backend |
 | **P1** | FA3 no Blackwell kernels (#4) | Upstream FA3 work + rebuild for sm_120/sm_121 | Best-in-class attention perf |
 | **P1** | NVML/unified memory (#10) | Detect unified memory, auto-lower `gpu_memory_utilization` | Prevents OOM on startup |
 | **P2** | PyTorch sm_121 not in arch list (#1, #2) | Rebuild PyTorch with `sm_121` in `TORCH_CUDA_ARCH_LIST` | Native SASS for all PyTorch ops, removes warning |
 | **P2** | SELinux blocking GPU (#11) | SELinux policy update or documentation | Smoother container UX |
 | **P2** | FlashMLA no sm_120 (#5) | Rebuild for sm_120/sm_121 | DeepSeek model support |
 | **P3** | HF_HUB_OFFLINE=1 (#12) | Document workaround | Dev experience |
 | **P3** | Transformers < 4.58 (#15) | Upgrade | Gemma 4 support |

 ---

 ## Summary of Current State

 The model **does serve** on the DGX Spark, but in a heavily degraded mode:

 - **Eager execution only** — no torch.compile, no CUDA graphs (due to missing `ptxas-blackwell`)
 - **Flash Attention 2 running from sm_80 PTX JIT** — cross-generation JIT from Ampere to Blackwell
 - **No Triton FLA/GDN kernels** — allocator not configured, fall back to slower path
 - **No FlashInfer** — JIT compilation blocked by same ptxas issue
 - **Memory management fragile** — unified memory pool requires manual `gpu_memory_utilization` tuning

 The **single highest-impact fix** is adding `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` to the container image. This one env var unblocks Triton JIT, torch.compile, CUDA graphs, and FlashInfer JIT — effectively unlocking the entire compiled execution stack.

 The **second highest-impact fix** is rebuilding Flash Attention 2 with `sm_120`/`sm_121` SASS targets, giving native attention kernels instead of cross-generation PTX JIT from sm_80.
	# vLLM on DGX Spark (GB10) — Compatibility Issues Report

	Date: 2026-04-27
	Host: `octo-et-spark-2.khw.eng.rdu2.dc.redhat.com`
	Image: `quay.io/aipcc/rhaiis/cuda-ubi9:3.4`
	Model tested: `RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16`

	---

	## Hardware Summary

	\| Property \| Value \|
	\|---\|---\|
	\| GPU \| NVIDIA GB10 (DGX Spark) \|
	\| Architecture \| Blackwell \|
	\| Compute Capability \| sm_121 (12.1) \|
	\| GPU Memory \| 119.6 GB (unified with CPU) \|
	\| CPU Architecture \| aarch64 (ARM Grace) \|
	\| Host OS \| Fedora 43, kernel 6.18.10 \|
	\| NVIDIA Driver \| 590.48.01 \|
	\| Host CUDA \| 13.1 \|

	## Software Stack in Container

	\| Component \| Version \|
	\|---\|---\|
	\| vLLM \| 0.18.0+rhaiv.7 \|
	\| PyTorch \| 2.10.0 \|
	\| CUDA (container) \| 13.0 \|
	\| Triton \| 3.6.0 \|
	\| FlashInfer \| 0.6.6 \|
	\| flash_attn \| NOT INSTALLED \|
	\| Transformers \| 4.57.6 \|
	\| cuDNN \| 9.19.0 \|

	---

	## Key Architectural Context: Two Separate Compilation Pipelines

	Understanding the issues below requires distinguishing two completely independent CUDA compilation paths:

	1. Pre-compiled C++/CUDA kernels (PyTorch, vLLM extensions, FA2, FA3, FlashMLA, FlashInfer):
	- Built at image build time via `nvcc`
	- Controlled by `TORCH_CUDA_ARCH_LIST`
	- Produce `.cubin` (native SASS) or `.ptx` (virtual ISA) embedded in `.so` files
	- Adding `sm_121` to `TORCH_CUDA_ARCH_LIST` and rebuilding gives native SASS for these

	2. Triton JIT-compiled Python kernels (FLA/GDN, layernorm fusions, custom ops):
	- Compiled at runtime: Python → Triton IR → PTX → `ptxas` → SASS
	- Requires a working `ptxas` binary that supports the target arch at runtime
	- `TORCH_CUDA_ARCH_LIST` has zero effect on these — Triton always needs `ptxas`

	These are orthogonal. Adding `sm_121` to PyTorch's arch list does NOT eliminate the need for `ptxas-blackwell`. Conversely, fixing `ptxas-blackwell` does NOT fix the pre-compiled kernel gaps.

	---

	## Issue 1: sm_121 IS a real SASS target — but PyTorch was not built with it

	Severity: HIGH (pre-compiled kernels fall back to sm_120 forward-compat)

	Contrary to initial assumption, `sm_121` is a real, distinct SASS target in CUDA 13.0+:

	```
	$ nvcc --list-gpu-code
	...
	sm_120
	sm_121 ← real SASS target, NOT just PTX

	$ nvcc --list-gpu-arch
	...
	compute_120
	compute_121 ← corresponding virtual arch
	```

	Both `nvcc -arch=sm_121` and `ptxas --gpu-name sm_121` succeed. The CUDA 13.0 toolkit in the container fully supports `sm_121` as a native compilation target.

	However, PyTorch 2.10.0 in the container was built with:

	```python
	torch.cuda.get_arch_list() →
	['sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_89', 'sm_90a', 'sm_100', 'sm_100a', 'sm_120', 'compute_120']
	```

	`sm_121` is absent. The highest SASS is `sm_120`, and `compute_120` provides PTX forward-compat. On the GB10, `sm_120` SASS runs natively via intra-family forward-compatibility (same ISA family), so this is not a PTX JIT situation — the performance impact is minor. But `sm_121`-specific features (if any) are unavailable.

	What needs to happen: Rebuild PyTorch (and all CUDA extensions) with `sm_121` added to `TORCH_CUDA_ARCH_LIST`. The CUDA 13.0 toolkit in the container already supports it. This gives native SASS with any `sm_121`-specific optimizations.

	---

	## Issue 2: PyTorch warns about unsupported compute capability 12.1

	Severity: LOW (cosmetic, but signals missing build target)

	```
	Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
	Minimum and Maximum cuda capability supported by this version of PyTorch is (7.5) - (12.0)
	```

	This warning is emitted because `sm_121` is not in PyTorch's compiled arch list. The GPU works via `sm_120` forward-compat. The warning disappears if PyTorch is rebuilt with `sm_121`.

	What needs to happen: Same as Issue 1 — add `sm_121` to `TORCH_CUDA_ARCH_LIST` when rebuilding PyTorch.

	---

	## Issue 3: vLLM Flash Attention 2 — compiled only for sm_80 (SASS)

	Severity: HIGH (real performance regression — cross-generation PTX JIT)

	The `_vllm_fa2_C.abi3.so` (vLLM's bundled Flash Attention 2) contains cubins only for `sm_80`:

	```
	_vllm_fa2_C.abi3.so: all 52 cubins are sm_80 only
	```

	Unlike the core vLLM `_C.abi3.so` (which has `sm_120` SASS), Flash Attention 2 was compiled only for Ampere. On `sm_121`, this means:

	- The kernel must run via PTX JIT from `sm_80` → `sm_121`, which is a cross-generation JIT (Ampere → Blackwell, skipping Hopper entirely)
	- No Blackwell tensor core optimizations, no Blackwell shared memory layout, no Blackwell-specific instruction scheduling
	- This is the hot path for attention computation — every forward pass goes through this kernel
	- Likely significant performance penalty vs native `sm_121` SASS

	What needs to happen: Rebuild `_vllm_fa2_C.abi3.so` with `sm_120` and `sm_121` in the arch list. The vLLM FA2 build configuration needs these targets added. The CUDA 13.0 toolkit in the container can compile for these targets today.

	---

	## Issue 4: vLLM Flash Attention 3 — compiled only for sm_90a

	Severity: MEDIUM (not usable on GB10, falls back to FA2)

	```
	_vllm_fa3_C.abi3.so: all 192 cubins are sm_90a only
	```

	FA3 uses Hopper-specific features (TMA, warp-specialization). On `sm_121`:

	- No native cubins → falls back to FA2 (which itself is only `sm_80`, see Issue 3)
	- PTX JIT from `sm_90a` to `sm_121` may work but is untested and likely suboptimal
	- vLLM correctly selects FA2 over FA3 on this device

	What needs to happen: FA3 needs Blackwell-native kernels (`sm_120`/`sm_121`). This may require upstream Dao-AILab/flash-attention work, as Blackwell has different TMA semantics than Hopper.

	---

	## Issue 5: FlashMLA — compiled only for sm_90a and sm_100

	Severity: LOW (only affects DeepSeek-style MLA models)

	```
	_flashmla_C.abi3.so: sm_90a, sm_100 only
	_flashmla_extension_C.abi3.so: sm_90a, sm_100 only
	```

	No `sm_120`/`sm_121` cubins. MLA (Multi-Latent Attention) kernels won't run natively on GB10.

	What needs to happen: Add `sm_120`/`sm_121` to FlashMLA build targets.

	---

	## Issue 6: flash_attn standalone package is NOT installed

	Severity: MEDIUM

	```
	flash_attn import error: No module named 'flash_attn'
	```

	The standalone Dao-AILab `flash_attn` package is absent. vLLM uses its bundled FA2/FA3 instead. Some models or integrations may expect the standalone package.

	What needs to happen: Either install `flash_attn` compiled for `sm_120`/`sm_121`, or confirm all code paths use vLLM's bundled attention.

	---

	## Issue 7: FlashInfer — installed but likely no sm_120/sm_121 cubins

	Severity: HIGH (FlashInfer backend unavailable)

	FlashInfer 0.6.6 is installed. `cuobjdump` returned no embedded cubins, suggesting it relies on JIT compilation (which requires Triton, which requires `ptxas-blackwell` — see Issue 8).

	vLLM chose FA2 over FlashInfer:

	```
	Using FLASH_ATTN attention backend out of potential backends:
	['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
	```

	What needs to happen: Either rebuild FlashInfer with `sm_120`/`sm_121` SASS pre-compiled, or ensure the Triton JIT path works (requires fixing Issues 8 and 9 first).

	---

	## Issue 8: Triton missing `ptxas-blackwell` — a container packaging gap

	Severity: CRITICAL (blocks ALL Triton kernel JIT on Blackwell)

	Background: Triton 3.6.0 normally ships its own bundled `ptxas` and `ptxas-blackwell` binaries inside `triton/backends/nvidia/bin/`. The `ptxas-blackwell` binary is simply a newer build of NVIDIA's `ptxas` assembler that supports Blackwell PTX. The split exists because Triton wanted to add Blackwell support without risking regressions on older architectures by upgrading the main `ptxas`.

	In this container image, the `triton/backends/nvidia/bin/` directory does not exist — Triton was installed without its bundled tools. Instead, the container relies on the system CUDA toolkit at `/usr/local/cuda/bin/`. Triton's fallback logic checks env vars:

	```python
	# triton/backends/nvidia/compiler.py line 35
	return knobs.nvidia.ptxas_blackwell if arch >= 100 else knobs.nvidia.ptxas
	# Env var: TRITON_PTXAS_BLACKWELL_PATH
	```

	Neither env var is set, and neither bundled binary exists → crash.

	The fix is straightforward: The system `ptxas` from CUDA 13.0 in the container already fully supports `sm_121`:

	```
	$ ptxas --gpu-name sm_121 --version
	ptxas: NVIDIA (R) Ptx optimizing assembler
	Cuda compilation tools, release 13.0, V13.0.88
	```

	So `ptxas` IS the "blackwell-capable ptxas" that Triton is looking for. Setting `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` is the correct fix.

	This is independent of `TORCH_CUDA_ARCH_LIST` / `sm_121`. Triton JIT-compiles kernels at runtime and always needs `ptxas`. Adding `sm_121` to PyTorch's build does not help Triton at all.

	What needs to happen (pick one):

	1. Best: Add to the container image Dockerfile:
	```dockerfile
	ENV TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas
	ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
	```
	2. Alternative: Create the expected directory and symlinks:
	```dockerfile
	RUN mkdir -p $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin'))") && \
	ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas'))") && \
	ln -s /usr/local/cuda/bin/ptxas $(python3 -c "import triton,os; print(os.path.join(os.path.dirname(triton.__file__),'backends','nvidia','bin','ptxas-blackwell'))")
	```

	---

	## Issue 9: Triton runtime memory allocator not configured

	Severity: HIGH (Triton kernels crash at runtime even after ptxas fix)

	After fixing `ptxas-blackwell`, Triton kernels compile successfully but fail at execution:

	```
	RuntimeError: Kernel requires a runtime memory allocation, but no allocator was set.
	Use triton.set_allocator to specify an allocator.
	```

	This affects the FLA (Flash Linear Attention) GDN prefill kernels in `qwen3_next.py`. On Blackwell, certain Triton kernels use "global scratch" memory, which requires an allocator to be registered via `triton.set_allocator()`.

	vLLM catches this as a WARNING and falls back to a slower non-Triton path:

	```
	WARNING [qwen3_next.py:767] RuntimeError: Kernel requires a runtime memory allocation...
	```

	What needs to happen: vLLM should call `triton.set_allocator()` early in initialization. A `cumem_allocator.abi3.so` exists in the container, suggesting infrastructure is present but not wired up. Likely fix:

	```python
	import triton
	from triton.runtime.allocation import TorchAllocator
	triton.set_allocator(TorchAllocator)
	```

	---

	## Issue 10: NVML reports "Not Supported" for memory queries

	Severity: MEDIUM (affects monitoring and memory management)

	```python
	>>> nvmlDeviceGetMemoryInfo(handle)
	# Raises: "Not Supported"
	```

	`nvidia-smi` shows `Memory-Usage: Not Supported`. The GB10 uses unified memory (shared CPU+GPU pool). NVML's discrete GPU memory API doesn't apply.

	Consequences:

	- `torch.cuda.get_device_properties().total_memory` reports 119.6 GB (the full unified pool), but actual available GPU memory depends on CPU usage
	- `--gpu-memory-utilization 0.9` failed because it requested 107.66 GB, but only 39.9 GB was free (CPU page cache consumed the shared pool)
	- No standard tool can report real-time GPU memory usage

	What needs to happen:

	- vLLM needs unified-memory-aware memory management for DGX Spark
	- Default `gpu_memory_utilization` should be lower (0.5 or less) when unified memory is detected
	- Consider detecting GB10/unified memory at startup and auto-adjusting

	---

	## Issue 11: SELinux blocks GPU access in containers (podman)

	Severity: MEDIUM (requires workaround for every container launch)

	Running with `--device nvidia.com/gpu=all` alone:

	```
	Failed to initialize NVML: Insufficient Permissions
	CUDA available: False
	```

	Fix: add `--security-opt=label=disable` to disable SELinux label enforcement for the container.

	What needs to happen: Either update SELinux policy with nvidia device rules, or document the required flag.

	---

	## Issue 12: Container image has `HF_HUB_OFFLINE=1` baked in

	Severity: LOW

	Prevents model downloads. Intentional for production, blocks first-time setup.

	Workaround: `-e HF_HUB_OFFLINE=0`

	---

	## Issue 13: torch.compile / torch.inductor blocked — cascading from ptxas issue

	Severity: HIGH (blocks CUDA graphs and compilation optimizations)

	Even with `--enforce-eager`, some model layers use `@torch.compile` decorators. The Inductor backend invokes Triton, which hits the `ptxas-blackwell` issue (Issue 8). This forced the following workarounds:

	- `TORCH_COMPILE_DISABLE=1`
	- `TORCHINDUCTOR_DISABLE=1`
	- `--enforce-eager`

	The model runs in pure eager mode — no torch.compile fusion, no CUDA graphs. This is a significant performance penalty.

	What needs to happen: This is a cascading effect of Issue 8. Once `TRITON_PTXAS_BLACKWELL_PATH` is set in the container image, torch.compile and CUDA graphs should work without these env vars.

	---

	## Issue 14: vLLM core CUDA extensions — have sm_120, lack sm_121

	Severity: LOW (functional, minor optimization gap)

	The main vLLM `_C.abi3.so` includes cubins for:

	```
	sm_75, sm_80, sm_86, sm_87, sm_89, sm_90, sm_90a, sm_100, sm_100a, sm_120, sm_120a
	```

	`_moe_C.abi3.so` similarly includes up to `sm_120`.

	Since `sm_121` is in the same ISA family as `sm_120`, these SASS cubins run natively on the GB10 via intra-family forward-compatibility. This is NOT the same as PTX JIT — there's no JIT compilation happening. The performance impact is minimal.

	Adding `sm_121` to the build would only matter if there are `sm_121`-specific instruction variants or scheduling optimizations.

	---

	## Issue 15: Transformers library too old for Gemma 4

	Severity: MEDIUM (blocks specific models)

	Transformers 4.57.6 does not recognize `model_type: gemma4`. Requires >= 4.58.0.

	What needs to happen: Upgrade transformers if Gemma 4 support is desired.

	---

	## Working `podman run` Command (All Workarounds Applied)

	```bash
	podman run -d \
	--name vllm-serving \
	--device nvidia.com/gpu=all \
	--security-opt=label=disable \
	--user 0 \
	-p 8000:8000 \
	-v /tmp/hf-cache:/root/.cache/huggingface \
	-e HF_HUB_OFFLINE=0 \
	-e HF_HOME=/root/.cache/huggingface \
	-e TORCH_COMPILE_DISABLE=1 \
	-e TORCHINDUCTOR_DISABLE=1 \
	-e TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas \
	--shm-size=16g \
	--entrypoint python3 \
	quay.io/aipcc/rhaiis/cuda-ubi9:3.4 \
	-m vllm.entrypoints.openai.api_server \
	--model RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16 \
	--dtype bfloat16 \
	--max-model-len 4096 \
	--gpu-memory-utilization 0.5 \
	--enforce-eager \
	--host 0.0.0.0 \
	--port 8000
	```

	---

	## Priority Work Items

	\| Priority \| Issue \| Fix \| Impact \|
	\|---\|---\|---\|---\|
	\| P0 \| Triton `ptxas-blackwell` missing (#8) \| Add `TRITON_PTXAS_BLACKWELL_PATH` env var to container image \| Unblocks ALL Triton JIT, torch.compile, CUDA graphs \|
	\| P0 \| torch.compile/inductor blocked (#13) \| Cascading fix from #8 — remove `TORCH_COMPILE_DISABLE` once ptxas is fixed \| Restores compiled execution + CUDA graphs \|
	\| P0 \| FA2 compiled only for sm_80 (#3) \| Rebuild with `sm_120 sm_121` in arch list \| Native SASS for attention hot path \|
	\| P1 \| Triton runtime allocator (#9) \| Call `triton.set_allocator(TorchAllocator)` in vLLM init \| Enables Triton FLA/GDN kernels \|
	\| P1 \| FlashInfer no sm_120 support (#7) \| Rebuild or ensure JIT path works after #8 fix \| Enables faster attention backend \|
	\| P1 \| FA3 no Blackwell kernels (#4) \| Upstream FA3 work + rebuild for sm_120/sm_121 \| Best-in-class attention perf \|
	\| P1 \| NVML/unified memory (#10) \| Detect unified memory, auto-lower `gpu_memory_utilization` \| Prevents OOM on startup \|
	\| P2 \| PyTorch sm_121 not in arch list (#1, #2) \| Rebuild PyTorch with `sm_121` in `TORCH_CUDA_ARCH_LIST` \| Native SASS for all PyTorch ops, removes warning \|
	\| P2 \| SELinux blocking GPU (#11) \| SELinux policy update or documentation \| Smoother container UX \|
	\| P2 \| FlashMLA no sm_120 (#5) \| Rebuild for sm_120/sm_121 \| DeepSeek model support \|
	\| P3 \| HF_HUB_OFFLINE=1 (#12) \| Document workaround \| Dev experience \|
	\| P3 \| Transformers < 4.58 (#15) \| Upgrade \| Gemma 4 support \|

	---

	## Summary of Current State

	The model does serve on the DGX Spark, but in a heavily degraded mode:

	- Eager execution only — no torch.compile, no CUDA graphs (due to missing `ptxas-blackwell`)
	- Flash Attention 2 running from sm_80 PTX JIT — cross-generation JIT from Ampere to Blackwell
	- No Triton FLA/GDN kernels — allocator not configured, fall back to slower path
	- No FlashInfer — JIT compilation blocked by same ptxas issue
	- Memory management fragile — unified memory pool requires manual `gpu_memory_utilization` tuning

	The single highest-impact fix is adding `TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda/bin/ptxas` to the container image. This one env var unblocks Triton JIT, torch.compile, CUDA graphs, and FlashInfer JIT — effectively unlocking the entire compiled execution stack.

	The second highest-impact fix is rebuilding Flash Attention 2 with `sm_120`/`sm_121` SASS targets, giving native attention kernels instead of cross-generation PTX JIT from sm_80.
No results found