Tested 2026-03-27, updated 2026-04-01. Uses sparkrun with a community-patched vLLM container that works around the MIXED_PRECISION whitelist bug in the NGC vLLM container (vllm-project/vllm#37854).
These are the exact versions tested. Using different versions may break things.
| Component | Version | Immutable ref |
|---|---|---|
| sparkrun | 0.2.15 | PyPI |
| Recipe | @eugr/nemotron-3-super-nvfp4 |
63aeced0 in spark-arena/recipe-registry |
| Container image | vllm-node (built locally from ghcr.io/spark-arena/dgx-vllm-eugr-nightly) |
sha256:44287d7066bc9a186fb0e2c8e9e4cb62a9bd222429861f30c12a3457672e7071 |
| Container build script | eugr/spark-vllm-docker | e7f2ee69 |
| vLLM | 0.18.1rc1.dev196+g21d2b53f8 | commit 21d2b53f8 in vllm-project/vllm |
| FlashInfer | 0.6.7 | commit 31b63bc3 in flashinfer-ai/flashinfer |
| PyTorch | 2.12.0.dev20260325+cu130 | — |
| CUDA base | 13.2.0-devel-ubuntu24.04 | — |
| Model | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |
— |
- DGX Spark with DGX OS (Ubuntu 24.04, aarch64)
- Docker installed (ships with DGX OS)
- Your user in the
dockergroup - A HuggingFace account and access token
- ~75 GB free disk for model weights, ~15 GB for the container image
sudo usermod -aG docker $USERLog out and log back in for the group to take effect.
curl -LsSf https://astral.sh/uv/install.sh | shmkdir -p ~/.cache/huggingface
echo -n "hf_YOUR_TOKEN_HERE" > ~/.cache/huggingface/token
chmod 600 ~/.cache/huggingface/tokenuvx sparkrun@0.2.15 run @eugr/nemotron-3-super-nvfp4 --tp 1 --hosts localhost --max-model-len 200000This will:
- Build the patched vLLM container from source (~3-5 min)
- Download the NVFP4 model from HuggingFace (~75 GB)
- Launch vLLM serving on port 8000
First run takes 15-30 minutes depending on download speed. The container and model are cached for subsequent runs.
Note on reproducibility: sparkrun is pinned above, but the recipe it fetches from the @eugr registry and the container image (ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest) are not pinnable through sparkrun. The recipe and container may change between runs. If the container has already been built locally (cached as vllm-node), sparkrun will reuse it. To guarantee reproducibility, do not delete the local vllm-node image.
# Check the model is listed
curl -s http://localhost:8000/v1/models | python3 -m json.tool
# Test inference
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"max_tokens": 500
}' | python3 -m json.tool
# Test tool calling
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
"max_tokens": 500
}' | python3 -m json.tooluvx sparkrun@0.2.15 stop- Nemotron-3-Super-120B (120B params, 12B active per token, hybrid Mamba-2/MoE/Attention)
- NVFP4 weights (trained at FP4 precision, not post-hoc quantized)
- OpenAI-compatible API at
http://localhost:8000 - Tool calling and reasoning/thinking support
- 200K token context window
- FP8 KV cache
NVIDIA's official NGC vLLM container (26.02) rejects this model because its quant_algo: "MIXED_PRECISION" isn't in the container's whitelist. The sparkrun recipe uses a community-built container (ghcr.io/spark-arena/dgx-vllm-eugr-nightly) with a patched vLLM that:
- Uses
--moe-backend cutlassinstead of Marlin - Cherry-picks vLLM PR #38126 (NVFP4 quant fix)
- Reverts PRs #34758 and #34302 (broken Hopper-only code paths)
Update 2026-04-01: NVIDIA has since published their own Spark Deployment Guide which takes a different approach: vllm/vllm-openai:cu130-nightly container, Marlin MoE backend (not CUTLASS), MTP speculative decoding, and 1M context at 90% GPU memory utilization. That guide is brand new and untested by us. It may be the better path going forward.
Compared three backends on MMLU-Pro (knowledge/reasoning) and GSM8K (math). All with max_tokens/max_gen_toks = 4096 to avoid truncating thinking tokens.
| Benchmark | vLLM / NVFP4 | Ollama / GGUF Q4_K_M | Gemma4:31b (Ollama) |
|---|---|---|---|
| MMLU-Pro (400q, 2 seeds) | 76.5% | 78.5% | 84.5% |
| GSM8K (200q) | 97.0% | 97.5% | 95.0% |
| Backend | s/question | Notes |
|---|---|---|
| Ollama / GGUF Nemotron (120B MoE, 12B active) | ~58s | Fastest |
| vLLM / NVFP4 Nemotron (120B MoE, 12B active) | ~80s | Ray overhead |
| Gemma4:31b (31B dense, Ollama) | ~156s | 2-3x slower |
- Gemma4:31b beats both Nemotron variants on MMLU-Pro by 6-8 points despite being 4x smaller.
- GSM8K is a near-tie — all above 95%.
- NVFP4 vs GGUF Q4_K_M differs by only 2 points — quantization format barely matters for accuracy.
- Ollama/GGUF Nemotron is faster than vLLM/NVFP4 Nemotron on single-Spark (no batching advantage with one user).
- vLLM's advantages are in serving features (prefix caching, continuous batching, structured tool-call parsing), not accuracy or single-request speed.
--max-model-len 200000keeps VRAM within budget. The model supports 262K but that slightly exceeds available KV cache at 70% GPU memory utilization on a single Spark.- The recipe defaults to TP=2 (two Sparks).
--tp 1overrides for single-Spark use. - There is a tokenizer regex warning in the logs (references Mistral). This is cosmetic and does not affect output.