Serving Nemotron-3-Super-120B-A12B-NVFP4 on a single DGX Spark

Tested 2026-03-27, updated 2026-04-01. Uses sparkrun with a community-patched vLLM container that works around the MIXED_PRECISION whitelist bug in the NGC vLLM container (vllm-project/vllm#37854).

Pinned versions

These are the exact versions tested. Using different versions may break things.

Component	Version	Immutable ref
sparkrun	0.2.15	PyPI
Recipe	`@eugr/nemotron-3-super-nvfp4`	`63aeced0` in spark-arena/recipe-registry
Container image	`vllm-node` (built locally from `ghcr.io/spark-arena/dgx-vllm-eugr-nightly`)	`sha256:44287d7066bc9a186fb0e2c8e9e4cb62a9bd222429861f30c12a3457672e7071`
Container build script	eugr/spark-vllm-docker	`e7f2ee69`
vLLM	0.18.1rc1.dev196+g21d2b53f8	commit `21d2b53f8` in vllm-project/vllm
FlashInfer	0.6.7	commit `31b63bc3` in flashinfer-ai/flashinfer
PyTorch	2.12.0.dev20260325+cu130	—
CUDA base	13.2.0-devel-ubuntu24.04	—
Model	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`	—

Prerequisites

DGX Spark with DGX OS (Ubuntu 24.04, aarch64)
Docker installed (ships with DGX OS)
Your user in the docker group
A HuggingFace account and access token
~75 GB free disk for model weights, ~15 GB for the container image

1. Docker group (if not already done)

sudo usermod -aG docker $USER

Log out and log back in for the group to take effect.

2. Install uv (if not already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Save your HuggingFace token

mkdir -p ~/.cache/huggingface
echo -n "hf_YOUR_TOKEN_HERE" > ~/.cache/huggingface/token
chmod 600 ~/.cache/huggingface/token

4. Run it

uvx sparkrun@0.2.15 run @eugr/nemotron-3-super-nvfp4 --tp 1 --hosts localhost --max-model-len 200000

This will:

Build the patched vLLM container from source (~3-5 min)
Download the NVFP4 model from HuggingFace (~75 GB)
Launch vLLM serving on port 8000

First run takes 15-30 minutes depending on download speed. The container and model are cached for subsequent runs.

Note on reproducibility: sparkrun is pinned above, but the recipe it fetches from the @eugr registry and the container image (ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest) are not pinnable through sparkrun. The recipe and container may change between runs. If the container has already been built locally (cached as vllm-node), sparkrun will reuse it. To guarantee reproducibility, do not delete the local vllm-node image.

5. Verify

# Check the model is listed
curl -s http://localhost:8000/v1/models | python3 -m json.tool

# Test inference
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 500
  }' | python3 -m json.tool

# Test tool calling
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
    "max_tokens": 500
  }' | python3 -m json.tool

6. Stop

uvx sparkrun@0.2.15 stop

What this gives you

Nemotron-3-Super-120B (120B params, 12B active per token, hybrid Mamba-2/MoE/Attention)
NVFP4 weights (trained at FP4 precision, not post-hoc quantized)
OpenAI-compatible API at http://localhost:8000
Tool calling and reasoning/thinking support
200K token context window
FP8 KV cache

What's happening under the hood

NVIDIA's official NGC vLLM container (26.02) rejects this model because its quant_algo: "MIXED_PRECISION" isn't in the container's whitelist. The sparkrun recipe uses a community-built container (ghcr.io/spark-arena/dgx-vllm-eugr-nightly) with a patched vLLM that:

Uses --moe-backend cutlass instead of Marlin
Cherry-picks vLLM PR #38126 (NVFP4 quant fix)
Reverts PRs #34758 and #34302 (broken Hopper-only code paths)

Update 2026-04-01: NVIDIA has since published their own Spark Deployment Guide which takes a different approach: vllm/vllm-openai:cu130-nightly container, Marlin MoE backend (not CUTLASS), MTP speculative decoding, and 1M context at 90% GPU memory utilization. That guide is brand new and untested by us. It may be the better path going forward.

Benchmarks (run on this hardware, 2026-04-05)

Compared three backends on MMLU-Pro (knowledge/reasoning) and GSM8K (math). All with max_tokens/max_gen_toks = 4096 to avoid truncating thinking tokens.

Accuracy

Benchmark	vLLM / NVFP4	Ollama / GGUF Q4_K_M	Gemma4:31b (Ollama)
MMLU-Pro (400q, 2 seeds)	76.5%	78.5%	84.5%
GSM8K (200q)	97.0%	97.5%	95.0%

Latency (MMLU-Pro, per question)

Backend	s/question	Notes
Ollama / GGUF Nemotron (120B MoE, 12B active)	~58s	Fastest
vLLM / NVFP4 Nemotron (120B MoE, 12B active)	~80s	Ray overhead
Gemma4:31b (31B dense, Ollama)	~156s	2-3x slower

Observations

Gemma4:31b beats both Nemotron variants on MMLU-Pro by 6-8 points despite being 4x smaller.
GSM8K is a near-tie — all above 95%.
NVFP4 vs GGUF Q4_K_M differs by only 2 points — quantization format barely matters for accuracy.
Ollama/GGUF Nemotron is faster than vLLM/NVFP4 Nemotron on single-Spark (no batching advantage with one user).
vLLM's advantages are in serving features (prefix caching, continuous batching, structured tool-call parsing), not accuracy or single-request speed.

Notes

--max-model-len 200000 keeps VRAM within budget. The model supports 262K but that slightly exceeds available KV cache at 70% GPU memory utilization on a single Spark.
The recipe defaults to TP=2 (two Sparks). --tp 1 overrides for single-Spark use.
There is a tokenizer regex warning in the logs (references Mistral). This is cosmetic and does not affect output.

algal/HOWTO-nemotron-super-on-spark.md

Select an option

No results found