Skip to content

Instantly share code, notes, and snippets.

@algal
Last active April 5, 2026 20:19
Show Gist options
  • Select an option

  • Save algal/bb51118830acc43ab2cb912579be9db8 to your computer and use it in GitHub Desktop.

Select an option

Save algal/bb51118830acc43ab2cb912579be9db8 to your computer and use it in GitHub Desktop.
Serving Nemotron-3-Super-120B-A12B-NVFP4 on a single DGX Spark

Serving Nemotron-3-Super-120B-A12B-NVFP4 on a single DGX Spark

Tested 2026-03-27, updated 2026-04-01. Uses sparkrun with a community-patched vLLM container that works around the MIXED_PRECISION whitelist bug in the NGC vLLM container (vllm-project/vllm#37854).

Pinned versions

These are the exact versions tested. Using different versions may break things.

Component Version Immutable ref
sparkrun 0.2.15 PyPI
Recipe @eugr/nemotron-3-super-nvfp4 63aeced0 in spark-arena/recipe-registry
Container image vllm-node (built locally from ghcr.io/spark-arena/dgx-vllm-eugr-nightly) sha256:44287d7066bc9a186fb0e2c8e9e4cb62a9bd222429861f30c12a3457672e7071
Container build script eugr/spark-vllm-docker e7f2ee69
vLLM 0.18.1rc1.dev196+g21d2b53f8 commit 21d2b53f8 in vllm-project/vllm
FlashInfer 0.6.7 commit 31b63bc3 in flashinfer-ai/flashinfer
PyTorch 2.12.0.dev20260325+cu130
CUDA base 13.2.0-devel-ubuntu24.04
Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Prerequisites

  • DGX Spark with DGX OS (Ubuntu 24.04, aarch64)
  • Docker installed (ships with DGX OS)
  • Your user in the docker group
  • A HuggingFace account and access token
  • ~75 GB free disk for model weights, ~15 GB for the container image

1. Docker group (if not already done)

sudo usermod -aG docker $USER

Log out and log back in for the group to take effect.

2. Install uv (if not already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Save your HuggingFace token

mkdir -p ~/.cache/huggingface
echo -n "hf_YOUR_TOKEN_HERE" > ~/.cache/huggingface/token
chmod 600 ~/.cache/huggingface/token

4. Run it

uvx sparkrun@0.2.15 run @eugr/nemotron-3-super-nvfp4 --tp 1 --hosts localhost --max-model-len 200000

This will:

  1. Build the patched vLLM container from source (~3-5 min)
  2. Download the NVFP4 model from HuggingFace (~75 GB)
  3. Launch vLLM serving on port 8000

First run takes 15-30 minutes depending on download speed. The container and model are cached for subsequent runs.

Note on reproducibility: sparkrun is pinned above, but the recipe it fetches from the @eugr registry and the container image (ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest) are not pinnable through sparkrun. The recipe and container may change between runs. If the container has already been built locally (cached as vllm-node), sparkrun will reuse it. To guarantee reproducibility, do not delete the local vllm-node image.

5. Verify

# Check the model is listed
curl -s http://localhost:8000/v1/models | python3 -m json.tool

# Test inference
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 500
  }' | python3 -m json.tool

# Test tool calling
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
    "max_tokens": 500
  }' | python3 -m json.tool

6. Stop

uvx sparkrun@0.2.15 stop

What this gives you

  • Nemotron-3-Super-120B (120B params, 12B active per token, hybrid Mamba-2/MoE/Attention)
  • NVFP4 weights (trained at FP4 precision, not post-hoc quantized)
  • OpenAI-compatible API at http://localhost:8000
  • Tool calling and reasoning/thinking support
  • 200K token context window
  • FP8 KV cache

What's happening under the hood

NVIDIA's official NGC vLLM container (26.02) rejects this model because its quant_algo: "MIXED_PRECISION" isn't in the container's whitelist. The sparkrun recipe uses a community-built container (ghcr.io/spark-arena/dgx-vllm-eugr-nightly) with a patched vLLM that:

  • Uses --moe-backend cutlass instead of Marlin
  • Cherry-picks vLLM PR #38126 (NVFP4 quant fix)
  • Reverts PRs #34758 and #34302 (broken Hopper-only code paths)

Update 2026-04-01: NVIDIA has since published their own Spark Deployment Guide which takes a different approach: vllm/vllm-openai:cu130-nightly container, Marlin MoE backend (not CUTLASS), MTP speculative decoding, and 1M context at 90% GPU memory utilization. That guide is brand new and untested by us. It may be the better path going forward.

Benchmarks (run on this hardware, 2026-04-05)

Compared three backends on MMLU-Pro (knowledge/reasoning) and GSM8K (math). All with max_tokens/max_gen_toks = 4096 to avoid truncating thinking tokens.

Accuracy

Benchmark vLLM / NVFP4 Ollama / GGUF Q4_K_M Gemma4:31b (Ollama)
MMLU-Pro (400q, 2 seeds) 76.5% 78.5% 84.5%
GSM8K (200q) 97.0% 97.5% 95.0%

Latency (MMLU-Pro, per question)

Backend s/question Notes
Ollama / GGUF Nemotron (120B MoE, 12B active) ~58s Fastest
vLLM / NVFP4 Nemotron (120B MoE, 12B active) ~80s Ray overhead
Gemma4:31b (31B dense, Ollama) ~156s 2-3x slower

Observations

  • Gemma4:31b beats both Nemotron variants on MMLU-Pro by 6-8 points despite being 4x smaller.
  • GSM8K is a near-tie — all above 95%.
  • NVFP4 vs GGUF Q4_K_M differs by only 2 points — quantization format barely matters for accuracy.
  • Ollama/GGUF Nemotron is faster than vLLM/NVFP4 Nemotron on single-Spark (no batching advantage with one user).
  • vLLM's advantages are in serving features (prefix caching, continuous batching, structured tool-call parsing), not accuracy or single-request speed.

Notes

  • --max-model-len 200000 keeps VRAM within budget. The model supports 262K but that slightly exceeds available KV cache at 70% GPU memory utilization on a single Spark.
  • The recipe defaults to TP=2 (two Sparks). --tp 1 overrides for single-Spark use.
  • There is a tokenizer regex warning in the logs (references Mistral). This is cosmetic and does not affect output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment