Hardware: NVIDIA B200 (180GB VRAM, compute capability 10.0) on RunPod
Image: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
Image contents: CUDA 12.8.1, PyTorch 2.8.0, Ubuntu 24.04
Cost: ~$5/hour — every minute counts
Network volume: /workspace — 500GB (all 3 models + llama.cpp build)
Container disk: 200GB (holds 1 model at a time on fast local NVMe)
Goal: Near-lossless quality, 200K+ context, benchmark-driven optimization
The RunPod image has CUDA 12.8.1 but may lack build tools:
apt-get update && apt-get install -y \
build-essential cmake git curl libcurl4-openssl-dev pciutils
# Verify CUDA is accessible
nvcc --version # Should show 12.8.x
nvidia-smi # Should show B200, compute cap 10.0git clone https://github.com/ggml-org/llama.cpp /workspace/llama.cpp
cd /workspace/llama.cpp
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON \
-DLLAMA_CURL=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_CUDA_ARCHITECTURES=100
cmake --build build --config Release -j $(nproc)Why
-DCMAKE_CUDA_ARCHITECTURES=100? The B200 is compute capability 10.0 (Blackwell). Usingnativecan fail in Docker containers where cmake can't detect the GPU at configure time. Setting100explicitly avoids this and generates optimized sm_100 code.
Why
-DGGML_CUDA_FA_ALL_QUANTS=ON? This enables quantized KV cache (Q8, Q4) with flash attention. Without it,-ctk q8_0 -ctv q8_0silently falls back to FP16, wasting VRAM.
Verify the build produced all required binaries:
ls -la build/bin/llama-{server,cli,bench}Copy binaries to a convenient location:
mkdir -p /workspace/bin
cp build/bin/llama-{server,cli,bench,gguf-split} /workspace/bin/
chmod +x /workspace/bin/*pip install huggingface_hub hf_transfer --break-system-packages
export HF_HUB_ENABLE_HF_TRANSFER=1Note: As of huggingface_hub v1.4+, the CLI command is
hf(nothuggingface-cli).
Run in parallel using tmux or background jobs. All downloads go to /workspace:
# Terminal 1 — MiniMax M2.5 Q5_K_M (~160GB)
hf download unsloth/MiniMax-M2.5-GGUF \
--include "*Q5_K_M*" \
--local-dir /workspace/models/minimax-m2.5
# Terminal 2 — Devstral 2 123B Q8_0 (~133GB)
hf download unsloth/Devstral-2-123B-Instruct-2512-GGUF \
--include "*Q8_0*" \
--local-dir /workspace/models/devstral-2
# Terminal 3 — Qwen3.5-122B-A10B Q8_0 (~130GB est.)
# Check availability first:
# https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF/tree/main
hf download unsloth/Qwen3.5-122B-A10B-GGUF \
--include "*Q8_0*" \
--local-dir /workspace/models/qwen3.5-122b
# Fallback if Q8_0 not yet uploaded:
# hf download unsloth/Qwen3.5-122B-A10B-GGUF \
# --include "*UD-Q4_K_XL*" \
# --local-dir /workspace/models/qwen3.5-122bBackground downloads with nohup: Use full path to avoid PATH issues in non-interactive shells:
export HF_HUB_ENABLE_HF_TRANSFER=1
nohup /usr/local/bin/hf download unsloth/Devstral-2-123B-Instruct-2512-GGUF \
--include "*Q8_0*" \
--local-dir /workspace/models/devstral-2 \
> /workspace/download-devstral.log 2>&1 &
# Monitor progress (tqdm doesn't update in log, check size instead):
watch -n5 'du -sh /workspace/models/devstral-2/'If download fails/hangs on lock: Clear stale locks and restart:
rm -rf /workspace/models/<model-dir>/.cache
# Then re-run the download commandQwen3.5-122B-A10B note: This model was released Feb 25-26, 2026. GGUFs were actively being uploaded. Higher quants may not yet be available. UD-Q4_K_XL (~75-80GB) is confirmed and works great.
On every new pod start, run this then copy the model you need:
#!/bin/bash
# /workspace/boot.sh — source this on each pod start
# 1. Install runtime deps
apt-get update -qq && apt-get install -y -qq libcurl4-openssl-dev pciutils > /dev/null 2>&1
# 2. Put binaries on PATH
export PATH="/workspace/bin:$PATH"
# 3. Verify GPU
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv,noheader
# 4. Set performance env vars
export GGML_CUDA_GRAPH_OPT=1
# 5. Prepare local model directory
mkdir -p /root/models
echo "=== Ready ==="
echo "=== Now copy your model to local NVMe: ==="
echo " Session 1: cp /workspace/models/minimax-m2.5/*.gguf /root/models/"
echo " Session 2: cp /workspace/models/devstral-2/*.gguf /root/models/"
echo " Session 3: cp /workspace/models/qwen3.5-122b/*.gguf /root/models/"This step is critical. Loading 130-160GB over network storage is extremely slow. Copy to container disk first:
# Session 1 — MiniMax M2.5 (~160GB, ~2-4 min on fast NVMe)
cp /workspace/models/minimax-m2.5/*.gguf /root/models/
# Session 2 — Devstral 2 (~133GB)
cp /workspace/models/devstral-2/*.gguf /root/models/
# Session 3 — Qwen3.5 (~130GB)
cp /workspace/models/qwen3.5-122b/*.gguf /root/models/Between sessions: Delete the previous model before copying the next one to free up container disk space:
rm /root/models/*.gguf cp /workspace/models/<next-model>/*.gguf /root/models/
GGML_CUDA_GRAPH_OPT=1: Enables kernel fusion and concurrent streams in the CUDA backend. On single GPU, this gives a measurable boost to token generation speed. Only works for batch size 1 (single user), which is our use case.
llama-bench measures two critical metrics:
- pp (prompt processing): How fast the model processes input context. Compute-bound. Measured in tok/s.
- tg (token generation): How fast the model generates new tokens. Memory-bandwidth-bound. Measured in tok/s.
It runs each test 5 times by default and reports mean ± stddev. The -d flag tests at different context depths (simulating existing conversation history).
GGML_CUDA_GRAPH_OPT=1 llama-bench \
-m <MODEL_PATH> \
-ngl 999 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-mmp 0 \
-p 512,2048,8192 \
-n 128 \
-d 0,4096,16384,65536 \
-r 3 \
-o mdFlag reference:
| Flag | Value | Purpose |
|---|---|---|
-ngl 999 |
999 | Offload all layers to GPU |
-fa 1 |
1 | Enable flash attention |
-ctk q8_0 -ctv q8_0 |
q8_0 | Quantize KV cache (halves KV memory) |
-mmp 0 |
0 | Disable mmap (load weights into memory) |
-p 512,2048,8192 |
varies | Prompt sizes to benchmark (pp) |
-n 128 |
128 | Number of tokens to generate (tg) |
-d 0,4096,16384,65536 |
varies | Context depth (simulates existing conversation) |
-r 3 |
3 | Repetitions per test (default 5, use 3 to save time) |
-o md |
md | Output as markdown table |
Run this first to verify the model loads and GPU is working:
llama-bench -m <MODEL_PATH> -ngl 999 -fa 1 -mmp 0 -r 1This runs the default pp512 + tg128 once. If you see reasonable tok/s numbers, proceed to full bench.
After the initial benchmark, try these variations to find the sweet spot:
# Test without KV cache quantization (baseline comparison)
llama-bench -m <MODEL> -ngl 999 -fa 1 -mmp 0 -r 3
# Test with Q8 KV cache
llama-bench -m <MODEL> -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 -r 3
# Test with Q4 KV cache (more aggressive, more VRAM savings)
llama-bench -m <MODEL> -ngl 999 -fa 1 -ctk q4_0 -ctv q4_0 -mmp 0 -r 3
# Test with CUDA graph optimization
GGML_CUDA_GRAPH_OPT=1 llama-bench -m <MODEL> -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 -r 3
# Deep context degradation test
GGML_CUDA_GRAPH_OPT=1 llama-bench \
-m <MODEL> -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 \
-p 2048 -n 32 \
-d 0,4096,8192,16384,32768,65536,131072 \
-r 3What to look for: TG speed degrades as depth increases (more KV cache to attend to). If TG drops below ~5 tok/s at your target context length, consider stronger KV quantization or reducing context.
| Property | Value |
|---|---|
| Architecture | MoE, 256 experts, 8 active per token |
| Total / Active params | 230B / 10B |
| Context | 200K (196,608 tokens) |
| Quant | Q5_K_M |
| Weight size | ~160GB |
| KV cache (200K, Q8_0) | ~10GB |
| Total VRAM est. | ~170GB |
| Strengths | SOTA coding (80.2% SWE-Bench), agentic tool use |
rm -f /root/models/*.gguf
cp /workspace/models/minimax-m2.5/*.gguf /root/models/
ls -lh /root/models/export PATH="/workspace/bin:$PATH"
export GGML_CUDA_GRAPH_OPT=1
MODEL=$(ls /root/models/MiniMax-M2.5-Q5_K_M-00001-of-*.gguf)
# Smoke test
llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1
# Full benchmark
llama-bench \
-m $MODEL \
-ngl 999 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-mmp 0 \
-p 512,2048,8192 \
-n 128 \
-d 0,4096,16384,65536 \
-r 3 \
-o md
# Deep context stress test
llama-bench \
-m $MODEL \
-ngl 999 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-mmp 0 \
-p 2048 -n 32 \
-d 0,16384,65536,131072,196608 \
-r 3 \
-o mdllama-server \
--model $MODEL \
--alias "minimax-m2.5" \
--host 0.0.0.0 \
--port 8000 \
-ngl 999 \
--ctx-size 200000 \
--flash-attn \
-ctk q8_0 -ctv q8_0 \
--no-mmap \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--repeat-penalty 1.0 \
--jinja \
--parallel 1--repeat-penalty 1.0is mandatory — MiniMax explicitly requires it disabled. Values > 1.0 cause looping/degeneration.- Temperature 1.0 is the official recommendation, not a typo.
- System prompt:
"You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax." - If VRAM is tight at 200K, reduce
--ctx-sizeto 131072 and re-bench.
If Q5_K_M doesn't fit at 200K context, download UD-Q4_K_XL (~130GB):
hf download unsloth/MiniMax-M2.5-GGUF \
--include "*UD-Q4_K_XL*" \
--local-dir /workspace/models/minimax-m2.5-q4| Property | Value |
|---|---|
| Architecture | Dense transformer (all params active) |
| Total / Active params | 123B / 123B |
| Context | 256K (using 131K for this session) |
| Quant | Q8_0 |
| Weight size | 133GB |
| KV cache (131K, FP16 fallback) | ~36GB |
| Total VRAM est. | ~169GB |
| Strengths | Agentic coding (72.2% SWE-Bench), multi-file edits |
rm -f /root/models/*.gguf
cp /workspace/models/devstral-2/*.gguf /root/models/
ls -lh /root/models/export PATH="/workspace/bin:$PATH"
export GGML_CUDA_GRAPH_OPT=1
MODEL=$(ls /root/models/Devstral-2-123B-Instruct-2512-Q8_0-00001-of-*.gguf)
# Smoke test
llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1
# Full benchmark — this is a dense 123B, expect slower TG than MoE models
llama-bench \
-m $MODEL \
-ngl 999 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-mmp 0 \
-p 512,2048,8192 \
-n 128 \
-d 0,4096,16384,65536 \
-r 3 \
-o md
# Compare KV quantization levels — this model is tight on VRAM at 200K
# Q8 KV cache (~28GB)
llama-bench -m $MODEL -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 -p 2048 -n 32 -d 0,32768,65536,131072 -r 3
# Q4 KV cache (~14GB, saves ~14GB VRAM at cost of some quality)
llama-bench -m $MODEL -ngl 999 -fa 1 -ctk q4_0 -ctv q4_0 -mmp 0 -p 2048 -n 32 -d 0,32768,65536,131072 -r 3llama-server \
--model $MODEL \
--alias "devstral-2" \
--host 0.0.0.0 \
--port 8000 \
-ngl 999 \
--ctx-size 131072 \
--flash-attn \
-ctk q8_0 -ctv q8_0 \
--no-mmap \
--temp 0.15 \
--jinja \
--parallel 1- This is a dense 123B model — every parameter is active on every forward pass. It WILL be slower per token than MoE models, but generates very high quality code.
- KV cache quantization is essential. Without it, FP16 KV at 200K needs ~55GB → total ~188GB (OOM). With Q8_0 KV: ~28GB → total ~161GB (fits).
- Known llama.cpp issues: Devstral 2 has had reports of broken output at long context and chat template quirks. Ensure you're on latest llama.cpp. If garbled output appears, try
--ctx-size 131072first. - Uses Mistral architecture identified as "llama" in GGUF metadata — this is normal.
- If 200K OOMs: Reduce to
--ctx-size 131072. Also try-ctk q4_0 -ctv q4_0. - Modified MIT license: free for companies under $20M/month revenue.
| Property | Value |
|---|---|
| Architecture | MoE + Gated Delta Networks (hybrid attention/SSM) |
| Total / Active params | 122B / 10B |
| Context | 256K (extendable to 1M via YaRN) |
| Quant | Q8_0 (~130GB) or UD-Q4_K_XL (~75-80GB) |
| KV cache (256K, Q8_0) | ~10GB |
| Total VRAM est. | ~140GB (Q8_0) / ~88GB (Q4_K_XL) |
| Strengths | Reasoning, multilingual (201 langs), hybrid thinking mode |
rm -f /root/models/*.gguf
cp /workspace/models/qwen3.5-122b/*.gguf /root/models/
ls -lh /root/models/export PATH="/workspace/bin:$PATH"
export GGML_CUDA_GRAPH_OPT=1
MODEL=$(ls /root/models/Qwen3.5-122B-A10B-*-00001-of-*.gguf)
# Smoke test
llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1
# Full benchmark
llama-bench \
-m $MODEL \
-ngl 999 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-mmp 0 \
-p 512,2048,8192 \
-n 128 \
-d 0,4096,16384,65536 \
-r 3 \
-o md
# This model has massive headroom — push to 256K
llama-bench \
-m $MODEL \
-ngl 999 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-mmp 0 \
-p 2048 -n 32 \
-d 0,32768,65536,131072,196608,262144 \
-r 3 \
-o mdllama-server \
--model $MODEL \
--alias "qwen3.5-122b" \
--host 0.0.0.0 \
--port 8000 \
-ngl 999 \
--ctx-size 262144 \
--flash-attn \
-ctk q8_0 -ctv q8_0 \
--no-mmap \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--jinja \
--parallel 1- Brand new model (Feb 25-26, 2026) — GGUFs actively being uploaded. Check repo before session.
- Thinking mode is enabled by default (generates
<think>...</think>before responses). To disable for faster direct answers, set in system prompt or use chat template'senable_thinking: false. - Hybrid architecture: Gated Delta Networks + sparse MoE means smaller KV cache footprint than a pure MHA model.
- At UD-Q4_K_XL (~75-80GB), this model has massive headroom — full 256K context with room to spare.
- Sampling presets:
- Coding:
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 - Creative:
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 - Tool calling:
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00
- Coding:
- Apache 2.0 license — fully open, no revenue restrictions.
B200 VRAM: 180 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Session 1 — MiniMax M2.5 Q5_K_M: ~170 GB (94%)
Session 2 — Devstral 2 Q8_0: ~169 GB (94%) [131K ctx, FP16 KV]
Session 3 — Qwen3.5-122B Q8_0: ~140 GB (78%)
Session 3 — Qwen3.5-122B Q4_K_XL: ~88 GB (49%)
/workspace volume: 500 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
llama.cpp build: ~2 GB
MiniMax M2.5 Q5_K_M: ~160 GB
Devstral 2 Q8_0: ~133 GB
Qwen3.5-122B Q8_0: ~130 GB
Total: ~425 GB (+75 GB headroom)
Container disk: 200 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OS + image: ~15 GB
1 model at a time: 130-160 GB
Headroom: 25-55 GB
1. Boot pod
2. source /workspace/boot.sh
3. Copy model: cp /workspace/models/<model>/*.gguf /root/models/
4. Smoke test: llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1
5. Full bench: (see session-specific commands)
6. Tune: Adjust -ctk/-ctv, ctx-size based on bench results
7. Serve: llama-server with best settings
8. Test: Interactive coding tasks via API or CLI
9. Swap model: rm /root/models/*.gguf && cp next model
| Flag | Purpose |
|---|---|
-ngl 999 |
Offload all layers to GPU |
--flash-attn / -fa 1 |
Enable flash attention (required for KV quant) |
-ctk q8_0 -ctv q8_0 |
Quantize KV cache to Q8 (halves KV memory) |
--no-mmap / -mmp 0 |
Load weights into memory (no memory-mapped I/O) |
--jinja |
Use model's native chat template |
--ctx-size N |
Set context window in tokens |
--parallel 1 |
Single-user inference (max context per user) |
GGML_CUDA_GRAPH_OPT=1 |
Enable CUDA concurrent streams (~5-10% TG boost) |
- Reduce
--ctx-size(200K → 131072 → 65536) - Quantize KV cache harder (
-ctk q4_0 -ctv q4_0) - Use a smaller quant of the same model
- Check
nvidia-smi— all layers must be on GPU
cmake fails with "CUDA_ARCHITECTURES native but no GPU detected"
→ This is expected in Docker. We set -DCMAKE_CUDA_ARCHITECTURES=100 explicitly for B200.
cmake fails with "Unsupported gpu architecture 'compute_120a'"
→ Your llama.cpp version is trying to auto-detect RTX 50-series arch. Pin to -DCMAKE_CUDA_ARCHITECTURES=100 for B200.
Devstral 2 produces garbled/repeated output
→ Known issue. Ensure latest llama.cpp build. Try --ctx-size 32768 first to verify model works, then scale up.
Qwen3.5-122B Q8_0 not available on HuggingFace
→ Model was brand new (Feb 26, 2026). Use UD-Q4_K_XL — confirmed available, excellent quality.
MiniMax M2.5 looping/repetition
→ --repeat-penalty must be 1.0 (disabled). MiniMax models are sensitive to values > 1.0.
Model loads but inference is very slow
→ Model is spilling to CPU RAM. Check nvidia-smi. If VRAM usage ≈ 180GB, reduce context. All layers must be on GPU.
Split GGUF files
→ Large models come as split shards (-00001-of-00003.gguf). Point llama.cpp at the first shard. All shards must be in the same directory.