Skip to content

Instantly share code, notes, and snippets.

@co-l
Created February 26, 2026 14:36
Show Gist options
  • Select an option

  • Save co-l/9358e09775c96d1f38fa66cfd96ce88f to your computer and use it in GitHub Desktop.

Select an option

Save co-l/9358e09775c96d1f38fa66cfd96ce88f to your computer and use it in GitHub Desktop.

B200 180GB VRAM — Model Testing Runbook

Hardware: NVIDIA B200 (180GB VRAM, compute capability 10.0) on RunPod
Image: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
Image contents: CUDA 12.8.1, PyTorch 2.8.0, Ubuntu 24.04
Cost: ~$5/hour — every minute counts
Network volume: /workspace500GB (all 3 models + llama.cpp build)
Container disk: 200GB (holds 1 model at a time on fast local NVMe)
Goal: Near-lossless quality, 200K+ context, benchmark-driven optimization


Phase 0: First Boot Setup (Do Once, Results Persist on /workspace)

0.1 Install Build Dependencies

The RunPod image has CUDA 12.8.1 but may lack build tools:

apt-get update && apt-get install -y \
  build-essential cmake git curl libcurl4-openssl-dev pciutils

# Verify CUDA is accessible
nvcc --version        # Should show 12.8.x
nvidia-smi            # Should show B200, compute cap 10.0

0.2 Build llama.cpp with All Optimizations

git clone https://github.com/ggml-org/llama.cpp /workspace/llama.cpp
cd /workspace/llama.cpp

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=100

cmake --build build --config Release -j $(nproc)

Why -DCMAKE_CUDA_ARCHITECTURES=100? The B200 is compute capability 10.0 (Blackwell). Using native can fail in Docker containers where cmake can't detect the GPU at configure time. Setting 100 explicitly avoids this and generates optimized sm_100 code.

Why -DGGML_CUDA_FA_ALL_QUANTS=ON? This enables quantized KV cache (Q8, Q4) with flash attention. Without it, -ctk q8_0 -ctv q8_0 silently falls back to FP16, wasting VRAM.

Verify the build produced all required binaries:

ls -la build/bin/llama-{server,cli,bench}

Copy binaries to a convenient location:

mkdir -p /workspace/bin
cp build/bin/llama-{server,cli,bench,gguf-split} /workspace/bin/
chmod +x /workspace/bin/*

0.3 Install Download Tools

pip install huggingface_hub hf_transfer --break-system-packages
export HF_HUB_ENABLE_HF_TRANSFER=1

Note: As of huggingface_hub v1.4+, the CLI command is hf (not huggingface-cli).

0.4 Download All Three Models

Run in parallel using tmux or background jobs. All downloads go to /workspace:

# Terminal 1 — MiniMax M2.5 Q5_K_M (~160GB)
hf download unsloth/MiniMax-M2.5-GGUF \
  --include "*Q5_K_M*" \
  --local-dir /workspace/models/minimax-m2.5

# Terminal 2 — Devstral 2 123B Q8_0 (~133GB)
hf download unsloth/Devstral-2-123B-Instruct-2512-GGUF \
  --include "*Q8_0*" \
  --local-dir /workspace/models/devstral-2

# Terminal 3 — Qwen3.5-122B-A10B Q8_0 (~130GB est.)
# Check availability first:
# https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF/tree/main
hf download unsloth/Qwen3.5-122B-A10B-GGUF \
  --include "*Q8_0*" \
  --local-dir /workspace/models/qwen3.5-122b

# Fallback if Q8_0 not yet uploaded:
# hf download unsloth/Qwen3.5-122B-A10B-GGUF \
#   --include "*UD-Q4_K_XL*" \
#   --local-dir /workspace/models/qwen3.5-122b

Background downloads with nohup: Use full path to avoid PATH issues in non-interactive shells:

export HF_HUB_ENABLE_HF_TRANSFER=1
nohup /usr/local/bin/hf download unsloth/Devstral-2-123B-Instruct-2512-GGUF \
  --include "*Q8_0*" \
  --local-dir /workspace/models/devstral-2 \
  > /workspace/download-devstral.log 2>&1 &

# Monitor progress (tqdm doesn't update in log, check size instead):
watch -n5 'du -sh /workspace/models/devstral-2/'

If download fails/hangs on lock: Clear stale locks and restart:

rm -rf /workspace/models/<model-dir>/.cache
# Then re-run the download command

Qwen3.5-122B-A10B note: This model was released Feb 25-26, 2026. GGUFs were actively being uploaded. Higher quants may not yet be available. UD-Q4_K_XL (~75-80GB) is confirmed and works great.


Subsequent Boot — Quick Start Script

On every new pod start, run this then copy the model you need:

#!/bin/bash
# /workspace/boot.sh — source this on each pod start

# 1. Install runtime deps
apt-get update -qq && apt-get install -y -qq libcurl4-openssl-dev pciutils > /dev/null 2>&1

# 2. Put binaries on PATH
export PATH="/workspace/bin:$PATH"

# 3. Verify GPU
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv,noheader

# 4. Set performance env vars
export GGML_CUDA_GRAPH_OPT=1

# 5. Prepare local model directory
mkdir -p /root/models

echo "=== Ready ==="
echo "=== Now copy your model to local NVMe: ==="
echo "  Session 1: cp /workspace/models/minimax-m2.5/*.gguf /root/models/"
echo "  Session 2: cp /workspace/models/devstral-2/*.gguf /root/models/"
echo "  Session 3: cp /workspace/models/qwen3.5-122b/*.gguf /root/models/"

Copy Model to Local NVMe

This step is critical. Loading 130-160GB over network storage is extremely slow. Copy to container disk first:

# Session 1 — MiniMax M2.5 (~160GB, ~2-4 min on fast NVMe)
cp /workspace/models/minimax-m2.5/*.gguf /root/models/

# Session 2 — Devstral 2 (~133GB)
cp /workspace/models/devstral-2/*.gguf /root/models/

# Session 3 — Qwen3.5 (~130GB)
cp /workspace/models/qwen3.5-122b/*.gguf /root/models/

Between sessions: Delete the previous model before copying the next one to free up container disk space:

rm /root/models/*.gguf
cp /workspace/models/<next-model>/*.gguf /root/models/

GGML_CUDA_GRAPH_OPT=1: Enables kernel fusion and concurrent streams in the CUDA backend. On single GPU, this gives a measurable boost to token generation speed. Only works for batch size 1 (single user), which is our use case.


Benchmarking Protocol

How llama-bench Works

llama-bench measures two critical metrics:

  • pp (prompt processing): How fast the model processes input context. Compute-bound. Measured in tok/s.
  • tg (token generation): How fast the model generates new tokens. Memory-bandwidth-bound. Measured in tok/s.

It runs each test 5 times by default and reports mean ± stddev. The -d flag tests at different context depths (simulating existing conversation history).

Standard Benchmark Command Template

GGML_CUDA_GRAPH_OPT=1 llama-bench \
  -m <MODEL_PATH> \
  -ngl 999 \
  -fa 1 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 512,2048,8192 \
  -n 128 \
  -d 0,4096,16384,65536 \
  -r 3 \
  -o md

Flag reference:

Flag Value Purpose
-ngl 999 999 Offload all layers to GPU
-fa 1 1 Enable flash attention
-ctk q8_0 -ctv q8_0 q8_0 Quantize KV cache (halves KV memory)
-mmp 0 0 Disable mmap (load weights into memory)
-p 512,2048,8192 varies Prompt sizes to benchmark (pp)
-n 128 128 Number of tokens to generate (tg)
-d 0,4096,16384,65536 varies Context depth (simulates existing conversation)
-r 3 3 Repetitions per test (default 5, use 3 to save time)
-o md md Output as markdown table

Quick Smoke Test (30 seconds)

Run this first to verify the model loads and GPU is working:

llama-bench -m <MODEL_PATH> -ngl 999 -fa 1 -mmp 0 -r 1

This runs the default pp512 + tg128 once. If you see reasonable tok/s numbers, proceed to full bench.

Performance Tuning Iteration

After the initial benchmark, try these variations to find the sweet spot:

# Test without KV cache quantization (baseline comparison)
llama-bench -m <MODEL> -ngl 999 -fa 1 -mmp 0 -r 3

# Test with Q8 KV cache
llama-bench -m <MODEL> -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 -r 3

# Test with Q4 KV cache (more aggressive, more VRAM savings)
llama-bench -m <MODEL> -ngl 999 -fa 1 -ctk q4_0 -ctv q4_0 -mmp 0 -r 3

# Test with CUDA graph optimization
GGML_CUDA_GRAPH_OPT=1 llama-bench -m <MODEL> -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 -r 3

# Deep context degradation test
GGML_CUDA_GRAPH_OPT=1 llama-bench \
  -m <MODEL> -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 \
  -p 2048 -n 32 \
  -d 0,4096,8192,16384,32768,65536,131072 \
  -r 3

What to look for: TG speed degrades as depth increases (more KV cache to attend to). If TG drops below ~5 tok/s at your target context length, consider stronger KV quantization or reducing context.


Session 1: MiniMax M2.5 (230B MoE, 10B active)

Specs

Property Value
Architecture MoE, 256 experts, 8 active per token
Total / Active params 230B / 10B
Context 200K (196,608 tokens)
Quant Q5_K_M
Weight size ~160GB
KV cache (200K, Q8_0) ~10GB
Total VRAM est. ~170GB
Strengths SOTA coding (80.2% SWE-Bench), agentic tool use

Step 0: Copy to Local Disk

rm -f /root/models/*.gguf
cp /workspace/models/minimax-m2.5/*.gguf /root/models/
ls -lh /root/models/

Step 1: Benchmark

export PATH="/workspace/bin:$PATH"
export GGML_CUDA_GRAPH_OPT=1

MODEL=$(ls /root/models/MiniMax-M2.5-Q5_K_M-00001-of-*.gguf)

# Smoke test
llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1

# Full benchmark
llama-bench \
  -m $MODEL \
  -ngl 999 \
  -fa 1 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 512,2048,8192 \
  -n 128 \
  -d 0,4096,16384,65536 \
  -r 3 \
  -o md

# Deep context stress test
llama-bench \
  -m $MODEL \
  -ngl 999 \
  -fa 1 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 2048 -n 32 \
  -d 0,16384,65536,131072,196608 \
  -r 3 \
  -o md

Step 2: Serve (With Best Settings From Bench)

llama-server \
  --model $MODEL \
  --alias "minimax-m2.5" \
  --host 0.0.0.0 \
  --port 8000 \
  -ngl 999 \
  --ctx-size 200000 \
  --flash-attn \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0.01 \
  --top-k 40 \
  --repeat-penalty 1.0 \
  --jinja \
  --parallel 1

MiniMax-Specific Notes

  • --repeat-penalty 1.0 is mandatory — MiniMax explicitly requires it disabled. Values > 1.0 cause looping/degeneration.
  • Temperature 1.0 is the official recommendation, not a typo.
  • System prompt: "You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax."
  • If VRAM is tight at 200K, reduce --ctx-size to 131072 and re-bench.

Fallback

If Q5_K_M doesn't fit at 200K context, download UD-Q4_K_XL (~130GB):

hf download unsloth/MiniMax-M2.5-GGUF \
  --include "*UD-Q4_K_XL*" \
  --local-dir /workspace/models/minimax-m2.5-q4

Session 2: Devstral 2 123B (Dense Transformer)

Specs

Property Value
Architecture Dense transformer (all params active)
Total / Active params 123B / 123B
Context 256K (using 131K for this session)
Quant Q8_0
Weight size 133GB
KV cache (131K, FP16 fallback) ~36GB
Total VRAM est. ~169GB
Strengths Agentic coding (72.2% SWE-Bench), multi-file edits

Step 0: Copy to Local Disk

rm -f /root/models/*.gguf
cp /workspace/models/devstral-2/*.gguf /root/models/
ls -lh /root/models/

Step 1: Benchmark

export PATH="/workspace/bin:$PATH"
export GGML_CUDA_GRAPH_OPT=1

MODEL=$(ls /root/models/Devstral-2-123B-Instruct-2512-Q8_0-00001-of-*.gguf)

# Smoke test
llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1

# Full benchmark — this is a dense 123B, expect slower TG than MoE models
llama-bench \
  -m $MODEL \
  -ngl 999 \
  -fa 1 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 512,2048,8192 \
  -n 128 \
  -d 0,4096,16384,65536 \
  -r 3 \
  -o md

# Compare KV quantization levels — this model is tight on VRAM at 200K
# Q8 KV cache (~28GB)
llama-bench -m $MODEL -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -mmp 0 -p 2048 -n 32 -d 0,32768,65536,131072 -r 3

# Q4 KV cache (~14GB, saves ~14GB VRAM at cost of some quality)
llama-bench -m $MODEL -ngl 999 -fa 1 -ctk q4_0 -ctv q4_0 -mmp 0 -p 2048 -n 32 -d 0,32768,65536,131072 -r 3

Step 2: Serve

llama-server \
  --model $MODEL \
  --alias "devstral-2" \
  --host 0.0.0.0 \
  --port 8000 \
  -ngl 999 \
  --ctx-size 131072 \
  --flash-attn \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --temp 0.15 \
  --jinja \
  --parallel 1

Devstral-Specific Notes

  • This is a dense 123B model — every parameter is active on every forward pass. It WILL be slower per token than MoE models, but generates very high quality code.
  • KV cache quantization is essential. Without it, FP16 KV at 200K needs ~55GB → total ~188GB (OOM). With Q8_0 KV: ~28GB → total ~161GB (fits).
  • Known llama.cpp issues: Devstral 2 has had reports of broken output at long context and chat template quirks. Ensure you're on latest llama.cpp. If garbled output appears, try --ctx-size 131072 first.
  • Uses Mistral architecture identified as "llama" in GGUF metadata — this is normal.
  • If 200K OOMs: Reduce to --ctx-size 131072. Also try -ctk q4_0 -ctv q4_0.
  • Modified MIT license: free for companies under $20M/month revenue.

Session 3: Qwen3.5-122B-A10B (MoE, Hybrid Architecture)

Specs

Property Value
Architecture MoE + Gated Delta Networks (hybrid attention/SSM)
Total / Active params 122B / 10B
Context 256K (extendable to 1M via YaRN)
Quant Q8_0 (~130GB) or UD-Q4_K_XL (~75-80GB)
KV cache (256K, Q8_0) ~10GB
Total VRAM est. ~140GB (Q8_0) / ~88GB (Q4_K_XL)
Strengths Reasoning, multilingual (201 langs), hybrid thinking mode

Step 0: Copy to Local Disk

rm -f /root/models/*.gguf
cp /workspace/models/qwen3.5-122b/*.gguf /root/models/
ls -lh /root/models/

Step 1: Benchmark

export PATH="/workspace/bin:$PATH"
export GGML_CUDA_GRAPH_OPT=1

MODEL=$(ls /root/models/Qwen3.5-122B-A10B-*-00001-of-*.gguf)

# Smoke test
llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1

# Full benchmark
llama-bench \
  -m $MODEL \
  -ngl 999 \
  -fa 1 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 512,2048,8192 \
  -n 128 \
  -d 0,4096,16384,65536 \
  -r 3 \
  -o md

# This model has massive headroom — push to 256K
llama-bench \
  -m $MODEL \
  -ngl 999 \
  -fa 1 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 2048 -n 32 \
  -d 0,32768,65536,131072,196608,262144 \
  -r 3 \
  -o md

Step 2: Serve

llama-server \
  --model $MODEL \
  --alias "qwen3.5-122b" \
  --host 0.0.0.0 \
  --port 8000 \
  -ngl 999 \
  --ctx-size 262144 \
  --flash-attn \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --jinja \
  --parallel 1

Qwen-Specific Notes

  • Brand new model (Feb 25-26, 2026) — GGUFs actively being uploaded. Check repo before session.
  • Thinking mode is enabled by default (generates <think>...</think> before responses). To disable for faster direct answers, set in system prompt or use chat template's enable_thinking: false.
  • Hybrid architecture: Gated Delta Networks + sparse MoE means smaller KV cache footprint than a pure MHA model.
  • At UD-Q4_K_XL (~75-80GB), this model has massive headroom — full 256K context with room to spare.
  • Sampling presets:
    • Coding: --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00
    • Creative: --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00
    • Tool calling: --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00
  • Apache 2.0 license — fully open, no revenue restrictions.

Quick Reference

Resource Budget

B200 VRAM:                          180 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Session 1 — MiniMax M2.5 Q5_K_M:   ~170 GB  (94%)
Session 2 — Devstral 2 Q8_0:       ~169 GB  (94%)  [131K ctx, FP16 KV]
Session 3 — Qwen3.5-122B Q8_0:     ~140 GB  (78%)
Session 3 — Qwen3.5-122B Q4_K_XL:   ~88 GB  (49%)

/workspace volume:                    500 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
llama.cpp build:                     ~2 GB
MiniMax M2.5 Q5_K_M:              ~160 GB
Devstral 2 Q8_0:                  ~133 GB
Qwen3.5-122B Q8_0:                ~130 GB
Total:                             ~425 GB  (+75 GB headroom)

Container disk:                     200 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OS + image:                         ~15 GB
1 model at a time:              130-160 GB
Headroom:                        25-55 GB

Session Workflow

1. Boot pod
2. source /workspace/boot.sh
3. Copy model:    cp /workspace/models/<model>/*.gguf /root/models/
4. Smoke test:    llama-bench -m $MODEL -ngl 999 -fa 1 -mmp 0 -r 1
5. Full bench:    (see session-specific commands)
6. Tune:          Adjust -ctk/-ctv, ctx-size based on bench results
7. Serve:         llama-server with best settings
8. Test:          Interactive coding tasks via API or CLI
9. Swap model:    rm /root/models/*.gguf && cp next model

Key Flags Cheat Sheet

Flag Purpose
-ngl 999 Offload all layers to GPU
--flash-attn / -fa 1 Enable flash attention (required for KV quant)
-ctk q8_0 -ctv q8_0 Quantize KV cache to Q8 (halves KV memory)
--no-mmap / -mmp 0 Load weights into memory (no memory-mapped I/O)
--jinja Use model's native chat template
--ctx-size N Set context window in tokens
--parallel 1 Single-user inference (max context per user)
GGML_CUDA_GRAPH_OPT=1 Enable CUDA concurrent streams (~5-10% TG boost)

If Something OOMs

  1. Reduce --ctx-size (200K → 131072 → 65536)
  2. Quantize KV cache harder (-ctk q4_0 -ctv q4_0)
  3. Use a smaller quant of the same model
  4. Check nvidia-smi — all layers must be on GPU

Troubleshooting

cmake fails with "CUDA_ARCHITECTURES native but no GPU detected"
→ This is expected in Docker. We set -DCMAKE_CUDA_ARCHITECTURES=100 explicitly for B200.

cmake fails with "Unsupported gpu architecture 'compute_120a'"
→ Your llama.cpp version is trying to auto-detect RTX 50-series arch. Pin to -DCMAKE_CUDA_ARCHITECTURES=100 for B200.

Devstral 2 produces garbled/repeated output
→ Known issue. Ensure latest llama.cpp build. Try --ctx-size 32768 first to verify model works, then scale up.

Qwen3.5-122B Q8_0 not available on HuggingFace
→ Model was brand new (Feb 26, 2026). Use UD-Q4_K_XL — confirmed available, excellent quality.

MiniMax M2.5 looping/repetition
--repeat-penalty must be 1.0 (disabled). MiniMax models are sensitive to values > 1.0.

Model loads but inference is very slow
→ Model is spilling to CPU RAM. Check nvidia-smi. If VRAM usage ≈ 180GB, reduce context. All layers must be on GPU.

Split GGUF files
→ Large models come as split shards (-00001-of-00003.gguf). Point llama.cpp at the first shard. All shards must be in the same directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment