armand1m/qwen3-vllm.sh

armand1m · 2026-02-09T12:37:48Z

GLM4.7 analysed this instance logs after I asked a fairly long code analysis request using OpenCode in plan mode.

Screenshots at the end

Based on the logs, here's a comprehensive table of metadata and performance insights for the vllm_qwen3codernext instance:                                                            
   
  vLLM Instance Metadata & Performance Analysis                                                                                                                                         
                                                                                                                                                                                      
  Instance Configuration                                                                                                                                                                
  ┌──────────────┬───────────┬───────────────────────────┬───────────────────────────────────────┐                                                                                      
  │   Category   │ Parameter │           Value           │                 Notes                 │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ vLLM Version │ Version   │ 0.15.2.dev0+g1892993bc    │ Development build (Feb 4, 2026)       │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Model        │ Name      │ Qwen/Qwen3-Coder-Next-FP8 │ FP8 quantized coder model             │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Served Name  │ Alias     │ qwen3-coder-next          │ API model name                        │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Architecture │ Type      │ Qwen3NextForCausalLM      │ Custom architecture with Mamba layers │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Quantization │ Type      │ fp8_w8a8                  │ 8-bit weights/activations             │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Data Type    │ dtype     │ torch.bfloat16            │ Compute precision                     │
  └──────────────┴───────────┴───────────────────────────┴───────────────────────────────────────┘
  Model Architecture
  ┌───────────────────┬────────────────────────────────┬───────────────────────────────┐
  │     Component     │             Value              │             Notes             │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Hidden Layers     │ 48                             │ Depth of network              │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Hidden Size       │ 2,048                          │ Dimensionality                │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Attention Heads   │ 16                             │ Total heads                   │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Key-Value Heads   │ 2                              │ GQA (Grouped Query Attention) │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Head Dimension    │ 256                            │ Per-head dimension            │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Vocab Size        │ 151,936                        │ Token vocabulary              │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ MoE Experts       │ 512 total, 10 active per token │ Mixture of Experts            │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ MoE Layers        │ 48                             │ All layers are MoE            │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Intermediate Size │ 5,120 (FFN), 512 (MoE)         │                               │
  └───────────────────┴────────────────────────────────┴───────────────────────────────┘
  Memory & Cache
  ┌───────────────────┬──────────────────────────────┬────────────────────────────────┐
  │      Metric       │            Value             │             Notes              │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ Model Memory      │ 74.89 GiB                    │ GPU VRAM for model weights     │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ KV Cache Size     │ 195,840 tokens               │ Total cache capacity           │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ KV Cache Memory   │ 17.95 GiB                    │ Available for KV cache         │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ CUDA Graph Memory │ -11.33 GiB                   │ Memory freed via graph capture │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ Block Size        │ 128 tokens (adjusted to 544) │ For attention/Mamba alignment  │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ Max Concurrency   │ 2.95x                        │ For 262K-token requests        │
  └───────────────────┴──────────────────────────────┴────────────────────────────────┘
  Performance Settings
  ┌──────────────────────────┬────────────────────┬──────────────────────────────────┐
  │        Parameter         │       Value        │              Notes               │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Max Sequence Length      │ 262,144 tokens     │ ~256K context window             │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Max Batched Tokens       │ 131,072            │ Chunked prefill limit            │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Max Concurrent Sequences │ 16                 │ Maximum parallel requests        │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Attention Backend        │ FlashInfer         │ Fast attention implementation    │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Load Format              │ fastsafetensors    │ Optimized loading                │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Prefix Caching           │ Enabled            │ Experimental for Mamba           │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Mamba Cache Mode         │ align              │ Experimental with prefix caching │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ CUDA Graph Mode          │ FULL_AND_PIECEWISE │ 2,1                              │
  └──────────────────────────┴────────────────────┴──────────────────────────────────┘
  Compilation Configuration
  ┌────────────────────┬────────────────────────────────────────────┬───────┐
  │      Setting       │                   Value                    │ Notes │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Backend            │ torch.compile (inductor)                   │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Custom Ops         │ +quant_fp8, unified_attention, mamba_mixer │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Compilation Sizes  │ 131,072 split point                        │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ CUDA Graph Warmups │ 1                                          │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Cache Dir          │ /root/.cache/vllm/torch_compile_cache/     │       │
  └────────────────────┴────────────────────────────────────────────┴───────┘
  Throughput Metrics (Observed)
  ┌───────────────────────┬───────────────────────┬───────────────────────────────────┐
  │        Metric         │         Range         │               Notes               │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Prompt Throughput     │ 315 - 14,083 tokens/s │ Highly variable based on batching │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Generation Throughput │ 1.7 - 43.1 tokens/s   │ Single-stream focused             │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Peak Prompt TPS       │ ~14,083 tok/s         │ With good batching                │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Peak Generation TPS   │ ~43 tok/s             │ Single long generation            │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Typical Gen TPS       │ 30-35 tok/s           │ Average sustained output          │
  └───────────────────────┴───────────────────────┴───────────────────────────────────┘
  Model Flops Utilization (MFU)
  ┌────────────────────┬──────────────────────┬─────────────────┐
  │       Metric       │        Range         │      Notes      │
  ├────────────────────┼──────────────────────┼─────────────────┤
  │ Compute (TF/s/GPU) │ 0.1 - 51.6 TF/s      │ Very wide range │
  ├────────────────────┼──────────────────────┼─────────────────┤
  │ Memory Bandwidth   │ 2.9 - 175.6 GB/s/GPU │                 │
  └────────────────────┴──────────────────────┴─────────────────┘
  Prefix Cache Performance
  ┌──────────────────┬───────────────┬───────────────────────────┐
  │      Metric      │     Value     │           Notes           │
  ├──────────────────┼───────────────┼───────────────────────────┤
  │ Hit Rate Range   │ 30.6% - 82.0% │ Varies by workload        │
  ├──────────────────┼───────────────┼───────────────────────────┤
  │ Typical Hit Rate │ ~60-70%       │ During active use         │
  ├──────────────────┼───────────────┼───────────────────────────┤
  │ Cache Usage      │ 0-18.3%       │ Dynamic based on requests │
  └──────────────────┴───────────────┴───────────────────────────┘
  Startup & Initialization
  ┌──────────────────────┬──────────┬─────────────────────────┐
  │        Stage         │ Duration │          Notes          │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ Model Loading        │ 42.4 sec │ Loading 74.89 GiB model │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ CUDA Graph Capture   │ 13 sec   │ 51 batch sizes captured │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ Total Init Time      │ 95.6 sec │ Full engine warmup      │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ Chat Template Warmup │ 2,411 ms │ First request overhead  │
  └──────────────────────┴──────────┴─────────────────────────┘
  Warnings & Issues
  ┌───────────────────────────┬──────────┬──────────────────────────────────────┐
  │           Issue           │ Severity │             Description              │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ MoE Config Missing        │ Warning  │ Sub-optimal performance for GB10 GPU │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ Mamba Prefix Caching      │ Warning  │ Experimental feature                 │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ AOT Compile Save Failed   │ Warning  │ Cache persistence issue              │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ num_ctx Ignored           │ Info     │ Ollama-compatible param not used     │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ FLA Tensor Shape Warnings │ Info     │ Minor format warnings                │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ GDS Not Supported         │ Info     │ GPU Direct Storage unavailable       │
  └───────────────────────────┴──────────┴──────────────────────────────────────┘
  Server Endpoints
  ┌─────────────────┬──────────────────────┐
  │    Endpoint     │       Details        │
  ├─────────────────┼──────────────────────┤
  │ API Server      │ http://0.0.0.0:8000  │
  ├─────────────────┼──────────────────────┤
  │ Supported Tasks │ generate             │
  ├─────────────────┼──────────────────────┤
  │ Tool Parser     │ Qwen3CoderToolParser │
  └─────────────────┴──────────────────────┘
  Key Insights

  1. Excellent for Long Context: 262K token context with 2.95x concurrency for max-length requests
  2. Strong Single-Stream Performance: 30-43 tokens/s generation is solid for coding workloads
  3. High Prompt Throughput Potential: Up to 14K tokens/s with proper batching
  4. Prefix Caching Effective: 60-70% hit rate indicates good cache utilization for repeated prompts
  5. MoE Optimization Opportunity: Missing device-specific config for GB10 GPU - could improve performance
  6. Memory Efficient: FP8 quantization keeps model at ~75GB with ~18GB for KV cache
  7. Experimental Features: Mamba prefix caching is cutting-edge but may have edge cases

capitangiaco · 2026-02-17T09:52:37Z

--gpu-memory-utilization 0.90 is too high, my spark went in OOM after one hour of coding.
with 0.80, after one day of coding I am at 170.000 tokens, the ram 117G/120G and swap 3.47GB used, but still working.

armand1m · 2026-02-17T13:07:19Z

@capitangiaco indeed, 0.80 is safer. I reduced it as well

armand1m · 2026-02-17T13:11:40Z

also, better to use 0.16.0-t5 at this stage most likely

capitangiaco · 2026-02-18T07:54:38Z

also, better to use 0.16.0-t5 at this stage most likely

I will try it
I had to use --max-num-batched-tokens 65536 \
with 131072 the system begin to swap at about 130-140K tokens

capitangiaco · 2026-02-18T22:16:45Z

I stopped the docker at 114GB, I will retry with -e VLLM_TORCH_COMPILE="0".
the next steps are --max-num-batched-tokens 32K and --gpu-memory-utilization 0.75.
I’m starting to think that with 128GB, the context to use should be 128K.

	#!/bin/bash
	docker run -d \
	--name vllm \
	--restart unless-stopped \
	--gpus all \
	--ipc host \
	--shm-size 64gb \
	--memory 110g \
	--memory-swap 120g \
	--pids-limit 4096 \
	-p 0.0.0.0:18080:8000 \
	-e HF_TOKEN="${HF_TOKEN:-}" \
	-e VLLM_LOGGING_LEVEL="INFO" \
	-e NVIDIA_TF32_OVERRIDE="1" \
	-e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1" \
	-e VLLM_TORCH_COMPILE="1" \
	-e VLLM_FLOAT32_MATMUL_PRECISION="high" \
	-e VLLM_LOG_STATS_INTERVAL="10" \
	-e VLLM_ATTENTION_BACKEND="FLASHINFER" \
	-e VLLM_FLASHINFER_FORCE_TENSOR_CORES="1" \
	-e VLLM_FLASHINFER_MOE_BACKEND="throughput" \
	-e CUDA_VISIBLE_DEVICES="0" \
	-e PYTHONHASHSEED="0" \
	-e VLLM_USE_V2_MODEL_RUNNER="0" \
	-e VLLM_ENABLE_PREFIX_CACHING="1" \
	-e TORCH_CUDA_ARCH_LIST="12.1f" \
	-v $HOME/huggingface:/root/.cache/huggingface \
	scitrera/dgx-spark-vllm:0.15.1-t5 \
	vllm serve Qwen/Qwen3-Coder-Next-FP8 \
	--served-model-name qwen3-coder-next \
	--load-format fastsafetensors \
	--attention-backend flashinfer \
	--port 8000 \
	--max-model-len 262144 \
	--block-size 128 \
	--max-num-seqs 16 \
	--max-num-batched-tokens 131072 \
	--gpu-memory-utilization 0.80 \
	--kv-cache-dtype auto \
	--enable-prefix-caching \
	--trust-remote-code \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--disable-uvicorn-access-log \
	--kv-cache-metrics \
	--cudagraph-metrics \
	--enable-mfu-metrics \
	-cc.max_cudagraph_capture_size 512 \
	--tensor-parallel-size 1

armand1m/qwen3-vllm.sh

Select an option

No results found

Select an option

No results found

armand1m commented Feb 9, 2026

Uh oh!

capitangiaco commented Feb 17, 2026

Uh oh!

armand1m commented Feb 17, 2026

Uh oh!

armand1m commented Feb 17, 2026

Uh oh!

capitangiaco commented Feb 18, 2026

Uh oh!

capitangiaco commented Feb 18, 2026

Uh oh!