armand1m/qwen3-vllm.sh

armand1m · 2026-02-09T12:37:48Z

GLM4.7 analysed this instance logs after I asked a fairly long code analysis request using OpenCode in plan mode.

Screenshots at the end

Based on the logs, here's a comprehensive table of metadata and performance insights for the vllm_qwen3codernext instance:                                                            
   
  vLLM Instance Metadata & Performance Analysis                                                                                                                                         
                                                                                                                                                                                      
  Instance Configuration                                                                                                                                                                
  ┌──────────────┬───────────┬───────────────────────────┬───────────────────────────────────────┐                                                                                      
  │   Category   │ Parameter │           Value           │                 Notes                 │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ vLLM Version │ Version   │ 0.15.2.dev0+g1892993bc    │ Development build (Feb 4, 2026)       │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Model        │ Name      │ Qwen/Qwen3-Coder-Next-FP8 │ FP8 quantized coder model             │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Served Name  │ Alias     │ qwen3-coder-next          │ API model name                        │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Architecture │ Type      │ Qwen3NextForCausalLM      │ Custom architecture with Mamba layers │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Quantization │ Type      │ fp8_w8a8                  │ 8-bit weights/activations             │
  ├──────────────┼───────────┼───────────────────────────┼───────────────────────────────────────┤
  │ Data Type    │ dtype     │ torch.bfloat16            │ Compute precision                     │
  └──────────────┴───────────┴───────────────────────────┴───────────────────────────────────────┘
  Model Architecture
  ┌───────────────────┬────────────────────────────────┬───────────────────────────────┐
  │     Component     │             Value              │             Notes             │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Hidden Layers     │ 48                             │ Depth of network              │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Hidden Size       │ 2,048                          │ Dimensionality                │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Attention Heads   │ 16                             │ Total heads                   │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Key-Value Heads   │ 2                              │ GQA (Grouped Query Attention) │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Head Dimension    │ 256                            │ Per-head dimension            │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Vocab Size        │ 151,936                        │ Token vocabulary              │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ MoE Experts       │ 512 total, 10 active per token │ Mixture of Experts            │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ MoE Layers        │ 48                             │ All layers are MoE            │
  ├───────────────────┼────────────────────────────────┼───────────────────────────────┤
  │ Intermediate Size │ 5,120 (FFN), 512 (MoE)         │                               │
  └───────────────────┴────────────────────────────────┴───────────────────────────────┘
  Memory & Cache
  ┌───────────────────┬──────────────────────────────┬────────────────────────────────┐
  │      Metric       │            Value             │             Notes              │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ Model Memory      │ 74.89 GiB                    │ GPU VRAM for model weights     │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ KV Cache Size     │ 195,840 tokens               │ Total cache capacity           │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ KV Cache Memory   │ 17.95 GiB                    │ Available for KV cache         │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ CUDA Graph Memory │ -11.33 GiB                   │ Memory freed via graph capture │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ Block Size        │ 128 tokens (adjusted to 544) │ For attention/Mamba alignment  │
  ├───────────────────┼──────────────────────────────┼────────────────────────────────┤
  │ Max Concurrency   │ 2.95x                        │ For 262K-token requests        │
  └───────────────────┴──────────────────────────────┴────────────────────────────────┘
  Performance Settings
  ┌──────────────────────────┬────────────────────┬──────────────────────────────────┐
  │        Parameter         │       Value        │              Notes               │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Max Sequence Length      │ 262,144 tokens     │ ~256K context window             │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Max Batched Tokens       │ 131,072            │ Chunked prefill limit            │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Max Concurrent Sequences │ 16                 │ Maximum parallel requests        │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Attention Backend        │ FlashInfer         │ Fast attention implementation    │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Load Format              │ fastsafetensors    │ Optimized loading                │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Prefix Caching           │ Enabled            │ Experimental for Mamba           │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ Mamba Cache Mode         │ align              │ Experimental with prefix caching │
  ├──────────────────────────┼────────────────────┼──────────────────────────────────┤
  │ CUDA Graph Mode          │ FULL_AND_PIECEWISE │ 2,1                              │
  └──────────────────────────┴────────────────────┴──────────────────────────────────┘
  Compilation Configuration
  ┌────────────────────┬────────────────────────────────────────────┬───────┐
  │      Setting       │                   Value                    │ Notes │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Backend            │ torch.compile (inductor)                   │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Custom Ops         │ +quant_fp8, unified_attention, mamba_mixer │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Compilation Sizes  │ 131,072 split point                        │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ CUDA Graph Warmups │ 1                                          │       │
  ├────────────────────┼────────────────────────────────────────────┼───────┤
  │ Cache Dir          │ /root/.cache/vllm/torch_compile_cache/     │       │
  └────────────────────┴────────────────────────────────────────────┴───────┘
  Throughput Metrics (Observed)
  ┌───────────────────────┬───────────────────────┬───────────────────────────────────┐
  │        Metric         │         Range         │               Notes               │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Prompt Throughput     │ 315 - 14,083 tokens/s │ Highly variable based on batching │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Generation Throughput │ 1.7 - 43.1 tokens/s   │ Single-stream focused             │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Peak Prompt TPS       │ ~14,083 tok/s         │ With good batching                │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Peak Generation TPS   │ ~43 tok/s             │ Single long generation            │
  ├───────────────────────┼───────────────────────┼───────────────────────────────────┤
  │ Typical Gen TPS       │ 30-35 tok/s           │ Average sustained output          │
  └───────────────────────┴───────────────────────┴───────────────────────────────────┘
  Model Flops Utilization (MFU)
  ┌────────────────────┬──────────────────────┬─────────────────┐
  │       Metric       │        Range         │      Notes      │
  ├────────────────────┼──────────────────────┼─────────────────┤
  │ Compute (TF/s/GPU) │ 0.1 - 51.6 TF/s      │ Very wide range │
  ├────────────────────┼──────────────────────┼─────────────────┤
  │ Memory Bandwidth   │ 2.9 - 175.6 GB/s/GPU │                 │
  └────────────────────┴──────────────────────┴─────────────────┘
  Prefix Cache Performance
  ┌──────────────────┬───────────────┬───────────────────────────┐
  │      Metric      │     Value     │           Notes           │
  ├──────────────────┼───────────────┼───────────────────────────┤
  │ Hit Rate Range   │ 30.6% - 82.0% │ Varies by workload        │
  ├──────────────────┼───────────────┼───────────────────────────┤
  │ Typical Hit Rate │ ~60-70%       │ During active use         │
  ├──────────────────┼───────────────┼───────────────────────────┤
  │ Cache Usage      │ 0-18.3%       │ Dynamic based on requests │
  └──────────────────┴───────────────┴───────────────────────────┘
  Startup & Initialization
  ┌──────────────────────┬──────────┬─────────────────────────┐
  │        Stage         │ Duration │          Notes          │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ Model Loading        │ 42.4 sec │ Loading 74.89 GiB model │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ CUDA Graph Capture   │ 13 sec   │ 51 batch sizes captured │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ Total Init Time      │ 95.6 sec │ Full engine warmup      │
  ├──────────────────────┼──────────┼─────────────────────────┤
  │ Chat Template Warmup │ 2,411 ms │ First request overhead  │
  └──────────────────────┴──────────┴─────────────────────────┘
  Warnings & Issues
  ┌───────────────────────────┬──────────┬──────────────────────────────────────┐
  │           Issue           │ Severity │             Description              │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ MoE Config Missing        │ Warning  │ Sub-optimal performance for GB10 GPU │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ Mamba Prefix Caching      │ Warning  │ Experimental feature                 │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ AOT Compile Save Failed   │ Warning  │ Cache persistence issue              │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ num_ctx Ignored           │ Info     │ Ollama-compatible param not used     │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ FLA Tensor Shape Warnings │ Info     │ Minor format warnings                │
  ├───────────────────────────┼──────────┼──────────────────────────────────────┤
  │ GDS Not Supported         │ Info     │ GPU Direct Storage unavailable       │
  └───────────────────────────┴──────────┴──────────────────────────────────────┘
  Server Endpoints
  ┌─────────────────┬──────────────────────┐
  │    Endpoint     │       Details        │
  ├─────────────────┼──────────────────────┤
  │ API Server      │ http://0.0.0.0:8000  │
  ├─────────────────┼──────────────────────┤
  │ Supported Tasks │ generate             │
  ├─────────────────┼──────────────────────┤
  │ Tool Parser     │ Qwen3CoderToolParser │
  └─────────────────┴──────────────────────┘
  Key Insights

  1. Excellent for Long Context: 262K token context with 2.95x concurrency for max-length requests
  2. Strong Single-Stream Performance: 30-43 tokens/s generation is solid for coding workloads
  3. High Prompt Throughput Potential: Up to 14K tokens/s with proper batching
  4. Prefix Caching Effective: 60-70% hit rate indicates good cache utilization for repeated prompts
  5. MoE Optimization Opportunity: Missing device-specific config for GB10 GPU - could improve performance
  6. Memory Efficient: FP8 quantization keeps model at ~75GB with ~18GB for KV cache
  7. Experimental Features: Mamba prefix caching is cutting-edge but may have edge cases

	#!/bin/bash
	docker run -d \
	--name vllm \
	--restart unless-stopped \
	--gpus all \
	--ipc host \
	--shm-size 64gb \
	--memory 110g \
	--memory-swap 120g \
	--pids-limit 4096 \
	-p 0.0.0.0:18080:8000 \
	-e HF_TOKEN="${HF_TOKEN:-}" \
	-e VLLM_LOGGING_LEVEL="INFO" \
	-e NVIDIA_TF32_OVERRIDE="1" \
	-e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1" \
	-e VLLM_TORCH_COMPILE="1" \
	-e VLLM_FLOAT32_MATMUL_PRECISION="high" \
	-e VLLM_LOG_STATS_INTERVAL="10" \
	-e VLLM_ATTENTION_BACKEND="FLASHINFER" \
	-e VLLM_FLASHINFER_FORCE_TENSOR_CORES="1" \
	-e VLLM_FLASHINFER_MOE_BACKEND="throughput" \
	-e CUDA_VISIBLE_DEVICES="0" \
	-e PYTHONHASHSEED="0" \
	-e VLLM_USE_V2_MODEL_RUNNER="0" \
	-e VLLM_ENABLE_PREFIX_CACHING="1" \
	-e TORCH_CUDA_ARCH_LIST="12.1f" \
	-v $HOME/huggingface:/root/.cache/huggingface \
	scitrera/dgx-spark-vllm:0.15.1-t5 \
	vllm serve Qwen/Qwen3-Coder-Next-FP8 \
	--served-model-name qwen3-coder-next \
	--load-format fastsafetensors \
	--attention-backend flashinfer \
	--port 8000 \
	--max-model-len 262144 \
	--block-size 128 \
	--max-num-seqs 16 \
	--max-num-batched-tokens 131072 \
	--gpu-memory-utilization 0.90 \
	--kv-cache-dtype auto \
	--enable-prefix-caching \
	--trust-remote-code \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--disable-uvicorn-access-log \
	--kv-cache-metrics \
	--cudagraph-metrics \
	--enable-mfu-metrics \
	-cc.max_cudagraph_capture_size 512 \
	--tensor-parallel-size 1

armand1m/qwen3-vllm.sh

Select an option

No results found

Select an option

No results found

armand1m commented Feb 9, 2026

Uh oh!