Last active
February 9, 2026 12:37
-
-
Save armand1m/8f354797ed39f14e14cea0ed5c52c770 to your computer and use it in GitHub Desktop.
qwen3-coder-next - vllm 0.15.1 - transformers 5 - optimized for dgx spark
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/bash | |
| docker run -d \ | |
| --name vllm \ | |
| --restart unless-stopped \ | |
| --gpus all \ | |
| --ipc host \ | |
| --shm-size 64gb \ | |
| --memory 110g \ | |
| --memory-swap 120g \ | |
| --pids-limit 4096 \ | |
| -p 0.0.0.0:18080:8000 \ | |
| -e HF_TOKEN="${HF_TOKEN:-}" \ | |
| -e VLLM_LOGGING_LEVEL="INFO" \ | |
| -e NVIDIA_TF32_OVERRIDE="1" \ | |
| -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1" \ | |
| -e VLLM_TORCH_COMPILE="1" \ | |
| -e VLLM_FLOAT32_MATMUL_PRECISION="high" \ | |
| -e VLLM_LOG_STATS_INTERVAL="10" \ | |
| -e VLLM_ATTENTION_BACKEND="FLASHINFER" \ | |
| -e VLLM_FLASHINFER_FORCE_TENSOR_CORES="1" \ | |
| -e VLLM_FLASHINFER_MOE_BACKEND="throughput" \ | |
| -e CUDA_VISIBLE_DEVICES="0" \ | |
| -e PYTHONHASHSEED="0" \ | |
| -e VLLM_USE_V2_MODEL_RUNNER="0" \ | |
| -e VLLM_ENABLE_PREFIX_CACHING="1" \ | |
| -e TORCH_CUDA_ARCH_LIST="12.1f" \ | |
| -v $HOME/huggingface:/root/.cache/huggingface \ | |
| scitrera/dgx-spark-vllm:0.15.1-t5 \ | |
| vllm serve Qwen/Qwen3-Coder-Next-FP8 \ | |
| --served-model-name qwen3-coder-next \ | |
| --load-format fastsafetensors \ | |
| --attention-backend flashinfer \ | |
| --port 8000 \ | |
| --max-model-len 262144 \ | |
| --block-size 128 \ | |
| --max-num-seqs 16 \ | |
| --max-num-batched-tokens 131072 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --kv-cache-dtype auto \ | |
| --enable-prefix-caching \ | |
| --trust-remote-code \ | |
| --enable-auto-tool-choice \ | |
| --tool-call-parser qwen3_coder \ | |
| --disable-uvicorn-access-log \ | |
| --kv-cache-metrics \ | |
| --cudagraph-metrics \ | |
| --enable-mfu-metrics \ | |
| -cc.max_cudagraph_capture_size 512 \ | |
| --tensor-parallel-size 1 |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
GLM4.7 analysed this instance logs after I asked a fairly long code analysis request using OpenCode in plan mode.
Screenshots at the end