Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jonaslsaa/44a85939b0914d1e71afb6de65ba54d7 to your computer and use it in GitHub Desktop.

Select an option

Save jonaslsaa/44a85939b0914d1e71afb6de65ba54d7 to your computer and use it in GitHub Desktop.
Also works as E=129,N=352,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json - but don't see a clear improvement

Server command

$ python3 -m sglang.launch_server --host 0.0.0.0 --port "12345" --model-path zai-org/GLM-4.5-Air-FP8 --tp-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 256 512 640 --cuda-graph-max-bs 640 --mem-fraction-static 0.8 --max-running-requests 256

BEFORE

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     10240     
Benchmark duration (s):                  352.15    
Total input tokens:                      3110292   
Total generated tokens:                  2013443   
Total generated tokens (retokenized):    2012696   
Request throughput (req/s):              29.08     
Input token throughput (tok/s):          8832.24   
Output token throughput (tok/s):         5717.54   
Total token throughput (tok/s):          14549.77  
Concurrency:                             5087.98   
Accept length:                           2.56      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   174975.07 
Median E2E Latency (ms):                 175330.96 
---------------Time to First Token----------------
Mean TTFT (ms):                          166851.21 
Median TTFT (ms):                        166935.17 
P99 TTFT (ms):                           336071.89 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           41.53     
Median ITL (ms):                         36.15     
P95 ITL (ms):                            110.09    
P99 ITL (ms):                            122.53    
Max ITL (ms):                            974.68    
==================================================

AFTER (with tuned config)

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     10240     
Benchmark duration (s):                  362.95    
Total input tokens:                      3110292   
Total generated tokens:                  2013443   
Total generated tokens (retokenized):    2012703   
Request throughput (req/s):              28.21     
Input token throughput (tok/s):          8569.39   
Output token throughput (tok/s):         5547.38   
Total token throughput (tok/s):          14116.77  
Concurrency:                             5130.75   
Accept length:                           2.55      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   181858.06 
Median E2E Latency (ms):                 182695.72 
---------------Time to First Token----------------
Mean TTFT (ms):                          173482.95 
Median TTFT (ms):                        173571.97 
P99 TTFT (ms):                           346657.98 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           42.82     
Median ITL (ms):                         36.42     
P95 ITL (ms):                            111.32    
P99 ITL (ms):                            126.34    
Max ITL (ms):                            1774.48   
==================================================
{
"1": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"2": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 64,
"num_warps": 4,
"num_stages": 4
},
"4": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"8": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"16": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"24": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"32": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"48": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"64": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 5
},
"96": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 32,
"num_warps": 4,
"num_stages": 5
},
"128": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"256": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 3
},
"512": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 3
},
"1024": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 3
},
"1536": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 3
},
"2048": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 4
},
"3072": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 4
},
"4096": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 4
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment