Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Created April 23, 2025 22:33
Show Gist options
  • Save nerdalert/3f524f1013dce1aad076e741818faa0d to your computer and use it in GitHub Desktop.
Save nerdalert/3f524f1013dce1aad076e741818faa0d to your computer and use it in GitHub Desktop.
$ python ./benchmark-e2e.py --port 8000 --model "meta-llama/Llama-3.2-1B" --cuda-device 0
Using port: 8000
Removing /home/ubuntu/vllm/benchmark-e2e/benchmark-compare
Removing /home/ubuntu/vllm/benchmark-e2e/venv-vllm
Removing /home/ubuntu/vllm/benchmark-e2e/venv-vllm-src
Removing /home/ubuntu/vllm/benchmark-e2e/venv-sgl
▶ git clone https://github.com/neuralmagic/benchmark-compare.git /home/ubuntu/vllm/benchmark-e2e/benchmark-compare
Cloning into '/home/ubuntu/vllm/benchmark-e2e/benchmark-compare'...
remote: Enumerating objects: 78, done.
remote: Counting objects: 100% (78/78), done.
remote: Compressing objects: 100% (63/63), done.
remote: Total 78 (delta 34), reused 55 (delta 15), pack-reused 0 (from 0)
Receiving objects: 100% (78/78), 19.82 KiB | 9.91 MiB/s, done.
Resolving deltas: 100% (34/34), done.
▶ git clone https://github.com/vllm-project/vllm.git /home/ubuntu/vllm/benchmark-e2e/benchmark-compare/vllm
Cloning into '/home/ubuntu/vllm/benchmark-e2e/benchmark-compare/vllm'...
remote: Enumerating objects: 68654, done.
remote: Counting objects: 100% (242/242), done.
remote: Compressing objects: 100% (174/174), done.
remote: Total 68654 (delta 151), reused 69 (delta 68), pack-reused 68412 (from 3)
Receiving objects: 100% (68654/68654), 46.16 MiB | 36.33 MiB/s, done.
Resolving deltas: 100% (53425/53425), done.
▶ git -C /home/ubuntu/vllm/benchmark-e2e/benchmark-compare/vllm checkout benchmark-output
branch 'benchmark-output' set up to track 'origin/benchmark-output'.
Switched to a new branch 'benchmark-output'
▶ Running vllm
=== vllm benchmark start ===
▶ uv venv venv-vllm --python 3.12
▶ bash -c source venv-vllm/bin/activate && uv pip install vllm==0.8.3
vllm package installed in venv-vllm
▶ source venv-vllm/bin/activate && vllm serve meta-llama/Llama-3.2-1B --disable-log-requests --port 8000
Started vllm serve (pid=88163)
Waiting for vllm to load…
vllm inference server ready at http://localhost:8000/v1/models
Creating venv-vllm-src in /home/ubuntu/vllm/benchmark-e2e/benchmark-compare/vllm
▶ uv venv venv-vllm-src --python 3.12
▶ source venv-vllm-src/bin/activate && export VLLM_USE_PRECOMPILED=1 && uv pip install -e . && uv pip install numpy pandas datasets
▶ bash -c source venv-vllm-src/bin/activate && export VLLM_USE_PRECOMPILED=1 && uv pip install -e . && uv pip install numpy pandas datasets
vllm-src dependencies installed (precompiled)
>>> Starting vllm benchmark; output → bench-vllm.log
▶ source vllm/venv-vllm-src/bin/activate && VLLM_USE_PRECOMPILED=1 MODEL=meta-llama/Llama-3.2-1B FRAMEWORK=vllm bash ./benchmark_1000_in_100_out.sh
vllm benchmark script completed
Stopping vllm server (pid=88163)
=== vllm benchmark done ===
✓ vllm completed
Killing vllm serve process group
▶ Running sglang
=== sglang benchmark start ===
▶ uv venv venv-sgl --python 3.12
▶ source venv-sgl/bin/activate && uv pip install "sglang[all]==0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
▶ bash -c source venv-sgl/bin/activate && uv pip install "sglang[all]==0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
sglang package installed in venv-sgl
▶ source venv-sgl/bin/activate && python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B --host 0.0.0.0 --port 8000
Started sglang serve (pid=88656)
Waiting for sglang to load…
sglang inference server ready at http://localhost:8000/v1/models
>>> Starting sglang benchmark; output → bench-sglang.log
▶ source vllm/venv-vllm-src/bin/activate && VLLM_USE_PRECOMPILED=1 MODEL=meta-llama/Llama-3.2-1B FRAMEWORK=sgl bash ./benchmark_1000_in_100_out.sh
sglang benchmark script completed
Stopping sglang server (pid=88656)
=== sglang benchmark done ===
✓ sglang completed
✅ Benchmark results are in benchmark-compare/results.json

(venv) ubuntu@ip-172-31-37-101:~/vllm/benchmark-e2e$ cat benchmark-compare/results.json
{"date": "20250423-203243", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 120, "framework": "vllm", "request_rate": 1.0, "burstiness": 1.0, "max_concurrency": null, "duration": 99.20296433399199, "completed": 120, "total_input_tokens": 120000, "total_output_tokens": 12000, "request_throughput": 1.2096412723715544, "request_goodput:": null, "output_throughput": 120.96412723715544, "total_token_throughput": 1330.6053996087098, "mean_ttft_ms": 56.39591719470142, "median_ttft_ms": 56.09072252991609, "std_ttft_ms": 6.521620934952344, "p99_ttft_ms": 80.00854982470631, "mean_tpot_ms": 7.612752920024094, "median_tpot_ms": 7.548823823324508, "std_tpot_ms": 0.5057545087410885, "p99_tpot_ms": 8.889770552026805, "mean_itl_ms": 7.612755184605881, "median_itl_ms": 7.326202990952879, "std_itl_ms": 3.696505481308391, "p99_itl_ms": 8.354500485002056}
{"date": "20250423-203456", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 1200, "framework": "vllm", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 124.88530976703623, "completed": 1200, "total_input_tokens": 1200000, "total_output_tokens": 120000, "request_throughput": 9.608816299038743, "request_goodput:": null, "output_throughput": 960.8816299038742, "total_token_throughput": 10569.697928942616, "mean_ttft_ms": 97.82328926060775, "median_ttft_ms": 72.64946450595744, "std_ttft_ms": 48.52175193647459, "p99_ttft_ms": 265.2696314576315, "mean_tpot_ms": 15.799513965929279, "median_tpot_ms": 14.92470340930264, "std_tpot_ms": 3.900820342697011, "p99_tpot_ms": 27.713377848138204, "mean_itl_ms": 15.799515945636017, "median_itl_ms": 9.164708491880447, "std_itl_ms": 17.522955437768, "p99_itl_ms": 87.39991699578235}
{"date": "20250423-203818", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 2400, "framework": "vllm", "request_rate": 20.0, "burstiness": 1.0, "max_concurrency": null, "duration": 191.70802650001133, "completed": 2400, "total_input_tokens": 2400000, "total_output_tokens": 240000, "request_throughput": 12.519037641858246, "request_goodput:": null, "output_throughput": 1251.9037641858247, "total_token_throughput": 13770.94140604407, "mean_ttft_ms": 32163.12611847175, "median_ttft_ms": 32384.989118028898, "std_ttft_ms": 19905.942609982794, "p99_ttft_ms": 65807.87219597376, "mean_tpot_ms": 138.21457556062197, "median_tpot_ms": 143.98218807058805, "std_tpot_ms": 20.21536058423136, "p99_tpot_ms": 146.94779215195345, "mean_itl_ms": 138.21457713794328, "median_itl_ms": 143.79463696968742, "std_itl_ms": 26.379256322744325, "p99_itl_ms": 151.98497631994542}
{"date": "20250423-204142", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 3600, "framework": "vllm", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 194.24181955197128, "completed": 2436, "total_input_tokens": 2436000, "total_output_tokens": 243600, "request_throughput": 12.541068682422555, "request_goodput:": null, "output_throughput": 1254.1068682422554, "total_token_throughput": 13795.17555066481, "mean_ttft_ms": 44395.219771976874, "median_ttft_ms": 51845.26359249139, "std_ttft_ms": 23358.47414596905, "p99_ttft_ms": 67145.04564490053, "mean_tpot_ms": 139.03073822772842, "median_tpot_ms": 144.20075422197561, "std_tpot_ms": 19.124088738974987, "p99_tpot_ms": 145.4723847309431, "mean_itl_ms": 139.0307399507849, "median_itl_ms": 144.15016697603278, "std_itl_ms": 24.654944623205477, "p99_itl_ms": 150.51576657802798}
{"date": "20250423-204507", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 4200, "framework": "vllm", "request_rate": 35.0, "burstiness": 1.0, "max_concurrency": null, "duration": 195.05380157195032, "completed": 2441, "total_input_tokens": 2441000, "total_output_tokens": 244100, "request_throughput": 12.51449589973553, "request_goodput:": null, "output_throughput": 1251.449589973553, "total_token_throughput": 13765.945489709084, "mean_ttft_ms": 47584.552676985964, "median_ttft_ms": 59682.9047119827, "std_ttft_ms": 22970.02669528606, "p99_ttft_ms": 67270.3901653993, "mean_tpot_ms": 139.1113751169461, "median_tpot_ms": 144.10068984837224, "std_tpot_ms": 19.01281195587184, "p99_tpot_ms": 145.62538136772793, "mean_itl_ms": 139.11137690801863, "median_itl_ms": 144.09897604491562, "std_itl_ms": 24.703271322069615, "p99_itl_ms": 151.285376381129}
{"date": "20250423-204639", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 2000, "framework": "vllm", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 81.89244579599472, "completed": 1011, "total_input_tokens": 1011000, "total_output_tokens": 101100, "request_throughput": 12.345461051664513, "request_goodput:": null, "output_throughput": 1234.5461051664513, "total_token_throughput": 13580.007156830963, "mean_ttft_ms": 39391.7349319032, "median_ttft_ms": 39235.60468602227, "std_ttft_ms": 23149.29754256831, "p99_ttft_ms": 78942.25678030052, "mean_tpot_ms": 131.68913699540624, "median_tpot_ms": 143.90260938348044, "std_tpot_ms": 27.81174315708015, "p99_tpot_ms": 145.27689228675123, "mean_itl_ms": 131.6891386227997, "median_itl_ms": 143.93187500536442, "std_itl_ms": 35.096189995818726, "p99_itl_ms": 148.05063619976863}
{"date": "20250423-204902", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 120, "framework": "sgl", "request_rate": 1.0, "burstiness": 1.0, "max_concurrency": null, "duration": 99.11668273800751, "completed": 120, "total_input_tokens": 120000, "total_output_tokens": 12000, "request_throughput": 1.2106942714900255, "request_goodput:": null, "output_throughput": 121.06942714900256, "total_token_throughput": 1331.7636986390282, "mean_ttft_ms": 63.842221374216024, "median_ttft_ms": 62.524924491299316, "std_ttft_ms": 9.18954094381228, "p99_ttft_ms": 99.52372966450642, "mean_tpot_ms": 6.89890064829345, "median_tpot_ms": 6.919228757562285, "std_tpot_ms": 0.38155698694133783, "p99_tpot_ms": 8.134621852777213, "mean_itl_ms": 6.898903317589967, "median_itl_ms": 6.520244991406798, "std_itl_ms": 3.5525888365527383, "p99_itl_ms": 18.186672396841455}
{"date": "20250423-205115", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 1200, "framework": "sgl", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 124.96947488398291, "completed": 1200, "total_input_tokens": 1200000, "total_output_tokens": 120000, "request_throughput": 9.602344901537244, "request_goodput:": null, "output_throughput": 960.2344901537244, "total_token_throughput": 10562.579391690968, "mean_ttft_ms": 102.50936468490787, "median_ttft_ms": 86.04621447739191, "std_ttft_ms": 41.184887435585594, "p99_ttft_ms": 228.6287149158306, "mean_tpot_ms": 19.238861276672996, "median_tpot_ms": 18.92460493436742, "std_tpot_ms": 4.364665641491298, "p99_tpot_ms": 31.518538754799103, "mean_itl_ms": 19.238864106771608, "median_itl_ms": 10.437523975269869, "std_itl_ms": 33.700984159225726, "p99_itl_ms": 150.8452844049316}
{"date": "20250423-205339", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 2400, "framework": "sgl", "request_rate": 20.0, "burstiness": 1.0, "max_concurrency": null, "duration": 134.6164819910191, "completed": 2400, "total_input_tokens": 2400000, "total_output_tokens": 240000, "request_throughput": 17.82842609243135, "request_goodput:": null, "output_throughput": 1782.842609243135, "total_token_throughput": 19611.268701674486, "mean_ttft_ms": 3616.1961309204344, "median_ttft_ms": 3964.495359483408, "std_ttft_ms": 2363.2753206295847, "p99_ttft_ms": 7834.874848054604, "mean_tpot_ms": 239.20506883086216, "median_tpot_ms": 262.5904571464299, "std_tpot_ms": 56.57863210291997, "p99_tpot_ms": 276.37519921712936, "mean_itl_ms": 239.205071390837, "median_itl_ms": 66.66208349633962, "std_itl_ms": 526.4404915922366, "p99_itl_ms": 2631.3248439948075}
{"date": "20250423-205634", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 3600, "framework": "sgl", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 164.3726693429635, "completed": 2937, "total_input_tokens": 2937000, "total_output_tokens": 293700, "request_throughput": 17.867933956051726, "request_goodput:": null, "output_throughput": 1786.7933956051727, "total_token_throughput": 19654.7273516569, "mean_ttft_ms": 20980.76531821706, "median_ttft_ms": 28319.143644999713, "std_ttft_ms": 11389.005766771883, "p99_ttft_ms": 38260.051296057645, "mean_tpot_ms": 178.5708749128471, "median_tpot_ms": 177.23527537350722, "std_tpot_ms": 66.00290716281911, "p99_tpot_ms": 275.670784812684, "mean_itl_ms": 178.5487708340278, "median_itl_ms": 64.20331704430282, "std_itl_ms": 1181.0858748120659, "p99_itl_ms": 524.3194170287281}
{"date": "20250423-205929", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 4200, "framework": "sgl", "request_rate": 35.0, "burstiness": 1.0, "max_concurrency": null, "duration": 164.29364342597546, "completed": 2944, "total_input_tokens": 2944000, "total_output_tokens": 294400, "request_throughput": 17.919135144911774, "request_goodput:": null, "output_throughput": 1791.9135144911772, "total_token_throughput": 19711.04865940295, "mean_ttft_ms": 23964.350407064496, "median_ttft_ms": 29287.7586199902, "std_ttft_ms": 11026.318904528236, "p99_ttft_ms": 40339.285274511785, "mean_tpot_ms": 175.7090356854724, "median_tpot_ms": 175.12939626796668, "std_tpot_ms": 66.35838080835629, "p99_tpot_ms": 275.290026722477, "mean_itl_ms": 175.6873376531382, "median_itl_ms": 64.22396903508343, "std_itl_ms": 1185.0453740292687, "p99_itl_ms": 539.5825622289263}
{"date": "20250423-210037", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 2000, "framework": "sgl", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 58.486214381002355, "completed": 1011, "total_input_tokens": 1011000, "total_output_tokens": 101100, "request_throughput": 17.2861247851322, "request_goodput:": null, "output_throughput": 1728.6124785132197, "total_token_throughput": 19014.737263645417, "mean_ttft_ms": 26123.602393004858, "median_ttft_ms": 28916.340080962982, "std_ttft_ms": 15140.76651942909, "p99_ttft_ms": 55773.4730803757, "mean_tpot_ms": 168.74380156993698, "median_tpot_ms": 167.6001953433804, "std_tpot_ms": 74.29706224953114, "p99_tpot_ms": 470.09156390964574, "mean_itl_ms": 168.7134623395603, "median_itl_ms": 63.90715204179287, "std_itl_ms": 1191.105812308519, "p99_itl_ms": 667.586823563407}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment