Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Created April 23, 2025 02:04
Show Gist options
  • Save nerdalert/e62522b08f051da008e07301eed84ea8 to your computer and use it in GitHub Desktop.
Save nerdalert/e62522b08f051da008e07301eed84ea8 to your computer and use it in GitHub Desktop.
podman run --rm -it     --network host     -e MODEL=meta-llama/Llama-3.2-1B     -e FRAMEWORK=vllm     -e HF_TOKEN="${HF_TOKEN}"     -e PORT=8000     -e HOST=172.31.37.101     -v "$(pwd)":/host:Z     -w /opt/benchmark     quay.io/bsalisbu/vllm-benchmark:latest

===== vllm - RUNNING meta-llama/Llama-3.2-1B FOR 120 PROMPTS WITH 1 QPS =====

INFO 04-23 01:38:59 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=120, logprobs=None, request_rate=1.0, burstiness=1.0, seed=1, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=100, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.5k/50.5k [00:00<00:00, 175MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 44.6MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 301/301 [00:00<00:00, 3.75MB/s]
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 1.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [01:39<00:00,  1.21it/s]
============ Serving Benchmark Result ============
Successful requests:                     120
Benchmark duration (s):                  99.20
Total input tokens:                      120000
Total generated tokens:                  12000
Request throughput (req/s):              1.21
Output token throughput (tok/s):         120.96
Total Token throughput (tok/s):          1330.59
---------------Time to First Token----------------
Mean TTFT (ms):                          56.52
Median TTFT (ms):                        55.59
P99 TTFT (ms):                           81.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.62
Median TPOT (ms):                        7.52
P99 TPOT (ms):                           8.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.62
Median ITL (ms):                         7.32
P99 ITL (ms):                            8.29
==================================================

===== vllm - RUNNING meta-llama/Llama-3.2-1B FOR 1200 PROMPTS WITH 10 QPS =====

INFO 04-23 01:40:50 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=1200, logprobs=None, request_rate=10.0, burstiness=1.0, seed=10, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=100, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1200/1200 [02:04<00:00,  9.61it/s]
============ Serving Benchmark Result ============
Successful requests:                     1200
Benchmark duration (s):                  124.91
Total input tokens:                      1200000
Total generated tokens:                  120000
Request throughput (req/s):              9.61
Output token throughput (tok/s):         960.68
Total Token throughput (tok/s):          10567.47
---------------Time to First Token----------------
Mean TTFT (ms):                          98.16
Median TTFT (ms):                        73.95
P99 TTFT (ms):                           263.67
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.85
Median TPOT (ms):                        14.92
P99 TPOT (ms):                           28.36
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.85
Median ITL (ms):                         9.18
P99 ITL (ms):                            87.65
==================================================

===== vllm - RUNNING meta-llama/Llama-3.2-1B FOR 2400 PROMPTS WITH 20 QPS =====

INFO 04-23 01:43:04 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=2400, logprobs=None, request_rate=20.0, burstiness=1.0, seed=20, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=100, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 20.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2400/2400 [03:11<00:00, 12.52it/s]
============ Serving Benchmark Result ============
Successful requests:                     2400
Benchmark duration (s):                  191.71
Total input tokens:                      2400000
Total generated tokens:                  240000
Request throughput (req/s):              12.52
Output token throughput (tok/s):         1251.89
Total Token throughput (tok/s):          13770.75
---------------Time to First Token----------------
Mean TTFT (ms):                          32079.74
Median TTFT (ms):                        32282.15
P99 TTFT (ms):                           65698.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          138.22
Median TPOT (ms):                        144.00
P99 TPOT (ms):                           146.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.22
Median ITL (ms):                         143.88
P99 ITL (ms):                            149.88
==================================================

===== vllm - RUNNING meta-llama/Llama-3.2-1B FOR 3600 PROMPTS WITH 30 QPS =====

INFO 04-23 01:46:24 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=3600, logprobs=None, request_rate=30.0, burstiness=1.0, seed=30, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=100, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 30.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3600/3600 [04:48<00:00, 12.48it/s]
============ Serving Benchmark Result ============
Successful requests:                     3600
Benchmark duration (s):                  288.57
Total input tokens:                      3600000
Total generated tokens:                  360000
Request throughput (req/s):              12.48
Output token throughput (tok/s):         1247.52
Total Token throughput (tok/s):          13722.74
---------------Time to First Token----------------
Mean TTFT (ms):                          78454.68
Median TTFT (ms):                        78732.81
P99 TTFT (ms):                           159386.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          140.87
Median TPOT (ms):                        144.18
P99 TPOT (ms):                           147.79
---------------Inter-token Latency----------------
Mean ITL (ms):                           140.87
Median ITL (ms):                         144.07
P99 ITL (ms):                            150.84
==================================================

===== vllm - RUNNING meta-llama/Llama-3.2-1B FOR 4200 PROMPTS WITH 35 QPS =====

INFO 04-23 01:51:23 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=4200, logprobs=None, request_rate=35.0, burstiness=1.0, seed=35, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=100, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 35.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4200/4200 [05:37<00:00, 12.46it/s]
============ Serving Benchmark Result ============
Successful requests:                     4200
Benchmark duration (s):                  337.13
Total input tokens:                      4200000
Total generated tokens:                  420000
Request throughput (req/s):              12.46
Output token throughput (tok/s):         1245.80
Total Token throughput (tok/s):          13703.79
---------------Time to First Token----------------
Mean TTFT (ms):                          103607.32
Median TTFT (ms):                        103651.33
P99 TTFT (ms):                           206849.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          141.48
Median TPOT (ms):                        144.37
P99 TPOT (ms):                           145.86
---------------Inter-token Latency----------------
Mean ITL (ms):                           141.49
Median ITL (ms):                         144.32
P99 ITL (ms):                            150.33
==================================================
INFO 04-23 01:57:10 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=2000, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=100, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [02:41<00:00, 12.39it/s]
============ Serving Benchmark Result ============
Successful requests:                     2000
Benchmark duration (s):                  161.38
Total input tokens:                      2000000
Total generated tokens:                  200000
Request throughput (req/s):              12.39
Output token throughput (tok/s):         1239.28
Total Token throughput (tok/s):          13632.09
---------------Time to First Token----------------
Mean TTFT (ms):                          78936.52
Median TTFT (ms):                        78913.56
P99 TTFT (ms):                           157280.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          137.98
Median TPOT (ms):                        144.11
P99 TPOT (ms):                           145.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           137.98
Median ITL (ms):                         144.17
P99 ITL (ms):                            147.89
==================================================
Copied results.json to /host/results.json
ubuntu@ip-172-31-37-101:~$ cat results.json
{"date": "20250423-014046", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 120, "framework": "vllm", "request_rate": 1.0, "burstiness": 1.0, "max_concurrency": null, "duration": 99.20402940898202, "completed": 120, "total_input_tokens": 120000, "total_output_tokens": 12000, "request_throughput": 1.2096282854125187, "request_goodput:": null, "output_throughput": 120.96282854125187, "total_token_throughput": 1330.5911139537704, "mean_ttft_ms": 56.522783224742554, "median_ttft_ms": 55.592673481442034, "std_ttft_ms": 6.728989469262301, "p99_ttft_ms": 81.07384593226018, "mean_tpot_ms": 7.617808739277057, "median_tpot_ms": 7.515317787747887, "std_tpot_ms": 0.5053597521484597, "p99_tpot_ms": 8.848136582304818, "mean_itl_ms": 7.617811373105117, "median_itl_ms": 7.317398500163108, "std_itl_ms": 3.6924498823862617, "p99_itl_ms": 8.286327839014117}
{"date": "20250423-014259", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 1200, "framework": "vllm", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 124.91169368900592, "completed": 1200, "total_input_tokens": 1200000, "total_output_tokens": 120000, "request_throughput": 9.606786719165411, "request_goodput:": null, "output_throughput": 960.6786719165411, "total_token_throughput": 10567.465391081952, "mean_ttft_ms": 98.15984705720136, "median_ttft_ms": 73.9472495042719, "std_ttft_ms": 48.281000487364004, "p99_ttft_ms": 263.66652443481144, "mean_tpot_ms": 15.850709780365909, "median_tpot_ms": 14.91638925779056, "std_tpot_ms": 3.9411592809454365, "p99_tpot_ms": 28.35926311913727, "mean_itl_ms": 15.850712099952466, "median_itl_ms": 9.17535848566331, "std_itl_ms": 17.584085113844704, "p99_itl_ms": 87.65007978945505}
{"date": "20250423-014620", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 2400, "framework": "vllm", "request_rate": 20.0, "burstiness": 1.0, "max_concurrency": null, "duration": 191.71071356604807, "completed": 2400, "total_input_tokens": 2400000, "total_output_tokens": 240000, "request_throughput": 12.51886217184807, "request_goodput:": null, "output_throughput": 1251.886217184807, "total_token_throughput": 13770.748389032879, "mean_ttft_ms": 32079.74456989093, "median_ttft_ms": 32282.153412525076, "std_ttft_ms": 19872.650014440318, "p99_ttft_ms": 65698.2901148073, "mean_tpot_ms": 138.221919501593, "median_tpot_ms": 143.99715812113655, "std_tpot_ms": 20.24949801537539, "p99_tpot_ms": 146.88849325161289, "mean_itl_ms": 138.22192137724542, "median_itl_ms": 143.88436550507322, "std_itl_ms": 26.32745545926573, "p99_itl_ms": 149.87629577866755}
{"date": "20250423-015119", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 3600, "framework": "vllm", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 288.57216349599184, "completed": 3600, "total_input_tokens": 3600000, "total_output_tokens": 360000, "request_throughput": 12.475215753268602, "request_goodput:": null, "output_throughput": 1247.52157532686, "total_token_throughput": 13722.737328595462, "mean_ttft_ms": 78454.67743228166, "median_ttft_ms": 78732.81360047986, "std_ttft_ms": 47516.60846174074, "p99_ttft_ms": 159386.35507877916, "mean_tpot_ms": 140.8697129594008, "median_tpot_ms": 144.17734174257745, "std_tpot_ms": 15.979013714543491, "p99_tpot_ms": 147.78838338305107, "mean_itl_ms": 140.87050527197087, "median_itl_ms": 144.0694895281922, "std_itl_ms": 21.929482712762724, "p99_itl_ms": 150.83982544718302}
{"date": "20250423-015706", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 4200, "framework": "vllm", "request_rate": 35.0, "burstiness": 1.0, "max_concurrency": null, "duration": 337.13310115702916, "completed": 4200, "total_input_tokens": 4200000, "total_output_tokens": 420000, "request_throughput": 12.457987618497695, "request_goodput:": null, "output_throughput": 1245.7987618497693, "total_token_throughput": 13703.786380347465, "mean_ttft_ms": 103607.31987829301, "median_ttft_ms": 103651.3340875099, "std_ttft_ms": 60947.60150211268, "p99_ttft_ms": 206849.30683836402, "mean_tpot_ms": 141.48402127307423, "median_tpot_ms": 144.37082749991347, "std_tpot_ms": 14.771168098280647, "p99_tpot_ms": 145.85994477524903, "mean_itl_ms": 141.48606493697073, "median_itl_ms": 144.31709697237238, "std_itl_ms": 20.9435129708143, "p99_itl_ms": 150.32883231819142}
{"date": "20250423-015957", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 2000, "framework": "vllm", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 161.38386876898585, "completed": 2000, "total_input_tokens": 2000000, "total_output_tokens": 200000, "request_throughput": 12.392812337786468, "request_goodput:": null, "output_throughput": 1239.2812337786468, "total_token_throughput": 13632.093571565114, "mean_ttft_ms": 78936.52451374804, "median_ttft_ms": 78913.55547102285, "std_ttft_ms": 46142.76403234757, "p99_ttft_ms": 157280.687647562, "mean_tpot_ms": 137.98037457196665, "median_tpot_ms": 144.1084595101373, "std_tpot_ms": 20.785586301110747, "p99_tpot_ms": 145.6412611889472, "mean_itl_ms": 137.98037627882812, "median_itl_ms": 144.16968447039835, "std_itl_ms": 25.85542480357516, "p99_itl_ms": 147.8862512583146}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment