Skip to content

Instantly share code, notes, and snippets.

@surajssd
Created February 23, 2025 05:16
Show Gist options
  • Save surajssd/198fe62c6268529e4d54971146de77ab to your computer and use it in GitHub Desktop.
Save surajssd/198fe62c6268529e4d54971146de77ab to your computer and use it in GitHub Desktop.
Running over qps list 1
~/vllm/benchmarks /vllm-workspace
Running test case serving_llama70B_tp2_pp2_sharegpt with qps 1
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_1.json --request-rate 1 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_1.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 5.80MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 17.2M/17.2M [00:00<00:00, 41.7MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 834kB/s]
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 1.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:48<00:00, 1.14s/it]
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 228.85
Total input tokens: 42659
Total generated tokens: 42657
Request throughput (req/s): 0.87
Output token throughput (tok/s): 186.40
Total Token throughput (tok/s): 372.80
---------------Time to First Token----------------
Mean TTFT (ms): 179.83
Median TTFT (ms): 137.75
P99 TTFT (ms): 431.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 67.84
Median TPOT (ms): 67.34
P99 TPOT (ms): 82.40
---------------Inter-token Latency----------------
Mean ITL (ms): 67.76
Median ITL (ms): 63.25
P99 ITL (ms): 224.09
==================================================
/vllm-workspace
~/vllm/benchmarks /vllm-workspace
Running test case serving_llama70B_tp2_pp2_sharegpt with qps 4
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_4.json --request-rate 4 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=4.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_4.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:38<00:00, 2.03it/s]
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 98.63
Total input tokens: 42659
Total generated tokens: 42673
Request throughput (req/s): 2.03
Output token throughput (tok/s): 432.67
Total Token throughput (tok/s): 865.20
---------------Time to First Token----------------
Mean TTFT (ms): 253.09
Median TTFT (ms): 210.66
P99 TTFT (ms): 593.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 102.31
Median TPOT (ms): 102.09
P99 TPOT (ms): 153.32
---------------Inter-token Latency----------------
Mean ITL (ms): 96.37
Median ITL (ms): 84.40
P99 ITL (ms): 377.67
==================================================
/vllm-workspace
~/vllm/benchmarks /vllm-workspace
Running test case serving_llama70B_tp2_pp2_sharegpt with qps 16
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_16.json --request-rate 16 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=16.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_16.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 16.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:14<00:00, 2.70it/s]
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 74.08
Total input tokens: 42659
Total generated tokens: 42727
Request throughput (req/s): 2.70
Output token throughput (tok/s): 576.74
Total Token throughput (tok/s): 1152.57
---------------Time to First Token----------------
Mean TTFT (ms): 823.37
Median TTFT (ms): 785.28
P99 TTFT (ms): 1791.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 183.30
Median TPOT (ms): 126.12
P99 TPOT (ms): 654.56
---------------Inter-token Latency----------------
Mean ITL (ms): 109.70
Median ITL (ms): 89.01
P99 ITL (ms): 617.00
==================================================
/vllm-workspace
~/vllm/benchmarks /vllm-workspace
Running test case serving_llama70B_tp2_pp2_sharegpt with qps inf
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_inf.json --request-rate inf --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_inf.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:11<00:00, 2.81it/s]
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 71.16
Total input tokens: 42659
Total generated tokens: 42768
Request throughput (req/s): 2.81
Output token throughput (tok/s): 600.99
Total Token throughput (tok/s): 1200.45
---------------Time to First Token----------------
Mean TTFT (ms): 5338.23
Median TTFT (ms): 5667.59
P99 TTFT (ms): 9723.99
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 195.32
Median TPOT (ms): 121.49
P99 TPOT (ms): 832.59
---------------Inter-token Latency----------------
Mean ITL (ms): 107.11
Median ITL (ms): 88.85
P99 ITL (ms): 837.97
==================================================
/vllm-workspace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment