Created
February 23, 2025 05:16
-
-
Save surajssd/198fe62c6268529e4d54971146de77ab to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Running over qps list 1 | |
~/vllm/benchmarks /vllm-workspace | |
Running test case serving_llama70B_tp2_pp2_sharegpt with qps 1 | |
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_1.json --request-rate 1 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200 | |
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_1.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) | |
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 5.80MB/s] | |
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 17.2M/17.2M [00:00<00:00, 41.7MB/s] | |
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 834kB/s] | |
Starting initial single prompt test run... | |
Initial test run completed. Starting main benchmark run... | |
Traffic request rate: 1.0 | |
Burstiness factor: 1.0 (Poisson process) | |
Maximum request concurrency: None | |
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:48<00:00, 1.14s/it] | |
============ Serving Benchmark Result ============ | |
Successful requests: 200 | |
Benchmark duration (s): 228.85 | |
Total input tokens: 42659 | |
Total generated tokens: 42657 | |
Request throughput (req/s): 0.87 | |
Output token throughput (tok/s): 186.40 | |
Total Token throughput (tok/s): 372.80 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 179.83 | |
Median TTFT (ms): 137.75 | |
P99 TTFT (ms): 431.71 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 67.84 | |
Median TPOT (ms): 67.34 | |
P99 TPOT (ms): 82.40 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 67.76 | |
Median ITL (ms): 63.25 | |
P99 ITL (ms): 224.09 | |
================================================== | |
/vllm-workspace | |
~/vllm/benchmarks /vllm-workspace | |
Running test case serving_llama70B_tp2_pp2_sharegpt with qps 4 | |
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_4.json --request-rate 4 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200 | |
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=4.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_4.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) | |
Starting initial single prompt test run... | |
Initial test run completed. Starting main benchmark run... | |
Traffic request rate: 4.0 | |
Burstiness factor: 1.0 (Poisson process) | |
Maximum request concurrency: None | |
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:38<00:00, 2.03it/s] | |
============ Serving Benchmark Result ============ | |
Successful requests: 200 | |
Benchmark duration (s): 98.63 | |
Total input tokens: 42659 | |
Total generated tokens: 42673 | |
Request throughput (req/s): 2.03 | |
Output token throughput (tok/s): 432.67 | |
Total Token throughput (tok/s): 865.20 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 253.09 | |
Median TTFT (ms): 210.66 | |
P99 TTFT (ms): 593.35 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 102.31 | |
Median TPOT (ms): 102.09 | |
P99 TPOT (ms): 153.32 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 96.37 | |
Median ITL (ms): 84.40 | |
P99 ITL (ms): 377.67 | |
================================================== | |
/vllm-workspace | |
~/vllm/benchmarks /vllm-workspace | |
Running test case serving_llama70B_tp2_pp2_sharegpt with qps 16 | |
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_16.json --request-rate 16 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200 | |
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=16.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_16.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) | |
Starting initial single prompt test run... | |
Initial test run completed. Starting main benchmark run... | |
Traffic request rate: 16.0 | |
Burstiness factor: 1.0 (Poisson process) | |
Maximum request concurrency: None | |
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:14<00:00, 2.70it/s] | |
============ Serving Benchmark Result ============ | |
Successful requests: 200 | |
Benchmark duration (s): 74.08 | |
Total input tokens: 42659 | |
Total generated tokens: 42727 | |
Request throughput (req/s): 2.70 | |
Output token throughput (tok/s): 576.74 | |
Total Token throughput (tok/s): 1152.57 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 823.37 | |
Median TTFT (ms): 785.28 | |
P99 TTFT (ms): 1791.07 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 183.30 | |
Median TPOT (ms): 126.12 | |
P99 TPOT (ms): 654.56 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 109.70 | |
Median ITL (ms): 89.01 | |
P99 ITL (ms): 617.00 | |
================================================== | |
/vllm-workspace | |
~/vllm/benchmarks /vllm-workspace | |
Running test case serving_llama70B_tp2_pp2_sharegpt with qps inf | |
Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_inf.json --request-rate inf --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200 | |
Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_inf.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) | |
Starting initial single prompt test run... | |
Initial test run completed. Starting main benchmark run... | |
Traffic request rate: inf | |
Burstiness factor: 1.0 (Poisson process) | |
Maximum request concurrency: None | |
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:11<00:00, 2.81it/s] | |
============ Serving Benchmark Result ============ | |
Successful requests: 200 | |
Benchmark duration (s): 71.16 | |
Total input tokens: 42659 | |
Total generated tokens: 42768 | |
Request throughput (req/s): 2.81 | |
Output token throughput (tok/s): 600.99 | |
Total Token throughput (tok/s): 1200.45 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 5338.23 | |
Median TTFT (ms): 5667.59 | |
P99 TTFT (ms): 9723.99 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 195.32 | |
Median TPOT (ms): 121.49 | |
P99 TPOT (ms): 832.59 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 107.11 | |
Median ITL (ms): 88.85 | |
P99 ITL (ms): 837.97 | |
================================================== | |
/vllm-workspace |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment