surajssd · February 23, 2025 05:16
diff --git a/vllm-benchmark-results b/vllm-benchmark-results
 Running over qps list 1
 ~/vllm/benchmarks /vllm-workspace
 Running test case serving_llama70B_tp2_pp2_sharegpt with qps 1
 Client command: python3 benchmark_serving.py         --save-result         --base-url http://llama-3-3-70b-instruct-leader.default:8000         --result-dir /root/results/         --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_1.json         --request-rate 1         --model=meta-llama/Llama-3.3-70B-Instruct         --backend=vllm         --dataset-name=sharegpt         --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json         --num-prompts=200
 Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_1.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
 tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 5.80MB/s]
 tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 17.2M/17.2M [00:00<00:00, 41.7MB/s]
 special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 834kB/s]
 Starting initial single prompt test run...
 Initial test run completed. Starting main benchmark run...
 Traffic request rate: 1.0
 Burstiness factor: 1.0 (Poisson process)
 Maximum request concurrency: None
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:48<00:00,  1.14s/it]
 ============ Serving Benchmark Result ============
 Successful requests:                     200
 Benchmark duration (s):                  228.85
 Total input tokens:                      42659
 Total generated tokens:                  42657
 Request throughput (req/s):              0.87
 Output token throughput (tok/s):         186.40
 Total Token throughput (tok/s):          372.80
 ---------------Time to First Token----------------
 Mean TTFT (ms):                          179.83
 Median TTFT (ms):                        137.75
 P99 TTFT (ms):                           431.71
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          67.84
 Median TPOT (ms):                        67.34
 P99 TPOT (ms):                           82.40
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           67.76
 Median ITL (ms):                         63.25
 P99 ITL (ms):                            224.09
 ==================================================
 /vllm-workspace
 ~/vllm/benchmarks /vllm-workspace
 Running test case serving_llama70B_tp2_pp2_sharegpt with qps 4
 Client command: python3 benchmark_serving.py         --save-result         --base-url http://llama-3-3-70b-instruct-leader.default:8000         --result-dir /root/results/         --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_4.json         --request-rate 4         --model=meta-llama/Llama-3.3-70B-Instruct         --backend=vllm         --dataset-name=sharegpt         --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json         --num-prompts=200
 Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=4.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_4.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
 Starting initial single prompt test run...
 Initial test run completed. Starting main benchmark run...
 Traffic request rate: 4.0
 Burstiness factor: 1.0 (Poisson process)
 Maximum request concurrency: None
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:38<00:00,  2.03it/s]
 ============ Serving Benchmark Result ============
 Successful requests:                     200
 Benchmark duration (s):                  98.63
 Total input tokens:                      42659
 Total generated tokens:                  42673
 Request throughput (req/s):              2.03
 Output token throughput (tok/s):         432.67
 Total Token throughput (tok/s):          865.20
 ---------------Time to First Token----------------
 Mean TTFT (ms):                          253.09
 Median TTFT (ms):                        210.66
 P99 TTFT (ms):                           593.35
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          102.31
 Median TPOT (ms):                        102.09
 P99 TPOT (ms):                           153.32
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           96.37
 Median ITL (ms):                         84.40
 P99 ITL (ms):                            377.67
 ==================================================
 /vllm-workspace
 ~/vllm/benchmarks /vllm-workspace
 Running test case serving_llama70B_tp2_pp2_sharegpt with qps 16
 Client command: python3 benchmark_serving.py         --save-result         --base-url http://llama-3-3-70b-instruct-leader.default:8000         --result-dir /root/results/         --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_16.json         --request-rate 16         --model=meta-llama/Llama-3.3-70B-Instruct         --backend=vllm         --dataset-name=sharegpt         --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json         --num-prompts=200
 Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=16.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_16.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
 Starting initial single prompt test run...
 Initial test run completed. Starting main benchmark run...
 Traffic request rate: 16.0
 Burstiness factor: 1.0 (Poisson process)
 Maximum request concurrency: None
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:14<00:00,  2.70it/s]
 ============ Serving Benchmark Result ============
 Successful requests:                     200
 Benchmark duration (s):                  74.08
 Total input tokens:                      42659
 Total generated tokens:                  42727
 Request throughput (req/s):              2.70
 Output token throughput (tok/s):         576.74
 Total Token throughput (tok/s):          1152.57
 ---------------Time to First Token----------------
 Mean TTFT (ms):                          823.37
 Median TTFT (ms):                        785.28
 P99 TTFT (ms):                           1791.07
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          183.30
 Median TPOT (ms):                        126.12
 P99 TPOT (ms):                           654.56
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           109.70
 Median ITL (ms):                         89.01
 P99 ITL (ms):                            617.00
 ==================================================
 /vllm-workspace
 ~/vllm/benchmarks /vllm-workspace
 Running test case serving_llama70B_tp2_pp2_sharegpt with qps inf
 Client command: python3 benchmark_serving.py         --save-result         --base-url http://llama-3-3-70b-instruct-leader.default:8000         --result-dir /root/results/         --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_inf.json         --request-rate inf         --model=meta-llama/Llama-3.3-70B-Instruct         --backend=vllm         --dataset-name=sharegpt         --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json         --num-prompts=200
 Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_inf.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
 Starting initial single prompt test run...
 Initial test run completed. Starting main benchmark run...
 Traffic request rate: inf
 Burstiness factor: 1.0 (Poisson process)
 Maximum request concurrency: None
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:11<00:00,  2.81it/s]
 ============ Serving Benchmark Result ============
 Successful requests:                     200
 Benchmark duration (s):                  71.16
 Total input tokens:                      42659
 Total generated tokens:                  42768
 Request throughput (req/s):              2.81
 Output token throughput (tok/s):         600.99
 Total Token throughput (tok/s):          1200.45
 ---------------Time to First Token----------------
 Mean TTFT (ms):                          5338.23
 Median TTFT (ms):                        5667.59
 P99 TTFT (ms):                           9723.99
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          195.32
 Median TPOT (ms):                        121.49
 P99 TPOT (ms):                           832.59
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           107.11
 Median ITL (ms):                         88.85
 P99 ITL (ms):                            837.97
 ==================================================
 /vllm-workspace
	Running over qps list 1
	~/vllm/benchmarks /vllm-workspace
	Running test case serving_llama70B_tp2_pp2_sharegpt with qps 1
	Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_1.json --request-rate 1 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
	Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_1.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
	tokenizer_config.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████\| 55.4k/55.4k [00:00<00:00, 5.80MB/s]
	tokenizer.json: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████\| 17.2M/17.2M [00:00<00:00, 41.7MB/s]
	special_tokens_map.json: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████\| 68.0/68.0 [00:00<00:00, 834kB/s]
	Starting initial single prompt test run...
	Initial test run completed. Starting main benchmark run...
	Traffic request rate: 1.0
	Burstiness factor: 1.0 (Poisson process)
	Maximum request concurrency: None
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 200/200 [03:48<00:00, 1.14s/it]
	============ Serving Benchmark Result ============
	Successful requests: 200
	Benchmark duration (s): 228.85
	Total input tokens: 42659
	Total generated tokens: 42657
	Request throughput (req/s): 0.87
	Output token throughput (tok/s): 186.40
	Total Token throughput (tok/s): 372.80
	---------------Time to First Token----------------
	Mean TTFT (ms): 179.83
	Median TTFT (ms): 137.75
	P99 TTFT (ms): 431.71
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 67.84
	Median TPOT (ms): 67.34
	P99 TPOT (ms): 82.40
	---------------Inter-token Latency----------------
	Mean ITL (ms): 67.76
	Median ITL (ms): 63.25
	P99 ITL (ms): 224.09
	==================================================
	/vllm-workspace
	~/vllm/benchmarks /vllm-workspace
	Running test case serving_llama70B_tp2_pp2_sharegpt with qps 4
	Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_4.json --request-rate 4 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
	Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=4.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_4.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
	Starting initial single prompt test run...
	Initial test run completed. Starting main benchmark run...
	Traffic request rate: 4.0
	Burstiness factor: 1.0 (Poisson process)
	Maximum request concurrency: None
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 200/200 [01:38<00:00, 2.03it/s]
	============ Serving Benchmark Result ============
	Successful requests: 200
	Benchmark duration (s): 98.63
	Total input tokens: 42659
	Total generated tokens: 42673
	Request throughput (req/s): 2.03
	Output token throughput (tok/s): 432.67
	Total Token throughput (tok/s): 865.20
	---------------Time to First Token----------------
	Mean TTFT (ms): 253.09
	Median TTFT (ms): 210.66
	P99 TTFT (ms): 593.35
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 102.31
	Median TPOT (ms): 102.09
	P99 TPOT (ms): 153.32
	---------------Inter-token Latency----------------
	Mean ITL (ms): 96.37
	Median ITL (ms): 84.40
	P99 ITL (ms): 377.67
	==================================================
	/vllm-workspace
	~/vllm/benchmarks /vllm-workspace
	Running test case serving_llama70B_tp2_pp2_sharegpt with qps 16
	Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_16.json --request-rate 16 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
	Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=16.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_16.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
	Starting initial single prompt test run...
	Initial test run completed. Starting main benchmark run...
	Traffic request rate: 16.0
	Burstiness factor: 1.0 (Poisson process)
	Maximum request concurrency: None
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 200/200 [01:14<00:00, 2.70it/s]
	============ Serving Benchmark Result ============
	Successful requests: 200
	Benchmark duration (s): 74.08
	Total input tokens: 42659
	Total generated tokens: 42727
	Request throughput (req/s): 2.70
	Output token throughput (tok/s): 576.74
	Total Token throughput (tok/s): 1152.57
	---------------Time to First Token----------------
	Mean TTFT (ms): 823.37
	Median TTFT (ms): 785.28
	P99 TTFT (ms): 1791.07
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 183.30
	Median TPOT (ms): 126.12
	P99 TPOT (ms): 654.56
	---------------Inter-token Latency----------------
	Mean ITL (ms): 109.70
	Median ITL (ms): 89.01
	P99 ITL (ms): 617.00
	==================================================
	/vllm-workspace
	~/vllm/benchmarks /vllm-workspace
	Running test case serving_llama70B_tp2_pp2_sharegpt with qps inf
	Client command: python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/results/ --result-filename serving_llama70B_tp2_pp2_sharegpt_qps_inf.json --request-rate inf --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts=200
	Namespace(backend='vllm', base_url='http://llama-3-3-70b-instruct-leader.default:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/root/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.3-70B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=200, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='/root/results/', result_filename='serving_llama70B_tp2_pp2_sharegpt_qps_inf.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
	Starting initial single prompt test run...
	Initial test run completed. Starting main benchmark run...
	Traffic request rate: inf
	Burstiness factor: 1.0 (Poisson process)
	Maximum request concurrency: None
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 200/200 [01:11<00:00, 2.81it/s]
	============ Serving Benchmark Result ============
	Successful requests: 200
	Benchmark duration (s): 71.16
	Total input tokens: 42659
	Total generated tokens: 42768
	Request throughput (req/s): 2.81
	Output token throughput (tok/s): 600.99
	Total Token throughput (tok/s): 1200.45
	---------------Time to First Token----------------
	Mean TTFT (ms): 5338.23
	Median TTFT (ms): 5667.59
	P99 TTFT (ms): 9723.99
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 195.32
	Median TPOT (ms): 121.49
	P99 TPOT (ms): 832.59
	---------------Inter-token Latency----------------
	Mean ITL (ms): 107.11
	Median ITL (ms): 88.85
	P99 ITL (ms): 837.97
	==================================================
	/vllm-workspace