- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | # of req. | Tput (req/s) | Output Tput (tok/s) | Total Tput (tok/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean TPOT (ms) | Median TPOT (ms) | P99 TPOT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_01 | 1xStandard_NC48ads_A100_v4 x 2 | 200 | 0.872355 | 186.627 | 372.696 | 183.928 | 149.645 | 433.518 | 68.6318 | 68.074 | 91.9705 | 68.4992 | 63.9263 | 224.921 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_04 | 1xStandard_NC48ads_A100_v4 x 2 | 200 | 2.00982 | 427.739 | 856.423 | 250.331 | 210.275 | 595.383 | 104.6 | 104.4 | 159.741 | 97.9462 | 85.5368 | 370.395 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_16 | 1xStandard_NC48ads_A100_v4 x 2 | 200 | 2.67286 | 569.012 | 1139.12 | 813.433 | 800.52 | 1630.49 | 183.463 | 126.463 | 649.71 | 111.009 | 91.1344 | 627.227 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_inf | 1xStandard_NC48ads_A100_v4 x 2 | 200 | 2.74262 | 585.508 | 1170.49 | 5586.03 | 5578.4 | 9532.99 | 199.365 | 120.439 | 856.838 | 108.644 | 91.1578 | 864.748 |
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {}, "throughput": {}, "serving": {"Test name": {"0": "serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_16", "1": "serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_04", "2": "serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_01", "3": "serving_meta-llama-Llama-3.3-70B-Instruct_tp2_pp2_sharegpt_qps_inf"}, "GPU": {"0": "Standard_NC48ads_A100_v4 x 2", "1": "Standard_NC48ads_A100_v4 x 2", "2": "Standard_NC48ads_A100_v4 x 2", "3": "Standard_NC48ads_A100_v4 x 2"}, "# of req.": {"0": 200, "1": 200, "2": 200, "3": 200}, "Tput (req/s)": {"0": 2.6728616119452546, "1": 2.0098153410901136, "2": 0.8723546086118796, "3": 2.7426189362908926}, "Output Tput (tok/s)": {"0": 569.0121442589656, "1": 427.7389499675034, "2": 186.62718319338248, "3": 585.5080036140612}, "Total Tput (tok/s)": {"0": 1139.1201617788286, "1": 856.4225131453192, "2": 372.6960594372533, "3": 1170.494909630227}, "Mean TTFT (ms)": {"0": 813.4334354352177, "1": 250.33063353503167, "2": 183.92802053997002, "3": 5586.032351420108}, "Median TTFT (ms)": {"0": 800.5200850002439, "1": 210.2748959987366, "2": 149.6454604985047, "3": 5578.40267499887}, "P99 TTFT (ms)": {"0": 1630.4850889814402, "1": 595.3828498008077, "2": 433.51827928232507, "3": 9532.991067929393}, "Mean TPOT (ms)": {"0": 183.46253161417457, "1": 104.60007571868582, "2": 68.63183632389541, "3": 199.3647094853612}, "Median TPOT (ms)": {"0": 126.46261202935186, "1": 104.40000343326128, "2": 68.07398973492188, "3": 120.43930869103623}, "P99 TPOT (ms)": {"0": 649.7100239449541, "1": 159.7409162747622, "2": 91.97054489733009, "3": 856.8383692525231}, "Mean ITL (ms)": {"0": 111.00900483125245, "1": 97.94619975543478, "2": 68.4992397737343, "3": 108.64352028194985}, "Median ITL (ms)": {"0": 91.13440799774253, "1": 85.53675200164435, "2": 63.92631100243307, "3": 91.15782000299077}, "P99 ITL (ms)": {"0": 627.2271953195741, "1": 370.39494579963514, "2": 224.9206623390637, "3": 864.7476389203803}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.