- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | # of req. | Tput (req/s) | Output Tput (tok/s) | Total Tput (tok/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean TPOT (ms) | Median TPOT (ms) | P99 TPOT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_01 | 1xStandard_NC24ads_A100_v4 x 1 | 200 | 0.987401 | 96.8492 | 309.022 | 67.1317 | 53.4584 | 154.889 | 22.4824 | 22.1032 | 29.2128 | 22.3552 | 21.2732 | 41.4236 |
serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_04 | 1xStandard_NC24ads_A100_v4 x 1 | 200 | 3.27724 | 325.2 | 1029.41 | 78.5488 | 64.2288 | 201.157 | 28.3271 | 27.6608 | 50.0242 | 27.7427 | 22.9982 | 119.413 |
serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_16 | 1xStandard_NC24ads_A100_v4 x 1 | 200 | 6.52711 | 646.249 | 2048.79 | 139.655 | 123.262 | 434.906 | 63.0477 | 56.5592 | 147.469 | 42.9211 | 29.0528 | 353.212 |
serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_inf | 1xStandard_NC24ads_A100_v4 x 1 | 200 | 7.81671 | 787.064 | 2466.72 | 4023.66 | 3319.28 | 5773.6 | 182.849 | 55.0766 | 2586.53 | 42.4817 | 29.3635 | 47.9881 |
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {}, "throughput": {}, "serving": {"Test name": {"0": "serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_inf", "1": "serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_01", "2": "serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_16", "3": "serving_microsoft-phi-4_tp1_pp1_sharegpt_qps_04"}, "GPU": {"0": "Standard_NC24ads_A100_v4 x 1", "1": "Standard_NC24ads_A100_v4 x 1", "2": "Standard_NC24ads_A100_v4 x 1", "3": "Standard_NC24ads_A100_v4 x 1"}, "# of req.": {"0": 200, "1": 200, "2": 200, "3": 200}, "Tput (req/s)": {"0": 7.816708267795149, "1": 0.9874006606348092, "2": 6.527105366628564, "3": 3.2772372061643584}, "Output Tput (tok/s)": {"0": 787.0643554842935, "1": 96.84919379836526, "2": 646.2487023498941, "3": 325.2002479676893}, "Total Tput (tok/s)": {"0": 2466.7186280681153, "1": 309.02184775557305, "2": 2048.79310353104, "3": 1029.4129788282867}, "Mean TTFT (ms)": {"0": 4023.6556951649954, "1": 67.13169773499885, "2": 139.65462660999947, "3": 78.54876733999617}, "Median TTFT (ms)": {"0": 3319.2782424999905, "1": 53.45841300004395, "2": 123.26202299993838, "3": 64.22880649995477}, "P99 TTFT (ms)": {"0": 5773.600177669878, "1": 154.88893441993696, "2": 434.90575131011826, "3": 201.15669853015976}, "Mean TPOT (ms)": {"0": 182.84897930821307, "1": 22.48243263621561, "2": 63.04765481115331, "3": 28.32710065808013}, "Median TPOT (ms)": {"0": 55.076568482757565, "1": 22.103191907692437, "2": 56.55915422857799, "3": 27.660848333349957}, "P99 TPOT (ms)": {"0": 2586.527520399977, "1": 29.212842526595843, "2": 147.4692513108478, "3": 50.024199164694096}, "Mean ITL (ms)": {"0": 42.481696395225136, "1": 22.355201681361745, "2": 42.92113829415369, "3": 27.74274770686149}, "Median ITL (ms)": {"0": 29.36348100001851, "1": 21.27322000001186, "2": 29.05283350003174, "3": 22.998229500103662}, "P99 ITL (ms)": {"0": 47.98806364003213, "1": 41.42363359993397, "2": 353.2116533299949, "3": 119.41316620012685}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.