- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | # of req. | Tput (req/s) | Output Tput (tok/s) | Total Tput (tok/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean TPOT (ms) | Median TPOT (ms) | P99 TPOT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_01 | 1xStandard_ND96asr_v4 x 2 | 200 | 0.928406 | 198.242 | 396.267 | 110.373 | 96.9817 | 230.301 | 43.7218 | 43.5453 | 50.5137 | 43.6313 | 42.2756 | 87.9981 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_04 | 1xStandard_ND96asr_v4 x 2 | 200 | 2.52147 | 539.053 | 1076.87 | 139.838 | 125.156 | 332.965 | 61.6271 | 63.497 | 83.148 | 60.9031 | 57.8237 | 173.346 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_16 | 1xStandard_ND96asr_v4 x 2 | 200 | 3.69792 | 791.891 | 1580.64 | 226.762 | 215.643 | 479.754 | 87.5762 | 81.3718 | 153.844 | 72.7284 | 64.3621 | 233.485 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_inf | 1xStandard_ND96asr_v4 x 2 | 200 | 4.13288 | 880.903 | 1762.43 | 2683.62 | 2771.16 | 4838.22 | 114.093 | 79.1199 | 400.543 | 71.8078 | 64.2954 | 399.109 |
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {}, "throughput": {}, "serving": {"Test name": {"0": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_inf", "1": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_01", "2": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_04", "3": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_16"}, "GPU": {"0": "Standard_ND96asr_v4 x 2", "1": "Standard_ND96asr_v4 x 2", "2": "Standard_ND96asr_v4 x 2", "3": "Standard_ND96asr_v4 x 2"}, "# of req.": {"0": 200, "1": 200, "2": 200, "3": 200}, "Tput (req/s)": {"0": 4.132880392447687, "1": 0.9284057358744006, "2": 2.521471685463534, "3": 3.697917701235066}, "Output Tput (tok/s)": {"0": 880.9027912482622, "1": 198.24247678126076, "2": 539.0528242768216, "3": 791.8905861309833}, "Total Tput (tok/s)": {"0": 1762.4255145553916, "1": 396.266778214591, "2": 1076.8701274277662, "3": 1580.6379422159166}, "Mean TTFT (ms)": {"0": 2683.6246253499667, "1": 110.37337160010793, "2": 139.8380736899344, "3": 226.7617405450983}, "Median TTFT (ms)": {"0": 2771.161826000025, "1": 96.9816950000677, "2": 125.15622350110789, "3": 215.64252300049702}, "P99 TTFT (ms)": {"0": 4838.2172842909495, "1": 230.3005734290491, "2": 332.96458055017825, "3": 479.7536375109121}, "Mean TPOT (ms)": {"0": 114.09317725452917, "1": 43.72182021034344, "2": 61.62705314762229, "3": 87.57618569686481}, "Median TPOT (ms)": {"0": 79.11987648951599, "1": 43.54532462942404, "2": 63.49695762410795, "3": 81.37176336238505}, "P99 TPOT (ms)": {"0": 400.5428015899088, "1": 50.513716590712384, "2": 83.14804765725845, "3": 153.84377854204757}, "Mean ITL (ms)": {"0": 71.80776212203955, "1": 43.631314270832306, "2": 60.90314970594733, "3": 72.72835843662789}, "Median ITL (ms)": {"0": 64.29544500133488, "1": 42.27557599915599, "2": 57.82372999965446, "3": 64.36212599874125}, "P99 ITL (ms)": {"0": 399.1085860384919, "1": 87.99811164881247, "2": 173.3455703589425, "3": 233.4853582404321}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.