- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name | GPU | # of req. | Tput (req/s) | Output Tput (tok/s) | Total Tput (tok/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) | Mean TPOT (ms) | Median TPOT (ms) | P99 TPOT (ms) | Mean ITL (ms) | Median ITL (ms) | P99 ITL (ms) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_01 | 1xStandard_NC96ads_A100_v4 x 1 | 200 | 0.927978 | 198.508 | 396.441 | 154.771 | 128.388 | 376.548 | 44.9394 | 44.6347 | 58.0394 | 44.8516 | 43.878 | 131.941 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_04 | 1xStandard_NC96ads_A100_v4 x 1 | 200 | 2.4957 | 532.931 | 1065.25 | 246.338 | 219.599 | 606.801 | 73.9558 | 73.8304 | 129.354 | 69.5497 | 55.5315 | 288.602 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_16 | 1xStandard_NC96ads_A100_v4 x 1 | 200 | 3.3252 | 710.13 | 1419.38 | 2461.81 | 2882.55 | 3660.78 | 199.375 | 117.275 | 704.745 | 96.2799 | 70.8682 | 705.108 |
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_inf | 1xStandard_NC96ads_A100_v4 x 1 | 200 | 3.3334 | 710.349 | 1421.35 | 8390.68 | 8344.11 | 15300.3 | 205.38 | 116.908 | 700.741 | 96.4424 | 70.6381 | 701.537 |
This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:
import json
import pandas as pd
benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
The json string for all benchmarking tables:
{"latency": {}, "throughput": {}, "serving": {"Test name": {"0": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_04", "1": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_01", "2": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_inf", "3": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp1_sharegpt_qps_16"}, "GPU": {"0": "Standard_NC96ads_A100_v4 x 1", "1": "Standard_NC96ads_A100_v4 x 1", "2": "Standard_NC96ads_A100_v4 x 1", "3": "Standard_NC96ads_A100_v4 x 1"}, "# of req.": {"0": 200, "1": 200, "2": 200, "3": 200}, "Tput (req/s)": {"0": 2.495698100981504, "1": 0.9279779228604551, "2": 3.3334046649708653, "3": 3.325199159049069}, "Output Tput (tok/s)": {"0": 532.9313724835904, "1": 198.50839736869426, "2": 710.3485341052914, "3": 710.1295324065192}, "Total Tput (tok/s)": {"0": 1065.2512989324402, "1": 396.441448425215, "2": 1421.3470821202523, "3": 1419.3778870358904}, "Mean TTFT (ms)": {"0": 246.33752696000101, "1": 154.77130150999983, "2": 8390.680319614992, "3": 2461.8073529899934}, "Median TTFT (ms)": {"0": 219.59910649991343, "1": 128.38760200008892, "2": 8344.11360350009, "3": 2882.5457675000052}, "P99 TTFT (ms)": {"0": 606.8011953101309, "1": 376.5480166300789, "2": 15300.305423499995, "3": 3660.7770861201493}, "Mean TPOT (ms)": {"0": 73.95579696666762, "1": 44.93937090850136, "2": 205.38025907747684, "3": 199.3748304130239}, "Median TPOT (ms)": {"0": 73.8304143313294, "1": 44.63469464226745, "2": 116.90824614373891, "3": 117.27497017068639}, "P99 TPOT (ms)": {"0": 129.3536884687555, "1": 58.03939859885578, "2": 700.7413872749946, "3": 704.7451063547737}, "Mean ITL (ms)": {"0": 69.54965636861299, "1": 44.85155391470774, "2": 96.44242881350782, "3": 96.27992907936581}, "Median ITL (ms)": {"0": 55.53150799994455, "1": 43.878026000129466, "2": 70.6381174999251, "3": 70.86817749996044}, "P99 ITL (ms)": {"0": 288.60196543984557, "1": 131.9412263799859, "2": 701.5374677201612, "3": 705.1076254899294}}}
You can also check the raw experiment data in the Artifact tab of the Buildkite page.