Last active
August 4, 2024 06:29
-
-
Save samos123/bd293d16bc8ad4be2c1f5b96539967a2 to your computer and use it in GitHub Desktop.
benchmark vllm using openai backend and random dataset
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
git clone https://github.com/vllm-project/vllm.git | |
cd vllm/benchmarks | |
python3 benchmark_serving.py --backend openai \ | |
--base-url http://127.0.0.1:8080 \ | |
--dataset-name=random \ | |
--model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \ | |
--seed 12345 | |
flgas used: | |
``` | |
Environment: | |
PORT: 8080 | |
MODEL: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 | |
GPU_MEMORY_UTILIZATION: 0.90 | |
MAX_MODEL_LEN: 16384 | |
EXTRA_ARGS: --kv-cache-dtype=auto --enable-prefix-caching --max-num-batched-tokens=16384 | |
` | |
============ Serving Benchmark Result ============ | |
Successful requests: 1000 | |
Benchmark duration (s): 459.87 | |
Total input tokens: 1024000 | |
Total generated tokens: 98778 | |
Request throughput (req/s): 2.17 | |
Input token throughput (tok/s): 2226.73 | |
Output token throughput (tok/s): 214.80 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 200733.41 | |
Median TTFT (ms): 195797.75 | |
P99 TTFT (ms): 451637.69 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 262.67 | |
Median TPOT (ms): 148.27 | |
P99 TPOT (ms): 4638.86 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 2750.27 | |
Median ITL (ms): 100.00 | |
P99 ITL (ms): 48780.82 | |
2nd run: | |
============ Serving Benchmark Result ============ | |
Successful requests: 1000 | |
Benchmark duration (s): 457.44 | |
Total input tokens: 1024000 | |
Total generated tokens: 98062 | |
Request throughput (req/s): 2.19 | |
Input token throughput (tok/s): 2238.54 | |
Output token throughput (tok/s): 214.37 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 202789.60 | |
Median TTFT (ms): 198844.52 | |
P99 TTFT (ms): 434753.50 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 331.79 | |
Median TPOT (ms): 168.32 | |
P99 TPOT (ms): 3681.04 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 2718.81 | |
Median ITL (ms): 99.95 | |
P99 ITL (ms): 45873.83 | |
================================================== | |
with chunked prefill: | |
``` | |
INFO 08-04 06:19:30 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False) | |
``` | |
============ Serving Benchmark Result ============ | |
Successful requests: 1000 | |
Benchmark duration (s): 418.96 | |
Total input tokens: 1024000 | |
Total generated tokens: 99665 | |
Request throughput (req/s): 2.39 | |
Input token throughput (tok/s): 2444.14 | |
Output token throughput (tok/s): 237.89 | |
---------------Time to First Token---------------- | |
Mean TTFT (ms): 185743.66 | |
Median TTFT (ms): 182591.25 | |
P99 TTFT (ms): 398203.80 | |
-----Time per Output Token (excl. 1st token)------ | |
Mean TPOT (ms): 307.14 | |
Median TPOT (ms): 185.45 | |
P99 TPOT (ms): 3296.57 | |
---------------Inter-token Latency---------------- | |
Mean ITL (ms): 2671.87 | |
Median ITL (ms): 98.71 | |
P99 ITL (ms): 65626.11 | |
================================================== |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment