Skip to content

Instantly share code, notes, and snippets.

@samos123
Last active August 4, 2024 06:29
Show Gist options
  • Save samos123/bd293d16bc8ad4be2c1f5b96539967a2 to your computer and use it in GitHub Desktop.
Save samos123/bd293d16bc8ad4be2c1f5b96539967a2 to your computer and use it in GitHub Desktop.
benchmark vllm using openai backend and random dataset
#!/usr/bin/env bash
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
python3 benchmark_serving.py --backend openai \
--base-url http://127.0.0.1:8080 \
--dataset-name=random \
--model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--seed 12345
flgas used:
```
Environment:
PORT: 8080
MODEL: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
GPU_MEMORY_UTILIZATION: 0.90
MAX_MODEL_LEN: 16384
EXTRA_ARGS: --kv-cache-dtype=auto --enable-prefix-caching --max-num-batched-tokens=16384
`
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 459.87
Total input tokens: 1024000
Total generated tokens: 98778
Request throughput (req/s): 2.17
Input token throughput (tok/s): 2226.73
Output token throughput (tok/s): 214.80
---------------Time to First Token----------------
Mean TTFT (ms): 200733.41
Median TTFT (ms): 195797.75
P99 TTFT (ms): 451637.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 262.67
Median TPOT (ms): 148.27
P99 TPOT (ms): 4638.86
---------------Inter-token Latency----------------
Mean ITL (ms): 2750.27
Median ITL (ms): 100.00
P99 ITL (ms): 48780.82
2nd run:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 457.44
Total input tokens: 1024000
Total generated tokens: 98062
Request throughput (req/s): 2.19
Input token throughput (tok/s): 2238.54
Output token throughput (tok/s): 214.37
---------------Time to First Token----------------
Mean TTFT (ms): 202789.60
Median TTFT (ms): 198844.52
P99 TTFT (ms): 434753.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 331.79
Median TPOT (ms): 168.32
P99 TPOT (ms): 3681.04
---------------Inter-token Latency----------------
Mean ITL (ms): 2718.81
Median ITL (ms): 99.95
P99 ITL (ms): 45873.83
==================================================
with chunked prefill:
```
INFO 08-04 06:19:30 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
```
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 418.96
Total input tokens: 1024000
Total generated tokens: 99665
Request throughput (req/s): 2.39
Input token throughput (tok/s): 2444.14
Output token throughput (tok/s): 237.89
---------------Time to First Token----------------
Mean TTFT (ms): 185743.66
Median TTFT (ms): 182591.25
P99 TTFT (ms): 398203.80
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 307.14
Median TPOT (ms): 185.45
P99 TPOT (ms): 3296.57
---------------Inter-token Latency----------------
Mean ITL (ms): 2671.87
Median ITL (ms): 98.71
P99 ITL (ms): 65626.11
==================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment