samos123 · August 4, 2024 06:29
diff --git a/gistfile1.txt b/gistfile1.txt
 #!/usr/bin/env bash

 git clone https://github.com/vllm-project/vllm.git
 cd vllm/benchmarks
 python3 benchmark_serving.py --backend openai \
    --base-url http://127.0.0.1:8080 \
    --dataset-name=random \
    --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --seed 12345

 flgas used:
 ```
    Environment:                                                                                                
      PORT:                    8080                                                                             
      MODEL:                   neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8                                             
      GPU_MEMORY_UTILIZATION:  0.90                                                                             
      MAX_MODEL_LEN:           16384                                                                            
      EXTRA_ARGS:              --kv-cache-dtype=auto --enable-prefix-caching --max-num-batched-tokens=16384
      `

 ============ Serving Benchmark Result ============                                                              
 Successful requests:                     1000                                                                   
 Benchmark duration (s):                  459.87                                                                 
 Total input tokens:                      1024000                                                                
 Total generated tokens:                  98778                                                                  
 Request throughput (req/s):              2.17                                                                   
 Input token throughput (tok/s):          2226.73                                                                
 Output token throughput (tok/s):         214.80                                                                 
 ---------------Time to First Token----------------                                                              
 Mean TTFT (ms):                          200733.41                                                              
 Median TTFT (ms):                        195797.75                                                              
 P99 TTFT (ms):                           451637.69                                                              
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          262.67    
 Median TPOT (ms):                        148.27    
 P99 TPOT (ms):                           4638.86   
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           2750.27   
 Median ITL (ms):                         100.00    
 P99 ITL (ms):                            48780.82 

 2nd run:
 ============ Serving Benchmark Result ============
 Successful requests:                     1000      
 Benchmark duration (s):                  457.44    
 Total input tokens:                      1024000   
 Total generated tokens:                  98062     
 Request throughput (req/s):              2.19      
 Input token throughput (tok/s):          2238.54   
 Output token throughput (tok/s):         214.37    
 ---------------Time to First Token----------------
 Mean TTFT (ms):                          202789.60 
 Median TTFT (ms):                        198844.52 
 P99 TTFT (ms):                           434753.50 
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          331.79    
 Median TPOT (ms):                        168.32    
 P99 TPOT (ms):                           3681.04   
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           2718.81   
 Median ITL (ms):                         99.95     
 P99 ITL (ms):                            45873.83  
 ==================================================

 with chunked prefill:
 ```
 INFO 08-04 06:19:30 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
 ```
 ============ Serving Benchmark Result ============
 Successful requests:                     1000      
 Benchmark duration (s):                  418.96    
 Total input tokens:                      1024000   
 Total generated tokens:                  99665     
 Request throughput (req/s):              2.39      
 Input token throughput (tok/s):          2444.14   
 Output token throughput (tok/s):         237.89    
 ---------------Time to First Token----------------
 Mean TTFT (ms):                          185743.66 
 Median TTFT (ms):                        182591.25 
 P99 TTFT (ms):                           398203.80 
 -----Time per Output Token (excl. 1st token)------
 Mean TPOT (ms):                          307.14    
 Median TPOT (ms):                        185.45    
 P99 TPOT (ms):                           3296.57   
 ---------------Inter-token Latency----------------
 Mean ITL (ms):                           2671.87   
 Median ITL (ms):                         98.71     
 P99 ITL (ms):                            65626.11  
 ==================================================
	#!/usr/bin/env bash

	git clone https://github.com/vllm-project/vllm.git
	cd vllm/benchmarks
	python3 benchmark_serving.py --backend openai \
	--base-url http://127.0.0.1:8080 \
	--dataset-name=random \
	--model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
	--seed 12345

	flgas used:
	```
	Environment:
	PORT: 8080
	MODEL: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
	GPU_MEMORY_UTILIZATION: 0.90
	MAX_MODEL_LEN: 16384
	EXTRA_ARGS: --kv-cache-dtype=auto --enable-prefix-caching --max-num-batched-tokens=16384
	`

	============ Serving Benchmark Result ============
	Successful requests: 1000
	Benchmark duration (s): 459.87
	Total input tokens: 1024000
	Total generated tokens: 98778
	Request throughput (req/s): 2.17
	Input token throughput (tok/s): 2226.73
	Output token throughput (tok/s): 214.80
	---------------Time to First Token----------------
	Mean TTFT (ms): 200733.41
	Median TTFT (ms): 195797.75
	P99 TTFT (ms): 451637.69
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 262.67
	Median TPOT (ms): 148.27
	P99 TPOT (ms): 4638.86
	---------------Inter-token Latency----------------
	Mean ITL (ms): 2750.27
	Median ITL (ms): 100.00
	P99 ITL (ms): 48780.82

	2nd run:
	============ Serving Benchmark Result ============
	Successful requests: 1000
	Benchmark duration (s): 457.44
	Total input tokens: 1024000
	Total generated tokens: 98062
	Request throughput (req/s): 2.19
	Input token throughput (tok/s): 2238.54
	Output token throughput (tok/s): 214.37
	---------------Time to First Token----------------
	Mean TTFT (ms): 202789.60
	Median TTFT (ms): 198844.52
	P99 TTFT (ms): 434753.50
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 331.79
	Median TPOT (ms): 168.32
	P99 TPOT (ms): 3681.04
	---------------Inter-token Latency----------------
	Mean ITL (ms): 2718.81
	Median ITL (ms): 99.95
	P99 ITL (ms): 45873.83
	==================================================

	with chunked prefill:
	```
	INFO 08-04 06:19:30 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
	```
	============ Serving Benchmark Result ============
	Successful requests: 1000
	Benchmark duration (s): 418.96
	Total input tokens: 1024000
	Total generated tokens: 99665
	Request throughput (req/s): 2.39
	Input token throughput (tok/s): 2444.14
	Output token throughput (tok/s): 237.89
	---------------Time to First Token----------------
	Mean TTFT (ms): 185743.66
	Median TTFT (ms): 182591.25
	P99 TTFT (ms): 398203.80
	-----Time per Output Token (excl. 1st token)------
	Mean TPOT (ms): 307.14
	Median TPOT (ms): 185.45
	P99 TPOT (ms): 3296.57
	---------------Inter-token Latency----------------
	Mean ITL (ms): 2671.87
	Median ITL (ms): 98.71
	P99 ITL (ms): 65626.11
	==================================================
No results found