This guide will walk you through using TensorRT-Cloud to perform performance sweeping with the TRT LLM PyTorch backend.
--> THIS GUIDE IS PSEUDOCODE AND JUST A PRODUCT MANAGER'S SUGGESTION. WE HAVE NOT YET BUILT THIS FEATURE <---
Unlike the C++ backend which uses ahead-of-time (AoT) compilation, TRT LLM PyTorch uses just-in-time (JIT) compilation. This means we'll be configuring runtime parameters rather than build parameters for our performance sweep.
Create a pytorch_sweep_spec.json
file with the following content:
{
"sweep_config": {
"model_inputs": [
{
"source": {
"id": "deepseek-ai/DeepSeek-V3",
"revision": "main",
"source_type": "huggingface_repo",
"token": "*********"
},
"type": "huggingface_checkpoint"
}
],
"trtllm_runtime": {
"backend": "pytorch",
"max_batch_size": [64, 128, 256, 384],
"max_num_tokens": [1024, 2048, 4096],
"tp_size": [1, 2, 4, 8],
"pp_size": [1],
"kv_cache_free_gpu_memory_fraction": [0.8, 0.9, 0.95],
"pytorch_config": {
"use_cuda_graph": [true, false],
"cuda_graph_padding_enabled": [true, false],
"cuda_graph_batch_sizes": [
[1, 2, 4, 8, 16, 32, 64, 128],
[1, 2, 4, 8, 16, 32, 64, 128, 256]
],
"print_iter_log": [true],
"enable_overlap_scheduler": [true, false],
"enable_attention_dp": [true, false]
}
},
"hardware": {
"gpu": "H100"
},
"search_strategy": {
"batch_size": 2,
"max_trials": 10,
"name": "grid",
"optimization_objective": "throughput"
},
"benchmark": {
"perf_configs": [
{
"requests_config": {
"concurrency": 500,
"input_tokens_mean": 1000,
"output_tokens_mean": 1000
}
}
]
}
}
}
This configuration will sweep across different values for:
- max_batch_size
- max_num_tokens
- tp_size
- kv_cache_free_gpu_memory_fraction
- PyTorch-specific settings like use_cuda_graph, cuda_graph_batch_sizes, etc.
Use the TensorRT-Cloud CLI to submit your sweep:
trtc sweep submit --config pytorch_sweep_spec.json
This will output a sweep ID you can use to check status:
Sweep submitted successfully!
Sweep ID: sweep-20240701-123456
Check the status of your sweep using:
trtc sweep get --id sweep-20240701-123456
For continuous monitoring:
trtc sweep watch --id sweep-20240701-123456
Once complete, download the results:
trtc sweep download --id sweep-20240701-123456 --output ./sweep_results
Navigate to the results directory:
cd ./sweep_results
Open summary.html
in your browser. The file will contain a table of all configurations tested, with metrics like:
- Throughput (tokens/sec)
- Latency (ms)
- Memory usage
- Configuration parameters used
Example summary table row:
Trial ID | Throughput | Latency P50 | Latency P90 | Latency P99 | Memory Usage | max_batch_size | max_num_tokens | tp_size | pp_size | kv_cache_fraction | use_cuda_graph | cuda_graph_padding | cuda_graph_batch_sizes | enable_overlap_scheduler | Run Command |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trial-7 | 9872.3 | 120.5 | 145.2 | 178.9 | 28.4 GB | 256 | 2048 | 8 | 1 | 0.95 | true | true | [1,2,4,8,16,32,64,128,256] | true | trtllm-serve deepseek-ai/DeepSeek-V3 --host localhost --port 8000 --backend pytorch --max_batch_size 256 --max_num_tokens 2048 --tp_size 8 --pp_size 1 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./config_trial7.yml |
The final column provides the exact command to run the best-performing configuration.
- Extract the configuration YAML for the best trial:
cat ./sweep_results/trials/trial-7/extra-llm-api-config.yml
Example output:
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
enable_overlap_scheduler: true
enable_attention_dp: true
- Run the model with the optimal configuration:
trtllm-serve \
deepseek-ai/DeepSeek-V3 \
--host localhost \
--port 8000 \
--backend pytorch \
--max_batch_size 256 \
--max_num_tokens 2048 \
--tp_size 8 \
--pp_size 1 \
--kv_cache_free_gpu_memory_fraction 0.95 \
--extra_llm_api_options ./sweep_results/trials/trial-7/extra-llm-api-config.yml
You've successfully:
- Created a PyTorch backend sweep configuration
- Submitted and monitored a performance sweep
- Retrieved and analyzed the results
- Identified and deployed the optimal configuration for your model
This automated approach ensures you get the best performance from your TRT LLM PyTorch deployment.