FAKE TensorRT-Cloud PyTorch Sweep Quickstart Guide

This guide will walk you through using TensorRT-Cloud to perform performance sweeping with the TRT LLM PyTorch backend.

--> THIS GUIDE IS PSEUDOCODE AND JUST A PRODUCT MANAGER'S SUGGESTION. WE HAVE NOT YET BUILT THIS FEATURE <---

Overview

Unlike the C++ backend which uses ahead-of-time (AoT) compilation, TRT LLM PyTorch uses just-in-time (JIT) compilation. This means we'll be configuring runtime parameters rather than build parameters for our performance sweep.

Example Sweep Configuration

Create a pytorch_sweep_spec.json file with the following content:

{
    "sweep_config": {
        "model_inputs": [
            {
                "source": {
                    "id": "deepseek-ai/DeepSeek-V3",
                    "revision": "main",
                    "source_type": "huggingface_repo",
                    "token": "*********"
                },
                "type": "huggingface_checkpoint"
            }
        ],
        "trtllm_runtime": {
            "backend": "pytorch",
            "max_batch_size": [64, 128, 256, 384],
            "max_num_tokens": [1024, 2048, 4096],
            "tp_size": [1, 2, 4, 8],
            "pp_size": [1],
            "kv_cache_free_gpu_memory_fraction": [0.8, 0.9, 0.95],
            "pytorch_config": {
                "use_cuda_graph": [true, false],
                "cuda_graph_padding_enabled": [true, false],
                "cuda_graph_batch_sizes": [
                    [1, 2, 4, 8, 16, 32, 64, 128],
                    [1, 2, 4, 8, 16, 32, 64, 128, 256]
                ],
                "print_iter_log": [true],
                "enable_overlap_scheduler": [true, false],
                "enable_attention_dp": [true, false]
            }
        },
        "hardware": {
            "gpu": "H100"
        },
        "search_strategy": {
            "batch_size": 2,
            "max_trials": 10,
            "name": "grid",
            "optimization_objective": "throughput"
        },
        "benchmark": {
            "perf_configs": [
                {
                    "requests_config": {
                        "concurrency": 500,
                        "input_tokens_mean": 1000,
                        "output_tokens_mean": 1000
                    }
                }
            ]
        }
    }
}

This configuration will sweep across different values for:

max_batch_size
max_num_tokens
tp_size
kv_cache_free_gpu_memory_fraction
PyTorch-specific settings like use_cuda_graph, cuda_graph_batch_sizes, etc.

Running the Sweep

Use the TensorRT-Cloud CLI to submit your sweep:

trtc sweep submit --config pytorch_sweep_spec.json

This will output a sweep ID you can use to check status:

Sweep submitted successfully!
Sweep ID: sweep-20240701-123456

Monitoring Progress

Check the status of your sweep using:

trtc sweep get --id sweep-20240701-123456

For continuous monitoring:

trtc sweep watch --id sweep-20240701-123456

Viewing Results

Once complete, download the results:

trtc sweep download --id sweep-20240701-123456 --output ./sweep_results

Navigate to the results directory:

cd ./sweep_results

Analyzing the Summary HTML

Open summary.html in your browser. The file will contain a table of all configurations tested, with metrics like:

Throughput (tokens/sec)
Latency (ms)
Memory usage
Configuration parameters used

Example summary table row:

Trial ID	Throughput	Latency P50	Latency P90	Latency P99	Memory Usage	max_batch_size	max_num_tokens	tp_size	pp_size	kv_cache_fraction	use_cuda_graph	cuda_graph_padding	cuda_graph_batch_sizes	enable_overlap_scheduler	Run Command
trial-7	9872.3	120.5	145.2	178.9	28.4 GB	256	2048	8	1	0.95	true	true	[1,2,4,8,16,32,64,128,256]	true	`trtllm-serve deepseek-ai/DeepSeek-V3 --host localhost --port 8000 --backend pytorch --max_batch_size 256 --max_num_tokens 2048 --tp_size 8 --pp_size 1 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./config_trial7.yml`

The final column provides the exact command to run the best-performing configuration.

Using the Best Configuration

Extract the configuration YAML for the best trial:

cat ./sweep_results/trials/trial-7/extra-llm-api-config.yml

Example output:

pytorch_backend_config:
    use_cuda_graph: true
    cuda_graph_padding_enabled: true
    cuda_graph_batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    print_iter_log: true
    enable_overlap_scheduler: true
enable_attention_dp: true

Run the model with the optimal configuration:

trtllm-serve \
  deepseek-ai/DeepSeek-V3 \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 256 \
  --max_num_tokens 2048 \
  --tp_size 8 \
  --pp_size 1 \
  --kv_cache_free_gpu_memory_fraction 0.95 \
  --extra_llm_api_options ./sweep_results/trials/trial-7/extra-llm-api-config.yml

Conclusion

You've successfully:

Created a PyTorch backend sweep configuration
Submitted and monitored a performance sweep
Retrieved and analyzed the results
Identified and deployed the optimal configuration for your model

This automated approach ensures you get the best performance from your TRT LLM PyTorch deployment.

BenHamm/fake_pytorch_sweep_quickstart.md