Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active April 15, 2025 02:38
Show Gist options
  • Save BenHamm/073a396f266fc6a0a84174b953144d43 to your computer and use it in GitHub Desktop.
Save BenHamm/073a396f266fc6a0a84174b953144d43 to your computer and use it in GitHub Desktop.

FAKE TensorRT-Cloud PyTorch Sweep Quickstart Guide

This guide will walk you through using TensorRT-Cloud to perform performance sweeping with the TRT LLM PyTorch backend.

--> THIS GUIDE IS PSEUDOCODE AND JUST A PRODUCT MANAGER'S SUGGESTION. WE HAVE NOT YET BUILT THIS FEATURE <---

Overview

Unlike the C++ backend which uses ahead-of-time (AoT) compilation, TRT LLM PyTorch uses just-in-time (JIT) compilation. This means we'll be configuring runtime parameters rather than build parameters for our performance sweep.

Example Sweep Configuration

Create a pytorch_sweep_spec.json file with the following content:

{
    "sweep_config": {
        "model_inputs": [
            {
                "source": {
                    "id": "deepseek-ai/DeepSeek-V3",
                    "revision": "main",
                    "source_type": "huggingface_repo",
                    "token": "*********"
                },
                "type": "huggingface_checkpoint"
            }
        ],
        "trtllm_runtime": {
            "backend": "pytorch",
            "max_batch_size": [64, 128, 256, 384],
            "max_num_tokens": [1024, 2048, 4096],
            "tp_size": [1, 2, 4, 8],
            "pp_size": [1],
            "kv_cache_free_gpu_memory_fraction": [0.8, 0.9, 0.95],
            "pytorch_config": {
                "use_cuda_graph": [true, false],
                "cuda_graph_padding_enabled": [true, false],
                "cuda_graph_batch_sizes": [
                    [1, 2, 4, 8, 16, 32, 64, 128],
                    [1, 2, 4, 8, 16, 32, 64, 128, 256]
                ],
                "print_iter_log": [true],
                "enable_overlap_scheduler": [true, false],
                "enable_attention_dp": [true, false]
            }
        },
        "hardware": {
            "gpu": "H100"
        },
        "search_strategy": {
            "batch_size": 2,
            "max_trials": 10,
            "name": "grid",
            "optimization_objective": "throughput"
        },
        "benchmark": {
            "perf_configs": [
                {
                    "requests_config": {
                        "concurrency": 500,
                        "input_tokens_mean": 1000,
                        "output_tokens_mean": 1000
                    }
                }
            ]
        }
    }
}

This configuration will sweep across different values for:

  • max_batch_size
  • max_num_tokens
  • tp_size
  • kv_cache_free_gpu_memory_fraction
  • PyTorch-specific settings like use_cuda_graph, cuda_graph_batch_sizes, etc.

Running the Sweep

Use the TensorRT-Cloud CLI to submit your sweep:

trtc sweep submit --config pytorch_sweep_spec.json

This will output a sweep ID you can use to check status:

Sweep submitted successfully!
Sweep ID: sweep-20240701-123456

Monitoring Progress

Check the status of your sweep using:

trtc sweep get --id sweep-20240701-123456

For continuous monitoring:

trtc sweep watch --id sweep-20240701-123456

Viewing Results

Once complete, download the results:

trtc sweep download --id sweep-20240701-123456 --output ./sweep_results

Navigate to the results directory:

cd ./sweep_results

Analyzing the Summary HTML

Open summary.html in your browser. The file will contain a table of all configurations tested, with metrics like:

  • Throughput (tokens/sec)
  • Latency (ms)
  • Memory usage
  • Configuration parameters used

Example summary table row:

Trial ID Throughput Latency P50 Latency P90 Latency P99 Memory Usage max_batch_size max_num_tokens tp_size pp_size kv_cache_fraction use_cuda_graph cuda_graph_padding cuda_graph_batch_sizes enable_overlap_scheduler Run Command
trial-7 9872.3 120.5 145.2 178.9 28.4 GB 256 2048 8 1 0.95 true true [1,2,4,8,16,32,64,128,256] true trtllm-serve deepseek-ai/DeepSeek-V3 --host localhost --port 8000 --backend pytorch --max_batch_size 256 --max_num_tokens 2048 --tp_size 8 --pp_size 1 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./config_trial7.yml

The final column provides the exact command to run the best-performing configuration.

Using the Best Configuration

  1. Extract the configuration YAML for the best trial:
cat ./sweep_results/trials/trial-7/extra-llm-api-config.yml

Example output:

pytorch_backend_config:
    use_cuda_graph: true
    cuda_graph_padding_enabled: true
    cuda_graph_batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    print_iter_log: true
    enable_overlap_scheduler: true
enable_attention_dp: true
  1. Run the model with the optimal configuration:
trtllm-serve \
  deepseek-ai/DeepSeek-V3 \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 256 \
  --max_num_tokens 2048 \
  --tp_size 8 \
  --pp_size 1 \
  --kv_cache_free_gpu_memory_fraction 0.95 \
  --extra_llm_api_options ./sweep_results/trials/trial-7/extra-llm-api-config.yml

Conclusion

You've successfully:

  1. Created a PyTorch backend sweep configuration
  2. Submitted and monitored a performance sweep
  3. Retrieved and analyzed the results
  4. Identified and deployed the optimal configuration for your model

This automated approach ensures you get the best performance from your TRT LLM PyTorch deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment