Skip to content

Instantly share code, notes, and snippets.

@richardliaw
Created December 12, 2024 01:24
Show Gist options
  • Save richardliaw/95e6b6018647b1fd26842a6ba4156dc9 to your computer and use it in GitHub Desktop.
Save richardliaw/95e6b6018647b1fd26842a6ba4156dc9 to your computer and use it in GitHub Desktop.

vLLM Batch

A high-performance batch inference library for Large Language Models (LLMs) powered by vLLM and Ray.

Overview

vLLM-Batch enables efficient, large-scale batch processing of LLM inference workloads with:

  • High Throughput: Optimized performance using vLLM's paged attention and continuous batching
  • Distributed Processing: Scale across multiple GPUs and machines using Ray Data
  • Smart Batching: Automatic request batching and scheduling for optimal throughput
  • Prefix Caching: Memory-efficient processing of prompts with shared prefixes
  • Flexible Workloads: Support for raw prompts, chat completions, and custom templates

Quick Start

from vllm_batch import BatchProcessor, LLMConfig, PromptConfig

# Basic usage
processor = BatchProcessor(
    model_config=LLMConfig(
        model_id="meta-llama/Llama-2-70b-chat-hf",
        num_workers=4
    )
)

ds = ray.data.read_parquet("s3://my-bucket/questions.parquet")
ds = processor.process(
    ds,
    input_column="question",
    output_column="answer"
)

Advanced Usage

Shared Prefix Optimization

Optimize throughput for datasets with common prefixes:

from vllm_batch import SharedPrefixConfig

# Process dataset with shared prefix
ds = processor.process(
    ds,
    prompt_config=SharedPrefixConfig(
        prefix="You are a math tutor. Solve step by step:\n\n",
        input_column="problem",
        output_column="solution"
    )
)

Chat Completions

Process chat-style datasets:

from vllm_batch import ChatConfig

# Process chat dataset
ds = processor.process(
    ds,
    prompt_config=ChatConfig(
        system_prompt="You are a helpful customer service agent.",
        template="Customer: {input}\nAgent:",
        input_column="query",
        output_column="response"
    )
)

Configuration Reference

LLMConfig

@dataclass
class LLMConfig:
    model_id: str
    num_workers: int = 1
    tensor_parallel_size: int = 1
    sampling_params: Dict[str, Any] = field(default_factory=lambda: {
        "temperature": 0.7,
        "max_tokens": 512,
    })

Prompt Configurations

@dataclass
class SharedPrefixConfig:
    prefix: str
    input_column: str
    output_column: str

@dataclass
class ChatConfig:
    system_prompt: str
    template: str
    input_column: str
    output_column: str

Performance Tips

  1. Shared Prefix Optimization

    • Use SharedPrefixConfig when many prompts share the same prefix
    • Especially useful for instruction tuning or consistent role prompts
  2. GPU Utilization

    • Use tensor_parallel_size for large models
    • Match num_workers to available GPUs

Requirements

  • Ray >= 2.6.0
  • vLLM >= 0.2.0
  • Python >= 3.8
  • CUDA >= 11.8

License

MIT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment