A high-performance batch inference library for Large Language Models (LLMs) powered by vLLM and Ray.
vLLM-Batch enables efficient, large-scale batch processing of LLM inference workloads with:
- High Throughput: Optimized performance using vLLM's paged attention and continuous batching
- Distributed Processing: Scale across multiple GPUs and machines using Ray Data
- Smart Batching: Automatic request batching and scheduling for optimal throughput
- Prefix Caching: Memory-efficient processing of prompts with shared prefixes
- Flexible Workloads: Support for raw prompts, chat completions, and custom templates
from vllm_batch import BatchProcessor, LLMConfig, PromptConfig
# Basic usage
processor = BatchProcessor(
model_config=LLMConfig(
model_id="meta-llama/Llama-2-70b-chat-hf",
num_workers=4
)
)
ds = ray.data.read_parquet("s3://my-bucket/questions.parquet")
ds = processor.process(
ds,
input_column="question",
output_column="answer"
)
Optimize throughput for datasets with common prefixes:
from vllm_batch import SharedPrefixConfig
# Process dataset with shared prefix
ds = processor.process(
ds,
prompt_config=SharedPrefixConfig(
prefix="You are a math tutor. Solve step by step:\n\n",
input_column="problem",
output_column="solution"
)
)
Process chat-style datasets:
from vllm_batch import ChatConfig
# Process chat dataset
ds = processor.process(
ds,
prompt_config=ChatConfig(
system_prompt="You are a helpful customer service agent.",
template="Customer: {input}\nAgent:",
input_column="query",
output_column="response"
)
)
@dataclass
class LLMConfig:
model_id: str
num_workers: int = 1
tensor_parallel_size: int = 1
sampling_params: Dict[str, Any] = field(default_factory=lambda: {
"temperature": 0.7,
"max_tokens": 512,
})
@dataclass
class SharedPrefixConfig:
prefix: str
input_column: str
output_column: str
@dataclass
class ChatConfig:
system_prompt: str
template: str
input_column: str
output_column: str
-
Shared Prefix Optimization
- Use
SharedPrefixConfig
when many prompts share the same prefix - Especially useful for instruction tuning or consistent role prompts
- Use
-
GPU Utilization
- Use
tensor_parallel_size
for large models - Match
num_workers
to available GPUs
- Use
- Ray >= 2.6.0
- vLLM >= 0.2.0
- Python >= 3.8
- CUDA >= 11.8
MIT