High-performance batch inference for large language models, powered by Ray Data.
Ray Data LLM provides an efficient, scalable solution for batch processing LLM inference workloads with:
- High Throughput: Optimized performance using vLLM's paged attention and continuous batching
- Distributed Processing: Scale across multiple GPUs and machines using Ray Data
- Smart Batching: Automatic request batching and scheduling for optimal throughput
- Prefix Caching: Memory-efficient processing of prompts with shared prefixes
- Flexible Workloads: Support for raw prompts, chat completions, and multimodal inputs
- LoRA Support: Dynamic loading of LoRA adapters from HuggingFace, local paths, or S3
from ray.data.llm import (
VLLMBatchInferencer,
LLMConfig,
SamplingParams
)
# Initialize processor with model/infrastructure config
processor = VLLMBatchInferencer(
model_config=LLMConfig(
model_id="meta-llama/Llama-2-70b-chat-hf",
num_workers=4,
tensor_parallel_size=2,
gpu_memory_utilization=0.95,
enable_prefix_caching=True
)
)
# Process dataset
ds = ray.data.read_parquet("s3://my-bucket/questions.parquet")
ds = processor.transform(
ds,
input_column="question",
output_column="answer",
sampling_params=SamplingParams(
temperature=0.7,
max_tokens=512,
top_p=0.95
)
)
@dataclass
class LLMConfig:
"""Configuration for LLM model and infrastructure."""
model_id: str
num_workers: int = 1
tensor_parallel_size: int = 1
pipeline_parallel_size: int = 1
gpu_memory_utilization: float = 0.95
enable_prefix_caching: bool = False
enforce_eager: bool = False
lora_config: Optional[LoRAConfig] = None
@dataclass
class SamplingParams:
"""Parameters for controlling text generation."""
temperature: float = 1.0
max_tokens: int = 512
top_p: float = 1.0
top_k: Optional[int] = None
presence_penalty: float = 0.0
frequency_penalty: float = 0.0
stop: Optional[List[str]] = None
ignore_eos: bool = False
Process chat-style conversations:
from ray.data.llm import ChatConfig
ds = processor.transform(
ds,
input_column="query",
output_column="response",
prompt_config=ChatConfig(
system_prompt="You are a helpful customer service agent.",
template="Customer: {input}\nAgent:"
),
sampling_params=SamplingParams(
temperature=0.9,
max_tokens=1024
)
)
Optimize throughput for prompts with common prefixes:
from ray.data.llm import SharedPrefixConfig
ds = processor.transform(
ds,
input_column="problem",
output_column="solution",
prompt_config=SharedPrefixConfig(
prefix="Solve step by step:\n\n"
),
sampling_params=SamplingParams(max_tokens=512)
)
Process datasets with images (requires compatible models):
from ray.data.llm import MultimodalConfig
ds = processor.transform(
ds,
input_column="image_path",
output_column="description",
prompt_config=MultimodalConfig(
template="Describe this image in detail:\n{image}",
image_size=(512, 512) # Optional resizing
),
sampling_params=SamplingParams(max_tokens=256)
)
Enable fault-tolerant processing with checkpointing:
from ray.data.llm import CheckpointConfig
ds = processor.transform(
ds,
input_column="article",
output_column="summary",
sampling_params=SamplingParams(max_tokens=200),
checkpoint_config=CheckpointConfig(
path="s3://my-bucket/checkpoints/cnn-summary",
checkpoint_frequency=100,
resume_from_checkpoint=True,
cleanup_checkpoint=True
)
)
- Adjust
tensor_parallel_size
based on model size and GPU memory - Set
num_workers
based on available GPU count - Use
pipeline_parallel_size
for very large models - Monitor
gpu_memory_utilization
(default 0.95)
- Enable
prefix_caching
for workloads with common prefixes - Set
enforce_eager=True
for more predictable latency - Adjust
max_tokens
based on your use case - Use checkpointing for fault tolerance on long-running jobs
- Ray >= 2.6.0
- vLLM >= 0.2.0
- Python >= 3.8
- CUDA >= 11.8
All configuration classes provide complete type hints for better IDE support:
from ray.data.llm import (
LLMConfig,
SamplingParams,
ChatConfig,
SharedPrefixConfig,
MultimodalConfig,
CheckpointConfig,
LoRAConfig
)
- Checkpointing support requires Ray Turbo 2.39+ and is Anyscale-only
- Image processing may require additional memory considerations
- Some vLLM features may not be available with certain attention backends
MIT