Here are the full supported engine configurations:
model_id: <HF model ID or local model path>
llm_engine: vllm
accelerator_type: <GPU type>
engine_kwargs:
<vLLM engine kwargs>
lora_config:
dynamic_lora_loading_path: <HF LoRA model ID, local path, or S3 path>
runtime_env:
env_vars:
<vLLM environment variables>
<Anyscale inference engine environment variables>
LOCAL_LORA_ROOT: <optional local path for LoRA models>
The following example uses Llama-3.1-70B-Instruct-FP8 to batch process CNN daily summary workloads (sampled 0.2% of the dataset) on 2xL40S GPUs. In this example, we do not upload the outputs but just report the engine execution time in seconds.
from rayllm_batch import RayLLMBatch
from rayllm_batch.workload import CNNDailySummary
workload = CNNDailySummary(dataset_fraction=0.002)
batch = RayLLMBatch(
"examples/configs/vllm-llama-3.1-70b-fp8-l40s.yaml",
workload,
num_replicas=1,
batch_size=None,
)
ds = batch.run()
Sample outputs:
... preprocessing logs ...
#Requests: 530 (1 partitions), Avg Prompt Tokens: 958.34, Max Prompt Tokens: 2336
... execution logs ...
Total tokens processed: 613921
Engine throughput (tokens/s): 1551.82
Projected 1M token time (mins): 11.58
The following example uses Llama-3.1-70B-Instruct-FP8 to batch process a synthetic workload with pre-configured shared prefix on 2xL40S GPUs. In this example, we do not upload the outputs but just report the engine execution time in seconds.
from rayllm_batch import RayLLMBatch, init_engine_from_config
from rayllm_batch.workload import SyntheticWithSharedPrefix
workload = SyntheticWithSharedPrefix(
num_synthetic_requests=500,
num_synthetic_prefixes=1,
num_synthetic_prefix_tokens=1000,
num_unique_synthetic_prompt_tokens=100,
)
override_config = {"engine_kwargs": {"enable_prefix_caching": True}}
# Manually initialize engine config with enable_prefix_caching overrided.
engine_cfg = init_engine_from_config(
"examples/configs/vllm-llama-3.1-70b-fp8-l40s.yaml",
override_config,
)
batch = RayLLMBatch(
engine_cfg,
workload,
num_replicas=1,
batch_size=None,
)
ds = batch.run()
Sample outputs:
... preprocessing logs ...
#Requests: 500 (1 partitions), Avg Prompt Tokens: 1100.00, Max Prompt Tokens: 1100
... execution logs ...
Total tokens processed: 650000
Engine throughput (tokens/s): 3208.49
Projected 1M token time (mins): 5.69