Skip to content

Instantly share code, notes, and snippets.

@richardliaw
Created December 12, 2024 01:23
Show Gist options
  • Save richardliaw/b2b36fc503701f3c6222bbd6c6cb0466 to your computer and use it in GitHub Desktop.
Save richardliaw/b2b36fc503701f3c6222bbd6c6cb0466 to your computer and use it in GitHub Desktop.

RayLLM-Batch

Engine Configurations

Here are the full supported engine configurations:

model_id: <HF model ID or local model path>
llm_engine: vllm
accelerator_type: <GPU type>
engine_kwargs:
  <vLLM engine kwargs>
lora_config:
  dynamic_lora_loading_path: <HF LoRA model ID, local path, or S3 path>
runtime_env:
  env_vars:
    <vLLM environment variables>
    <Anyscale inference engine environment variables>
    LOCAL_LORA_ROOT: <optional local path for LoRA models>

Examples

CNN Summarization

The following example uses Llama-3.1-70B-Instruct-FP8 to batch process CNN daily summary workloads (sampled 0.2% of the dataset) on 2xL40S GPUs. In this example, we do not upload the outputs but just report the engine execution time in seconds.

from rayllm_batch import RayLLMBatch
from rayllm_batch.workload import CNNDailySummary

workload = CNNDailySummary(dataset_fraction=0.002)
batch = RayLLMBatch(
    "examples/configs/vllm-llama-3.1-70b-fp8-l40s.yaml",
    workload,
    num_replicas=1,
    batch_size=None,
)
ds = batch.run()

Sample outputs:

... preprocessing logs ...
#Requests: 530 (1 partitions), Avg Prompt Tokens: 958.34, Max Prompt Tokens: 2336
... execution logs ...
Total tokens processed: 613921
Engine throughput (tokens/s): 1551.82
Projected 1M token time (mins): 11.58

Synthetic Prefix Sharing Workloads

The following example uses Llama-3.1-70B-Instruct-FP8 to batch process a synthetic workload with pre-configured shared prefix on 2xL40S GPUs. In this example, we do not upload the outputs but just report the engine execution time in seconds.

from rayllm_batch import RayLLMBatch, init_engine_from_config
from rayllm_batch.workload import SyntheticWithSharedPrefix

workload = SyntheticWithSharedPrefix(
    num_synthetic_requests=500,
    num_synthetic_prefixes=1,
    num_synthetic_prefix_tokens=1000,
    num_unique_synthetic_prompt_tokens=100,
)
override_config = {"engine_kwargs": {"enable_prefix_caching": True}}

# Manually initialize engine config with enable_prefix_caching overrided.
engine_cfg = init_engine_from_config(
    "examples/configs/vllm-llama-3.1-70b-fp8-l40s.yaml",
    override_config,
)
batch = RayLLMBatch(
    engine_cfg,
    workload,
    num_replicas=1,
    batch_size=None,
)
ds = batch.run()

Sample outputs:

... preprocessing logs ...
#Requests: 500 (1 partitions), Avg Prompt Tokens: 1100.00, Max Prompt Tokens: 1100
... execution logs ...
Total tokens processed: 650000
Engine throughput (tokens/s): 3208.49
Projected 1M token time (mins): 5.69
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment