Skip to content

Instantly share code, notes, and snippets.

@richardliaw
Created December 11, 2024 23:59
Show Gist options
  • Select an option

  • Save richardliaw/0717fd036ed59eefe37a26439ce84d82 to your computer and use it in GitHub Desktop.

Select an option

Save richardliaw/0717fd036ed59eefe37a26439ce84d82 to your computer and use it in GitHub Desktop.

Ray Serve LLM Primitives

Deploy LLMs with full OpenAI API compatibility using Ray Serve primitives.

Features

  • πŸš€ Automatic scaling and load balancing
  • πŸ”„ Tensor parallelism for large models
  • 🎯 JSON mode and function calling
  • πŸ“¦ Multi-LoRA adapter support
  • πŸ›  Custom model composition
  • Out of the box support for TRT-LLM and vLLM

Quick Start

Get started with LLM features by using the VLLMDeployment and LLMConfig API.

from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig

# Create a basic LLM deployment
@serve.deployment
class MyLLMApp:
    def __init__(self):
        self.llm_config = LLMConfig(
            model_id="meta-llama/Llama-3.1-8b-instruct",
            accelerator_type="A100_80G",
            tensor_parallelism=2
        )
        deployment_params = self.llm_config.get_deployment_args()
        self.deployment = VLLMDeployment.create_from(self.llm_config) \
            .options(**deployment_params) \
            .bind(base_config=self.llm_config)

    async def __call__(self, prompt: str):
        return await self.deployment.predict(prompt)

# Deploy the app
app = MyLLMApp.bind()

Then you can run this using:

serve run app.py

We also offer a configuration generation script:

# Generate a base configuration file
python generate-config.py --model meta-llama/Llama-2-7b-chat-hf > serve-config.yaml

The generated YAML is simple to understand and modify:

# serve-config.yaml
model_id: "meta-llama/Llama-2-7b-chat-hf"
name: "llama-2-chat"
accelerator_type: "A100_80G"
tensor_parallelism: 2
max_batch_size: 64
# Optional LoRA support
lora_config:
  weights_path: "s3://my-bucket/loras/"
  max_lora_rank: 8

You can then import this yaml file programmatically:

from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig

# Load from config file
config = LLMConfig.from_yaml("serve-config.yaml")
deployment = VLLMDeployment.bind(base_config=config)
...

⭐️ Key Features

OpenAI API Compatible

# Deploy with OpenAI API compatibility using Ray Serve
from ray import serve
from ray.serve.llm import OpenAICompatRouter, LLMConfig

llm_config = LLMConfig(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    name="local-llama",  # Name shown in /v1/models
    accelerator_type="A100_80G",
    tensor_parallel_degree=2
)

deployment = OpenAICompatRouter.bind(
    model_configs=[llm_config],
    llm_deployment=VLLMDeployment.bind(base_config=llm_config)
)

To serve the application, you can use the following command:

# Save the above code in a file named `app.py` and run the following command in your terminal:

serve run app.py

The following endpoints are supported out of the box:

  • βœ… /v1/chat/completions: Chat interface (ChatGPT-style)
  • βœ… /v1/completions: Text completion
  • βœ… /v1/models: List available models
  • βœ… /v1/models/{model}: Model information

And you can use the following OpenAI client to call:

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000", api_key="fake-key")

# Basic completion
response = client.chat.completions.create(
    model="local-llama",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Using LoRA
response = client.chat.completions.create(
    model="local-llama",
    messages=[{"role": "user", "content": "Hello!"}],
    lora_id="my-custom-lora"
)

Multi-Model support

You can also deploy multiple models with the OpenAICompatRouter

# Deploy with OpenAI API compatibility using Ray Serve
from ray import serve
from ray.serve.llm import OpenAICompatRouter, LLMConfig, VLLMDeployment

# Define configurations for different models
llm_7b = LLMConfig(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    name="llama-7b",
    accelerator_type="A100_80G",
    tensor_parallel_degree=2
)

llm_13b = LLMConfig(
    model_id="meta-llama/Llama-2-13b-chat-hf",
    name="llama-13b",
    accelerator_type="A100_80G",
    tensor_parallel_degree=4
)

llm_70b = LLMConfig(
    model_id="meta-llama/Llama-2-70b-chat-hf",
    name="llama-70b",
    accelerator_type="A100_80G",
    tensor_parallel_degree=8
)

# Create separate deployments for each model
deployment_7b = VLLMDeployment.bind(base_config=llm_7b)
deployment_13b = VLLMDeployment.bind(base_config=llm_13b)
deployment_70b = VLLMDeployment.bind(base_config=llm_70b)

# Create router that manages all deployments
deployment = OpenAICompatRouter.bind(
    model_configs=[llm_7b, llm_13b, llm_70b],
    llm_deployments={
        "llama-7b": deployment_7b,
        "llama-13b": deployment_13b,
        "llama-70b": deployment_70b
    }
)

LoRA support

Our LLM Deployment APIs comes with LoRA support:

@serve.deployment
class MyLLMApp:
    def __init__(self):
        # Add LoRA configuration to the LLMConfig
        self.llm_config = LLMConfig(
            model_id="meta-llama/Llama-3.1-8b-instruct",
            accelerator_type="A100_80G",
            tensor_parallelism=2,
            lora_config=LoRAConfig(
                weights_path="s3://my-bucket/loras/",
                max_lora_rank=8,
                max_loras_per_replica=4
            )
        )
        deployment_params = self.llm_config.get_deployment_args()
        self.deployment = VLLMDeployment \
            .options(**deployment_params) \
            .bind(base_config=self.llm_config)

    async def __call__(self, prompt: str, lora_id: Optional[str] = None):
        return await self.deployment.predict(
            prompt,
            lora_id=lora_id  # Pass through LoRA ID if specified
        )

# Deploy the app
app = MyLLMApp.bind()

JSON Mode

JSON Mode ensures structured outputs from your LLM deployments. Here is an example:

from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig, JSONMode

# Define model with JSON schema
llm_config = LLMConfig(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    name="llama-2-chat",
    accelerator_type="A100_80G",
    tensor_parallelism=2,
    json_mode=JSONMode(
        schema={
            "type": "object",
            "properties": {
                "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
                "score": {"type": "number", "minimum": 0, "maximum": 1}
            },
            "required": ["sentiment", "score"]
        }
    )
)

@serve.deployment
class SentimentAnalyzer:
    def __init__(self):
        self.deployment = VLLMDeployment.bind(base_config=llm_config)

    async def __call__(self, text: str):
        return await self.deployment.predict(
            f"Analyze: '{text}'",
            response_format={"type": "json"}
        )

# Deploy
app = SentimentAnalyzer.bind()

# Example usage:
# response = await app.remote("This product exceeded my expectations!")
# Returns: {"sentiment": "positive", "score": 0.95}
@akshay-anyscale
Copy link

Important note - This should only be in addition to what we already have which is suitable for production deployments that are usually managed with configs. What we have today is essentially equivalent to -
deployment_handle = OpenAICompatibleRouter.build([LLMConfig])
with it being called rayllm.app instead of OpenAICompatibleRouter.build.

For chaining with other Ray Serve deployments, i.e:

self.rayllm_deployment.generate(llm_request)
we should expose and document the generate python API at this layer. This can be done first before moving to the per deployment handles, since the latter would increase the surface area to maintain and document.

@akshay-anyscale
Copy link

also - looks like you forgot to remove .options/.bind after switching to create_from

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment