Skip to content

Instantly share code, notes, and snippets.

@rbiswasfc
Created February 15, 2025 11:49
Show Gist options
  • Save rbiswasfc/5954fef554d71108e16325e87ca67696 to your computer and use it in GitHub Desktop.
Save rbiswasfc/5954fef554d71108e16325e87ca67696 to your computer and use it in GitHub Desktop.
SGLang 0.4.3

File: README.md

SGLang Documentation

We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. Most documentation files are located under the docs/ folder. We prefer Jupyter Notebooks over Markdown so that all examples can be executed and validated by our docs CI pipeline.

Docs Workflow

Install Dependency

pip install -r requirements.txt

Update Documentation

Update your Jupyter notebooks in the appropriate subdirectories under docs/. If you add new files, remember to update index.rst (or relevant .rst files) accordingly.

  • pre-commit run --all-files manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks before creating a Pull Request.
  • Do not commit directly to the main branch. Always create a new branch (e.g., feature/my-new-feature), push your changes, and open a PR from that branch.
# 1) Compile all Jupyter notebooks
make compile

# 2) Generate static HTML
make html

# 3) Preview documentation locally
# Open your browser at the displayed port to view the docs
bash serve.sh

# 4) Clean notebook outputs
# nbstripout removes notebook outputs so your PR stays clean
pip install nbstripout
find . -name '*.ipynb' -exec nbstripout {} \;

# 5) Pre-commit checks and create a PR
# After these checks pass, push your changes and open a PR on your branch
pre-commit run --all-files

If you need to run and shut down a SGLang server or engine, following these examples:

  1. Launch and close Sever:
#Launch Sever

from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)

wait_for_server("http://localhost:30000")

# Terminate Sever

terminate_process(server_process)
  1. Launch Engine and close Engine
# Launch Engine

import sglang as sgl
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

# Terminalte Engine
llm.shutdown()

File: backend/function_calling.md

Tool and Function Calling

This guide demonstrates how to use SGLang’s Tool Calling functionality.

OpenAI Compatible API

Launching the Server

from openai import OpenAI
import json
from sglang.utils import wait_for_server, print_highlight, terminate_process
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --host 0.0.0.0"  # llama3
)
wait_for_server(f"http://localhost:{port}")

Note that --tool-call-parser defines the parser used to interpret responses. Currently supported parsers include:

  • llama3: Llama 3.1 / 3.2 (e.g. meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct).
  • mistral: Mistral (e.g. mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Nemo-Instruct-2407, mistralai/ Mistral-Nemo-Instruct-2407, mistralai/Mistral-7B-v0.3).
  • qwen25: Qwen 2.5 (e.g. Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-7B-Instruct).

Define Tools for Function Call

Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters.

# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to find the weather for, e.g. 'San Francisco'",
                    },
                    "state": {
                        "type": "string",
                        "description": "the two-letter abbreviation for the state that the city is"
                        " in, e.g. 'CA' which would mean 'California'",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city", "state", "unit"],
            },
        },
    }
]

Define Messages

def get_messages():
    return [
        {
            "role": "user",
            "content": "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}",
        }
    ]


messages = get_messages()

Initialize the Client

# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id

Non-Streaming Request

# Non-streaming mode test
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,  # Non-streaming
    tools=tools,
)
print_highlight("Non-stream response:")
print(response_non_stream)

Streaming Request

# Streaming mode test
print_highlight("Streaming response:")
response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=True,  # Enable streaming
    tools=tools,
)

chunks = []
for chunk in response_stream:
    chunks.append(chunk)
    if chunk.choices[0].delta.tool_calls:
        print(chunk.choices[0].delta.tool_calls[0])

Handle Tool Calls

When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly.

Non-Streaming Request

name_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.name
arguments_non_stream = (
    response_non_stream.choices[0].message.tool_calls[0].function.arguments
)

print_highlight(f"Final streamed function call name: {name_non_stream}")
print_highlight(f"Final streamed function call arguments: {arguments_non_stream}")

Streaming Request

# Parse and combine function call arguments
arguments = []
for chunk in chunks:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.tool_calls:
        tool_call = delta.tool_calls[0]
        if tool_call.function.name:
            print_highlight(f"Streamed function call name: {tool_call.function.name}")

        if tool_call.function.arguments:
            arguments.append(tool_call.function.arguments)
            print(f"Streamed function call arguments: {tool_call.function.arguments}")

# Combine all fragments into a single JSON string
full_arguments = "".join(arguments)
print_highlight(f"Final streamed function call arguments: {full_arguments}")

Define a Tool Function

# This is a demonstration, define real function according to your usage.
def get_current_weather(city: str, state: str, unit: "str"):
    return (
        f"The weather in {city}, {state} is 85 degrees {unit}. It is "
        "partly cloudly, with highs in the 90's."
    )


available_tools = {"get_current_weather": get_current_weather}

Execute the Tool

call_data = json.loads(full_arguments)

messages.append(
    {
        "role": "user",
        "content": "",
        "tool_calls": {"name": "get_current_weather", "arguments": full_arguments},
    }
)

# Call the corresponding tool function
tool_name = messages[-1]["tool_calls"]["name"]
tool_to_call = available_tools[tool_name]
result = tool_to_call(**call_data)
print_highlight(f"Function call result: {result}")
messages.append({"role": "tool", "content": result, "name": tool_name})

print_highlight(f"Updated message history: {messages}")

Send Results Back to Model

final_response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools,
)
print_highlight("Non-stream response:")
print(final_response)

Native API and SGLang Runtime (SRT)

from transformers import AutoTokenizer
import requests

# generate an answer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

messages = get_messages()

input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools,
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {"text": input, "sampling_params": {"skip_special_tokens": False}}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]
print(gen_response)

# parse the response
parse_url = f"http://localhost:{port}/function_call"

function_call_input = {
    "text": gen_response,
    "tool_call_parser": "llama3",
    "tools": tools,
}

function_call_response = requests.post(parse_url, json=function_call_input)
function_call_response_json = function_call_response.json()
print("function name: ", function_call_response_json["calls"][0]["name"])
print("function arguments: ", function_call_response_json["calls"][0]["parameters"])
terminate_process(server_process, port)

Offline Engine API

import sglang as sgl
from sglang.srt.function_call_parser import FunctionCallParser
from sglang.srt.managers.io_struct import Tool, Function

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer = llm.tokenizer_manager.tokenizer
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, tools=tools
)

sampling_params = {
    "max_new_tokens": 128,
    "temperature": 0.3,
    "top_p": 0.95,
    "skip_special_tokens": False,
}

# 1) Offline generation
result = llm.generate(input_ids=input_ids, sampling_params=sampling_params)
generated_text = result["text"]  # Assume there is only one prompt

print("=== Offline Engine Output Text ===")
print(generated_text)


# 2) Parse using FunctionCallParser
def convert_dict_to_tool(tool_dict: dict) -> Tool:
    function_dict = tool_dict.get("function", {})
    return Tool(
        type=tool_dict.get("type", "function"),
        function=Function(
            name=function_dict.get("name"),
            description=function_dict.get("description"),
            parameters=function_dict.get("parameters"),
        ),
    )


tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools]

parser = FunctionCallParser(tools=tools, tool_call_parser="llama3")
normal_text, calls = parser.parse_non_stream(generated_text)

print("\n=== Parsing Result ===")
print("Normal text portion:", normal_text)
print("Function call portion:")
for call in calls:
    # call: ToolCallItem
    print(f"  - tool name: {call.name}")
    print(f"    parameters: {call.parameters}")

# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc.
llm.shutdown()

How to support a new model?

  1. Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include:
	TOOLS_TAG_LIST = [
	    “<|plugin|>“,
	    “<function=“,
	    “<tool_call>“,
	    “<|python_tag|>“,
	    “[TOOL_CALLS]”
	]
  1. Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example:
    class NewModelDetector(BaseFormatDetector):
  1. Add the new detector to the MultiFormatParser class that manages all the format detectors.

File: backend/native_api.md

Native APIs

Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:

  • /generate (text generation model)
  • /get_model_info
  • /get_server_info
  • /health
  • /health_generate
  • /flush_cache
  • /update_weights
  • /encode(embedding model)
  • /classify(reward model)

We mainly use requests to test these APIs in the following examples. You can also use curl.

Launch A Server

import requests
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")

Generate (text generation model)

Generate completions. This is similar to the /v1/completions in OpenAI API. Detailed parameters can be found in the sampling parameters.

url = f"http://localhost:{port}/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())

Get Model Info

Get the information of the model.

  • model_path: The path/name of the model.
  • is_generation: Whether the model is used as generation model or embedding model.
  • tokenizer_path: The path/name of the tokenizer.
url = f"http://localhost:{port}/get_model_info"

response = requests.get(url)
response_json = response.json()
print_highlight(response_json)
assert response_json["model_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json["is_generation"] is True
assert response_json["tokenizer_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json.keys() == {"model_path", "is_generation", "tokenizer_path"}

Get Server Info

Gets the server information including CLI arguments, token limits, and memory pool sizes.

  • Note: get_server_info merges the following deprecated endpoints:
    • get_server_args
    • get_memory_pool_size
    • get_max_total_num_tokens
# get_server_info

url = f"http://localhost:{port}/get_server_info"

response = requests.get(url)
print_highlight(response.text)

Health Check

  • /health: Check the health of the server.
  • /health_generate: Check the health of the server by generating one token.
url = f"http://localhost:{port}/health_generate"

response = requests.get(url)
print_highlight(response.text)
url = f"http://localhost:{port}/health"

response = requests.get(url)
print_highlight(response.text)

Flush Cache

Flush the radix cache. It will be automatically triggered when the model weights are updated by the /update_weights API.

# flush cache

url = f"http://localhost:{port}/flush_cache"

response = requests.post(url)
print_highlight(response.text)

Update Weights From Disk

Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.

SGLang support update_weights_from_disk API for continuous evaluation during training (save checkpoint to disk and update weights from disk).

# successful update with same architecture and size

url = f"http://localhost:{port}/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B"}

response = requests.post(url, json=data)
print_highlight(response.text)
assert response.json()["success"] is True
assert response.json()["message"] == "Succeeded to update model weights."
assert response.json().keys() == {"success", "message"}
# failed update with different parameter size or wrong name

url = f"http://localhost:{port}/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B-wrong"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(response_json)
assert response_json["success"] is False
assert response_json["message"] == (
    "Failed to get weights iterator: "
    "meta-llama/Llama-3.2-1B-wrong"
    " (repository not found)."
)

Encode (embedding model)

Encode text into embeddings. Note that this API is only available for embedding models and will raise an error for generation models. Therefore, we launch a new server to server an embedding model.

terminate_process(server_process, port)

embedding_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
    --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"http://localhost:{port}")
# successful encode for embedding model

url = f"http://localhost:{port}/encode"
data = {"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "text": "Once upon a time"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}")
terminate_process(embedding_process, port)

Classify (reward model)

SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations.

terminate_process(embedding_process, port)

# Note that SGLang now treats embedding models and reward models as the same type of models.
# This will be updated in the future.

reward_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"http://localhost:{port}")
from transformers import AutoTokenizer

PROMPT = (
    "What is the range of the numeric output of a sigmoid node in a neural network?"
)

RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1."
RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1."

CONVS = [
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}],
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}],
]

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)

url = f"http://localhost:{port}/classify"
data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts}

responses = requests.post(url, json=data).json()
for response in responses:
    print_highlight(f"reward: {response['embedding'][0]}")
terminate_process(reward_process, port)

Skip Tokenizer and Detokenizer

SGLang Runtime also supports skip tokenizer and detokenizer. This is useful in cases like integrating with RLHF workflow.

tokenizer_free_server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --skip-tokenizer-init
"""
)

wait_for_server(f"http://localhost:{port}")
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

input_text = "What is the capital of France?"

input_tokens = tokenizer.encode(input_text)
print_highlight(f"Input Text: {input_text}")
print_highlight(f"Tokenized Input: {input_tokens}")

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "input_ids": input_tokens,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
            "stop_token_ids": [tokenizer.eos_token_id],
        },
        "stream": False,
    },
)
output = response.json()
output_tokens = output["token_ids"]

output_text = tokenizer.decode(output_tokens, skip_special_tokens=False)
print_highlight(f"Tokenized Output: {output_tokens}")
print_highlight(f"Decoded Output: {output_text}")
print_highlight(f"Output Text: {output['meta_info']['finish_reason']}")
terminate_process(tokenizer_free_server_process, port)

File: backend/offline_engine_api.md

Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

  • Offline Batch Inference
  • Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

  • Non-streaming synchronous generation
  • Streaming synchronous generation
  • Non-streaming asynchronous generation
  • Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

Non-streaming Synchronous Generation

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Streaming Synchronous Generation

prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()

Non-streaming Asynchronous Generation

prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())

Streaming Asynchronous Generation

prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())
llm.shutdown()

Return Hidden States

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()
llm.shutdown()

File: backend/openai_api_completions.md

OpenAI APIs - Completions

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference.

This tutorial covers the following popular APIs:

  • chat/completions
  • completions
  • batches

Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.

Launch A Server

Launch the server in your terminal and wait for it to initialize.

from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

Chat Completions

Usage

The server fully implements the OpenAI API. It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available. You can also specify a custom chat template with --chat-template when launching the server.

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

Parameters

The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to OpenAI Chat Completions API for more details.

Here is an example of a detailed chat completion request:

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=128,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
    presence_penalty=0.2,  # Mild penalty to avoid repetition
    frequency_penalty=0.2,  # Mild penalty for more natural language
    n=1,  # Single response is usually more stable
    seed=42,  # Keep for reproducibility
)

print_highlight(response.choices[0].message.content)

Streaming mode is also supported.

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Completions

Usage

Completions API is similar to Chat Completions API, but without the messages parameter or chat templates.

response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print_highlight(f"Response: {response}")

Parameters

The completions API accepts OpenAI Completions API's parameters. Refer to OpenAI Completions API for more details.

Here is an example of a detailed completions request:

response = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
    stop=["\n\n", "THE END"],  # Multiple stop sequences
    presence_penalty=0.3,  # Encourage novel elements
    frequency_penalty=0.3,  # Reduce repetitive phrases
    n=1,  # Generate one completion
    seed=123,  # For reproducible results
)

print_highlight(f"Response: {response}")

Structured Outputs (JSON, Regex, EBNF)

For OpenAI compatible structed outputs API, refer to Structured Outputs for more details.

Batches

Batches API for chat completions and completions are also supported. You can upload your requests in jsonl files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).

The batches APIs are:

  • batches
  • batches/{batch_id}/cancel
  • batches/{batch_id}

Here is an example of a batch job for chat completions, completions are similar.

import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "messages": [
                {"role": "user", "content": "Tell me a joke about programming"}
            ],
            "max_tokens": 50,
        },
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    },
]

input_file_path = "batch_requests.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Batch job created with ID: {batch_response.id}")
while batch_response.status not in ["completed", "failed", "cancelled"]:
    time.sleep(3)
    print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
    batch_response = client.batches.retrieve(batch_response.id)

if batch_response.status == "completed":
    print("Batch job completed successfully!")
    print(f"Request counts: {batch_response.request_counts}")

    result_file_id = batch_response.output_file_id
    file_response = client.files.content(result_file_id)
    result_content = file_response.read().decode("utf-8")

    results = [
        json.loads(line) for line in result_content.split("\n") if line.strip() != ""
    ]

    for result in results:
        print_highlight(f"Request {result['custom_id']}:")
        print_highlight(f"Response: {result['response']}")

    print_highlight("Cleaning up files...")
    # Only delete the result file ID since file_response is just content
    client.files.delete(result_file_id)
else:
    print_highlight(f"Batch job failed with status: {batch_response.status}")
    if hasattr(batch_response, "errors"):
        print_highlight(f"Errors: {batch_response.errors}")

It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.

  1. batches/{batch_id}: Retrieve the batch job status.
  2. batches/{batch_id}/cancel: Cancel the batch job.

Here is an example to check the batch job status.

import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(20):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 64,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

max_checks = 5
for i in range(max_checks):
    batch_details = client.batches.retrieve(batch_id=batch_job.id)

    print_highlight(
        f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
    )
    print_highlight(
        f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
    )

    time.sleep(3)

Here is an example to cancel a batch job.

import json
import time
from openai import OpenAI
import os

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(5000):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 128,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

try:
    cancelled_job = client.batches.cancel(batch_id=batch_job.id)
    print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
    assert cancelled_job.status == "cancelling"

    # Monitor the cancellation process
    while cancelled_job.status not in ["failed", "cancelled"]:
        time.sleep(3)
        cancelled_job = client.batches.retrieve(batch_job.id)
        print_highlight(f"Current status: {cancelled_job.status}")

    # Verify final status
    assert cancelled_job.status == "cancelled"
    print_highlight("Batch job successfully cancelled")

except Exception as e:
    print_highlight(f"Error during cancellation: {e}")
    raise e

finally:
    try:
        del_response = client.files.delete(uploaded_file.id)
        if del_response.deleted:
            print_highlight("Successfully cleaned up input file")
        if os.path.exists(input_file_path):
            os.remove(input_file_path)
            print_highlight("Successfully deleted local batch_requests.jsonl file")
    except Exception as e:
        print_highlight(f"Error cleaning up: {e}")
        raise e
terminate_process(server_process, port)

File: backend/openai_api_embeddings.md

OpenAI APIs - Embedding

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference.

This tutorial covers the embedding APIs for embedding models, such as

Launch A Server

Launch the server in your terminal and wait for it to initialize. Remember to add --is-embedding to the command.

from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

embedding_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
    --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"http://localhost:{port}")

Using cURL

import subprocess, json

text = "Once upon a time"

curl_text = f"""curl -s http://localhost:{port}/v1/embeddings \
  -d '{{"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "input": "{text}"}}'"""

text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))["data"][0][
    "embedding"
]

print_highlight(f"Text embedding (first 10): {text_embedding[:10]}")

Using Python Requests

import requests

text = "Once upon a time"

response = requests.post(
    f"http://localhost:{port}/v1/embeddings",
    json={"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "input": text},
)

text_embedding = response.json()["data"][0]["embedding"]

print_highlight(f"Text embedding (first 10): {text_embedding[:10]}")

Using OpenAI Python Client

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

# Text embedding example
response = client.embeddings.create(
    model="Alibaba-NLP/gte-Qwen2-7B-instruct",
    input=text,
)

embedding = response.data[0].embedding[:10]
print_highlight(f"Text embedding (first 10): {embedding}")

Using Input IDs

SGLang also supports input_ids as input to get the embedding.

import json
import os
from transformers import AutoTokenizer

os.environ["TOKENIZERS_PARALLELISM"] = "false"

tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct")
input_ids = tokenizer.encode(text)

curl_ids = f"""curl -s http://localhost:{port}/v1/embeddings \
  -d '{{"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "input": {json.dumps(input_ids)}}}'"""

input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))["data"][
    0
]["embedding"]

print_highlight(f"Input IDs embedding (first 10): {input_ids_embedding[:10]}")
terminate_process(embedding_process, port)

File: backend/openai_api_vision.md

OpenAI APIs - Vision

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference. This tutorial covers the vision APIs for vision language models.

SGLang supports vision language models such as Llama 3.2, LLaVA-OneVision, and QWen-VL2

Launch A Server

Launch the server in your terminal and wait for it to initialize.

Remember to add --chat-template llama_3_vision to specify the vision chat template, otherwise the server only supports text, and performance degradation may occur.

We need to specify --chat-template for vision language models because the chat template provided in Hugging Face tokenizer only supports text.

from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

embedding_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
    --chat-template=llama_3_vision
"""
)

wait_for_server(f"http://localhost:{port}")

Using cURL

Once the server is up, you can send test requests using curl or requests.

import subprocess

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
  -d '{{
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "text",
            "text": "What’s in this image?"
          }},
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
            }}
          }}
        ]
      }}
    ],
    "max_tokens": 300
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)


response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)

Using Python Requests

import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print_highlight(response.text)

Using OpenAI Python Client

from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)

Multiple-Image Inputs

The server also supports multiple images and interleaved text and images if the model supports it.

from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                    "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)
terminate_process(embedding_process, port)

Chat Template

As mentioned before, if you do not specify a vision model's --chat-template, the server uses Hugging Face's default template, which only supports text.

We list popular vision models with their chat templates:


File: backend/send_request.md

Quick Start: Sending Requests

This notebook provides a quick-start guide to use SGLang in chat completions after installation.

Launch A Server

This code block is equivalent to executing

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
 --host 0.0.0.0

in your terminal and wait for the server to be ready. Once the server is running, you can send test requests using curl or requests. The server implements the OpenAI-compatible APIs.

from sglang.test.test_utils import is_in_ci
from sglang.utils import wait_for_server, print_highlight, terminate_process

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    """
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
 --host 0.0.0.0
"""
)

wait_for_server(f"http://localhost:{port}")

Using cURL

import subprocess, json

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)

Using Python Requests

import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

Using OpenAI Python Client

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print_highlight(response)

Streaming

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

# Use stream=True for streaming responses
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
    stream=True,
)

# Handle the streaming output
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Using Native Generation APIs

You can also use the native /generate endpoint with requests, which provides more flexiblity. An API reference is available at Sampling Parameters.

import requests

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)

print_highlight(response.json())

Streaming

import requests, json

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"]
        print(output[prev:], end="", flush=True)
        prev = len(output)
terminate_process(server_process, port)

File: backend/server_arguments.md

Server Arguments

Common launch commands

  • To enable multi-GPU tensor parallelism, add --tp 2. If it reports the error "peer access is not supported between these two devices", add --enable-p2p-check to the server launch command.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
  • To enable multi-GPU data parallelism, add --dp 2. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Router for data parallelism.
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
  • If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static. The default value is 0.9.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
  • See hyperparameter tuning on tuning hyperparameters for better performance.
  • If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
  • To enable torch.compile acceleration, add --enable-torch-compile. It accelerates small models on small batch sizes. This does not work for FP8 currently.

  • To enable torchao quantization, add --torchao-config int4wo-128. It supports other quantization strategies (INT8/FP8) as well.

  • To enable fp8 weight quantization, add --quantization fp8 on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.

  • To enable fp8 kv cache quantization, add --kv-cache-dtype fp8_e5m2.

  • If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template.

  • To run tensor parallelism on multiple nodes, add --nnodes 2. If you have two nodes with two GPUs on each node and want to run TP=4, let sgl-dev-0 be the hostname of the first node and 50000 be an available port, you can use the following commands. If you meet deadlock, please try to add --disable-cuda-graph

# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0

# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1

Please consult the documentation below to learn more about the parameters you may provide when launching a server.

Model and tokenizer

  • model_path: Path to the model that will be served.
  • tokenizer_path: Defaults to the model_path.
  • tokenizer_mode: By default auto, see here for different mode.
  • load_format: The format the weights are loaded in. Defaults to *.safetensors/*.bin.
  • trust_remote_code: If True, will use locally cached config files, otherwise use remote configs in HuggingFace.
  • dtype: Dtype used for the model, defaults to bfloat16.
  • kv_cache_dtype: Dtype of the kv cache, defaults to the dtype.
  • context_length: The number of tokens our model can process including the input. Not that extending the default might lead to strange behavior.
  • device: The device we put the model, defaults to cuda.
  • chat_template: The chat template to use. Deviating from the default might lead to unexpected responses. For multi-modal chat templates, refer to here.
  • is_embedding: Set to true to perform embedding / encode and reward tasks.
  • revision: Adjust if a specific version of the model should be used.
  • skip_tokenizer_init: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF.
  • json_model_override_args: Override model config with the provided JSON.
  • delete_ckpt_after_loading: Delete the model checkpoint after loading the model.

Important

Make sure the correct chat_template is passed, or performance degradation may occur.

Serving: HTTP & API

HTTP Server configuration

  • port and host: Setup the host for HTTP server. By default host: str = "127.0.0.1" and port: int = 30000

API configuration

  • api_key: Sets an API key for the server and the OpenAI-compatible API.
  • file_storage_pth: Directory for storing uploaded or generated files from API calls.
  • enable_cache_report: If set, includes detailed usage of cached tokens in the response usage.

Parallelism

Tensor parallelism

  • tp_size: The number of GPUs the model weights get sharded over. Mainly for saving memory rather than for high throughput, see this blogpost.

Data parallelism

  • dp_size: Will be deprecated. The number of data-parallel copies of the model. SGLang router is recommended instead of the current naive data parallel.
  • load_balance_method: Will be deprecated. Load balancing strategy for data parallel requests.

Expert parallelism

  • ep_size: Distribute the experts onto multiple GPUs for MoE models. Remember to shard the model weights with tp_size=ep_size, for detailed benchmarking refer to this PR.

Memory and scheduling

  • mem_fraction_static: Fraction of the free GPU memory used for static memory like model weights and KV cache. If building KV cache fails, it should be increased. If CUDA runs out of memory, it should be decreased.
  • max_running_requests: The maximum number of requests to run concurrently.
  • max_total_tokens: The maximum number of tokens that can be stored into the KV cache. Use mainly for debugging.
  • chunked_prefill_size: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the VRAM consumption. If CUDA runs out of memory, it should be decreased.
  • max_prefill_tokens: Token budget of how many tokens to accept in one prefill batch. The actual number is the max of this parameter and the context_length.
  • schedule_policy: The scheduling policy to control the processing order of waiting prefill requests in a single engine.
  • schedule_conservativeness: Can be used to decrease/increase the conservativeness of the server when taking new requests. Highly conservative behavior leads to starvation, but low conservativeness leads to slowed-down performance.
  • cpu_offload_gb: Reserve this amount of RAM in GB for offloading of model parameters to the CPU.
  • prefill_only_one_req: When this flag is turned on, the engine prefills only one request at a time.

Other runtime options

  • stream_interval: Interval (in tokens) for streaming responses. Smaller values lead to smoother streaming, and larger values lead to better throughput.
  • random_seed: Can be used to enforce more deterministic behavior.
  • watchdog_timeout: Adjusts the watchdog thread’s timeout before killing the server if batch generation takes too long.
  • download_dir: Use to override the default Hugging Face cache directory for model weights.
  • base_gpu_id: Use to adjust first GPU used to distribute the model across available GPUs.
  • allow_auto_truncate: Automatically truncate requests that exceed the maximum input length.

Logging

  • log_level: Global log verbosity.
  • log_level_http: Separate verbosity level for the HTTP server logs (if unset, defaults to log_level).
  • log_requests: Logs the inputs and outputs of all requests for debugging.
  • show_time_cost: Prints or logs detailed timing info for internal operations (helpful for performance tuning).
  • enable_metrics: Exports Prometheus-like metrics for request usage and performance.
  • decode_log_interval: How often (in tokens) to log decode progress.

Multi-node distributed serving

  • dist_init_addr: The TCP address used for initializing PyTorch’s distributed backend (e.g. 192.168.0.2:25000).
  • nnodes: Total number of nodes in the cluster. Refer to how to run the Llama 405B model.
  • node_rank: Rank (ID) of this node among the nnodes in the distributed setup.

LoRA

  • lora_paths: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently cuda_graph and radix_attention are not supportet with this option so you need to disable them manually. We are still working on through these issues.
  • max_loras_per_batch: Maximum number of LoRAs in a running batch including base model.
  • lora_backend: The backend of running GEMM kernels for Lora modules, can be one of triton or flashinfer. Defaults to be triton.

Kernel backend

  • attention_backend: The backend for attention computation and KV cache management.
  • sampling_backend: The backend for sampling.

Constrained Decoding

  • grammar_backend: The grammar backend for constraint decoding. Detailed usage can be found in this document.
  • constrained_json_whitespace_pattern: Use with Outlines grammar backend to allow JSON with syntatic newlines, tabs or multiple spaces. Details can be found here.

Speculative decoding

  • speculative_draft_model_path: The draft model path for speculative decoding.
  • speculative_algorithm: The algorithm for speculative decoding. Currently only Eagle is supported. Note that the radix cache, chunked prefill, and overlap scheduler are disabled when using eagle speculative decoding.
  • speculative_num_steps: How many draft passes we run before verifying.
  • speculative_num_draft_tokens: The number of tokens proposed in a draft.
  • speculative_eagle_topk: The number of top candidates we keep for verification at each step for Eagle.

Double Sparsity

  • enable_double_sparsity: Enables double sparsity which increases throughput.
  • ds_channel_config_path: The double sparsity config. For a guide on how to generate the config for your model see this repo.
  • ds_heavy_channel_num: Number of channel indices to keep for each layer.
  • ds_heavy_token_num: Number of tokens used for attention during decode. Skip sparse decoding if min_seq_len in batch < this number.
  • ds_heavy_channel_type: The type of heavy channels. Either q, k or qk.
  • ds_sparse_decode_threshold: Don't apply sparse decoding if max_seq_len in batch < this threshold.

Debug options

Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.

  • disable_radix_cache: Disable Radix backend for prefix caching.
  • disable_jump_forward: Disable jump-forward for outlines grammar backend.
  • disable_cuda_graph: Disable cuda graph for model forward. Use if encountering uncorrectable CUDA ECC errors.
  • disable_cuda_graph_padding: Disable cuda graph when padding is needed. In other case still use cuda graph.
  • disable_outlines_disk_cache: Disable disk cache for outlines grammar backend.
  • disable_custom_all_reduce: Disable usage of custom all reduce kernel.
  • disable_mla: Disable Multi-Head Latent Attention for Deepseek model.
  • disable_overlap_schedule: Disable the Overhead-Scheduler.
  • enable_nan_detection: Turning this on makes the sampler print a warning if the logits contain NaN.
  • enable_p2p_check: Turns off the default of allowing always p2p check when accessing GPU.
  • triton_attention_reduce_in_fp32: In triton kernels this will cast the intermediate attention result to float32.

Optimization

Note: Some of these options are still in experimental stage.

  • enable_mixed_chunk: Enables mixing prefill and decode, see this discussion.
  • enable_dp_attention: Enable Data Parallelism Attention for Deepseek models. Note that you need to choose dp_size = tp_size for this.
  • enable_ep_moe: Enables expert parallelism, see the description of ep_size.
  • enable_torch_compile: Torch compile the model. This is an experimental feature.
  • torch_compile_max_bs: The maximum batch size when using torch_compile.
  • cuda_graph_max_bs: Adjust the maximum batchsize when using cuda graph. By default this is chosen for you based on GPU specifics.
  • cuda_graph_bs: The batch sizes to capture by CudaGraphRunner. By default this is done for you.
  • torchao_config: Experimental feature that optimizes the model with torchao. Possible choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row.
  • triton_attention_num_kv_splits: Use to adjust the number of KV splits in triton kernels. Default is 8.

File: backend/speculative_decoding.md

Speculative Decoding

SGLang now provides an EAGLE-based speculative decoding option. The implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.

Note: Currently, Speculative Decoding in SGLang does not support radix cache.

Performance Highlights

  • Official EAGLE code (SafeAILab/EAGLE): ~200 tokens/s
  • Standard SGLang Decoding: ~156 tokens/s
  • EAGLE Decoding in SGLang: ~297 tokens/s
  • EAGLE Decoding in SGLang (w/ torch.compile): ~316 tokens/s

All benchmarks below were run on a single H100.

EAGLE Decoding

To enable EAGLE-based speculative decoding, specify the draft model (--speculative-draft) and the relevant EAGLE parameters:

from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE \
    --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64
"""
)

wait_for_server(f"http://localhost:{port}")
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
terminate_process(server_process)

EAGLE Decoding with torch.compile

You can also enable torch.compile for further optimizations and optionally set --cuda-graph-max-bs:

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE \
    --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --cuda-graph-max-bs 2
"""
)

wait_for_server(f"http://localhost:{port}")
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
terminate_process(server_process)

File: backend/structured_outputs.md

Structured Outputs

You can specify a JSON schema, regular expression or EBNF to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (json_schema, regex, or ebnf) can be specified for a request.

SGLang supports two grammar backends:

  • Outlines (default): Supports JSON schema and regular expression constraints.
  • XGrammar: Supports JSON schema, regular expression, and EBNF constraints.

We suggest using XGrammar for its better performance and utility. XGrammar currently uses the GGML BNF format. For more details, see XGrammar technical overview.

To use Xgrammar, simply add --grammar-backend xgrammar when launching the server. If no backend is specified, Outlines will be used as the default.

For better output quality, It's advisable to explicitly include instructions in the prompt to guide the model to generate the desired format. For example, you can specify, 'Please generate the output in the following JSON format: ...'.

OpenAI Compatible API

import openai
import os
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

os.environ["TOKENIZERS_PARALLELISM"] = "false"


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --grammar-backend xgrammar"
)

wait_for_server(f"http://localhost:{port}")
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

JSON

you can directly define a JSON schema or use Pydantic to define and validate the response.

Using Pydantic

from pydantic import BaseModel, Field


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Please generate the information of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=128,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "foo",
            # convert the pydantic model to json schema
            "schema": CapitalInfo.model_json_schema(),
        },
    },
)

response_content = response.choices[0].message.content
# validate the JSON response by the pydantic model
capital_info = CapitalInfo.model_validate_json(response_content)
print_highlight(f"Validated response: {capital_info.model_dump_json()}")

JSON Schema Directly

import json

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Give me the information of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=128,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "foo", "schema": json.loads(json_schema)},
    },
)

print_highlight(response.choices[0].message.content)

EBNF

ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful geography bot."},
        {
            "role": "user",
            "content": "Give me the information of the capital of France.",
        },
    ],
    temperature=0,
    max_tokens=32,
    extra_body={"ebnf": ebnf_grammar},
)

print_highlight(response.choices[0].message.content)

Regular expression

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=128,
    extra_body={"regex": "(Paris|London)"},
)

print_highlight(response.choices[0].message.content)

Native API and SGLang Runtime (SRT)

JSON

Using Pydantic

import requests
import json
from pydantic import BaseModel, Field


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


# Make API request
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json.dumps(CapitalInfo.model_json_schema()),
        },
    },
)
print_highlight(response.json())


response_data = json.loads(response.json()["text"])
# validate the response by the pydantic model
capital_info = CapitalInfo.model_validate(response_data)
print_highlight(f"Validated response: {capital_info.model_dump_json()}")

JSON Schema Directly

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)

print_highlight(response.json())

EBNF

import requests

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Give me the information of the capital of France.",
        "sampling_params": {
            "max_new_tokens": 128,
            "temperature": 0,
            "n": 3,
            "ebnf": (
                "root ::= city | description\n"
                'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
                'description ::= city " is " status\n'
                'status ::= "the capital of " country\n'
                'country ::= "England" | "France" | "Germany" | "Italy"'
            ),
        },
        "stream": False,
        "return_logprob": False,
    },
)

print_highlight(response.json())

Regular expression

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print_highlight(response.json())
terminate_process(server_process, port)

Offline Engine API

import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", grammar_backend="xgrammar"
)

JSON

Using Pydantic

import json
from pydantic import BaseModel, Field


prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


sampling_params = {
    "temperature": 0.1,
    "top_p": 0.95,
    "json_schema": json.dumps(CapitalInfo.model_json_schema()),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}")  # validate the output by the pydantic model
    capital_info = CapitalInfo.model_validate_json(output["text"])
    print_highlight(f"Validated output: {capital_info.model_dump_json()}")

JSON Schema Directly

prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

sampling_params = {"temperature": 0.1, "top_p": 0.95, "json_schema": json_schema}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

EBNF

prompts = [
    "Give me the information of the capital of France.",
    "Give me the information of the capital of Germany.",
    "Give me the information of the capital of Italy.",
]

sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "ebnf": (
        "root ::= city | description\n"
        'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
        'description ::= city " is " status\n'
        'status ::= "the capital of " country\n'
        'country ::= "England" | "France" | "Germany" | "Italy"'
    ),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Regular expression

prompts = [
    "Please provide information about London as a major global city:",
    "Please provide information about Paris as a major global city:",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
llm.shutdown()

File: developer/development_guide_using_docker.md

Development Guide Using Docker

Setup VSCode

Download code from Https://code.visualstudio.com/docs/?dv=linux64cli

wget https://vscode.download.prss.microsoft.com/dbazure/download/stable/fabdb6a30b49f79a7aba0f2ad9df9b399473380f/vscode_cli_alpine_x64_cli.tar.gz
tar xf vscode_cli_alpine_x64_cli.tar.gz

# https://code.visualstudio.com/docs/remote/tunnels
./code tunnel

Setup Docker Container

The following startup command is an example for internal development by the SGLang team. You can modify or add directory mappings as needed, especially for model weight downloads, to prevent repeated downloads by different Docker containers.

H100

# Change the name to yours
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh

H200

docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh

Profile

# Change batch size, input, output and add `disable-cuda-graph` (for easier analysis)
# e.g. DeepSeek V3
nsys profile -o deepseek_v3 python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --disable-cuda-graph

Evaluation

# e.g. gsm8k 8 shot
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8

File: developer/release_process.md

PyPI Package Release Process

Update the version in code

Update the package version in python/pyproject.toml and python/sglang/__init__.py.

Upload the PyPI package

pip install build twine
cd python
bash upload_pypi.sh

Make a release in GitHub

Make a new release https://github.com/sgl-project/sglang/releases/new.


File: developer/setup_github_runner.md

Set Up Self-Hosted Runners for GitHub Action

Add a Runner

Step 1: Start a docker container.

You can mount a folder for the shared huggingface model weights cache. The command below uses /tmp/huggingface as an example.

docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
# Nvidia
docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
# AMD
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.3-rocm630 /bin/bash
# AMD just the last 2 GPUs
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.3-rocm630 /bin/bash

Step 2: Configure the runner by config.sh

Run these commands inside the container.

apt update && apt install -y curl python3-pip git
export RUNNER_ALLOW_RUNASROOT=1

Then follow https://github.com/sgl-project/sglang/settings/actions/runners/new?arch=x64&os=linux to run config.sh

Notes

  • Do not need to specify the runner group
  • Give it a name (e.g., test-sgl-gpu-0) and some labels (e.g., 1-gpu-runner). The labels can be editted later in Github Settings.
  • Do not need to change the work folder.

Step 3: Run the runner by run.sh

  • Set up environment variables
export HF_HOME=/hf_home
export SGLANG_IS_IN_CI=true
export HF_TOKEN=hf_xxx
export OPENAI_API_KEY=sk-xxx
export CUDA_VISIBLE_DEVICES=0
  • Run it forever
while true; do ./run.sh; echo "Restarting..."; sleep 2; done

File: frontend/choices_methods.md

Choices Methods in SGLang

This doc describes the choices methods supported by SGLang.

The optional choices_method arg determines how options supplied to SGLang's choices primitive are selected. Only the RuntimeEndpoint backend supports the choices_method arg. Other backends, such as OpenAI, have bespoke selection implementations due to API limitations.

Methods

Token Length Normalized

Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.

Usage example (alternatively, simply omit the choices_method arg):

@sgl.function
def example(s):
    s += sgl.user("What is the capital of France?")
    s += sgl.assistant(
        sgl.gen(
            "answer",
            choices=["London", "Paris", "Berlin"],
            choices_method=sgl.token_length_normalized,
        )
    )

This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are ["Paris", "Antidisestablishmentarianism"].

Greedy Token Selection

Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.

Usage example:

@sgl.function
def example(s):
    s += sgl.user("What is the capital of France?")
    s += sgl.assistant(
        sgl.gen(
            "answer",
            choices=["London", "Paris", "Berlin"],
            choices_method=sgl.greedy_token_selection,
        )
    )

This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:

@sgl.function
def us_president_example(s):
    s += sgl.user("Name a US president.")
    s += sgl.assistant(
        sgl.gen(
            "answer",
            choices=["Donald Duck", "Millard Fillmore"],
            choices_method=sgl.greedy_token_selection,
        )
    )

Unconditional Likelihood Normalized

Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in this EleutherAI blogpost. This method incurs an additional LLM call to obtain the unconditional likelihoods.

Usage example:

@sgl.function
def example(s):
    s += sgl.user("What is the capital of France?")
    s += sgl.assistant(
        sgl.gen(
            "answer",
            choices=["London", "Paris", "Berlin"],
            choices_method=sgl.unconditional_likelihood_normalized,
        )
    )

File: frontend/frontend.md

Frontend: Structured Generation Language (SGLang)

The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow.

Quick Start

The example below shows how to use SGLang to answer a multi-turn question.

Using Local Models

First, launch a server with

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

Then, connect to the server and answer a multi-turn question.

from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

Using OpenAI Models

Set the OpenAI API Key

export OPENAI_API_KEY=sk-******

Then, answer a multi-turn question.

from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(OpenAI("gpt-3.5-turbo"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

More Examples

Anthropic and VertexAI (Gemini) models are also supported. You can find more examples at examples/quick_start.

Language Feature

To begin with, import sglang.

import sglang as sgl

sglang provides some simple primitives such as gen, select, fork, image. You can implement your prompt flow in a function decorated by sgl.function. You can then invoke the function with run or run_batch. The system will manage the state, chat template, parallelism and batching for you.

The complete code for the examples below can be found at readme_examples.py

Control Flow

You can use any Python code within the function body, including control flow, nested function calls, and external libraries.

@sgl.function
def tool_use(s, question):
    s += "To answer this question: " + question + ". "
    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "

    if s["tool"] == "calculator":
        s += "The math expression is" + sgl.gen("expression")
    elif s["tool"] == "search engine":
        s += "The key word to search is" + sgl.gen("word")

Parallelism

Use fork to launch parallel prompts. Because sgl.gen is non-blocking, the for loop below issues two generation calls in parallel.

@sgl.function
def tip_suggestion(s):
    s += (
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += f"Now, expand tip {i+1} into a paragraph:\n"
        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")

    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
    s += "In summary" + sgl.gen("summary")

Multi-Modality

Use sgl.image to pass an image as input.

@sgl.function
def image_qa(s, image_file, question):
    s += sgl.user(sgl.image(image_file) + question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256)

See also local_example_llava_next.py.

Constrained Decoding

Use regex to specify a regular expression as a decoding constraint. This is only supported for local models.

@sgl.function
def regular_expression_gen(s):
    s += "Q: What is the IP address of the Google DNS servers?\n"
    s += "A: " + sgl.gen(
        "answer",
        temperature=0,
        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
    )

JSON Decoding

Use regex to specify a JSON schema with a regular expression.

character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)

@sgl.function
def character_gen(s, name):
    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
    s += sgl.gen("json_output", max_tokens=256, regex=character_regex)

See also json_decode.py for an additional example of specifying formats with Pydantic models.

Batching

Use run_batch to run a batch of requests with continuous batching.

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True
)

Streaming

Add stream=True to enable streaming.

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

state = text_qa.run(
    question="What is the capital of France?",
    temperature=0.1,
    stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)

Roles

Use sgl.systemsgl.user and sgl.assistant to set roles when using Chat models. You can also define more complex role prompts using begin and end tokens.

@sgl.function
def chat_example(s):
    s += sgl.system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += sgl.assistant_begin()
    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
    s += sgl.assistant_end()

Tips and Implementation Details

  • The choices argument in sgl.gen is implemented by computing the token-length normalized log probabilities of all choices and selecting the one with the highest probability.
  • The regex argument in sgl.gen is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with temperature=0 and temperature != 0.

File: references/accuracy_evaluation.md

Measuring Model Accuracy in SGLang

This guide shows how to evaluate model accuracy using SGLang's built-in benchmarks. Please include accuracy on crucial benchmarks in your PR if you make modifications on the model side, like the kernel and model architecture.

Benchmarking Model Accuracy

This is a reference workflow for the MMLU benchmark. For more details or other benchmarks, please refer to the README in each specific benchmark folder under sglang/benchmark.

# Step 1: Download the dataset
bash download_data.sh

# Step 2: Launch the server
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-Math-1.5B-Instruct \  # Model selection
  --port 30000 \  # Network configuration
  --mem-fraction-static 0.8  # Memory optimization

# Step 3: Run the benchmark script
python3 bench_sglang.py --nsub 10  # Test 10 subjects

# Step 4: Extract the accuracy
cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'

Customizing Benchmark Scripts

Some benchmark implementations may differ from ours, causing accuracy discrepancies. To match [Qwen2.5-Math]'s reported 76.8% GSM8K accuracy, customization is required.

# The GSM8K benchmark script includes few shot examples for evaluation by default.
# Here we exclude them.
for i in range(len(lines[num_shots:num_questions])):
    questions.append(get_one_example(lines, i, False))
    labels.append(get_answer_value(lines[i]["answer"]))
@sgl.function
def few_shot_gsm8k(s, question):
    # System prompt given in https://github.com/QwenLM/Qwen2.5-Math
    s += sgl.system("Please reason step by step, and put your final answer within \\boxed{}.") # Include system prompt
    s += few_shot_examples + question
    # Stopwords given in evaluation/math_eval.py of the Qwen2.5-Math repo
    s += sgl.gen(
        "answer", max_tokens=2048, stop=["Question", "Assistant:", "</s>", "<|im_end|>", "<|endoftext|>"]
    )

These adjustments should return the desired accuracy.

Extending Evaluation Capabilities

  1. Contribute New Benchmarks
  2. Request Implementations
    • Feel free to open an issue describing your evaluation needs
  3. Use Alternative Tools

File: references/amd.md

SGLang on AMD

Introduction

This document describes how to set up an AMD-based environment for SGLang. If you encounter issues or have questions, please open an issue on the SGLang repository.

System Configure

When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:

NOTE: We strongly recommend reading theses docs entirely guide to fully utilize your system.

Below are a few key settings to confirm or enable:

Update GRUB Settings

In /etc/default/grub, append the following to GRUB_CMDLINE_LINUX:

pci=realloc=off iommu=pt

Afterward, run sudo update-grub (or your distro’s equivalent) and reboot.

Disable NUMA Auto-Balancing

sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'

You can automate or verify this change using this helpful script.

Again, please go through the entire documentation to confirm your system is using the recommended configuration.

Installing SGLang

For general installation instructions, see the official SGLang Installation Docs. Below are the AMD-specific steps summarized for convenience.

Install from Source

git clone https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all_hip]"

Install Using Docker (Recommended)

  1. Build the docker image.
docker build -t sglang_image -f Dockerfile.rocm .
  1. Create a convenient alias.
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri \
    --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx \
    -v /data:/data'
  1. Launch the server.

NOTE: Replace <secret> below with your huggingface hub token.

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    sglang_image \
    python3 -m sglang.launch_server \
    --model-path NousResearch/Meta-Llama-3.1-8B \
    --host 0.0.0.0 \
    --port 30000
  1. To verify the utility, you can run a benchmark in another terminal or refer to other docs to send requests to the engine.
drun sglang_image \
    python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --num-prompts 4000 \
    --random-input 128 \
    --random-output 128

With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.

Examples

Running DeepSeek-V3

The only difference in running DeepSeek-V3 is when starting the server. Here's an example command:

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    --env "HF_TOKEN=<secret>" \
    sglang_image \
    python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V3 \ # <- here
    --tp 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000

Running Llama3.1

Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command:

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    --env "HF_TOKEN=<secret>" \
    sglang_image \
    python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
    --tp 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000

Warmup Step

When the server displays "The server is fired up and ready to roll!", it means the startup is successful.


File: references/benchmark_and_profiling.md

Benchmark and Profiling

Benchmark

  • Benchmark the latency of running a single static batch without a server. The arguments are the same as for launch_server.py. Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
    python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
    
  • Benchmark offline processing. This script will start an offline engine and run the benchmark.
    python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
    
  • Benchmark online serving. Please use sglang.launch_server to launch a server first and run the following command.
    python3 -m sglang.bench_serving --backend sglang --num-prompt 10
    

Profile with PyTorch Profiler

Pytorch Profiler is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.

  • To profile a server
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log

# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct

# send profiling request from client
python -m sglang.bench_serving --backend sglang --model-path meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile

Please make sure that the SGLANG_TORCH_PROFILER_DIR should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting SGLANG_TORCH_PROFILER_DIR in the .*rc file of shell (e.g. ~/.bashrc for bash shells).

  • To profile offline
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
  • View Traces

Trace files can be loaded and visualized from:

  1. https://ui.perfetto.dev/ (any browser)
  2. chrome://tracing (Chrome browser only)

If browser cannot open trace file due to its large size, client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs. For example, when profiling a server,

python -m sglang.bench_serving --backend sglang --model-path meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile

sets the number of prompts to 2 with --num-prompts argument and limits the length of output sequences to 100 with --sharegpt-output-len argument, which can generate a small trace file for browser to open smoothly.

Profile with Nsight

Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.

  1. Prerequisite: install using apt, or run inside a NVIDIA Docker container or SGLang Docker container.
# install nsys
# https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli
  1. To profile a single batch, use nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512

  2. To profile a server, e.g.

# server
# set the delay and duration times according to needs
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache

# client
python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
  1. Use NVTX to annotate code regions, e.g. to see their execution time.
# install nvtx
pip install nvtx
# code snippets
import nvtx
with nvtx.annotate("description", color="color"):
    # some critical code

Other tips

  1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add --load-format dummy to the above commands and then you only need a correct config.json under the checkpoint folder.
  2. You can benchmark a model with modified configs (e.g., less layers) by using --json-model-override-args. For example, you can benchmark a model with only 2 layers and 2 kv heads using python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'
  3. You can use --python-backtrace=cuda to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing)
  4. For more args please see https://docs.nvidia.com/nsight-systems/UserGuide/index.html

File: references/contribution_guide.md

Contribution Guide

Welcome to SGLang! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.

Setting Up & Building from Source

Fork and Clone the Repository

Note: New contributors do not have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.

git clone https://github.com/<your_user_name>/sglang.git

Install Dependencies & Build

Refer to Install SGLang from Source documentation for more details on setting up the necessary dependencies.

Code Formatting with Pre-Commit

We use pre-commit to maintain consistent code style checks. Before pushing your changes, please run:

pip3 install pre-commit
pre-commit install
pre-commit run --all-files
  • pre-commit run --all-files manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks before creating a Pull Request.
  • Do not commit directly to the main branch. Always create a new branch (e.g., feature/my-new-feature), push your changes, and open a PR from that branch.

Running Unit Tests & Adding to CI

SGLang uses Python's built-in unittest framework. For detailed instructions on running tests and adding them to CI, please refer to test/README.md.

Writing Documentation & Running Docs CI

We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to docs/README.md.

Tips for Newcomers

If you want to contribute but don’t have a specific idea in mind, pick issues labeled “good first issue” or “help wanted”. These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this code walk-through for a deeper look into SGLang’s workflow.

If you have any questions or want to start a discussion, please feel free to ask in our Slack channel.

Thank you for your interest in SGLang—happy coding!


File: references/custom_chat_template.md

Custom Chat Template in SGLang Runtime

NOTE: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at conversation.py). It is NOT related to the chat template used in the SGLang language frontend (defined at chat_template.py).

By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.

If needed, you can also override the chat template when launching the server:

python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2

If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.

JSON Format

You can load the JSON format, which is defined by conversation.py.

{
  "name": "my_model",
  "system": "<|im_start|>system",
  "user": "<|im_start|>user",
  "assistant": "<|im_start|>assistant",
  "sep_style": "CHATML",
  "sep": "<|im_end|>",
  "stop_str": ["<|im_end|>", "<|im_start|>"]
}
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json

Jinja Format

You can also use the Jinja template format, defined by Hugging Face transformers https://huggingface.co/docs/transformers/main/en/chat_templating

python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja

File: references/deepseek.md

DeepSeek Model Usage and Optimizations

SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for DeepSeek V3.

Launch DeepSeek V3 with SGLang

SGLang is recognized as one of the top engines for DeepSeek model inference.

Download Weights

If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to DeepSeek V3 offical guide to download the weights.

Launch with One node of 8 H200

Please refer to the example. Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like --quantization fp8 --kv-cache-dtype fp8_e5m2. Also, --enable-dp-attention can be useful to improve for Deepseek V3/R1's throughput. Please refer to Data Parallelism Attention for detail.

Running examples on Multi-node

Optimizations

Multi-head Latent Attention (MLA) Throughput Optimizations

Description: MLA is an innovative attention mechanism introduced by the DeepSeek team, aimed at improving inference efficiency. SGLang has implemented specific optimizations for this, including:

  • Weight Absorption: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.

  • Triton Decoding Kernel Optimization: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process.

  • FP8 Quantization: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.

  • CUDA Graph & Torch.compile: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.

Overall, with these optimizations, we have achieved up to a 7x acceleration in output throughput compared to the previous version.

Multi-head Latent Attention for DeepSeek Series Models

Usage: MLA optimization is enabled by default, to disable, use --disable-mla.

Reference: Check Blog and Slides for more details.

Data Parallelism Attention

Description: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer.

Data Parallelism Attention for DeepSeek Series Models

Usage: This optimization is aimed at improving throughput and should be used for scenarios with high QPS (Queries Per Second). Data Parallelism Attention optimization can be enabled by --enable-dp-attention for DeepSeek Series Models.

Data Parallelism Attention Performance Comparison

Reference: Check Blog.

Multi Node Tensor Parallelism

Description: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory.

Usage: Check here for usage examples.

Block-wise FP8

Description: SGLang implements block-wise FP8 quantization with two key optimizations:

  • Activation: E4M3 format using per-token-per-128-channel sub-vector scales with online casting.

  • Weight: Per-128x128-block quantization for better numerical stability.

Usage: turn on by default for DeepSeek V3 models.


File: references/faq.md

Frequently Asked Questions

The results are not deterministic, even with a temperature of 0

You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.

From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs.

To achieve more deterministic outputs in the current code, you can add --disable-radix-cache and send only one request at a time. The results will be mostly deterministic under this setting.

We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower.

We have two issues to track our progress:


File: references/hyperparameter_tuning.md

Guide on Hyperparameter Tuning

Achieving Peak Throughput

Achieving a large batch size is the most important thing for attaining high throughput.

When the server is running at full load, look for the following in the log:

Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317

Tune Your Request Submission Speed

#queue-req indicates the number of requests in the queue. If you frequently see #queue-req == 0, it suggests you are bottlenecked by the request submission speed. A healthy range for #queue-req is 50 - 500. On the other hand, do not make #queue-req too large because it will also increase the scheduling overhead on the server, especially when using the default longest-prefix-match schedule policy (--schedule-policy lpm).

Tune --schedule-conservativeness

token usage indicates the KV cache memory utilization of the server. token usage > 0.9 means good utilization. If you frequently see token usage < 0.9 and #queue-req > 0, it means the server is too conservative about taking in new requests. You can decrease --schedule-conservativeness to a value like 0.3. The case of server being too conservative can happen when users send many requests with a large max_new_tokens but the requests stop very early due to EOS or stop strings.

On the other hand, if you see token usage very high and you frequently see warnings like decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000, you can increase --schedule-conservativeness to a value like 1.3. If you see decode out of memory happened occasionally but not frequently, it is okay.

Tune --dp-size and --tp-size

Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.

Avoid out-of-memory by Tuning --chunked-prefill-size, --mem-fraction-static, --max-running-requests

If you see out of memory (OOM) errors, you can try to tune the following parameters.

  • If OOM happens during prefill, try to decrease --chunked-prefill-size to 4096 or 2048.
  • If OOM happens during decoding, try to decrease --max-running-requests.
  • You can also try to decrease --mem-fraction-static, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

Try Advanced Options

  • To enable torch.compile acceleration, add --enable-torch-compile. It accelerates small models on small batch sizes. This does not work for FP8 currently.

Tune --schedule-policy

If the workload has many shared prefixes, use the default --schedule-policy lpm. lpm stands for longest prefix match. When you have no shared prefixes at all or you always send the requests with the shared prefixes together, you can try --schedule-policy fcfs. fcfs stands for first come first serve. fcfs has a lower scheduling overhead.


File: references/learn_more.md

Learn more

You can find more blogs, slides, and videos about SGLang at https://github.com/sgl-project/sgl-learning-materials.


File: references/modelscope.md

Use Models From ModelScope

To use a model from ModelScope, set the environment variable SGLANG_USE_MODELSCOPE.

export SGLANG_USE_MODELSCOPE=true

We take Qwen2-7B-Instruct as an example.

Launch the Server:

python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000

Or start it by docker:

docker run --gpus all \
    -p 30000:30000 \
    -v ~/.cache/modelscope:/root/.cache/modelscope \
    --env "SGLANG_USE_MODELSCOPE=true" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000

Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.


File: references/multi_node.md

Run Multi-Node Inference

Llama 3.1 405B

Run 405B (fp16) on Two Nodes

# replace 172.16.4.52:20000 with your own node ip address and port of the first node

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0

# replace 172.18.45.52:20000 with your own node ip address and port of the second node

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.18.45.52:20000 --nnodes 2 --node-rank 1

Note that LLama 405B (fp8) can also be launched on a single node.

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8

DeepSeek V3/R1

Please refer to DeepSeek documents for reference..

Multi-Node Inference on SLURM

This example showcases how to serve SGLang server across multiple nodes by SLURM. Submit the following job to the SLURM cluster.

#!/bin/bash -l

#SBATCH -o SLURM_Logs/%x_%j_master.out
#SBATCH -e SLURM_Logs/%x_%j_master.err
#SBATCH -D ./
#SBATCH -J Llama-405B-Online-Inference-TP16-SGL

#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1  # Ensure 1 task per node
#SBATCH --cpus-per-task=18
#SBATCH --mem=224GB
#SBATCH --partition="lmsys.org"
#SBATCH --gres=gpu:8
#SBATCH --time=12:00:00

echo "[INFO] Activating environment on node $SLURM_PROCID"
if ! source ENV_FOLDER/bin/activate; then
    echo "[ERROR] Failed to activate environment" >&2
    exit 1
fi

# Define parameters
model=MODEL_PATH
tp_size=16

echo "[INFO] Running inference"
echo "[INFO] Model: $model"
echo "[INFO] TP Size: $tp_size"

# Set NCCL initialization address using the hostname of the head node
HEAD_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n 1)
NCCL_INIT_ADDR="${HEAD_NODE}:8000"
echo "[INFO] NCCL_INIT_ADDR: $NCCL_INIT_ADDR"

# Launch the model server on each node using SLURM
srun --ntasks=2 --nodes=2 --output="SLURM_Logs/%x_%j_node$SLURM_NODEID.out" \
    --error="SLURM_Logs/%x_%j_node$SLURM_NODEID.err" \
    python3 -m sglang.launch_server \
    --model-path "$model" \
    --grammar-backend "xgrammar" \
    --tp "$tp_size" \
    --nccl-init-addr "$NCCL_INIT_ADDR" \
    --nnodes 2 \
    --node-rank "$SLURM_NODEID" &

# Wait for the NCCL server to be ready on port 30000
while ! nc -z "$HEAD_NODE" 30000; do
    sleep 1
    echo "[INFO] Waiting for $HEAD_NODE:30000 to accept connections"
done

echo "[INFO] $HEAD_NODE:30000 is ready to accept connections"

# Keep the script running until the SLURM job times out
wait

Then, you can test the server by sending requests following other documents.

Thanks for aflah02 for providing the example, based on his blog post.


File: references/nvidia_jetson.md

Apply SGLang on NVIDIA Jetson Orin

Prerequisites

Before starting, ensure the following:

To install torch from this index:

pip install torch --index-url https://pypi.jetson-ai-lab.dev/jp6/cu126

Installation

Please refer to Installation Guide to install FlashInfer and SGLang.


Running Inference

Launch the server:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --device cuda \
  --dtype half \
  --attention-backend flashinfer \
  --mem-fraction-static 0.8 \
  --context-length 8192

The quantization and limited context length (--dtype half --context-length 8192) are due to the limited computational resources in Nvidia jetson kit. A detailed explanation can be found in Server Arguments.

After launching the engine, refer to Chat completions to test the usability.


Running quantization with TorchAO

TorchAO is suggested to NVIDIA Jetson Orin.

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --device cuda \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.8 \
    --context-length 8192 \
    --torchao-config int4wo-128

This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of --torchao-config int4wo-128 is also for memory efficiency.


Structured output with XGrammar

Please refer to SGLang doc structured output.


Thanks to the support from shahizat.

References


File: references/production_metrics.md

Production Metrics

SGLang exposes the following metrics via Prometheus. The metrics are namespaced by $name (the model name).

An example of the monitoring dashboard is available in examples/monitoring/grafana.json.

Here is an example of the metrics:

$ curl http://localhost:30000/metrics

# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.30457592010498047
sglang:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.25",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="7.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="25.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.30521273612976074
sglang:e2e_request_latency_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="50.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE sglang:time_per_output_token_seconds histogram
sglang:time_per_output_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0381757915019989
sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.015",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
# HELP sglang:func_latency_seconds Function latency in seconds
# TYPE sglang:func_latency_seconds histogram
sglang:func_latency_seconds_sum{name="generate_request"} 0.3061351010110229
sglang:func_latency_seconds_bucket{le="0.05",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.07500000000000001",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.1125",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.16875",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.253125",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.3796875",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="0.56953125",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="0.8542968750000001",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="1.2814453125",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="1.9221679687500002",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="2.8832519531250003",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="4.3248779296875",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="6.487316894531251",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="9.730975341796876",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="14.596463012695313",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="21.89469451904297",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="32.84204177856446",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="49.26306266784668",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="+Inf",name="generate_request"} 1.0
sglang:func_latency_seconds_count{name="generate_request"} 1.0
# HELP sglang:num_running_reqs The number of running requests
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:num_used_tokens The number of used tokens
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:gen_throughput The generate throughput (token/s)
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:token_usage The token usage
# TYPE sglang:token_usage gauge
sglang:token_usage{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0

Setup Guide

To setup a monitoring dashboard, you can use the following docker compose file: examples/monitoring/docker-compose.yaml.

Assume you have sglang server running at localhost:30000.

To start the monitoring dashboard (prometheus + grafana), cd to examples/monitoring and run:

docker compose -f compose.yaml -p monitoring up

Then you can access the Grafana dashboard at http://localhost:3000.

Grafana Dashboard

To import the Grafana dashboard, click + -> Import -> Upload JSON file -> Upload and select grafana.json.


File: references/quantization.md

Quantization

SGLang supports various quantization methods, including offline quantization and online dynamic quantization.

Offline quantization loads pre-quantized model weights directly during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set.

Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. Like NVIDIA FP8 training's delayed scaling mechanism, online quantization calculates the appropriate scaling factors on-the-fly to convert high-precision weights into a lower-precision format.

Note that, for better performance, usability and convenience, offline quantization is recommended over online quantization. And if you use a pre-quantized model, do not add --quantization to enable online quantization at the same time. For popular pre-quantized models, please visit neuralmagic collection for some popular quantized LLMs on huggingface.

Offline Quantization

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add "--quantization" argument when starting the engine. The quantization method will be parsed from the downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.

python3 -m sglang.launch_server \
    --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
    --port 30000 --host 0.0.0.0

To do offline quantization for your model, firstly you need to install llm-compressor library:

pip install llmcompressor

Here, we take quantize meta-llama/Meta-Llama-3-8B-Instruct to FP8 as an example to elaborate on how to do offline quantization.

from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Step 1: Load the original model.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

model = SparseAutoModelForCausalLM.from_pretrained(
  MODEL_ID, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Step 2: Perform offline quantization.
# Step 2.1: Configure the simple PTQ quantization.
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# Step 2.2: Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)

# Step 3: Save the model.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Then, you can directly use the quantized model with SGLang, by using the following command:

python3 -m sglang.launch_server \
    --model-path $PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic \
    --port 30000 --host 0.0.0.0

Online Quantization

To enable online quantization, you can simply specify --quantization in the command line. For example, you can launch the server with the following command to enable FP8 quantization for model meta-llama/Meta-Llama-3.1-8B-Instruct:

python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantization fp8 \
    --port 30000 --host 0.0.0.0

Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to ["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]

We also support quantization methods based on torchao. You can simply specify --torchao-config in the command line to support this feature. For example, if you want to enable int4wo-128 for model meta-llama/Meta-Llama-3.1-8B-Instruct, you can launch the server with the following command:

python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --torchao-config int4wo-128 \
    --port 30000 --host 0.0.0.0

We support the following quantization methods based on torchao ["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"].

Note: According to this issue, "int8dq" method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using "int8dq" method. Namely, please use the following command:

python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --torchao-config int8dq \
    --disable-cuda-graph \
    --port 30000 --host 0.0.0.0

Reference


File: references/sampling_params.md

Sampling Parameters in SGLang Runtime

This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. If you want a high-level endpoint that can automatically handle chat templates, consider using the OpenAI Compatible API.

The /generate endpoint accepts the following arguments in the JSON format.

@dataclass
class GenerateReqInput:
    # The input prompt. It can be a single prompt or a batch of prompts.
    text: Optional[Union[List[str], str]] = None
    # The token ids for text; one can specify either text or input_ids
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
    # The embeddings for input_ids; one can specify either text or input_ids or input_embeds.
    input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
    image_data: Optional[Union[List[str], str]] = None
    # The sampling_params. See descriptions below.
    sampling_params: Optional[Union[List[Dict], Dict]] = None
    # The request id.
    rid: Optional[Union[List[str], str]] = None
    # Whether to return logprobs.
    return_logprob: Optional[Union[List[bool], bool]] = None
    # If return logprobs, the start location in the prompt for returning logprobs.
    # By default, this value is "-1", which means it will only return logprobs for output tokens.
    logprob_start_len: Optional[Union[List[int], int]] = None
    # If return logprobs, the number of top logprobs to return at each position.
    top_logprobs_num: Optional[Union[List[int], int]] = None
    # Whether to detokenize tokens in text in the returned logprobs.
    return_text_in_logprobs: bool = False
    # Whether to stream output.
    stream: bool = False
    # Whether to log metrics for this request (e.g. health_generate calls do not log metrics)
    log_metrics: bool = True

    # The modalities of the image data [image, multi-images, video]
    modalities: Optional[List[str]] = None
    # LoRA related
    lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None

    # Session info for continual prompting
    session_params: Optional[Union[List[Dict], Dict]] = None
    # Custom logit processor for advanced sampling control. Must be a serialized instance
    # of `CustomLogitProcessor` in python/sglang/srt/sampling/custom_logit_processor.py
    # Use the processor's `to_str()` method to generate the serialized string.
    custom_logit_processor: Optional[Union[List[Optional[str]], str]] = None

The sampling_params follows this format

# The maximum number of output tokens
max_new_tokens: int = 128,
# Stop when hitting any of the strings in this list
stop: Optional[Union[str, List[str]]] = None,
# Stop when hitting any of the token_ids in this list
stop_token_ids: Optional[List[int]] = [],
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
# Min-p sampling
min_p: float = 0.0,
# Whether to ignore EOS token
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization
spaces_between_special_tokens: bool = True,
# Do parallel sampling and return `n` outputs.
n: int = 1,

## Structured Outputs
# Only one of the below three can be set for a request.

# Constrain the output to follow a given JSON schema.
json_schema: Optional[str] = None,
# Constrain the output to follow a given regular expression.
regex: Optional[str] = None,
# Constrain the output to follow a given EBNF grammar.
ebnf: Optional[str] = None,

## Penalties.

# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,


## Custom Parameters for Custom Logit Processor.
# A dictionary of custom parameters for the custom logit processor.
# The custom logit processor takes a list of dictionaries as input, where each
# dictionary is the custom parameters for one token in a batch of the input.
# See also python/sglang/srt/sampling/custom_logit_processor.py
custom_params: Optional[Dict[str, Any]] = None,

Examples

Normal

Launch a server

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

Send a request

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())

Streaming

Send a request and stream the output

import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")

Multi modal

Launch a server

python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava

Download an image

curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true

Send a request

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())

The image_data can be a file name, a URL, or a base64 encoded string. See also python/sglang/srt/utils.py:load_image. Streaming is supported in a similar manner as above.

Structured Outputs (JSON, Regex, EBNF)

You can specify a JSON schema, regular expression or EBNF to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (json_schema, regex, or ebnf) can be specified for a request.

SGLang supports two grammar backends:

  • Outlines (default): Supports JSON schema and regular expression constraints.
  • XGrammar: Supports JSON schema, regular expression, and EBNF constraints.

Initialize the XGrammar backend using --grammar-backend xgrammar flag

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)
import json
import requests

json_schema = json.dumps({
    "type": "object",
    "properties": {
        "name": {"type": "string", "pattern": "^[\\w]+$"},
        "population": {"type": "integer"},
    },
    "required": ["name", "population"],
})

# JSON (works with both Outlines and XGrammar)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

# Regular expression (Outlines backend only)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())

# EBNF (XGrammar backend only)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Write a greeting.",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
        },
    },
)
print(response.json())

Custom Logit Processor

Launch a server with --enable-custom-logit-processor flag on.

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor

Define a custom logit processor that will always sample a specific token id.

from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor

class DeterministicLogitProcessor(CustomLogitProcessor):
    """A dummy logit processor that changes the logits to always
    sample the given token id.
    """

    def __call__(self, logits, custom_param_list):
        # Check that the number of logits matches the number of custom parameters
        assert logits.shape[0] == len(custom_param_list)
        key = "token_id"

        for i, param_dict in enumerate(custom_param_list):
            # Mask all other tokens
            logits[i, :] = -float("inf")
            # Assign highest probability to the specified token
            logits[i, param_dict[key]] = 0.0
        return logits

Send a request

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
        "sampling_params": {
            "temperature": 0.0,
            "max_new_tokens": 32,
            "custom_params": {"token_id": 5},
        },
    },
)
print(response.json())

File: references/supported_models.md

Supported Models

Generative Models

  • Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
  • Mistral / Mixtral / Mistral NeMo / Mistral Small 3
  • Gemma / Gemma 2
  • Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
  • DeepSeek / DeepSeek 2 / DeepSeek 3
  • OLMoE
  • LLaVA-OneVision
    • python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava
    • python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava
    • Query the server with the OpenAI Vision API. See examples at test/srt/test_vision_openai_server.py
  • LLaVA 1.5 / 1.6 / NeXT
    • python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3
    • python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava
    • Query the server with the OpenAI Vision API. See examples at test/srt/test_vision_openai_server.py
  • Yi-VL
  • StableLM
  • Command-R
  • DBRX
  • Grok
  • ChatGLM
  • InternLM 2
  • Exaone 3
  • BaiChuan2
  • MiniCPM / MiniCPM 3 / MiniCPMV
  • XVERSE / XVERSE MoE
  • SmolLM
  • GLM-4
  • Phi-3 / Phi-4
  • Phi-3-Small
  • IBM Granite 3

Embedding Models

  • LlamaEmbeddingModel
  • Mistral embedding models
  • Qwen embedding models
    • python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding

Reward Models

  • LlamaForSequenceClassification
    • python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --is-embedding
  • Gemma2ForSequenceClassification
    • python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Gemma-2-27B-v0.2 --is-embedding
  • InternLM2ForRewardModel
    • python -m sglang.launch_server --model-path internlm/internlm2-7b-reward --is-embedding --trust-remote-code

How to Support a New Language Model

To support a new model in SGLang, you only need to add a single file under SGLang Models Directory. You can learn from existing model implementations and create new files for the new models. For most models, you should be able to find a similar model to start with (e.g., starting from Llama).

How to Support a New vision LLM

To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.

  1. Register your new model as multimodal: Extend is_multimodal_model in model_config.py to return True for your model.
  2. Process Images: Create a new ImageProcessor class that inherits from BaseImageProcessor and register this processor as your model's dedicated processor. See image_processor.py for more details.
  3. Handle Image Tokens: Implement a pad_input_ids function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for RadixAttention.
  4. Replace Multi-headed Attention of ViT with SGLang's VisionAttention.

You can refer Qwen2VL or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.

Test the correctness

Interactive debugging

For interactive debugging, you can compare the outputs of huggingface/transformers and SGLang. The following two commands should give the same text output and very similar prefill logits.

  • Get the reference output by python3 scripts/playground/reference_hf.py --model [new model]
  • Get the SGLang output by python3 -m sglang.bench_one_batch --correct --model [new model]

Add the model to the test suite

To make sure the new model is well maintained in the future, it is better to add it to the test suite. You can add it to the ALL_OTHER_MODELS list in the test_generation_models.py and run the following command to test it.

For example, if the model is Qwen/Qwen2-1.5B

ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others

Port a model from vLLM to SGLang

Another valuable resource is the vLLM Models Directory. vLLM has extensive coverage of models, and SGLang reuses vLLM's interface and some layers to implement the models. This similarity makes it easy to port many models from vLLM to SGLang.

To port a model from vLLM to SGLang, you can compare these two files SGLang Llama Implementation and vLLM Llama Implementation. This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of Attention with RadixAttention. The other parts are almost identical. Specifically,

  • Replace vllm's Attention with RadixAttention. Note that you need to pass layer_id all the way to RadixAttention.
  • Replace vllm's LogitsProcessor with SGLang's LogitsProcessor.
  • Replace Multi-headed Attention of ViT with SGLang's VisionAttention.
  • Replace other vLLM layers with SGLang layers (e.g., RMSNorm, SiluAndMul).
  • Remove Sample.
  • Change forward() functions, and add forward_batch.
  • Add EntryClass at the end.
  • Please ensure the new implementation uses only SGLang components and does not rely on any vLLM components.

Registering an external model implementation

In addition to the methods described above, you can also register your new model with the ModelRegistry before launching the server. This approach is useful if you want to integrate your model without needing to modify the source code.

Here is how you can do it:

from sglang.srt.models.registry import ModelRegistry
from sglang.srt.entrypoints.http_server import launch_server

# for a single model, you can add it to the registry
ModelRegistry.models[model_name] = model_class

# for multiple models, you can imitate the import_model_classes() function in sglang/srt/models/registry.py
from functools import lru_cache

@lru_cache()
def import_new_model_classes():
    model_arch_name_to_cls = {}
    ...
    return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

launch_server(server_args)

File: references/torch_compile_cache.md

Enabling cache for torch.compile

SGLang uses max-autotune-no-cudagraphs mode of torch.compile. The auto-tuning can be slow. If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps.

This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html

  1. Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.
TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
  1. Copy the cache folder to other machines and launch the server with TORCHINDUCTOR_CACHE_DIR.

File: references/troubleshooting.md

Troubleshooting

This page lists some common errors and tips for fixing them.

CUDA out of memory

If you see out of memory (OOM) errors, you can try to tune the following parameters.

  • If OOM happens during prefill, try to decrease --chunked-prefill-size to 4096 or 2048.
  • If OOM happens during decoding, try to decrease --max-running-requests.
  • You can also try to decrease --mem-fraction-static, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

CUDA error: an illegal memory access was encountered

This error may be due to kernel errors or out-of-memory issues.

  • If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
  • If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above section to avoid the OOM.

File: router/router.md

Router for Data Parallelism

Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm.

The router is an independent Python package, and it can be used as a drop-in replacement for the OpenAI API.

Installation

$ pip install sglang-router

Detailed usage of the router can be found in launch_router and launch_server. Also, you can directly run the following command to see the usage of the router.

$ python -m sglang_router.launch_server --help
$ python -m sglang_router.launch_router --help

The router supports two working modes:

  1. Co-launch Router and Runtimes
  2. Launch Runtimes and Router separately

Co-launch Router and Runtimes

This will be a drop-in replacement for the existing --dp-size argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.

$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1

After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.

import requests

url = "http://localhost:30000/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print(response.json())

Launch Runtimes and Router Separately

This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.

$ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2

Dynamic Scaling APIs

We offer /add_worker and /remove_worker APIs to dynamically add or remove workers from the router.

  • /add_worker

Usage:

$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1

Example:

$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
Successfully added worker: http://127.0.0.1:30001
  • /remove_worker

Usage:

$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1

Example:

$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
Successfully removed worker: http://127.0.0.1:30001

Note:

  • For cache-aware router, the worker will be removed from the tree and the queues.

Fault Tolerance

We provide retries based for failure tolerance.

  1. If the request to a worker fails for max_worker_retries times, the router will remove the worker from the router and move on to the next worker.
  2. If the total number of retries exceeds max_total_retries, the router will return an error.

Note:

  • max_worker_retries is 3 and max_total_retries is 6 by default.

Routing Strategies

Cache-Aware Load-Balancing Router

The native router combines two strategies to optimize both cache utilization and request distribution:

  1. Cache-Aware Routing (Approximate Tree)
  2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)

The router dynamically switches between these strategies based on load conditions:

  • Uses load balancing when the system is imbalanced
  • Uses cache-aware routing when the system is balanced

A system is considered imbalanced if both conditions are met:

  1. (max_load - min_load) > balance_abs_threshold
  2. max_load > balance_rel_threshold * min_load

Cache-Aware Routing (Approximate Tree)

When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead.

Process:

  1. For each request, find the worker with the highest prefix match.

    • If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached)
    • If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity)
  2. Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow.

Load-Balancing (Shortest Queue)

For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers.

Configuration Parameters

  1. cache_threshold: (float, 0.0 to 1.0, default: 0.5)

    • Minimum prefix match ratio to use highest-match routing.
    • Below this threshold, the request will be routed to the worker with most available cache space.
  2. balance_abs_threshold: (integer, default: 32)

    • Absolute difference threshold for load imbalance detection.
    • The system is potentially imbalanced if (max_load - min_load) > abs_threshold.
  3. balance_rel_threshold: (float, default: 1.0001)

    • Relative ratio threshold for load imbalance detection.
    • The system is potentially imbalanced if max_load > min_load * rel_threshold.
    • Used in conjunction with balance_abs_threshold to determine the final imbalance state.
  4. eviction_interval: (integer, default: 60)

    • Interval in seconds between LRU eviction cycles for the approximate trees.
    • Background thread periodically evicts least recently used nodes to maintain tree size.
  5. max_tree_size: (integer, default: 16777216)

    • Maximum nodes on the approximate tree.
    • When exceeded, LRU leaf nodes are evicted during the next eviction cycle.

File: start/install.md

Install SGLang

You can install SGLang using any of the methods below. For running DeepSeek V3/R1 with SGLang, refer to DeepSeek V3 Support. It is always recommended to use the latest release version and deploy it with Docker to avoid fixed issues and environment-related problems.

Method 1: With pip

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc. Please note that the package currently used by FlashInfer is named flashinfer-python, not flashinfer.

If you experience an error like OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root, please try either of the following solutions:

  • Use export CUDA_HOME=/usr/local/cuda-<your-cuda-version> to set the CUDA_HOME environment variable.
  • Follow the procedure described in FlashInfer installation doc first, then install SGLang as described above.

Method 2: From source

# Use the last release branch
git clone -b v0.4.3 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc.

If you want to work on development in SGLang, it is highly recommended that you use docker. Please refer to setup docker container for guidance. The image used is lmsysorg/sglang:dev.

Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:

# Use the last release branch
git clone -b v0.4.3 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"

Method 3: Using docker

The docker images are available on Docker Hub as lmsysorg/sglang, built from Dockerfile. Replace <secret> below with your huggingface hub token.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

Note: To AMD ROCm system with Instinct/MI GPUs, it is recommended to use docker/Dockerfile.rocm to build images, example and usage as below:

docker build --build-arg SGL_BRANCH=v0.4.3 -t v0.4.3-rocm630 -f Dockerfile.rocm .

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    v0.4.3-rocm630 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.4.3-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8

Method 4: Using docker compose

More

This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.

  1. Copy the compose.yml to your local machine
  2. Execute the command docker compose up -d in your terminal.

Method 5: Run on Kubernetes or Clouds with SkyPilot

More

To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.

  1. Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot's documentation.
  2. Deploy on your own infra with a single command and get the HTTP API endpoint:
SkyPilot YAML: sglang.yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
  1. To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.

Common Notes

  • FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding --attention-backend triton --sampling-backend pytorch and open an issue on GitHub.
  • If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using pip install "sglang[openai]".
  • The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run pip install sglang, and for the backend, use pip install sglang[srt]. srt is the abbreviation of SGLang runtime.
  • To reinstall flashinfer locally, use the following command: pip install "flashinfer-python>=0.2.1.post1" -i https://flashinfer.ai/whl/cu124/torch2.5 --force-reinstall --no-deps and then delete the cache with rm -rf ~/.cache/flashinfer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment