We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. Most documentation files are located under the docs/
folder. We prefer Jupyter Notebooks over Markdown so that all examples can be executed and validated by our docs CI pipeline.
pip install -r requirements.txt
Update your Jupyter notebooks in the appropriate subdirectories under docs/
. If you add new files, remember to update index.rst
(or relevant .rst
files) accordingly.
pre-commit run --all-files
manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks before creating a Pull Request.- Do not commit directly to the
main
branch. Always create a new branch (e.g.,feature/my-new-feature
), push your changes, and open a PR from that branch.
# 1) Compile all Jupyter notebooks
make compile
# 2) Generate static HTML
make html
# 3) Preview documentation locally
# Open your browser at the displayed port to view the docs
bash serve.sh
# 4) Clean notebook outputs
# nbstripout removes notebook outputs so your PR stays clean
pip install nbstripout
find . -name '*.ipynb' -exec nbstripout {} \;
# 5) Pre-commit checks and create a PR
# After these checks pass, push your changes and open a PR on your branch
pre-commit run --all-files
If you need to run and shut down a SGLang server or engine, following these examples:
- Launch and close Sever:
#Launch Sever
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)
wait_for_server("http://localhost:30000")
# Terminate Sever
terminate_process(server_process)
- Launch Engine and close Engine
# Launch Engine
import sglang as sgl
import asyncio
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
# Terminalte Engine
llm.shutdown()
This guide demonstrates how to use SGLang’s Tool Calling functionality.
from openai import OpenAI
import json
from sglang.utils import wait_for_server, print_highlight, terminate_process
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
server_process, port = launch_server_cmd(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --host 0.0.0.0" # llama3
)
wait_for_server(f"http://localhost:{port}")
Note that --tool-call-parser
defines the parser used to interpret responses. Currently supported parsers include:
- llama3: Llama 3.1 / 3.2 (e.g. meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct).
- mistral: Mistral (e.g. mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Nemo-Instruct-2407, mistralai/ Mistral-Nemo-Instruct-2407, mistralai/Mistral-7B-v0.3).
- qwen25: Qwen 2.5 (e.g. Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-7B-Instruct).
Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters.
# Define tools
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city to find the weather for, e.g. 'San Francisco'",
},
"state": {
"type": "string",
"description": "the two-letter abbreviation for the state that the city is"
" in, e.g. 'CA' which would mean 'California'",
},
"unit": {
"type": "string",
"description": "The unit to fetch the temperature in",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["city", "state", "unit"],
},
},
}
]
def get_messages():
return [
{
"role": "user",
"content": "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}",
}
]
messages = get_messages()
# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id
# Non-streaming mode test
response_non_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.8,
top_p=0.8,
stream=False, # Non-streaming
tools=tools,
)
print_highlight("Non-stream response:")
print(response_non_stream)
# Streaming mode test
print_highlight("Streaming response:")
response_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.8,
top_p=0.8,
stream=True, # Enable streaming
tools=tools,
)
chunks = []
for chunk in response_stream:
chunks.append(chunk)
if chunk.choices[0].delta.tool_calls:
print(chunk.choices[0].delta.tool_calls[0])
When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly.
Non-Streaming Request
name_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.name
arguments_non_stream = (
response_non_stream.choices[0].message.tool_calls[0].function.arguments
)
print_highlight(f"Final streamed function call name: {name_non_stream}")
print_highlight(f"Final streamed function call arguments: {arguments_non_stream}")
Streaming Request
# Parse and combine function call arguments
arguments = []
for chunk in chunks:
choice = chunk.choices[0]
delta = choice.delta
if delta.tool_calls:
tool_call = delta.tool_calls[0]
if tool_call.function.name:
print_highlight(f"Streamed function call name: {tool_call.function.name}")
if tool_call.function.arguments:
arguments.append(tool_call.function.arguments)
print(f"Streamed function call arguments: {tool_call.function.arguments}")
# Combine all fragments into a single JSON string
full_arguments = "".join(arguments)
print_highlight(f"Final streamed function call arguments: {full_arguments}")
# This is a demonstration, define real function according to your usage.
def get_current_weather(city: str, state: str, unit: "str"):
return (
f"The weather in {city}, {state} is 85 degrees {unit}. It is "
"partly cloudly, with highs in the 90's."
)
available_tools = {"get_current_weather": get_current_weather}
call_data = json.loads(full_arguments)
messages.append(
{
"role": "user",
"content": "",
"tool_calls": {"name": "get_current_weather", "arguments": full_arguments},
}
)
# Call the corresponding tool function
tool_name = messages[-1]["tool_calls"]["name"]
tool_to_call = available_tools[tool_name]
result = tool_to_call(**call_data)
print_highlight(f"Function call result: {result}")
messages.append({"role": "tool", "content": result, "name": tool_name})
print_highlight(f"Updated message history: {messages}")
final_response = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.8,
top_p=0.8,
stream=False,
tools=tools,
)
print_highlight("Non-stream response:")
print(final_response)
from transformers import AutoTokenizer
import requests
# generate an answer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
messages = get_messages()
input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
tools=tools,
)
gen_url = f"http://localhost:{port}/generate"
gen_data = {"text": input, "sampling_params": {"skip_special_tokens": False}}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]
print(gen_response)
# parse the response
parse_url = f"http://localhost:{port}/function_call"
function_call_input = {
"text": gen_response,
"tool_call_parser": "llama3",
"tools": tools,
}
function_call_response = requests.post(parse_url, json=function_call_input)
function_call_response_json = function_call_response.json()
print("function name: ", function_call_response_json["calls"][0]["name"])
print("function arguments: ", function_call_response_json["calls"][0]["parameters"])
terminate_process(server_process, port)
import sglang as sgl
from sglang.srt.function_call_parser import FunctionCallParser
from sglang.srt.managers.io_struct import Tool, Function
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer = llm.tokenizer_manager.tokenizer
input_ids = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, tools=tools
)
sampling_params = {
"max_new_tokens": 128,
"temperature": 0.3,
"top_p": 0.95,
"skip_special_tokens": False,
}
# 1) Offline generation
result = llm.generate(input_ids=input_ids, sampling_params=sampling_params)
generated_text = result["text"] # Assume there is only one prompt
print("=== Offline Engine Output Text ===")
print(generated_text)
# 2) Parse using FunctionCallParser
def convert_dict_to_tool(tool_dict: dict) -> Tool:
function_dict = tool_dict.get("function", {})
return Tool(
type=tool_dict.get("type", "function"),
function=Function(
name=function_dict.get("name"),
description=function_dict.get("description"),
parameters=function_dict.get("parameters"),
),
)
tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools]
parser = FunctionCallParser(tools=tools, tool_call_parser="llama3")
normal_text, calls = parser.parse_non_stream(generated_text)
print("\n=== Parsing Result ===")
print("Normal text portion:", normal_text)
print("Function call portion:")
for call in calls:
# call: ToolCallItem
print(f" - tool name: {call.name}")
print(f" parameters: {call.parameters}")
# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc.
llm.shutdown()
- Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include:
TOOLS_TAG_LIST = [
“<|plugin|>“,
“<function=“,
“<tool_call>“,
“<|python_tag|>“,
“[TOOL_CALLS]”
]
- Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example:
class NewModelDetector(BaseFormatDetector):
- Add the new detector to the MultiFormatParser class that manages all the format detectors.
Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:
/generate
(text generation model)/get_model_info
/get_server_info
/health
/health_generate
/flush_cache
/update_weights
/encode
(embedding model)/classify
(reward model)
We mainly use requests
to test these APIs in the following examples. You can also use curl
.
import requests
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
server_process, port = launch_server_cmd(
"python -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0"
)
wait_for_server(f"http://localhost:{port}")
Generate completions. This is similar to the /v1/completions
in OpenAI API. Detailed parameters can be found in the sampling parameters.
url = f"http://localhost:{port}/generate"
data = {"text": "What is the capital of France?"}
response = requests.post(url, json=data)
print_highlight(response.json())
Get the information of the model.
model_path
: The path/name of the model.is_generation
: Whether the model is used as generation model or embedding model.tokenizer_path
: The path/name of the tokenizer.
url = f"http://localhost:{port}/get_model_info"
response = requests.get(url)
response_json = response.json()
print_highlight(response_json)
assert response_json["model_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json["is_generation"] is True
assert response_json["tokenizer_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json.keys() == {"model_path", "is_generation", "tokenizer_path"}
Gets the server information including CLI arguments, token limits, and memory pool sizes.
- Note:
get_server_info
merges the following deprecated endpoints:get_server_args
get_memory_pool_size
get_max_total_num_tokens
# get_server_info
url = f"http://localhost:{port}/get_server_info"
response = requests.get(url)
print_highlight(response.text)
/health
: Check the health of the server./health_generate
: Check the health of the server by generating one token.
url = f"http://localhost:{port}/health_generate"
response = requests.get(url)
print_highlight(response.text)
url = f"http://localhost:{port}/health"
response = requests.get(url)
print_highlight(response.text)
Flush the radix cache. It will be automatically triggered when the model weights are updated by the /update_weights
API.
# flush cache
url = f"http://localhost:{port}/flush_cache"
response = requests.post(url)
print_highlight(response.text)
Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.
SGLang support update_weights_from_disk
API for continuous evaluation during training (save checkpoint to disk and update weights from disk).
# successful update with same architecture and size
url = f"http://localhost:{port}/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B"}
response = requests.post(url, json=data)
print_highlight(response.text)
assert response.json()["success"] is True
assert response.json()["message"] == "Succeeded to update model weights."
assert response.json().keys() == {"success", "message"}
# failed update with different parameter size or wrong name
url = f"http://localhost:{port}/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B-wrong"}
response = requests.post(url, json=data)
response_json = response.json()
print_highlight(response_json)
assert response_json["success"] is False
assert response_json["message"] == (
"Failed to get weights iterator: "
"meta-llama/Llama-3.2-1B-wrong"
" (repository not found)."
)
Encode text into embeddings. Note that this API is only available for embedding models and will raise an error for generation models. Therefore, we launch a new server to server an embedding model.
terminate_process(server_process, port)
embedding_process, port = launch_server_cmd(
"""
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
--host 0.0.0.0 --is-embedding
"""
)
wait_for_server(f"http://localhost:{port}")
# successful encode for embedding model
url = f"http://localhost:{port}/encode"
data = {"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "text": "Once upon a time"}
response = requests.post(url, json=data)
response_json = response.json()
print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}")
terminate_process(embedding_process, port)
SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations.
terminate_process(embedding_process, port)
# Note that SGLang now treats embedding models and reward models as the same type of models.
# This will be updated in the future.
reward_process, port = launch_server_cmd(
"""
python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding
"""
)
wait_for_server(f"http://localhost:{port}")
from transformers import AutoTokenizer
PROMPT = (
"What is the range of the numeric output of a sigmoid node in a neural network?"
)
RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1."
RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1."
CONVS = [
[{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}],
[{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}],
]
tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)
url = f"http://localhost:{port}/classify"
data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts}
responses = requests.post(url, json=data).json()
for response in responses:
print_highlight(f"reward: {response['embedding'][0]}")
terminate_process(reward_process, port)
SGLang Runtime also supports skip tokenizer and detokenizer. This is useful in cases like integrating with RLHF workflow.
tokenizer_free_server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --skip-tokenizer-init
"""
)
wait_for_server(f"http://localhost:{port}")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
input_text = "What is the capital of France?"
input_tokens = tokenizer.encode(input_text)
print_highlight(f"Input Text: {input_text}")
print_highlight(f"Tokenized Input: {input_tokens}")
response = requests.post(
f"http://localhost:{port}/generate",
json={
"input_ids": input_tokens,
"sampling_params": {
"temperature": 0,
"max_new_tokens": 256,
"stop_token_ids": [tokenizer.eos_token_id],
},
"stream": False,
},
)
output = response.json()
output_tokens = output["token_ids"]
output_text = tokenizer.decode(output_tokens, skip_special_tokens=False)
print_highlight(f"Tokenized Output: {output_tokens}")
print_highlight(f"Decoded Output: {output_text}")
print_highlight(f"Output Text: {output['meta_info']['finish_reason']}")
terminate_process(tokenizer_free_server_process, port)
SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:
- Offline Batch Inference
- Custom Server on Top of the Engine
This document focuses on the offline batch inference, demonstrating four different inference modes:
- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation
Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.
SGLang offline engine supports batch inference with efficient scheduling.
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci
if is_in_ci():
import patch
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {
"temperature": 0.2,
"top_p": 0.9,
}
print("\n=== Testing synchronous streaming generation with overlap removal ===\n")
for prompt in prompts:
print(f"Prompt: {prompt}")
merged_output = stream_and_merge(llm, prompt, sampling_params)
print("Generated text:", merged_output)
print()
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous batch generation ===")
async def main():
outputs = await llm.async_generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"\nPrompt: {prompt}")
print(f"Generated text: {output['text']}")
asyncio.run(main())
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous streaming generation (no repeats) ===")
async def main():
for prompt in prompts:
print(f"\nPrompt: {prompt}")
print("Generated text: ", end="", flush=True)
# Replace direct calls to async_generate with our custom overlap-aware version
async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
print(cleaned_chunk, end="", flush=True)
print() # New line after each prompt
asyncio.run(main())
llm.shutdown()
Return Hidden States
llm = sgl.Engine(
model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}
outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(
f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
)
print()
llm.shutdown()
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference.
This tutorial covers the following popular APIs:
chat/completions
completions
batches
Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.
Launch the server in your terminal and wait for it to initialize.
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
server_process, port = launch_server_cmd(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0"
)
wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
The server fully implements the OpenAI API.
It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.
You can also specify a custom chat template with --chat-template
when launching the server.
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to OpenAI Chat Completions API for more details.
Here is an example of a detailed chat completion request:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "system",
"content": "You are a knowledgeable historian who provides concise responses.",
},
{"role": "user", "content": "Tell me about ancient Rome"},
{
"role": "assistant",
"content": "Ancient Rome was a civilization centered in Italy.",
},
{"role": "user", "content": "What were their major achievements?"},
],
temperature=0.3, # Lower temperature for more focused responses
max_tokens=128, # Reasonable length for a concise response
top_p=0.95, # Slightly higher for better fluency
presence_penalty=0.2, # Mild penalty to avoid repetition
frequency_penalty=0.2, # Mild penalty for more natural language
n=1, # Single response is usually more stable
seed=42, # Keep for reproducibility
)
print_highlight(response.choices[0].message.content)
Streaming mode is also supported.
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Completions API is similar to Chat Completions API, but without the messages
parameter or chat templates.
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="List 3 countries and their capitals.",
temperature=0,
max_tokens=64,
n=1,
stop=None,
)
print_highlight(f"Response: {response}")
The completions API accepts OpenAI Completions API's parameters. Refer to OpenAI Completions API for more details.
Here is an example of a detailed completions request:
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="Write a short story about a space explorer.",
temperature=0.7, # Moderate temperature for creative writing
max_tokens=150, # Longer response for a story
top_p=0.9, # Balanced diversity in word choice
stop=["\n\n", "THE END"], # Multiple stop sequences
presence_penalty=0.3, # Encourage novel elements
frequency_penalty=0.3, # Reduce repetitive phrases
n=1, # Generate one completion
seed=123, # For reproducible results
)
print_highlight(f"Response: {response}")
For OpenAI compatible structed outputs API, refer to Structured Outputs for more details.
Batches API for chat completions and completions are also supported. You can upload your requests in jsonl
files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).
The batches APIs are:
batches
batches/{batch_id}/cancel
batches/{batch_id}
Here is an example of a batch job for chat completions, completions are similar.
import json
import time
from openai import OpenAI
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Tell me a joke about programming"}
],
"max_tokens": 50,
},
},
{
"custom_id": "request-2",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is Python?"}],
"max_tokens": 50,
},
},
]
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
file_response = client.files.create(file=f, purpose="batch")
batch_response = client.batches.create(
input_file_id=file_response.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Batch job created with ID: {batch_response.id}")
while batch_response.status not in ["completed", "failed", "cancelled"]:
time.sleep(3)
print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
batch_response = client.batches.retrieve(batch_response.id)
if batch_response.status == "completed":
print("Batch job completed successfully!")
print(f"Request counts: {batch_response.request_counts}")
result_file_id = batch_response.output_file_id
file_response = client.files.content(result_file_id)
result_content = file_response.read().decode("utf-8")
results = [
json.loads(line) for line in result_content.split("\n") if line.strip() != ""
]
for result in results:
print_highlight(f"Request {result['custom_id']}:")
print_highlight(f"Response: {result['response']}")
print_highlight("Cleaning up files...")
# Only delete the result file ID since file_response is just content
client.files.delete(result_file_id)
else:
print_highlight(f"Batch job failed with status: {batch_response.status}")
if hasattr(batch_response, "errors"):
print_highlight(f"Errors: {batch_response.errors}")
It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.
batches/{batch_id}
: Retrieve the batch job status.batches/{batch_id}/cancel
: Cancel the batch job.
Here is an example to check the batch job status.
import json
import time
from openai import OpenAI
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
requests = []
for i in range(20):
requests.append(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": f"{i}: You are a helpful AI assistant",
},
{
"role": "user",
"content": "Write a detailed story about topic. Make it very long.",
},
],
"max_tokens": 64,
},
}
)
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
uploaded_file = client.files.create(file=f, purpose="batch")
batch_job = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")
time.sleep(10)
max_checks = 5
for i in range(max_checks):
batch_details = client.batches.retrieve(batch_id=batch_job.id)
print_highlight(
f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
)
print_highlight(
f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
)
time.sleep(3)
Here is an example to cancel a batch job.
import json
import time
from openai import OpenAI
import os
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
requests = []
for i in range(5000):
requests.append(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": f"{i}: You are a helpful AI assistant",
},
{
"role": "user",
"content": "Write a detailed story about topic. Make it very long.",
},
],
"max_tokens": 128,
},
}
)
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
uploaded_file = client.files.create(file=f, purpose="batch")
batch_job = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")
time.sleep(10)
try:
cancelled_job = client.batches.cancel(batch_id=batch_job.id)
print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
assert cancelled_job.status == "cancelling"
# Monitor the cancellation process
while cancelled_job.status not in ["failed", "cancelled"]:
time.sleep(3)
cancelled_job = client.batches.retrieve(batch_job.id)
print_highlight(f"Current status: {cancelled_job.status}")
# Verify final status
assert cancelled_job.status == "cancelled"
print_highlight("Batch job successfully cancelled")
except Exception as e:
print_highlight(f"Error during cancellation: {e}")
raise e
finally:
try:
del_response = client.files.delete(uploaded_file.id)
if del_response.deleted:
print_highlight("Successfully cleaned up input file")
if os.path.exists(input_file_path):
os.remove(input_file_path)
print_highlight("Successfully deleted local batch_requests.jsonl file")
except Exception as e:
print_highlight(f"Error cleaning up: {e}")
raise e
terminate_process(server_process, port)
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference.
This tutorial covers the embedding APIs for embedding models, such as
Launch the server in your terminal and wait for it to initialize. Remember to add --is-embedding
to the command.
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
embedding_process, port = launch_server_cmd(
"""
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
--host 0.0.0.0 --is-embedding
"""
)
wait_for_server(f"http://localhost:{port}")
import subprocess, json
text = "Once upon a time"
curl_text = f"""curl -s http://localhost:{port}/v1/embeddings \
-d '{{"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "input": "{text}"}}'"""
text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))["data"][0][
"embedding"
]
print_highlight(f"Text embedding (first 10): {text_embedding[:10]}")
import requests
text = "Once upon a time"
response = requests.post(
f"http://localhost:{port}/v1/embeddings",
json={"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "input": text},
)
text_embedding = response.json()["data"][0]["embedding"]
print_highlight(f"Text embedding (first 10): {text_embedding[:10]}")
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
# Text embedding example
response = client.embeddings.create(
model="Alibaba-NLP/gte-Qwen2-7B-instruct",
input=text,
)
embedding = response.data[0].embedding[:10]
print_highlight(f"Text embedding (first 10): {embedding}")
SGLang also supports input_ids
as input to get the embedding.
import json
import os
from transformers import AutoTokenizer
os.environ["TOKENIZERS_PARALLELISM"] = "false"
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-7B-instruct")
input_ids = tokenizer.encode(text)
curl_ids = f"""curl -s http://localhost:{port}/v1/embeddings \
-d '{{"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "input": {json.dumps(input_ids)}}}'"""
input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))["data"][
0
]["embedding"]
print_highlight(f"Input IDs embedding (first 10): {input_ids_embedding[:10]}")
terminate_process(embedding_process, port)
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference. This tutorial covers the vision APIs for vision language models.
SGLang supports vision language models such as Llama 3.2, LLaVA-OneVision, and QWen-VL2
- meta-llama/Llama-3.2-11B-Vision-Instruct
- lmms-lab/llava-onevision-qwen2-72b-ov-chat
- Qwen/Qwen2-VL-7B-Instruct
Launch the server in your terminal and wait for it to initialize.
Remember to add --chat-template llama_3_vision
to specify the vision chat template, otherwise the server only supports text, and performance degradation may occur.
We need to specify --chat-template
for vision language models because the chat template provided in Hugging Face tokenizer only supports text.
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
embedding_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--chat-template=llama_3_vision
"""
)
wait_for_server(f"http://localhost:{port}")
Once the server is up, you can send test requests using curl or requests.
import subprocess
curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
-d '{{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{{
"role": "user",
"content": [
{{
"type": "text",
"text": "What’s in this image?"
}},
{{
"type": "image_url",
"image_url": {{
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
}}
}}
]
}}
],
"max_tokens": 300
}}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
import requests
url = f"http://localhost:{port}/v1/chat/completions"
data = {
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
},
},
],
}
],
"max_tokens": 300,
}
response = requests.post(url, json=data)
print_highlight(response.text)
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
},
},
],
}
],
max_tokens=300,
)
print_highlight(response.choices[0].message.content)
The server also supports multiple images and interleaved text and images if the model supports it.
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
},
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
},
},
{
"type": "text",
"text": "I have two very different images. They are not related at all. "
"Please describe the first image in one sentence, and then describe the second image in another sentence.",
},
],
}
],
temperature=0,
)
print_highlight(response.choices[0].message.content)
terminate_process(embedding_process, port)
As mentioned before, if you do not specify a vision model's --chat-template
, the server uses Hugging Face's default template, which only supports text.
We list popular vision models with their chat templates:
- meta-llama/Llama-3.2-Vision uses
llama_3_vision
. - Qwen/Qwen2-VL-7B-Instruct uses
qwen2-vl
. - LlaVA-OneVision uses
chatml-llava
. - LLaVA-NeXT uses
chatml-llava
. - Llama3-LLaVA-NeXT uses
llava_llama_3
. - LLaVA-v1.5 / 1.6 uses
vicuna_v1.1
.
This notebook provides a quick-start guide to use SGLang in chat completions after installation.
- For Vision Language Models, see OpenAI APIs - Vision.
- For Embedding Models, see OpenAI APIs - Embedding and Encode (embedding model).
- For Reward Models, see Classify (reward model).
This code block is equivalent to executing
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0
in your terminal and wait for the server to be ready. Once the server is running, you can send test requests using curl or requests. The server implements the OpenAI-compatible APIs.
from sglang.test.test_utils import is_in_ci
from sglang.utils import wait_for_server, print_highlight, terminate_process
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
server_process, port = launch_server_cmd(
"""
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0
"""
)
wait_for_server(f"http://localhost:{port}")
import subprocess, json
curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}'
"""
response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)
import requests
url = f"http://localhost:{port}/v1/chat/completions"
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print_highlight(response.json())
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(response)
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
# Use stream=True for streaming responses
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
stream=True,
)
# Handle the streaming output
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
You can also use the native /generate
endpoint with requests, which provides more flexiblity. An API reference is available at Sampling Parameters.
import requests
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print_highlight(response.json())
import requests, json
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
"stream": True,
},
stream=True,
)
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"]
print(output[prev:], end="", flush=True)
prev = len(output)
terminate_process(server_process, port)
- To enable multi-GPU tensor parallelism, add
--tp 2
. If it reports the error "peer access is not supported between these two devices", add--enable-p2p-check
to the server launch command.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
- To enable multi-GPU data parallelism, add
--dp 2
. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Router for data parallelism.
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-static
. The default value is0.9
.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
- See hyperparameter tuning on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
-
To enable torch.compile acceleration, add
--enable-torch-compile
. It accelerates small models on small batch sizes. This does not work for FP8 currently. -
To enable torchao quantization, add
--torchao-config int4wo-128
. It supports other quantization strategies (INT8/FP8) as well. -
To enable fp8 weight quantization, add
--quantization fp8
on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. -
To enable fp8 kv cache quantization, add
--kv-cache-dtype fp8_e5m2
. -
If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template.
-
To run tensor parallelism on multiple nodes, add
--nnodes 2
. If you have two nodes with two GPUs on each node and want to run TP=4, letsgl-dev-0
be the hostname of the first node and50000
be an available port, you can use the following commands. If you meet deadlock, please try to add--disable-cuda-graph
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0
# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1
Please consult the documentation below to learn more about the parameters you may provide when launching a server.
model_path
: Path to the model that will be served.tokenizer_path
: Defaults to themodel_path
.tokenizer_mode
: By defaultauto
, see here for different mode.load_format
: The format the weights are loaded in. Defaults to*.safetensors
/*.bin
.trust_remote_code
: IfTrue
, will use locally cached config files, otherwise use remote configs in HuggingFace.dtype
: Dtype used for the model, defaults tobfloat16
.kv_cache_dtype
: Dtype of the kv cache, defaults to thedtype
.context_length
: The number of tokens our model can process including the input. Not that extending the default might lead to strange behavior.device
: The device we put the model, defaults tocuda
.chat_template
: The chat template to use. Deviating from the default might lead to unexpected responses. For multi-modal chat templates, refer to here.is_embedding
: Set to true to perform embedding / encode and reward tasks.revision
: Adjust if a specific version of the model should be used.skip_tokenizer_init
: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF.json_model_override_args
: Override model config with the provided JSON.delete_ckpt_after_loading
: Delete the model checkpoint after loading the model.
Important
Make sure the correct chat_template
is passed, or performance degradation may occur.
port
andhost
: Setup the host for HTTP server. By defaulthost: str = "127.0.0.1"
andport: int = 30000
api_key
: Sets an API key for the server and the OpenAI-compatible API.file_storage_pth
: Directory for storing uploaded or generated files from API calls.enable_cache_report
: If set, includes detailed usage of cached tokens in the response usage.
tp_size
: The number of GPUs the model weights get sharded over. Mainly for saving memory rather than for high throughput, see this blogpost.
dp_size
: Will be deprecated. The number of data-parallel copies of the model. SGLang router is recommended instead of the current naive data parallel.load_balance_method
: Will be deprecated. Load balancing strategy for data parallel requests.
ep_size
: Distribute the experts onto multiple GPUs for MoE models. Remember to shard the model weights withtp_size=ep_size
, for detailed benchmarking refer to this PR.
mem_fraction_static
: Fraction of the free GPU memory used for static memory like model weights and KV cache. If building KV cache fails, it should be increased. If CUDA runs out of memory, it should be decreased.max_running_requests
: The maximum number of requests to run concurrently.max_total_tokens
: The maximum number of tokens that can be stored into the KV cache. Use mainly for debugging.chunked_prefill_size
: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the VRAM consumption. If CUDA runs out of memory, it should be decreased.max_prefill_tokens
: Token budget of how many tokens to accept in one prefill batch. The actual number is the max of this parameter and thecontext_length
.schedule_policy
: The scheduling policy to control the processing order of waiting prefill requests in a single engine.schedule_conservativeness
: Can be used to decrease/increase the conservativeness of the server when taking new requests. Highly conservative behavior leads to starvation, but low conservativeness leads to slowed-down performance.cpu_offload_gb
: Reserve this amount of RAM in GB for offloading of model parameters to the CPU.prefill_only_one_req
: When this flag is turned on, the engine prefills only one request at a time.
stream_interval
: Interval (in tokens) for streaming responses. Smaller values lead to smoother streaming, and larger values lead to better throughput.random_seed
: Can be used to enforce more deterministic behavior.watchdog_timeout
: Adjusts the watchdog thread’s timeout before killing the server if batch generation takes too long.download_dir
: Use to override the default Hugging Face cache directory for model weights.base_gpu_id
: Use to adjust first GPU used to distribute the model across available GPUs.allow_auto_truncate
: Automatically truncate requests that exceed the maximum input length.
log_level
: Global log verbosity.log_level_http
: Separate verbosity level for the HTTP server logs (if unset, defaults tolog_level
).log_requests
: Logs the inputs and outputs of all requests for debugging.show_time_cost
: Prints or logs detailed timing info for internal operations (helpful for performance tuning).enable_metrics
: Exports Prometheus-like metrics for request usage and performance.decode_log_interval
: How often (in tokens) to log decode progress.
dist_init_addr
: The TCP address used for initializing PyTorch’s distributed backend (e.g.192.168.0.2:25000
).nnodes
: Total number of nodes in the cluster. Refer to how to run the Llama 405B model.node_rank
: Rank (ID) of this node among thennodes
in the distributed setup.
lora_paths
: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currentlycuda_graph
andradix_attention
are not supportet with this option so you need to disable them manually. We are still working on through these issues.max_loras_per_batch
: Maximum number of LoRAs in a running batch including base model.lora_backend
: The backend of running GEMM kernels for Lora modules, can be one oftriton
orflashinfer
. Defaults to betriton
.
attention_backend
: The backend for attention computation and KV cache management.sampling_backend
: The backend for sampling.
grammar_backend
: The grammar backend for constraint decoding. Detailed usage can be found in this document.constrained_json_whitespace_pattern
: Use withOutlines
grammar backend to allow JSON with syntatic newlines, tabs or multiple spaces. Details can be found here.
speculative_draft_model_path
: The draft model path for speculative decoding.speculative_algorithm
: The algorithm for speculative decoding. Currently only Eagle is supported. Note that the radix cache, chunked prefill, and overlap scheduler are disabled when using eagle speculative decoding.speculative_num_steps
: How many draft passes we run before verifying.speculative_num_draft_tokens
: The number of tokens proposed in a draft.speculative_eagle_topk
: The number of top candidates we keep for verification at each step for Eagle.
enable_double_sparsity
: Enables double sparsity which increases throughput.ds_channel_config_path
: The double sparsity config. For a guide on how to generate the config for your model see this repo.ds_heavy_channel_num
: Number of channel indices to keep for each layer.ds_heavy_token_num
: Number of tokens used for attention during decode. Skip sparse decoding ifmin_seq_len
in batch < this number.ds_heavy_channel_type
: The type of heavy channels. Eitherq
,k
orqk
.ds_sparse_decode_threshold
: Don't apply sparse decoding ifmax_seq_len
in batch < this threshold.
Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.
disable_radix_cache
: Disable Radix backend for prefix caching.disable_jump_forward
: Disable jump-forward for outlines grammar backend.disable_cuda_graph
: Disable cuda graph for model forward. Use if encountering uncorrectable CUDA ECC errors.disable_cuda_graph_padding
: Disable cuda graph when padding is needed. In other case still use cuda graph.disable_outlines_disk_cache
: Disable disk cache for outlines grammar backend.disable_custom_all_reduce
: Disable usage of custom all reduce kernel.disable_mla
: Disable Multi-Head Latent Attention for Deepseek model.disable_overlap_schedule
: Disable the Overhead-Scheduler.enable_nan_detection
: Turning this on makes the sampler print a warning if the logits containNaN
.enable_p2p_check
: Turns off the default of allowing always p2p check when accessing GPU.triton_attention_reduce_in_fp32
: In triton kernels this will cast the intermediate attention result tofloat32
.
Note: Some of these options are still in experimental stage.
enable_mixed_chunk
: Enables mixing prefill and decode, see this discussion.enable_dp_attention
: Enable Data Parallelism Attention for Deepseek models. Note that you need to choosedp_size = tp_size
for this.enable_ep_moe
: Enables expert parallelism, see the description ofep_size
.enable_torch_compile
: Torch compile the model. This is an experimental feature.torch_compile_max_bs
: The maximum batch size when usingtorch_compile
.cuda_graph_max_bs
: Adjust the maximum batchsize when using cuda graph. By default this is chosen for you based on GPU specifics.cuda_graph_bs
: The batch sizes to capture byCudaGraphRunner
. By default this is done for you.torchao_config
: Experimental feature that optimizes the model with torchao. Possible choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row.triton_attention_num_kv_splits
: Use to adjust the number of KV splits in triton kernels. Default is 8.
SGLang now provides an EAGLE-based speculative decoding option. The implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.
Note: Currently, Speculative Decoding in SGLang does not support radix cache.
- Official EAGLE code (SafeAILab/EAGLE): ~200 tokens/s
- Standard SGLang Decoding: ~156 tokens/s
- EAGLE Decoding in SGLang: ~297 tokens/s
- EAGLE Decoding in SGLang (w/
torch.compile
): ~316 tokens/s
All benchmarks below were run on a single H100.
To enable EAGLE-based speculative decoding, specify the draft model (--speculative-draft
) and the relevant EAGLE parameters:
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algo EAGLE \
--speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64
"""
)
wait_for_server(f"http://localhost:{port}")
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
terminate_process(server_process)
You can also enable torch.compile
for further optimizations and optionally set --cuda-graph-max-bs
:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algo EAGLE \
--speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
--enable-torch-compile --cuda-graph-max-bs 2
"""
)
wait_for_server(f"http://localhost:{port}")
import openai
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
terminate_process(server_process)
You can specify a JSON schema, regular expression or EBNF to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (json_schema
, regex
, or ebnf
) can be specified for a request.
SGLang supports two grammar backends:
- Outlines (default): Supports JSON schema and regular expression constraints.
- XGrammar: Supports JSON schema, regular expression, and EBNF constraints.
We suggest using XGrammar for its better performance and utility. XGrammar currently uses the GGML BNF format. For more details, see XGrammar technical overview.
To use Xgrammar, simply add --grammar-backend
xgrammar when launching the server. If no backend is specified, Outlines will be used as the default.
For better output quality, It's advisable to explicitly include instructions in the prompt to guide the model to generate the desired format. For example, you can specify, 'Please generate the output in the following JSON format: ...'.
import openai
import os
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
os.environ["TOKENIZERS_PARALLELISM"] = "false"
server_process, port = launch_server_cmd(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --grammar-backend xgrammar"
)
wait_for_server(f"http://localhost:{port}")
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
you can directly define a JSON schema or use Pydantic to define and validate the response.
Using Pydantic
from pydantic import BaseModel, Field
# Define the schema using Pydantic
class CapitalInfo(BaseModel):
name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
population: int = Field(..., description="Population of the capital city")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Please generate the information of the capital of France in the JSON format.",
},
],
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {
"name": "foo",
# convert the pydantic model to json schema
"schema": CapitalInfo.model_json_schema(),
},
},
)
response_content = response.choices[0].message.content
# validate the JSON response by the pydantic model
capital_info = CapitalInfo.model_validate_json(response_content)
print_highlight(f"Validated response: {capital_info.model_dump_json()}")
JSON Schema Directly
import json
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Give me the information of the capital of France in the JSON format.",
},
],
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {"name": "foo", "schema": json.loads(json_schema)},
},
)
print_highlight(response.choices[0].message.content)
ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful geography bot."},
{
"role": "user",
"content": "Give me the information of the capital of France.",
},
],
temperature=0,
max_tokens=32,
extra_body={"ebnf": ebnf_grammar},
)
print_highlight(response.choices[0].message.content)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
max_tokens=128,
extra_body={"regex": "(Paris|London)"},
)
print_highlight(response.choices[0].message.content)
Using Pydantic
import requests
import json
from pydantic import BaseModel, Field
# Define the schema using Pydantic
class CapitalInfo(BaseModel):
name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
population: int = Field(..., description="Population of the capital city")
# Make API request
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "Here is the information of the capital of France in the JSON format.\n",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"json_schema": json.dumps(CapitalInfo.model_json_schema()),
},
},
)
print_highlight(response.json())
response_data = json.loads(response.json()["text"])
# validate the response by the pydantic model
capital_info = CapitalInfo.model_validate(response_data)
print_highlight(f"Validated response: {capital_info.model_dump_json()}")
JSON Schema Directly
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
# JSON
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "Here is the information of the capital of France in the JSON format.\n",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"json_schema": json_schema,
},
},
)
print_highlight(response.json())
import requests
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "Give me the information of the capital of France.",
"sampling_params": {
"max_new_tokens": 128,
"temperature": 0,
"n": 3,
"ebnf": (
"root ::= city | description\n"
'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
'description ::= city " is " status\n'
'status ::= "the capital of " country\n'
'country ::= "England" | "France" | "Germany" | "Italy"'
),
},
"stream": False,
"return_logprob": False,
},
)
print_highlight(response.json())
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "Paris is the capital of",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"regex": "(France|England)",
},
},
)
print_highlight(response.json())
terminate_process(server_process, port)
import sglang as sgl
llm = sgl.Engine(
model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", grammar_backend="xgrammar"
)
Using Pydantic
import json
from pydantic import BaseModel, Field
prompts = [
"Give me the information of the capital of China in the JSON format.",
"Give me the information of the capital of France in the JSON format.",
"Give me the information of the capital of Ireland in the JSON format.",
]
# Define the schema using Pydantic
class CapitalInfo(BaseModel):
name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
population: int = Field(..., description="Population of the capital city")
sampling_params = {
"temperature": 0.1,
"top_p": 0.95,
"json_schema": json.dumps(CapitalInfo.model_json_schema()),
}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}") # validate the output by the pydantic model
capital_info = CapitalInfo.model_validate_json(output["text"])
print_highlight(f"Validated output: {capital_info.model_dump_json()}")
JSON Schema Directly
prompts = [
"Give me the information of the capital of China in the JSON format.",
"Give me the information of the capital of France in the JSON format.",
"Give me the information of the capital of Ireland in the JSON format.",
]
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
sampling_params = {"temperature": 0.1, "top_p": 0.95, "json_schema": json_schema}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
prompts = [
"Give me the information of the capital of France.",
"Give me the information of the capital of Germany.",
"Give me the information of the capital of Italy.",
]
sampling_params = {
"temperature": 0.8,
"top_p": 0.95,
"ebnf": (
"root ::= city | description\n"
'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
'description ::= city " is " status\n'
'status ::= "the capital of " country\n'
'country ::= "England" | "France" | "Germany" | "Italy"'
),
}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
prompts = [
"Please provide information about London as a major global city:",
"Please provide information about Paris as a major global city:",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
llm.shutdown()
Download code
from Https://code.visualstudio.com/docs/?dv=linux64cli
wget https://vscode.download.prss.microsoft.com/dbazure/download/stable/fabdb6a30b49f79a7aba0f2ad9df9b399473380f/vscode_cli_alpine_x64_cli.tar.gz
tar xf vscode_cli_alpine_x64_cli.tar.gz
# https://code.visualstudio.com/docs/remote/tunnels
./code tunnel
The following startup command is an example for internal development by the SGLang team. You can modify or add directory mappings as needed, especially for model weight downloads, to prevent repeated downloads by different Docker containers.
# Change the name to yours
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
# Change batch size, input, output and add `disable-cuda-graph` (for easier analysis)
# e.g. DeepSeek V3
nsys profile -o deepseek_v3 python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --disable-cuda-graph
# e.g. gsm8k 8 shot
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
Update the package version in python/pyproject.toml
and python/sglang/__init__.py
.
pip install build twine
cd python
bash upload_pypi.sh
Make a new release https://github.com/sgl-project/sglang/releases/new.
You can mount a folder for the shared huggingface model weights cache. The command below uses /tmp/huggingface
as an example.
docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
# Nvidia
docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
# AMD
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.3-rocm630 /bin/bash
# AMD just the last 2 GPUs
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.3-rocm630 /bin/bash
Run these commands inside the container.
apt update && apt install -y curl python3-pip git
export RUNNER_ALLOW_RUNASROOT=1
Then follow https://github.com/sgl-project/sglang/settings/actions/runners/new?arch=x64&os=linux to run config.sh
Notes
- Do not need to specify the runner group
- Give it a name (e.g.,
test-sgl-gpu-0
) and some labels (e.g.,1-gpu-runner
). The labels can be editted later in Github Settings. - Do not need to change the work folder.
- Set up environment variables
export HF_HOME=/hf_home
export SGLANG_IS_IN_CI=true
export HF_TOKEN=hf_xxx
export OPENAI_API_KEY=sk-xxx
export CUDA_VISIBLE_DEVICES=0
- Run it forever
while true; do ./run.sh; echo "Restarting..."; sleep 2; done
This doc describes the choices methods supported by SGLang.
The optional choices_method
arg determines how options supplied to SGLang's choices
primitive are selected. Only the RuntimeEndpoint
backend supports the choices_method
arg. Other backends, such as OpenAI
, have bespoke selection implementations due to API limitations.
Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.
Usage example (alternatively, simply omit the choices_method
arg):
@sgl.function
def example(s):
s += sgl.user("What is the capital of France?")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["London", "Paris", "Berlin"],
choices_method=sgl.token_length_normalized,
)
)
This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are ["Paris", "Antidisestablishmentarianism"]
.
Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.
Usage example:
@sgl.function
def example(s):
s += sgl.user("What is the capital of France?")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["London", "Paris", "Berlin"],
choices_method=sgl.greedy_token_selection,
)
)
This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:
@sgl.function
def us_president_example(s):
s += sgl.user("Name a US president.")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["Donald Duck", "Millard Fillmore"],
choices_method=sgl.greedy_token_selection,
)
)
Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in this EleutherAI blogpost. This method incurs an additional LLM call to obtain the unconditional likelihoods.
Usage example:
@sgl.function
def example(s):
s += sgl.user("What is the capital of France?")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["London", "Paris", "Berlin"],
choices_method=sgl.unconditional_likelihood_normalized,
)
)
The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow.
The example below shows how to use SGLang to answer a multi-turn question.
First, launch a server with
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
Then, connect to the server and answer a multi-turn question.
from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint
@function
def multi_turn_question(s, question_1, question_2):
s += system("You are a helpful assistant.")
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=256))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=256))
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = multi_turn_question.run(
question_1="What is the capital of the United States?",
question_2="List two local attractions.",
)
for m in state.messages():
print(m["role"], ":", m["content"])
print(state["answer_1"])
Set the OpenAI API Key
export OPENAI_API_KEY=sk-******
Then, answer a multi-turn question.
from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@function
def multi_turn_question(s, question_1, question_2):
s += system("You are a helpful assistant.")
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=256))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=256))
set_default_backend(OpenAI("gpt-3.5-turbo"))
state = multi_turn_question.run(
question_1="What is the capital of the United States?",
question_2="List two local attractions.",
)
for m in state.messages():
print(m["role"], ":", m["content"])
print(state["answer_1"])
Anthropic and VertexAI (Gemini) models are also supported. You can find more examples at examples/quick_start.
To begin with, import sglang.
import sglang as sgl
sglang
provides some simple primitives such as gen
, select
, fork
, image
.
You can implement your prompt flow in a function decorated by sgl.function
.
You can then invoke the function with run
or run_batch
.
The system will manage the state, chat template, parallelism and batching for you.
The complete code for the examples below can be found at readme_examples.py
You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
@sgl.function
def tool_use(s, question):
s += "To answer this question: " + question + ". "
s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "
if s["tool"] == "calculator":
s += "The math expression is" + sgl.gen("expression")
elif s["tool"] == "search engine":
s += "The key word to search is" + sgl.gen("word")
Use fork
to launch parallel prompts.
Because sgl.gen
is non-blocking, the for loop below issues two generation calls in parallel.
@sgl.function
def tip_suggestion(s):
s += (
"Here are two tips for staying healthy: "
"1. Balanced Diet. 2. Regular Exercise.\n\n"
)
forks = s.fork(2)
for i, f in enumerate(forks):
f += f"Now, expand tip {i+1} into a paragraph:\n"
f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
s += "In summary" + sgl.gen("summary")
Use sgl.image
to pass an image as input.
@sgl.function
def image_qa(s, image_file, question):
s += sgl.user(sgl.image(image_file) + question)
s += sgl.assistant(sgl.gen("answer", max_tokens=256)
See also local_example_llava_next.py.
Use regex
to specify a regular expression as a decoding constraint.
This is only supported for local models.
@sgl.function
def regular_expression_gen(s):
s += "Q: What is the IP address of the Google DNS servers?\n"
s += "A: " + sgl.gen(
"answer",
temperature=0,
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)
Use regex
to specify a JSON schema with a regular expression.
character_regex = (
r"""\{\n"""
+ r""" "name": "[\w\d\s]{1,16}",\n"""
+ r""" "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
+ r""" "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
+ r""" "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
+ r""" "wand": \{\n"""
+ r""" "wood": "[\w\d\s]{1,16}",\n"""
+ r""" "core": "[\w\d\s]{1,16}",\n"""
+ r""" "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
+ r""" \},\n"""
+ r""" "alive": "(Alive|Deceased)",\n"""
+ r""" "patronus": "[\w\d\s]{1,16}",\n"""
+ r""" "bogart": "[\w\d\s]{1,16}"\n"""
+ r"""\}"""
)
@sgl.function
def character_gen(s, name):
s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
See also json_decode.py for an additional example of specifying formats with Pydantic models.
Use run_batch
to run a batch of requests with continuous batching.
@sgl.function
def text_qa(s, question):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", stop="\n")
states = text_qa.run_batch(
[
{"question": "What is the capital of the United Kingdom?"},
{"question": "What is the capital of France?"},
{"question": "What is the capital of Japan?"},
],
progress_bar=True
)
Add stream=True
to enable streaming.
@sgl.function
def text_qa(s, question):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", stop="\n")
state = text_qa.run(
question="What is the capital of France?",
temperature=0.1,
stream=True
)
for out in state.text_iter():
print(out, end="", flush=True)
Use sgl.system
, sgl.user
and sgl.assistant
to set roles when using Chat models. You can also define more complex role prompts using begin and end tokens.
@sgl.function
def chat_example(s):
s += sgl.system("You are a helpful assistant.")
# Same as: s += s.system("You are a helpful assistant.")
with s.user():
s += "Question: What is the capital of France?"
s += sgl.assistant_begin()
s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
s += sgl.assistant_end()
- The
choices
argument insgl.gen
is implemented by computing the token-length normalized log probabilities of all choices and selecting the one with the highest probability. - The
regex
argument insgl.gen
is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible withtemperature=0
andtemperature != 0
.
This guide shows how to evaluate model accuracy using SGLang's built-in benchmarks. Please include accuracy on crucial benchmarks in your PR if you make modifications on the model side, like the kernel and model architecture.
This is a reference workflow for the MMLU benchmark. For more details or other benchmarks, please refer to the README in each specific benchmark folder under sglang/benchmark.
# Step 1: Download the dataset
bash download_data.sh
# Step 2: Launch the server
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Math-1.5B-Instruct \ # Model selection
--port 30000 \ # Network configuration
--mem-fraction-static 0.8 # Memory optimization
# Step 3: Run the benchmark script
python3 bench_sglang.py --nsub 10 # Test 10 subjects
# Step 4: Extract the accuracy
cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'
Some benchmark implementations may differ from ours, causing accuracy discrepancies. To match [Qwen2.5-Math]'s reported 76.8% GSM8K accuracy, customization is required.
# The GSM8K benchmark script includes few shot examples for evaluation by default.
# Here we exclude them.
for i in range(len(lines[num_shots:num_questions])):
questions.append(get_one_example(lines, i, False))
labels.append(get_answer_value(lines[i]["answer"]))
@sgl.function
def few_shot_gsm8k(s, question):
# System prompt given in https://github.com/QwenLM/Qwen2.5-Math
s += sgl.system("Please reason step by step, and put your final answer within \\boxed{}.") # Include system prompt
s += few_shot_examples + question
# Stopwords given in evaluation/math_eval.py of the Qwen2.5-Math repo
s += sgl.gen(
"answer", max_tokens=2048, stop=["Question", "Assistant:", "</s>", "<|im_end|>", "<|endoftext|>"]
)
These adjustments should return the desired accuracy.
- Contribute New Benchmarks
- Follow our contribution guidelines to add new test scripts
- Request Implementations
- Feel free to open an issue describing your evaluation needs
- Use Alternative Tools
This document describes how to set up an AMD-based environment for SGLang. If you encounter issues or have questions, please open an issue on the SGLang repository.
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
NOTE: We strongly recommend reading theses docs entirely guide to fully utilize your system.
Below are a few key settings to confirm or enable:
In /etc/default/grub
, append the following to GRUB_CMDLINE_LINUX
:
pci=realloc=off iommu=pt
Afterward, run sudo update-grub
(or your distro’s equivalent) and reboot.
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
You can automate or verify this change using this helpful script.
Again, please go through the entire documentation to confirm your system is using the recommended configuration.
For general installation instructions, see the official SGLang Installation Docs. Below are the AMD-specific steps summarized for convenience.
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all_hip]"
- Build the docker image.
docker build -t sglang_image -f Dockerfile.rocm .
- Create a convenient alias.
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx \
-v /data:/data'
- Launch the server.
NOTE: Replace <secret>
below with your huggingface hub token.
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path NousResearch/Meta-Llama-3.1-8B \
--host 0.0.0.0 \
--port 30000
- To verify the utility, you can run a benchmark in another terminal or refer to other docs to send requests to the engine.
drun sglang_image \
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 4000 \
--random-input 128 \
--random-output 128
With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.
The only difference in running DeepSeek-V3 is when starting the server. Here's an example command:
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \ # <- here
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command:
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
When the server displays "The server is fired up and ready to roll!", it means the startup is successful.
- Benchmark the latency of running a single static batch without a server. The arguments are the same as for
launch_server.py
. Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
- Benchmark offline processing. This script will start an offline engine and run the benchmark.
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
- Benchmark online serving. Please use
sglang.launch_server
to launch a server first and run the following command.python3 -m sglang.bench_serving --backend sglang --num-prompt 10
Pytorch Profiler is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.
- To profile a server
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
# send profiling request from client
python -m sglang.bench_serving --backend sglang --model-path meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
Please make sure that the SGLANG_TORCH_PROFILER_DIR
should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting SGLANG_TORCH_PROFILER_DIR
in the .*rc
file of shell (e.g. ~/.bashrc
for bash shells).
- To profile offline
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
- View Traces
Trace files can be loaded and visualized from:
- https://ui.perfetto.dev/ (any browser)
- chrome://tracing (Chrome browser only)
If browser cannot open trace file due to its large size, client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs. For example, when profiling a server,
python -m sglang.bench_serving --backend sglang --model-path meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
sets the number of prompts to 2 with --num-prompts
argument and limits the length of output sequences to 100 with --sharegpt-output-len
argument, which can generate a small trace file for browser to open smoothly.
Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.
- Prerequisite: install using apt, or run inside a NVIDIA Docker container or SGLang Docker container.
# install nsys
# https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli
-
To profile a single batch, use
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512
-
To profile a server, e.g.
# server
# set the delay and duration times according to needs
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache
# client
python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
- Use NVTX to annotate code regions, e.g. to see their execution time.
# install nvtx
pip install nvtx
# code snippets
import nvtx
with nvtx.annotate("description", color="color"):
# some critical code
- You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add
--load-format dummy
to the above commands and then you only need a correctconfig.json
under the checkpoint folder. - You can benchmark a model with modified configs (e.g., less layers) by using
--json-model-override-args
. For example, you can benchmark a model with only 2 layers and 2 kv heads usingpython -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'
- You can use
--python-backtrace=cuda
to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing) - For more args please see https://docs.nvidia.com/nsight-systems/UserGuide/index.html
Welcome to SGLang! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
Note: New contributors do not have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.
git clone https://github.com/<your_user_name>/sglang.git
Refer to Install SGLang from Source documentation for more details on setting up the necessary dependencies.
We use pre-commit to maintain consistent code style checks. Before pushing your changes, please run:
pip3 install pre-commit
pre-commit install
pre-commit run --all-files
pre-commit run --all-files
manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks before creating a Pull Request.- Do not commit directly to the
main
branch. Always create a new branch (e.g.,feature/my-new-feature
), push your changes, and open a PR from that branch.
SGLang uses Python's built-in unittest framework. For detailed instructions on running tests and adding them to CI, please refer to test/README.md.
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to docs/README.md.
If you want to contribute but don’t have a specific idea in mind, pick issues labeled “good first issue” or “help wanted”. These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this code walk-through for a deeper look into SGLang’s workflow.
If you have any questions or want to start a discussion, please feel free to ask in our Slack channel.
Thank you for your interest in SGLang—happy coding!
NOTE: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at conversation.py). It is NOT related to the chat template used in the SGLang language frontend (defined at chat_template.py).
By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
If needed, you can also override the chat template when launching the server:
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
You can load the JSON format, which is defined by conversation.py
.
{
"name": "my_model",
"system": "<|im_start|>system",
"user": "<|im_start|>user",
"assistant": "<|im_start|>assistant",
"sep_style": "CHATML",
"sep": "<|im_end|>",
"stop_str": ["<|im_end|>", "<|im_start|>"]
}
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
You can also use the Jinja template format, defined by Hugging Face transformers https://huggingface.co/docs/transformers/main/en/chat_templating
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for DeepSeek V3.
SGLang is recognized as one of the top engines for DeepSeek model inference.
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to DeepSeek V3 offical guide to download the weights.
Please refer to the example. Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like --quantization fp8 --kv-cache-dtype fp8_e5m2
. Also, --enable-dp-attention
can be useful to improve for Deepseek V3/R1's throughput. Please refer to Data Parallelism Attention for detail.
Description: MLA is an innovative attention mechanism introduced by the DeepSeek team, aimed at improving inference efficiency. SGLang has implemented specific optimizations for this, including:
-
Weight Absorption: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
-
Triton Decoding Kernel Optimization: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process.
-
FP8 Quantization: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
-
CUDA Graph & Torch.compile: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
Overall, with these optimizations, we have achieved up to a 7x acceleration in output throughput compared to the previous version.
Usage: MLA optimization is enabled by default, to disable, use --disable-mla
.
Reference: Check Blog and Slides for more details.
Description: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer.
Usage: This optimization is aimed at improving throughput and should be used for scenarios with high QPS (Queries Per Second). Data Parallelism Attention optimization can be enabled by --enable-dp-attention
for DeepSeek Series Models.
Reference: Check Blog.
Description: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory.
Usage: Check here for usage examples.
Description: SGLang implements block-wise FP8 quantization with two key optimizations:
-
Activation: E4M3 format using per-token-per-128-channel sub-vector scales with online casting.
-
Weight: Per-128x128-block quantization for better numerical stability.
Usage: turn on by default for DeepSeek V3 models.
You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.
From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs.
To achieve more deterministic outputs in the current code, you can add --disable-radix-cache
and send only one request at a time. The results will be mostly deterministic under this setting.
We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower.
We have two issues to track our progress:
- The deterministic mode is tracked at sgl-project/sglang#1729.
- The per-request random seed is tracked at sgl-project/sglang#1335.
Achieving a large batch size is the most important thing for attaining high throughput.
When the server is running at full load, look for the following in the log:
Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317
#queue-req
indicates the number of requests in the queue. If you frequently see #queue-req == 0
, it suggests you are bottlenecked by the request submission speed.
A healthy range for #queue-req
is 50 - 500
.
On the other hand, do not make #queue-req
too large because it will also increase the scheduling overhead on the server, especially when using the default longest-prefix-match schedule policy (--schedule-policy lpm
).
token usage
indicates the KV cache memory utilization of the server. token usage > 0.9
means good utilization.
If you frequently see token usage < 0.9
and #queue-req > 0
, it means the server is too conservative about taking in new requests. You can decrease --schedule-conservativeness
to a value like 0.3.
The case of server being too conservative can happen when users send many requests with a large max_new_tokens
but the requests stop very early due to EOS or stop strings.
On the other hand, if you see token usage
very high and you frequently see warnings like
decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000
, you can increase --schedule-conservativeness
to a value like 1.3.
If you see decode out of memory happened
occasionally but not frequently, it is okay.
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
If you see out of memory (OOM) errors, you can try to tune the following parameters.
- If OOM happens during prefill, try to decrease
--chunked-prefill-size
to4096
or2048
. - If OOM happens during decoding, try to decrease
--max-running-requests
. - You can also try to decrease
--mem-fraction-static
, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
- To enable torch.compile acceleration, add
--enable-torch-compile
. It accelerates small models on small batch sizes. This does not work for FP8 currently.
If the workload has many shared prefixes, use the default --schedule-policy lpm
. lpm
stands for longest prefix match.
When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
you can try --schedule-policy fcfs
. fcfs
stands for first come first serve. fcfs
has a lower scheduling overhead.
You can find more blogs, slides, and videos about SGLang at https://github.com/sgl-project/sgl-learning-materials.
To use a model from ModelScope, set the environment variable SGLANG_USE_MODELSCOPE
.
export SGLANG_USE_MODELSCOPE=true
We take Qwen2-7B-Instruct as an example.
Launch the Server:
python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
Or start it by docker:
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--env "SGLANG_USE_MODELSCOPE=true" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.
Run 405B (fp16) on Two Nodes
# replace 172.16.4.52:20000 with your own node ip address and port of the first node
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0
# replace 172.18.45.52:20000 with your own node ip address and port of the second node
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.18.45.52:20000 --nnodes 2 --node-rank 1
Note that LLama 405B (fp8) can also be launched on a single node.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
Please refer to DeepSeek documents for reference..
This example showcases how to serve SGLang server across multiple nodes by SLURM. Submit the following job to the SLURM cluster.
#!/bin/bash -l
#SBATCH -o SLURM_Logs/%x_%j_master.out
#SBATCH -e SLURM_Logs/%x_%j_master.err
#SBATCH -D ./
#SBATCH -J Llama-405B-Online-Inference-TP16-SGL
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1 # Ensure 1 task per node
#SBATCH --cpus-per-task=18
#SBATCH --mem=224GB
#SBATCH --partition="lmsys.org"
#SBATCH --gres=gpu:8
#SBATCH --time=12:00:00
echo "[INFO] Activating environment on node $SLURM_PROCID"
if ! source ENV_FOLDER/bin/activate; then
echo "[ERROR] Failed to activate environment" >&2
exit 1
fi
# Define parameters
model=MODEL_PATH
tp_size=16
echo "[INFO] Running inference"
echo "[INFO] Model: $model"
echo "[INFO] TP Size: $tp_size"
# Set NCCL initialization address using the hostname of the head node
HEAD_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n 1)
NCCL_INIT_ADDR="${HEAD_NODE}:8000"
echo "[INFO] NCCL_INIT_ADDR: $NCCL_INIT_ADDR"
# Launch the model server on each node using SLURM
srun --ntasks=2 --nodes=2 --output="SLURM_Logs/%x_%j_node$SLURM_NODEID.out" \
--error="SLURM_Logs/%x_%j_node$SLURM_NODEID.err" \
python3 -m sglang.launch_server \
--model-path "$model" \
--grammar-backend "xgrammar" \
--tp "$tp_size" \
--nccl-init-addr "$NCCL_INIT_ADDR" \
--nnodes 2 \
--node-rank "$SLURM_NODEID" &
# Wait for the NCCL server to be ready on port 30000
while ! nc -z "$HEAD_NODE" 30000; do
sleep 1
echo "[INFO] Waiting for $HEAD_NODE:30000 to accept connections"
done
echo "[INFO] $HEAD_NODE:30000 is ready to accept connections"
# Keep the script running until the SLURM job times out
wait
Then, you can test the server by sending requests following other documents.
Thanks for aflah02 for providing the example, based on his blog post.
Before starting, ensure the following:
- NVIDIA Jetson AGX Orin Devkit is set up with JetPack 6.1 or later.
- CUDA Toolkit and cuDNN are installed.
- Verify that the Jetson AGX Orin is in high-performance mode:
sudo nvpmodel -m 0
- A custom PyPI index hosted at https://pypi.jetson-ai-lab.dev/jp6/cu126, tailored for NVIDIA Jetson Orin platforms and CUDA 12.6.
To install torch from this index:
pip install torch --index-url https://pypi.jetson-ai-lab.dev/jp6/cu126
Please refer to Installation Guide to install FlashInfer and SGLang.
Launch the server:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
The quantization and limited context length (--dtype half --context-length 8192
) are due to the limited computational resources in Nvidia jetson kit. A detailed explanation can be found in Server Arguments.
After launching the engine, refer to Chat completions to test the usability.
TorchAO is suggested to NVIDIA Jetson Orin.
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of --torchao-config int4wo-128
is also for memory efficiency.
Please refer to SGLang doc structured output.
Thanks to the support from shahizat.
SGLang exposes the following metrics via Prometheus. The metrics are namespaced by $name
(the model name).
An example of the monitoring dashboard is available in examples/monitoring/grafana.json.
Here is an example of the metrics:
$ curl http://localhost:30000/metrics
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.30457592010498047
sglang:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.25",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="7.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="25.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.30521273612976074
sglang:e2e_request_latency_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="50.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE sglang:time_per_output_token_seconds histogram
sglang:time_per_output_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0381757915019989
sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.015",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
# HELP sglang:func_latency_seconds Function latency in seconds
# TYPE sglang:func_latency_seconds histogram
sglang:func_latency_seconds_sum{name="generate_request"} 0.3061351010110229
sglang:func_latency_seconds_bucket{le="0.05",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.07500000000000001",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.1125",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.16875",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.253125",name="generate_request"} 0.0
sglang:func_latency_seconds_bucket{le="0.3796875",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="0.56953125",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="0.8542968750000001",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="1.2814453125",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="1.9221679687500002",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="2.8832519531250003",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="4.3248779296875",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="6.487316894531251",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="9.730975341796876",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="14.596463012695313",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="21.89469451904297",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="32.84204177856446",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="49.26306266784668",name="generate_request"} 1.0
sglang:func_latency_seconds_bucket{le="+Inf",name="generate_request"} 1.0
sglang:func_latency_seconds_count{name="generate_request"} 1.0
# HELP sglang:num_running_reqs The number of running requests
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:num_used_tokens The number of used tokens
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:gen_throughput The generate throughput (token/s)
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:token_usage The token usage
# TYPE sglang:token_usage gauge
sglang:token_usage{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
# HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
To setup a monitoring dashboard, you can use the following docker compose file: examples/monitoring/docker-compose.yaml.
Assume you have sglang server running at localhost:30000
.
To start the monitoring dashboard (prometheus + grafana), cd to examples/monitoring
and run:
docker compose -f compose.yaml -p monitoring up
Then you can access the Grafana dashboard at http://localhost:3000.
To import the Grafana dashboard, click +
-> Import
-> Upload JSON file
-> Upload
and select grafana.json.
SGLang supports various quantization methods, including offline quantization and online dynamic quantization.
Offline quantization loads pre-quantized model weights directly during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set.
Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. Like NVIDIA FP8 training's delayed scaling mechanism, online quantization calculates the appropriate scaling factors on-the-fly to convert high-precision weights into a lower-precision format.
Note that, for better performance, usability and convenience, offline quantization is recommended over online quantization. And if you use a pre-quantized model, do not add --quantization
to enable online quantization at the same time. For popular pre-quantized models, please visit neuralmagic collection for some popular quantized LLMs on huggingface.
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add "--quantization" argument when starting the engine. The quantization method will be parsed from the downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.
python3 -m sglang.launch_server \
--model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--port 30000 --host 0.0.0.0
To do offline quantization for your model, firstly you need to install llm-compressor library:
pip install llmcompressor
Here, we take quantize meta-llama/Meta-Llama-3-8B-Instruct
to FP8
as an example to elaborate on how to do offline quantization.
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Step 1: Load the original model.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Step 2: Perform offline quantization.
# Step 2.1: Configure the simple PTQ quantization.
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
# Step 2.2: Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)
# Step 3: Save the model.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
Then, you can directly use the quantized model with SGLang
, by using the following command:
python3 -m sglang.launch_server \
--model-path $PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic \
--port 30000 --host 0.0.0.0
To enable online quantization, you can simply specify --quantization
in the command line. For example, you can launch the server with the following command to enable FP8
quantization for model meta-llama/Meta-Llama-3.1-8B-Instruct
:
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 \
--port 30000 --host 0.0.0.0
Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to ["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]
We also support quantization methods based on torchao. You can simply specify --torchao-config
in the command line to support this feature. For example, if you want to enable int4wo-128
for model meta-llama/Meta-Llama-3.1-8B-Instruct
, you can launch the server with the following command:
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--torchao-config int4wo-128 \
--port 30000 --host 0.0.0.0
We support the following quantization methods based on torchao ["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]
.
Note: According to this issue, "int8dq"
method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using "int8dq"
method. Namely, please use the following command:
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--torchao-config int8dq \
--disable-cuda-graph \
--port 30000 --host 0.0.0.0
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. If you want a high-level endpoint that can automatically handle chat templates, consider using the OpenAI Compatible API.
The /generate
endpoint accepts the following arguments in the JSON format.
@dataclass
class GenerateReqInput:
# The input prompt. It can be a single prompt or a batch of prompts.
text: Optional[Union[List[str], str]] = None
# The token ids for text; one can specify either text or input_ids
input_ids: Optional[Union[List[List[int]], List[int]]] = None
# The embeddings for input_ids; one can specify either text or input_ids or input_embeds.
input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
# The image input. It can be a file name, a url, or base64 encoded string.
# See also python/sglang/srt/utils.py:load_image.
image_data: Optional[Union[List[str], str]] = None
# The sampling_params. See descriptions below.
sampling_params: Optional[Union[List[Dict], Dict]] = None
# The request id.
rid: Optional[Union[List[str], str]] = None
# Whether to return logprobs.
return_logprob: Optional[Union[List[bool], bool]] = None
# If return logprobs, the start location in the prompt for returning logprobs.
# By default, this value is "-1", which means it will only return logprobs for output tokens.
logprob_start_len: Optional[Union[List[int], int]] = None
# If return logprobs, the number of top logprobs to return at each position.
top_logprobs_num: Optional[Union[List[int], int]] = None
# Whether to detokenize tokens in text in the returned logprobs.
return_text_in_logprobs: bool = False
# Whether to stream output.
stream: bool = False
# Whether to log metrics for this request (e.g. health_generate calls do not log metrics)
log_metrics: bool = True
# The modalities of the image data [image, multi-images, video]
modalities: Optional[List[str]] = None
# LoRA related
lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None
# Session info for continual prompting
session_params: Optional[Union[List[Dict], Dict]] = None
# Custom logit processor for advanced sampling control. Must be a serialized instance
# of `CustomLogitProcessor` in python/sglang/srt/sampling/custom_logit_processor.py
# Use the processor's `to_str()` method to generate the serialized string.
custom_logit_processor: Optional[Union[List[Optional[str]], str]] = None
The sampling_params
follows this format
# The maximum number of output tokens
max_new_tokens: int = 128,
# Stop when hitting any of the strings in this list
stop: Optional[Union[str, List[str]]] = None,
# Stop when hitting any of the token_ids in this list
stop_token_ids: Optional[List[int]] = [],
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
# Min-p sampling
min_p: float = 0.0,
# Whether to ignore EOS token
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization
spaces_between_special_tokens: bool = True,
# Do parallel sampling and return `n` outputs.
n: int = 1,
## Structured Outputs
# Only one of the below three can be set for a request.
# Constrain the output to follow a given JSON schema.
json_schema: Optional[str] = None,
# Constrain the output to follow a given regular expression.
regex: Optional[str] = None,
# Constrain the output to follow a given EBNF grammar.
ebnf: Optional[str] = None,
## Penalties.
# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,
## Custom Parameters for Custom Logit Processor.
# A dictionary of custom parameters for the custom logit processor.
# The custom logit processor takes a list of dictionaries as input, where each
# dictionary is the custom parameters for one token in a batch of the input.
# See also python/sglang/srt/sampling/custom_logit_processor.py
custom_params: Optional[Dict[str, Any]] = None,
Launch a server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
Send a request
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print(response.json())
Send a request and stream the output
import requests, json
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
"stream": True,
},
stream=True,
)
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"].strip()
print(output[prev:], end="", flush=True)
prev = len(output)
print("")
Launch a server
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
Download an image
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
Send a request
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
"<|im_start|>assistant\n",
"image_data": "example_image.png",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print(response.json())
The image_data
can be a file name, a URL, or a base64 encoded string. See also python/sglang/srt/utils.py:load_image
.
Streaming is supported in a similar manner as above.
You can specify a JSON schema, regular expression or EBNF to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (json_schema
, regex
, or ebnf
) can be specified for a request.
SGLang supports two grammar backends:
- Outlines (default): Supports JSON schema and regular expression constraints.
- XGrammar: Supports JSON schema, regular expression, and EBNF constraints.
- XGrammar currently uses the GGML BNF format
Initialize the XGrammar backend using --grammar-backend xgrammar
flag
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)
import json
import requests
json_schema = json.dumps({
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
})
# JSON (works with both Outlines and XGrammar)
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Here is the information of the capital of France in the JSON format.\n",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"json_schema": json_schema,
},
},
)
print(response.json())
# Regular expression (Outlines backend only)
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Paris is the capital of",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"regex": "(France|England)",
},
},
)
print(response.json())
# EBNF (XGrammar backend only)
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Write a greeting.",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
},
},
)
print(response.json())
Launch a server with --enable-custom-logit-processor
flag on.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor
Define a custom logit processor that will always sample a specific token id.
from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor
class DeterministicLogitProcessor(CustomLogitProcessor):
"""A dummy logit processor that changes the logits to always
sample the given token id.
"""
def __call__(self, logits, custom_param_list):
# Check that the number of logits matches the number of custom parameters
assert logits.shape[0] == len(custom_param_list)
key = "token_id"
for i, param_dict in enumerate(custom_param_list):
# Mask all other tokens
logits[i, :] = -float("inf")
# Assign highest probability to the specified token
logits[i, param_dict[key]] = 0.0
return logits
Send a request
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"custom_logit_processor": DeterministicLogitProcessor().to_str(),
"sampling_params": {
"temperature": 0.0,
"max_new_tokens": 32,
"custom_params": {"token_id": 5},
},
},
)
print(response.json())
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
- Mistral / Mixtral / Mistral NeMo / Mistral Small 3
- Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
- DeepSeek / DeepSeek 2 / DeepSeek 3
- OLMoE
- LLaVA-OneVision
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava
- Query the server with the OpenAI Vision API. See examples at test/srt/test_vision_openai_server.py
- LLaVA 1.5 / 1.6 / NeXT
python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3
python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava
- Query the server with the OpenAI Vision API. See examples at test/srt/test_vision_openai_server.py
- Yi-VL
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- InternLM 2
- Exaone 3
- BaiChuan2
- MiniCPM / MiniCPM 3 / MiniCPMV
- XVERSE / XVERSE MoE
- SmolLM
- GLM-4
- Phi-3 / Phi-4
- Phi-3-Small
- IBM Granite 3
- LlamaEmbeddingModel
- Mistral embedding models
- Qwen embedding models
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding
- LlamaForSequenceClassification
python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --is-embedding
- Gemma2ForSequenceClassification
python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Gemma-2-27B-v0.2 --is-embedding
- InternLM2ForRewardModel
python -m sglang.launch_server --model-path internlm/internlm2-7b-reward --is-embedding --trust-remote-code
To support a new model in SGLang, you only need to add a single file under SGLang Models Directory. You can learn from existing model implementations and create new files for the new models. For most models, you should be able to find a similar model to start with (e.g., starting from Llama).
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.
- Register your new model as multimodal: Extend
is_multimodal_model
inmodel_config.py
to return True for your model. - Process Images: Create a new
ImageProcessor
class that inherits fromBaseImageProcessor
and register this processor as your model's dedicated processor. Seeimage_processor.py
for more details. - Handle Image Tokens: Implement a
pad_input_ids
function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images forRadixAttention
. - Replace Multi-headed
Attention
of ViT with SGLang'sVisionAttention
.
You can refer Qwen2VL or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.
For interactive debugging, you can compare the outputs of huggingface/transformers and SGLang. The following two commands should give the same text output and very similar prefill logits.
- Get the reference output by
python3 scripts/playground/reference_hf.py --model [new model]
- Get the SGLang output by
python3 -m sglang.bench_one_batch --correct --model [new model]
To make sure the new model is well maintained in the future, it is better to add it to the test suite.
You can add it to the ALL_OTHER_MODELS
list in the test_generation_models.py and run the following command to test it.
For example, if the model is Qwen/Qwen2-1.5B
ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
Another valuable resource is the vLLM Models Directory. vLLM has extensive coverage of models, and SGLang reuses vLLM's interface and some layers to implement the models. This similarity makes it easy to port many models from vLLM to SGLang.
To port a model from vLLM to SGLang, you can compare these two files SGLang Llama Implementation and vLLM Llama Implementation. This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of Attention with RadixAttention. The other parts are almost identical. Specifically,
- Replace vllm's
Attention
withRadixAttention
. Note that you need to passlayer_id
all the way toRadixAttention
. - Replace vllm's
LogitsProcessor
with SGLang'sLogitsProcessor
. - Replace Multi-headed
Attention
of ViT with SGLang'sVisionAttention
. - Replace other vLLM layers with SGLang layers (e.g.,
RMSNorm
,SiluAndMul
). - Remove
Sample
. - Change
forward()
functions, and addforward_batch
. - Add
EntryClass
at the end. - Please ensure the new implementation uses only SGLang components and does not rely on any vLLM components.
In addition to the methods described above, you can also register your new model with the ModelRegistry
before launching the server. This approach is useful if you want to integrate your model without needing to modify the source code.
Here is how you can do it:
from sglang.srt.models.registry import ModelRegistry
from sglang.srt.entrypoints.http_server import launch_server
# for a single model, you can add it to the registry
ModelRegistry.models[model_name] = model_class
# for multiple models, you can imitate the import_model_classes() function in sglang/srt/models/registry.py
from functools import lru_cache
@lru_cache()
def import_new_model_classes():
model_arch_name_to_cls = {}
...
return model_arch_name_to_cls
ModelRegistry.models.update(import_new_model_classes())
launch_server(server_args)
SGLang uses max-autotune-no-cudagraphs
mode of torch.compile. The auto-tuning can be slow.
If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps.
This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html
- Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.
TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
- Copy the cache folder to other machines and launch the server with
TORCHINDUCTOR_CACHE_DIR
.
This page lists some common errors and tips for fixing them.
If you see out of memory (OOM) errors, you can try to tune the following parameters.
- If OOM happens during prefill, try to decrease
--chunked-prefill-size
to4096
or2048
. - If OOM happens during decoding, try to decrease
--max-running-requests
. - You can also try to decrease
--mem-fraction-static
, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above section to avoid the OOM.
Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm.
The router is an independent Python package, and it can be used as a drop-in replacement for the OpenAI API.
$ pip install sglang-router
Detailed usage of the router can be found in launch_router and launch_server. Also, you can directly run the following command to see the usage of the router.
$ python -m sglang_router.launch_server --help
$ python -m sglang_router.launch_router --help
The router supports two working modes:
- Co-launch Router and Runtimes
- Launch Runtimes and Router separately
This will be a drop-in replacement for the existing --dp-size
argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
import requests
url = "http://localhost:30000/generate"
data = {"text": "What is the capital of France?"}
response = requests.post(url, json=data)
print(response.json())
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
$ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
We offer /add_worker
and /remove_worker
APIs to dynamically add or remove workers from the router.
/add_worker
Usage:
$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
Example:
$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
Successfully added worker: http://127.0.0.1:30001
/remove_worker
Usage:
$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
Example:
$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
Successfully removed worker: http://127.0.0.1:30001
Note:
- For cache-aware router, the worker will be removed from the tree and the queues.
We provide retries based for failure tolerance.
- If the request to a worker fails for
max_worker_retries
times, the router will remove the worker from the router and move on to the next worker. - If the total number of retries exceeds
max_total_retries
, the router will return an error.
Note:
max_worker_retries
is 3 andmax_total_retries
is 6 by default.
The native router combines two strategies to optimize both cache utilization and request distribution:
- Cache-Aware Routing (Approximate Tree)
- Load-Balancing Routing (Shortest Queue with Balance Thresholds)
The router dynamically switches between these strategies based on load conditions:
- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced
A system is considered imbalanced if both conditions are met:
- (max_load - min_load) > balance_abs_threshold
- max_load > balance_rel_threshold * min_load
Cache-Aware Routing (Approximate Tree)
When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead.
Process:
-
For each request, find the worker with the highest prefix match.
- If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity)
-
Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow.
Load-Balancing (Shortest Queue)
For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers.
-
cache_threshold
: (float, 0.0 to 1.0, default: 0.5)- Minimum prefix match ratio to use highest-match routing.
- Below this threshold, the request will be routed to the worker with most available cache space.
-
balance_abs_threshold
: (integer, default: 32)- Absolute difference threshold for load imbalance detection.
- The system is potentially imbalanced if (max_load - min_load) > abs_threshold.
-
balance_rel_threshold
: (float, default: 1.0001)- Relative ratio threshold for load imbalance detection.
- The system is potentially imbalanced if max_load > min_load * rel_threshold.
- Used in conjunction with
balance_abs_threshold
to determine the final imbalance state.
-
eviction_interval
: (integer, default: 60)- Interval in seconds between LRU eviction cycles for the approximate trees.
- Background thread periodically evicts least recently used nodes to maintain tree size.
-
max_tree_size
: (integer, default: 16777216)- Maximum nodes on the approximate tree.
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle.
You can install SGLang using any of the methods below. For running DeepSeek V3/R1 with SGLang, refer to DeepSeek V3 Support. It is always recommended to use the latest release version and deploy it with Docker to avoid fixed issues and environment-related problems.
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc. Please note that the package currently used by FlashInfer is named flashinfer-python
, not flashinfer
.
If you experience an error like OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root
, please try either of the following solutions:
- Use
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>
to set theCUDA_HOME
environment variable. - Follow the procedure described in FlashInfer installation doc first, then install SGLang as described above.
# Use the last release branch
git clone -b v0.4.3 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc.
If you want to work on development in SGLang, it is highly recommended that you use docker. Please refer to setup docker container for guidance. The image used is lmsysorg/sglang:dev
.
Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:
# Use the last release branch
git clone -b v0.4.3 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"
The docker images are available on Docker Hub as lmsysorg/sglang, built from Dockerfile.
Replace <secret>
below with your huggingface hub token.
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
Note: To AMD ROCm system with Instinct/MI GPUs, it is recommended to use docker/Dockerfile.rocm
to build images, example and usage as below:
docker build --build-arg SGL_BRANCH=v0.4.3 -t v0.4.3-rocm630 -f Dockerfile.rocm .
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
--shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx -v /data:/data'
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
v0.4.3-rocm630 \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.4.3-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
More
This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.
- Copy the compose.yml to your local machine
- Execute the command
docker compose up -d
in your terminal.
More
To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.
- Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot's documentation.
- Deploy on your own infra with a single command and get the HTTP API endpoint:
SkyPilot YAML: sglang.yaml
# sglang.yaml
envs:
HF_TOKEN: null
resources:
image_id: docker:lmsysorg/sglang:latest
accelerators: A100
ports: 30000
run: |
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
- To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.
- FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding
--attention-backend triton --sampling-backend pytorch
and open an issue on GitHub. - If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using
pip install "sglang[openai]"
. - The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run
pip install sglang
, and for the backend, usepip install sglang[srt]
.srt
is the abbreviation of SGLang runtime. - To reinstall flashinfer locally, use the following command:
pip install "flashinfer-python>=0.2.1.post1" -i https://flashinfer.ai/whl/cu124/torch2.5 --force-reinstall --no-deps
and then delete the cache withrm -rf ~/.cache/flashinfer
.