Skip to content

Instantly share code, notes, and snippets.

@rbiswasfc
Last active February 15, 2025 04:36
Show Gist options
  • Save rbiswasfc/678e4c78258480dcb6214efeedbe5af8 to your computer and use it in GitHub Desktop.
Save rbiswasfc/678e4c78258480dcb6214efeedbe5af8 to your computer and use it in GitHub Desktop.
vLLM Docs

README.md

vLLM documents

Build the docs

# Install dependencies.
pip install -r requirements-docs.txt

# Build the docs.
make clean
make html

Open the docs with your browser

python -m http.server -d build/html/

Launch your browser and open localhost:8000.


source/api/engine/async_llm_engine.md

AsyncLLMEngine

.. autoclass:: vllm.AsyncLLMEngine
    :members:
    :show-inheritance:

source/api/engine/index.md

vLLM Engine

.. automodule:: vllm.engine
.. currentmodule:: vllm.engine
:caption: Engines
:maxdepth: 2

llm_engine
async_llm_engine

source/api/engine/llm_engine.md

LLMEngine

.. autoclass:: vllm.LLMEngine
    :members:
    :show-inheritance:

source/api/inference_params.md

Inference Parameters

Inference parameters for vLLM APIs.

(sampling-params)=

Sampling Parameters

.. autoclass:: vllm.SamplingParams
    :members:

(pooling-params)=

Pooling Parameters

.. autoclass:: vllm.PoolingParams
    :members:

source/api/model/adapters.md

Model Adapters

Module Contents

.. automodule:: vllm.model_executor.models.adapters
    :members:
    :member-order: bysource

source/api/model/index.md

Model Development

Submodules

:maxdepth: 1

interfaces_base
interfaces
adapters

source/api/model/interfaces.md

Optional Interfaces

Module Contents

.. automodule:: vllm.model_executor.models.interfaces
    :members:
    :member-order: bysource

source/api/model/interfaces_base.md

Base Model Interfaces

Module Contents

.. automodule:: vllm.model_executor.models.interfaces_base
    :members:
    :member-order: bysource

source/api/multimodal/index.md

(multi-modality)=

Multi-Modality

vLLM provides experimental support for multi-modal models through the {mod}vllm.multimodal package.

Multi-modal inputs can be passed alongside text and token prompts to supported models via the multi_modal_data field in {class}vllm.inputs.PromptType.

Looking to add your own multi-modal model? Please follow the instructions listed here.

Module Contents

.. autodata:: vllm.multimodal.MULTIMODAL_REGISTRY

Submodules

:maxdepth: 1

inputs
parse
processing
profiling
registry

source/api/multimodal/inputs.md

Input Definitions

User-facing inputs

.. autodata:: vllm.multimodal.inputs.MultiModalDataDict

Internal data structures

.. autoclass:: vllm.multimodal.inputs.PlaceholderRange
    :members:
    :show-inheritance:
.. autodata:: vllm.multimodal.inputs.NestedTensors
.. autoclass:: vllm.multimodal.inputs.MultiModalFieldElem
    :members:
    :show-inheritance:
.. autoclass:: vllm.multimodal.inputs.MultiModalFieldConfig
    :members:
    :show-inheritance:
.. autoclass:: vllm.multimodal.inputs.MultiModalKwargsItem
    :members:
    :show-inheritance:
.. autoclass:: vllm.multimodal.inputs.MultiModalKwargs
    :members:
    :show-inheritance:
.. autoclass:: vllm.multimodal.inputs.MultiModalInputs
    :members:
    :show-inheritance:

source/api/multimodal/parse.md

Data Parsing

Module Contents

.. automodule:: vllm.multimodal.parse
    :members:
    :member-order: bysource

source/api/multimodal/processing.md

Data Processing

Module Contents

.. automodule:: vllm.multimodal.processing
    :members:
    :member-order: bysource

source/api/multimodal/profiling.md

Memory Profiling

Module Contents

.. automodule:: vllm.multimodal.profiling
    :members:
    :member-order: bysource

source/api/multimodal/registry.md

Registry

Module Contents

.. automodule:: vllm.multimodal.registry
    :members:
    :member-order: bysource

source/api/offline_inference/index.md

Offline Inference

:caption: Contents
:maxdepth: 1

llm
llm_inputs

source/api/offline_inference/llm.md

LLM Class

.. autoclass:: vllm.LLM
    :members:
    :show-inheritance:

source/api/offline_inference/llm_inputs.md

LLM Inputs

.. autodata:: vllm.inputs.PromptType
.. autoclass:: vllm.inputs.TextPrompt
    :show-inheritance:
    :members:
    :member-order: bysource
.. autoclass:: vllm.inputs.TokensPrompt
    :show-inheritance:
    :members:
    :member-order: bysource

source/community/meetups.md

(meetups)=

vLLM Meetups

We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

We are always looking for speakers and sponsors at San Francisco Bay Area and potentially other locations. If you are interested in speaking or sponsoring, please contact us at [email protected].


source/community/sponsors.md

Sponsors

vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!

Cash Donations:

  • a16z
  • Dropbox
  • Sequoia Capital
  • Skywork AI
  • ZhenFund

Compute Resources:

  • AMD
  • Anyscale
  • AWS
  • Crusoe Cloud
  • Databricks
  • DeepInfra
  • Google Cloud
  • Lambda Lab
  • Nebius
  • Novita AI
  • NVIDIA
  • Replicate
  • Roblox
  • RunPod
  • Trainy
  • UC Berkeley
  • UC San Diego

Slack Sponsor: Anyscale

We also have an official fundraising venue through OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.


source/deployment/docker.md

(deployment-docker)=

Using Docker

(deployment-docker-pre-built-image)=

Use vLLM's Official Docker Image

vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai.

$ docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-v0.1

You can add any other project:#engine-args you need after the image tag (vllm/vllm-openai:latest).

You can either use the `ipc=host` flag or `--shm-size` flag to allow the
container to access the host's shared memory. vLLM uses PyTorch, which uses shared
memory to share data between processes under the hood, particularly for tensor parallel inference.

(deployment-docker-build-image-from-source)=

Building vLLM's Docker Image from Source

You can build and run vLLM from source via the provided gh-file:Dockerfile. To build vLLM:

# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
for vLLM to find the current GPU type and build for that.

If you are using Podman instead of Docker, you might need to disable SELinux labeling by
adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).

Building for Arm64/aarch64

A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use of PyTorch Nightly and should be considered experimental. Using the flag --platform "linux/arm64" will attempt to build for arm64.

Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
$ python3 use_existing_torch.py
$ DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --platform "linux/arm64" \
  -t vllm/vllm-gh200-openai:latest \
  --build-arg max_jobs=66 \
  --build-arg nvcc_threads=2 \
  --build-arg torch_cuda_arch_list="9.0+PTX" \
  --build-arg vllm_fa_cmake_gpu_arches="90-real"

Use the custom-built vLLM Docker image

To run vLLM with the custom-built Docker image:

$ docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    vllm/vllm-openai <args...>

The argument vllm/vllm-openai specifies the image to run, and should be replaced with the name of the custom-built image (the -t tag from the build command).

**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .

source/deployment/frameworks/bentoml.md

(deployment-bentoml)=

BentoML

BentoML allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.

For details, see the tutorial vLLM inference in the BentoML documentation.


source/deployment/frameworks/cerebrium.md

(deployment-cerebrium)=

Cerebrium

<p align="center">
    <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>

vLLM can be run on a cloud based GPU machine with Cerebrium, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.

To install the Cerebrium client, run:

pip install cerebrium
cerebrium login

Next, create your Cerebrium project, run:

cerebrium init vllm-project

Next, to install the required packages, add the following to your cerebrium.toml:

[cerebrium.deployment]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"

[cerebrium.dependencies.pip]
vllm = "latest"

Next, let us add our code to handle inference for the LLM of your choice (mistralai/Mistral-7B-Instruct-v0.1 for this example), add the following code to your main.py:

from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")

def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):

    sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    results = []
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        results.append({"prompt": prompt, "generated_text": generated_text})

    return {"results": results}

Then, run the following code to deploy it to the cloud:

cerebrium deploy

If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case/run)

curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
 -H 'Content-Type: application/json' \
 -H 'Authorization: <JWT TOKEN>' \
 --data '{
   "prompts": [
     "Hello, my name is",
     "The president of the United States is",
     "The capital of France is",
     "The future of AI is"
   ]
 }'

You should get a response like:

{
    "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
    "result": {
        "result": [
            {
                "prompt": "Hello, my name is",
                "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
            },
            {
                "prompt": "The president of the United States is",
                "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
            },
            {
                "prompt": "The capital of France is",
                "generated_text": " Paris.\n"
            },
            {
                "prompt": "The future of AI is",
                "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
            }
        ]
    },
    "run_time_ms": 152.53663063049316
}

You now have an autoscaling endpoint where you only pay for the compute you use!


source/deployment/frameworks/dstack.md

(deployment-dstack)=

dstack

<p align="center">
    <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
</p>

vLLM can be run on a cloud based GPU machine with dstack, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.

To install dstack client, run:

pip install "dstack[all]
dstack server

Next, to configure your dstack project, run:

mkdir -p vllm-dstack
cd vllm-dstack
dstack init

Next, to provision a VM instance with LLM of your choice (NousResearch/Llama-2-7b-chat-hf for this example), create the following serve.dstack.yml file for the dstack Service:

type: service

python: "3.11"
env:
    - MODEL=NousResearch/Llama-2-7b-chat-hf
port: 8000
resources:
    gpu: 24GB
commands:
    - pip install vllm
    - vllm serve $MODEL --port 8000
model:
    format: openai
    type: chat
    name: NousResearch/Llama-2-7b-chat-hf

Then, run the following CLI for provisioning:

$ dstack run . -f serve.dstack.yml

⠸ Getting run plan...
 Configuration  serve.dstack.yml
 Project        deep-diver-main
 User           deep-diver
 Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
 Max price      -
 Max duration   -
 Spot policy    auto
 Retry policy   no

 #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
 1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
 2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
 3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
    ...
 Shown 3 of 193 offers, $5.876 max

Continue? [y/n]: y
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling)
spicy-treefrog-1 provisioning completed (running)
Service is published at ...

After the provisioning, you can interact with the model by using the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.<gateway domain>",
    api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
)

completion = client.chat.completions.create(
    model="NousResearch/Llama-2-7b-chat-hf",
    messages=[
        {
            "role": "user",
            "content": "Compose a poem that explains the concept of recursion in programming.",
        }
    ]
)

print(completion.choices[0].message.content)
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)

source/deployment/frameworks/helm.md

(deployment-helm)=

Helm

A Helm chart to deploy vLLM for Kubernetes

Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values.

This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.

Prerequisites

Before you begin, ensure that you have the following:

Installing the chart

To install the chart with the release name test-vllm:

helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY

Uninstalling the Chart

To uninstall the test-vllm deployment:

helm uninstall test-vllm --namespace=ns-vllm

The command removes all the Kubernetes components associated with the chart including persistent volumes and deletes the release.

Architecture

Values

:widths: 25 25 25 25
:header-rows: 1

* - Key
  - Type
  - Default
  - Description
* - autoscaling
  - object
  - {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
  - Autoscaling configuration
* - autoscaling.enabled
  - bool
  - false
  - Enable autoscaling
* - autoscaling.maxReplicas
  - int
  - 100
  - Maximum replicas
* - autoscaling.minReplicas
  - int
  - 1
  - Minimum replicas
* - autoscaling.targetCPUUtilizationPercentage
  - int
  - 80
  - Target CPU utilization for autoscaling
* - configs
  - object
  - {}
  - Configmap
* - containerPort
  - int
  - 8000
  - Container port
* - customObjects
  - list
  - []
  - Custom Objects configuration
* - deploymentStrategy
  - object
  - {}
  - Deployment strategy configuration
* - externalConfigs
  - list
  - []
  - External configuration
* - extraContainers
  - list
  - []
  - Additional containers configuration
* - extraInit
  - object
  - {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
  - Additional configuration for the init container
* - extraInit.pvcStorage
  - string
  - "50Gi"
  - Storage size of the s3
* - extraInit.s3modelpath
  - string
  - "relative_s3_model_path/opt-125m"
  - Path of the model on the s3 which hosts model weights and config files
* - extraInit.awsEc2MetadataDisabled
  - boolean
  - true
  - Disables the use of the Amazon EC2 instance metadata service
* - extraPorts
  - list
  - []
  - Additional ports configuration
* - gpuModels
  - list
  - ["TYPE_GPU_USED"]
  - Type of gpu used
* - image
  - object
  - {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
  - Image configuration
* - image.command
  - list
  - ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
  - Container launch command
* - image.repository
  - string
  - "vllm/vllm-openai"
  - Image repository
* - image.tag
  - string
  - "latest"
  - Image tag
* - livenessProbe
  - object
  - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
  - Liveness probe configuration
* - livenessProbe.failureThreshold
  - int
  - 3
  - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
* - livenessProbe.httpGet
  - object
  - {"path":"/health","port":8000}
  - Configuration of the Kubelet http request on the server
* - livenessProbe.httpGet.path
  - string
  - "/health"
  - Path to access on the HTTP server
* - livenessProbe.httpGet.port
  - int
  - 8000
  - Name or number of the port to access on the container, on which the server is listening
* - livenessProbe.initialDelaySeconds
  - int
  - 15
  - Number of seconds after the container has started before liveness probe is initiated
* - livenessProbe.periodSeconds
  - int
  - 10
  - How often (in seconds) to perform the liveness probe
* - maxUnavailablePodDisruptionBudget
  - string
  - ""
  - Disruption Budget Configuration
* - readinessProbe
  - object
  - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
  - Readiness probe configuration
* - readinessProbe.failureThreshold
  - int
  - 3
  - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
* - readinessProbe.httpGet
  - object
  - {"path":"/health","port":8000}
  - Configuration of the Kubelet http request on the server
* - readinessProbe.httpGet.path
  - string
  - "/health"
  - Path to access on the HTTP server
* - readinessProbe.httpGet.port
  - int
  - 8000
  - Name or number of the port to access on the container, on which the server is listening
* - readinessProbe.initialDelaySeconds
  - int
  - 5
  - Number of seconds after the container has started before readiness probe is initiated
* - readinessProbe.periodSeconds
  - int
  - 5
  - How often (in seconds) to perform the readiness probe
* - replicaCount
  - int
  - 1
  - Number of replicas
* - resources
  - object
  - {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
  - Resource configuration
* - resources.limits."nvidia.com/gpu"
  - int
  - 1
  - Number of gpus used
* - resources.limits.cpu
  - int
  - 4
  - Number of CPUs
* - resources.limits.memory
  - string
  - "16Gi"
  - CPU memory configuration
* - resources.requests."nvidia.com/gpu"
  - int
  - 1
  - Number of gpus used
* - resources.requests.cpu
  - int
  - 4
  - Number of CPUs
* - resources.requests.memory
  - string
  - "16Gi"
  - CPU memory configuration
* - secrets
  - object
  - {}
  - Secrets configuration
* - serviceName
  - string
  -
  - Service name
* - servicePort
  - int
  - 80
  - Service port
* - labels.environment
  - string
  - test
  - Environment name
* - labels.release
  - string
  - test
  - Release name

source/deployment/frameworks/index.md

Using other frameworks

:maxdepth: 1

bentoml
cerebrium
dstack
helm
lws
modal
skypilot
triton

source/deployment/frameworks/lws.md

(deployment-lws)=

LWS

LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.

vLLM can be deployed with LWS on Kubernetes for distributed model serving.

Please see this guide for more details on deploying vLLM on Kubernetes using LWS.


source/deployment/frameworks/modal.md

(deployment-modal)=

Modal

vLLM can be run on cloud GPUs with Modal, a serverless computing platform designed for fast auto-scaling.

For details on how to deploy vLLM on Modal, see this tutorial in the Modal documentation.


source/deployment/frameworks/skypilot.md

(deployment-skypilot)=

SkyPilot

<p align="center">
  <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p>

vLLM can be run and scaled to multiple service replicas on clouds and Kubernetes with SkyPilot, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in SkyPilot AI gallery.

Prerequisites

  • Go to the HuggingFace model page and request access to the model meta-llama/Meta-Llama-3-8B-Instruct.
  • Check that you have installed SkyPilot (docs).
  • Check that sky check shows clouds or Kubernetes are enabled.
pip install skypilot-nightly
sky check

Run on a single instance

See the vLLM SkyPilot YAML for serving, serving.yaml.

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log &

  echo 'Waiting for vllm api server to start...'
  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
    --stop-token-ids 128009,128001

Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):

HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN

Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.

(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

Optional: Serve the 70B model instead of the default 8B and use more GPU:

HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct

Scale up to multiple replicas

SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
    model: $MODEL_NAME
    messages:
      - role: user
        content: Hello! What is your name?
  max_completion_tokens: 1
<details>
<summary>Click to see the full recipe YAML</summary>
service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_completion_tokens: 1

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log
</details>

Start the serving the Llama-3 8B model on multiple replicas:

HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN

Wait until the service is ready:

watch -n10 sky serve status vllm
<details>
<summary>Example outputs:</summary>
Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
vllm  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
</details>

After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

ENDPOINT=$(sky serve status --endpoint 8081 vllm)
curl -L http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Who are you?"
    }
    ],
    "stop_token_ids": [128009,  128001]
  }'

To enable autoscaling, you could replace the replicas with the following configs in service:

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2

This will scale the service up to when the QPS exceeds 2 for each replica.

<details>
<summary>Click to see the full recipe YAML</summary>
service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_completion_tokens: 1

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log
</details>

To update the service with the new config:

HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN

To stop the service:

sky serve down vllm

Optional: Connect a GUI to the endpoint

It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

<details>
<summary>Click to see the full GUI YAML</summary>
envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.

resources:
  cpus: 2

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  # Install Gradio for web UI.
  pip install gradio openai

run: |
  conda activate vllm
  export PATH=$PATH:/sbin

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://$ENDPOINT/v1 \
    --stop-token-ids 128009,128001 | tee ~/gradio.log
</details>
  1. Start the chat web UI:

    sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
  2. Then, we can access the GUI at the returned gradio link:

    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

source/deployment/frameworks/triton.md

(deployment-triton)=

NVIDIA Triton

The Triton Inference Server hosts a tutorial demonstrating how to quickly deploy a simple facebook/opt-125m model using vLLM. Please see Deploying a vLLM model in Triton for more details.


source/deployment/integrations/index.md

External Integrations

:maxdepth: 1

kserve
kubeai
llamastack

source/deployment/integrations/kserve.md

(deployment-kserve)=

KServe

vLLM can be deployed with KServe on Kubernetes for highly scalable distributed model serving.

Please see this guide for more details on using vLLM with KServe.


source/deployment/integrations/kubeai.md

(deployment-kubeai)=

KubeAI

KubeAI is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.

Please see the Installation Guides for environment specific instructions:

Once you have KubeAI installed, you can configure text generation models using vLLM.


source/deployment/integrations/llamastack.md

(deployment-llamastack)=

Llama Stack

vLLM is also available via Llama Stack .

To install Llama Stack, run

pip install llama-stack -q

Inference using OpenAI Compatible API

Then start Llama Stack server pointing to your vLLM server with the following configuration:

inference:
  - provider_id: vllm0
    provider_type: remote::vllm
    config:
      url: http://127.0.0.1:8000

Please refer to this guide for more details on this remote vLLM provider.

Inference via Embedded vLLM

An inline vLLM provider is also available. This is a sample of configuration using that method:

inference
  - provider_type: vllm
    config:
      model: Llama3.1-8B-Instruct
      tensor_parallel_size: 4

source/deployment/k8s.md

(deployment-k8s)=

Using Kubernetes

Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.

Prerequisites

Before you begin, ensure that you have the following:

  • A running Kubernetes cluster
  • NVIDIA Kubernetes Device Plugin (k8s-device-plugin): This can be found at https://github.com/NVIDIA/k8s-device-plugin/
  • Available GPU resources in your cluster

Deployment Steps

  1. Create a PVC, Secret and Deployment for vLLM

    PVC is used to store the model cache and it is optional, you can use hostPath or other storage options

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: mistral-7b
      namespace: default
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: default
      volumeMode: Filesystem

    Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models

    apiVersion: v1
    kind: Secret
    metadata:
      name: hf-token-secret
      namespace: default
    type: Opaque
    stringData:
      token: "REPLACE_WITH_TOKEN"

    Next to create the deployment file for vLLM to run the model server. The following example deploys the Mistral-7B-Instruct-v0.3 model.

    Here are two examples for using NVIDIA GPU and AMD GPU.

    NVIDIA GPU:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mistral-7b
      namespace: default
      labels:
        app: mistral-7b
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: mistral-7b
      template:
        metadata:
          labels:
            app: mistral-7b
        spec:
          volumes:
          - name: cache-volume
            persistentVolumeClaim:
              claimName: mistral-7b
          # vLLM needs to access the host's shared memory for tensor parallel inference.
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: "2Gi"
          containers:
          - name: mistral-7b
            image: vllm/vllm-openai:latest
            command: ["/bin/sh", "-c"]
            args: [
              "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
            ]
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: token
            ports:
            - containerPort: 8000
            resources:
              limits:
                cpu: "10"
                memory: 20G
                nvidia.com/gpu: "1"
              requests:
                cpu: "2"
                memory: 6G
                nvidia.com/gpu: "1"
            volumeMounts:
            - mountPath: /root/.cache/huggingface
              name: cache-volume
            - name: shm
              mountPath: /dev/shm
            livenessProbe:
              httpGet:
                path: /health
                port: 8000
              initialDelaySeconds: 60
              periodSeconds: 10
            readinessProbe:
              httpGet:
                path: /health
                port: 8000
              initialDelaySeconds: 60
              periodSeconds: 5

    AMD GPU:

    You can refer to the deployment.yaml below if using AMD ROCm GPU like MI300X.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mistral-7b
      namespace: default
      labels:
        app: mistral-7b
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: mistral-7b
      template:
        metadata:
          labels:
            app: mistral-7b
        spec:
          volumes:
          # PVC
          - name: cache-volume
            persistentVolumeClaim:
              claimName: mistral-7b
          # vLLM needs to access the host's shared memory for tensor parallel inference.
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: "8Gi"
          hostNetwork: true
          hostIPC: true
          containers:
          - name: mistral-7b
            image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
            securityContext:
              seccompProfile:
                type: Unconfined
              runAsGroup: 44
              capabilities:
                add:
                - SYS_PTRACE
            command: ["/bin/sh", "-c"]
            args: [
              "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
            ]
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: token
            ports:
            - containerPort: 8000
            resources:
              limits:
                cpu: "10"
                memory: 20G
                amd.com/gpu: "1"
              requests:
                cpu: "6"
                memory: 6G
                amd.com/gpu: "1"
            volumeMounts:
            - name: cache-volume
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm

    You can get the full example with steps and sample yaml files from https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve.

  2. Create a Kubernetes Service for vLLM

    Next, create a Kubernetes Service file to expose the mistral-7b deployment:

    apiVersion: v1
    kind: Service
    metadata:
      name: mistral-7b
      namespace: default
    spec:
      ports:
      - name: http-mistral-7b
        port: 80
        protocol: TCP
        targetPort: 8000
      # The label selector should match the deployment labels & it is useful for prefix caching feature
      selector:
        app: mistral-7b
      sessionAffinity: None
      type: ClusterIP
  3. Deploy and Test

    Apply the deployment and service configurations using kubectl apply -f <filename>:

    kubectl apply -f deployment.yaml
    kubectl apply -f service.yaml

    To test the deployment, run the following curl command:

    curl http://mistral-7b.default.svc.cluster.local/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
            "model": "mistralai/Mistral-7B-Instruct-v0.3",
            "prompt": "San Francisco is a",
            "max_tokens": 7,
            "temperature": 0
          }'

    If the service is correctly deployed, you should receive a response from the vLLM model.

Conclusion

Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.


source/deployment/nginx.md

(nginxloadbalancer)=

Using Nginx

This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

Table of contents:

  1. Build Nginx Container
  2. Create Simple Nginx Config file
  3. Build vLLM Container
  4. Create Docker Network
  5. Launch vLLM Containers
  6. Launch Nginx
  7. Verify That vLLM Servers Are Ready

(nginxloadbalancer-nginx-build)=

Build Nginx Container

This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.

export vllm_root=`pwd`

Create a file named Dockerfile.nginx:

FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

Build the container:

docker build . -f Dockerfile.nginx --tag nginx-lb

(nginxloadbalancer-nginx-conf)=

Create Simple Nginx Config file

Create a file named nginx_conf/nginx.conf. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another server vllmN:8000 max_fails=3 fail_timeout=10000s; entry to upstream backend.

upstream backend {
    least_conn;
    server vllm0:8000 max_fails=3 fail_timeout=10000s;
    server vllm1:8000 max_fails=3 fail_timeout=10000s;
}
server {
    listen 80;
    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

(nginxloadbalancer-nginx-vllm-container)=

Build vLLM Container

cd $vllm_root
docker build -f Dockerfile . --tag vllm

If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:

cd $vllm_root
docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy

(nginxloadbalancer-nginx-docker-network)=

Create Docker Network

docker network create vllm_nginx

(nginxloadbalancer-nginx-launch-container)=

Launch vLLM Containers

Notes:

  • If you have your HuggingFace models cached somewhere else, update hf_cache_dir below.
  • If you don't have an existing HuggingFace cache you will want to start vllm0 and wait for the model to complete downloading and the server to be ready. This will ensure that vllm1 can leverage the model you just downloaded and it won't have to be downloaded again.
  • The below example assumes GPU backend used. If you are using CPU backend, remove --gpus all, add VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND environment variables to the docker run command.
  • Adjust the model name that you want to use in your vLLM servers if you don't want to use Llama-2-7b-chat-hf.
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.

(nginxloadbalancer-nginx-launch-nginx)=

Launch Nginx

docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest

(nginxloadbalancer-nginx-verify-nginx)=

Verify That vLLM Servers Are Ready

docker logs vllm0 | grep Uvicorn
docker logs vllm1 | grep Uvicorn

Both outputs should look like this:

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

source/design/arch_overview.md

(arch-overview)=

Architecture Overview

This document provides an overview of the vLLM architecture.

:depth: 2
:local: true

Entrypoints

vLLM provides a number of entrypoints for interacting with the system. The following diagram shows the relationship between them.

:alt: Entrypoints Diagram

LLM Class

The LLM class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server.

Here is a sample of LLM class usage:

from vllm import LLM, SamplingParams

# Define a list of input prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The largest ocean is",
]

# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Initialize the LLM engine with the OPT-125M model
llm = LLM(model="facebook/opt-125m")

# Generate outputs for the input prompts
outputs = llm.generate(prompts, sampling_params)

# Print the generated outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

More API details can be found in the {doc}Offline Inference </api/offline_inference/index> section of the API docs.

The code for the LLM class can be found in gh-file:vllm/entrypoints/llm.py.

OpenAI-Compatible API Server

The second primary interface to vLLM is via its OpenAI-compatible API server. This server can be started using the vllm serve command.

vllm serve <model>

The code for the vllm CLI can be found in gh-file:vllm/scripts.py.

Sometimes you may see the API server entrypoint used directly instead of via the vllm CLI command. For example:

python -m vllm.entrypoints.openai.api_server --model <model>

That code can be found in gh-file:vllm/entrypoints/openai/api_server.py.

More details on the API server can be found in the OpenAI-Compatible Server document.

LLM Engine

The LLMEngine and AsyncLLMEngine classes are central to the functioning of the vLLM system, handling model inference and asynchronous request processing.

:alt: LLMEngine Diagram

LLMEngine

The LLMEngine class is the core component of the vLLM engine. It is responsible for receiving requests from clients and generating outputs from the model. The LLMEngine includes input processing, model execution (possibly distributed across multiple hosts and/or GPUs), scheduling, and output processing.

  • Input Processing: Handles tokenization of input text using the specified tokenizer.
  • Scheduling: Chooses which requests are processed in each step.
  • Model Execution: Manages the execution of the language model, including distributed execution across multiple GPUs.
  • Output Processing: Processes the outputs generated by the model, decoding the token IDs from a language model into human-readable text.

The code for LLMEngine can be found in gh-file:vllm/engine/llm_engine.py.

AsyncLLMEngine

The AsyncLLMEngine class is an asynchronous wrapper for the LLMEngine class. It uses asyncio to create a background loop that continuously processes incoming requests. The AsyncLLMEngine is designed for online serving, where it can handle multiple concurrent requests and stream outputs to clients.

The OpenAI-compatible API server uses the AsyncLLMEngine. There is also a demo API server that serves as a simpler example in gh-file:vllm/entrypoints/api_server.py.

The code for AsyncLLMEngine can be found in gh-file:vllm/engine/async_llm_engine.py.

Worker

A worker is a process that runs the model inference. vLLM follows the common practice of using one process to control one accelerator device, such as GPUs. For example, if we use tensor parallelism of size 2 and pipeline parallelism of size 2, we will have 4 workers in total. Workers are identified by their rank and local_rank. rank is used for global orchestration, while local_rank is mainly used for assigning the accelerator device and accessing local resources such as the file system and shared memory.

Model Runner

Every worker has one model runner object, responsible for loading and running the model. Much of the model execution logic resides here, such as preparing input tensors and capturing cudagraphs.

Model

Every model runner object has one model object, which is the actual torch.nn.Module instance. See huggingface_integration for how various configurations affect the class we ultimately get.

Class Hierarchy

The following figure shows the class hierarchy of vLLM:

:align: center
:alt: query
:width: 100%

There are several important design choices behind this class hierarchy:

1. Extensibility: All classes in the hierarchy accept a configuration object containing all the necessary information. The VllmConfig class is the main configuration object that is passed around. The class hierarchy is quite deep, and every class needs to read the configuration it is interested in. By encapsulating all configurations in one object, we can easily pass the configuration object around and access the configuration we need. Suppose we want to add a new feature (this is often the case given how fast the field of LLM inference is evolving) that only touches the model runner. We will have to add a new configuration option in the VllmConfig class. Since we pass the whole config object around, we only need to add the configuration option to the VllmConfig class, and the model runner can access it directly. We don't need to change the constructor of the engine, worker, or model class to pass the new configuration option.

2. Uniformity: The model runner needs a unified interface to create and initialize the model. vLLM supports more than 50 types of popular open-source models. Each model has its own initialization logic. If the constructor signature varies with models, the model runner does not know how to call the constructor accordingly, without complicated and error-prone inspection logic. By making the constructor of the model class uniform, the model runner can easily create and initialize the model without knowing the specific model type. This is also useful for composing models. Vision-language models often consist of a vision model and a language model. By making the constructor uniform, we can easily create a vision model and a language model and compose them into a vision-language model.

To support this change, all vLLM models' signatures have been updated to:

```python
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
```

To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:

```python
class MyOldModel(nn.Module):
    def __init__(
        self,
        config,
        cache_config: Optional[CacheConfig] = None,
        quant_config: Optional[QuantizationConfig] = None,
        lora_config: Optional[LoRAConfig] = None,
        prefix: str = "",
    ) -> None:
        ...

from vllm.config import VllmConfig
class MyNewModel(MyOldModel):
    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        config = vllm_config.model_config.hf_config
        cache_config = vllm_config.cache_config
        quant_config = vllm_config.quant_config
        lora_config = vllm_config.lora_config
        super().__init__(config, cache_config, quant_config, lora_config, prefix)

if __version__ >= "0.6.4":
    MyModel = MyNewModel
else:
    MyModel = MyOldModel
```

This way, the model can work with both old and new versions of vLLM.

3. Sharding and Quantization at Initialization: Certain features require changing the model weights. For example, tensor parallelism needs to shard the model weights, and quantization needs to quantize the model weights. There are two possible ways to implement this feature. One way is to change the model weights after the model is initialized. The other way is to change the model weights during the model initialization. vLLM chooses the latter. The first approach is not scalable to large models. Suppose we want to run a 405B model (with roughly 810GB weights) with 16 H100 80GB GPUs. Ideally, every GPU should only load 50GB weights. If we change the model weights after the model is initialized, we need to load the full 810GB weights to every GPU and then shard the weights, leading to a huge memory overhead. Instead, if we shard the weights during the model initialization, every layer will only create a shard of the weights it needs, leading to a much smaller memory overhead. The same idea applies to quantization. Note that we also add an additional argument prefix to the model's constructor so that the model can initialize itself differently based on the prefix. This is useful for non-uniform quantization, where different parts of the model are quantized differently. The prefix is usually an empty string for the top-level model and a string like "vision" or "language" for the sub-models. In general, it matches the name of the module's state dict in the checkpoint file.

One disadvantage of this design is that it is hard to write unit tests for individual components in vLLM because every component needs to be initialized by a complete config object. We solve this problem by providing a default initialization function that creates a default config object with all fields set to None. If the component we want to test only cares about a few fields in the config object, we can create a default config object and set the fields we care about. This way, we can test the component in isolation. Note that many tests in vLLM are end-to-end tests that test the whole system, so this is not a big problem.

In summary, the complete config object VllmConfig can be treated as an engine-level global state that is shared among all vLLM classes.


source/design/automatic_prefix_caching.md

(design-automatic-prefix-caching)=

Automatic Prefix Caching

The core idea of PagedAttention is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.

                    Block 1                  Block 2                  Block 3
         [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|

In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:

hash(prefix tokens + block tokens) <--> KV Block

With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.

This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.

Generalized Caching Policy

Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.

Managing KV cache with a hash table allows us to implement flexible caching policies. As an example, in current vLLM, we implement the following eviction policy:

  • When there are no free blocks left, we will evict a KV block with reference count (i.e., number of current requests using the block) equals 0.
  • If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU).
  • If there are multiple blocks whose last access time are the same, we prioritize the eviction of the block that is at the end of the longest prefix (i.e., has the maximum number of blocks before it).

Note that this eviction policy effectively implements the exact policy as in RadixAttention when applied to models with full attention, which prioritizes to evict reference count zero and least recent used leaf nodes in the prefix tree.

However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:

  • Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
  • Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.

source/design/huggingface_integration.md

(huggingface-integration)=

Integration with HuggingFace

This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run vllm serve.

Let's say we want to serve the popular QWen model by running vllm serve Qwen/Qwen2-7B.

  1. The model argument is Qwen/Qwen2-7B. vLLM determines whether this model exists by checking for the corresponding config file config.json. See this code snippet for the implementation. Within this process:

    • If the model argument corresponds to an existing local path, vLLM will load the config file directly from this path.
    • If the model argument is a HuggingFace model ID consisting of a username and model name, vLLM will first try to use the config file from the HuggingFace local cache, using the model argument as the model name and the --revision argument as the revision. See their website for more information on how the HuggingFace cache works.
    • If the model argument is a HuggingFace model ID but it is not found in the cache, vLLM will download the config file from the HuggingFace model hub. Refer to this function for the implementation. The input arguments include the model argument as the model name, the --revision argument as the revision, and the environment variable HF_TOKEN as the token to access the model hub. In our case, vLLM will download the config.json file.
  2. After confirming the existence of the model, vLLM loads its config file and converts it into a dictionary. See this code snippet for the implementation.

  3. Next, vLLM inspects the model_type field in the config dictionary to generate the config object to use. There are some model_type values that vLLM directly supports; see here for the list. If the model_type is not in the list, vLLM will use AutoConfig.from_pretrained to load the config class, with model, --revision, and --trust_remote_code as the arguments. Please note that:

    • HuggingFace also has its own logic to determine the config class to use. It will again use the model_type field to search for the class name in the transformers library; see here for the list of supported models. If the model_type is not found, HuggingFace will use the auto_map field from the config JSON file to determine the class name. Specifically, it is the AutoConfig field under auto_map. See DeepSeek for an example.
    • The AutoConfig field under auto_map points to a module path in the model's repository. To create the config class, HuggingFace will import the module and use the from_pretrained method to load the config class. This can generally cause arbitrary code execution, so it is only executed when --trust_remote_code is enabled.
  4. Subsequently, vLLM applies some historical patches to the config object. These are mostly related to RoPE configuration; see here for the implementation.

  5. Finally, vLLM can reach the model class we want to initialize. vLLM uses the architectures field in the config object to determine the model class to initialize, as it maintains the mapping from architecture name to model class in its registry. If the architecture name is not found in the registry, it means this model architecture is not supported by vLLM. For Qwen/Qwen2-7B, the architectures field is ["Qwen2ForCausalLM"], which corresponds to the Qwen2ForCausalLM class in vLLM's code. This class will initialize itself depending on various configs.

Beyond that, there are two more things vLLM depends on HuggingFace for.

  1. Tokenizer: vLLM uses the tokenizer from HuggingFace to tokenize the input text. The tokenizer is loaded using AutoTokenizer.from_pretrained with the model argument as the model name and the --revision argument as the revision. It is also possible to use a tokenizer from another model by specifying the --tokenizer argument in the vllm serve command. Other relevant arguments are --tokenizer-revision and --tokenizer-mode. Please check HuggingFace's documentation for the meaning of these arguments. This part of the logic can be found in the get_tokenizer function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in get_cached_tokenizer.

  2. Model weight: vLLM downloads the model weight from the HuggingFace model hub using the model argument as the model name and the --revision argument as the revision. vLLM provides the argument --load-format to control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass --load-format dummy to skip downloading the weights.

    • It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the documentation for more information on the safetensors format. This part of the logic can be found here. Please note that:

This completes the integration between vLLM and HuggingFace.

In summary, vLLM reads the config file config.json, tokenizer, and model weight from the HuggingFace model hub or a local directory. It uses the config class from either vLLM, HuggingFace transformers, or loads the config class from the model's repository.


source/design/kernel/paged_attention.md

(design-paged-attention)=

vLLM Paged Attention

  • Currently, vLLM utilizes its own implementation of a multi-head query attention kernel (csrc/attention/attention_kernels.cu). This kernel is designed to be compatible with vLLM's paged KV caches, where the key and value cache are stored in separate blocks (note that this block concept differs from the GPU thread block. So in a later document, I will refer to vLLM paged attention block as "block", while refer to GPU thread block as "thread block").
  • To achieve high performance, this kernel relies on a specially designed memory layout and access method, specifically when threads read data from global memory to shared memory. The purpose of this document is to provide a high-level explanation of the kernel implementation step by step, aiding those who wish to learn about the vLLM multi-head query attention kernel. After going through this document, users will likely have a better understanding and feel easier to follow the actual implementation.
  • Please note that this document may not cover all details, such as how to calculate the correct index for the corresponding data or the dot multiplication implementation. However, after reading this document and becoming familiar with the high-level logic flow, it should be easier for you to read the actual code and understand the details.

Inputs

  • The kernel function takes a list of arguments for the current thread to perform its assigned work. The three most important arguments are the input pointers q, k_cache, and v_cache, which point to query, key, and value data on global memory that need to be read and processed. The output pointer out points to global memory where the result should be written. These four pointers actually refer to multi-dimensional arrays, but each thread only accesses the portion of data assigned to it. I have omitted all other runtime parameters here for simplicity.

    template<
    typename scalar_t,
    int HEAD_SIZE,
    int BLOCK_SIZE,
    int NUM_THREADS,
    int PARTITION_SIZE = 0>
    __device__ void paged_attention_kernel(
    ... // Other side args.
    const scalar_t* __restrict__ out,       // [num_seqs, num_heads, max_num_partitions, head_size]
    const scalar_t* __restrict__ q,         // [num_seqs, num_heads, head_size]
    const scalar_t* __restrict__ k_cache,   // [num_blocks, num_kv_heads, head_size/x, block_size, x]
    const scalar_t* __restrict__ v_cache,   // [num_blocks, num_kv_heads, head_size, block_size]
    ... // Other side args.
    )
  • There are also a list of template arguments above the function signature that are determined during compilation time. scalar_t represents the data type of the query, key, and value data elements, such as FP16. HEAD_SIZE indicates the number of elements in each head. BLOCK_SIZE refers to the number of tokens in each block. NUM_THREADS denotes the number of threads in each thread block. PARTITION_SIZE represents the number of tensor parallel GPUs (For simplicity, we assume this is 0 and tensor parallel is disabled).

  • With these arguments, we need to perform a sequence of preparations. This includes calculating the current head index, block index, and other necessary variables. However, for now, we can ignore these preparations and proceed directly to the actual calculations. It will be easier to understand them once we grasp the entire flow.

Concepts

  • Just before we dive into the calculation flow, I want to describe a few concepts that are needed for later sections. However, you may skip this section and return later if you encounter any confusing terminologies.
  • Sequence: A sequence represents a client request. For example, the data pointed to by q has a shape of [num_seqs, num_heads, head_size]. That represents there are total num_seqs of query sequence data are pointed by q. Since this kernel is a single query attention kernel, each sequence only has one query token. Hence, the num_seqs equals the total number of tokens that are processed in the batch.
  • Context: The context consists of the generated tokens from the sequence. For instance, ["What", "is", "your"] are the context tokens, and the input query token is "name". The model might generate the token "?".
  • Vec: The vec is a list of elements that are fetched and calculated together. For query and key data, the vec size (VEC_SIZE) is determined so that each thread group can fetch and calculate 16 bytes of data at a time. For value data, the vec size (V_VEC_SIZE) is determined so that each thread can fetch and calculate 16 bytes of data at a time. For example, if the scalar_t is FP16 (2 bytes) and THREAD_GROUP_SIZE is 2, the VEC_SIZE will be 4, while the V_VEC_SIZE will be 8.
  • Thread group: The thread group is a small group of threads(THREAD_GROUP_SIZE) that fetches and calculates one query token and one key token at a time. Each thread handles only a portion of the token data. The total number of elements processed by one thread group is referred as x. For example, if the thread group contains 2 threads and the head size is 8, then thread 0 handles the query and key elements at index 0, 2, 4, 6, while thread 1 handles the elements at index 1, 3, 5, 7.
  • Block: The key and value cache data in vLLM are split into blocks. Each block stores data for a fixed number(BLOCK_SIZE) of tokens at one head. Each block may contain only a portion of the whole context tokens. For example, if the block size is 16 and the head size is 128, then for one head, one block can store 16 * 128 = 2048 elements.
  • Warp: A warp is a group of 32 threads(WARP_SIZE) that execute simultaneously on a stream multiprocessor (SM). In this kernel, each warp processes the calculation between one query token and key tokens of one entire block at a time (it may process multiple blocks in multiple iterations). For example, if there are 4 warps and 6 blocks for one context, the assignment would be like warp 0 handles the 0th, 4th blocks, warp 1 handles the 1st, 5th blocks, warp 2 handles the 2nd block and warp 3 handles the 3rd block.
  • Thread block: A thread block is a group of threads(NUM_THREADS) that can access the same shared memory. Each thread block contains multiple warps(NUM_WARPS), and in this kernel, each thread block processes the calculation between one query token and key tokens of a whole context.
  • Grid: A grid is a collection of thread blocks and defines the shape of the collection. In this kernel, the shape is (num_heads, num_seqs, max_num_partitions). Therefore, each thread block only handles the calculation for one head, one sequence, and one partition.

Query

  • This section will introduce how query data is stored in memory and fetched by each thread. As mentioned above, each thread group fetches one query token data, while each thread itself only handles a part of one query token data. Within each warp, every thread group will fetch the same query token data, but will multiply it with different key token data.

    const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
    :align: center
    :alt: query
    :width: 70%
    
    Query data of one token at one head
    
  • Each thread defines its own q_ptr which points to the assigned query token data on global memory. For example, if VEC_SIZE is 4 and HEAD_SIZE is 128, the q_ptr points to data that contains total of 128 elements divided into 128 / 4 = 32 vecs.

    :align: center
    :alt: q_vecs
    :width: 70%
    
    `q_vecs` for one thread group
    
    __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
  • Next, we need to read the global memory data pointed to by q_ptr into shared memory as q_vecs. It is important to note that each vecs is assigned to a different row. For example, if the THREAD_GROUP_SIZE is 2, thread 0 will handle the 0th row vecs, while thread 1 handles the 1st row vecs. By reading the query data in this way, neighboring threads like thread 0 and thread 1 can read neighbor memory, achieving the memory coalescing to improve performance.

Key

  • Similar to the "Query" section, this section introduces memory layout and assignment for keys. While each thread group only handle one query token one kernel run, it may handle multiple key tokens across multiple iterations. Meanwhile, each warp will process multiple blocks of key tokens in multiple iterations, ensuring that all context tokens are processed by the entire thread group after the kernel run. In this context, "handle" refers to performing the dot multiplication between query data and key data.

    const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride
                        + kv_head_idx * kv_head_stride
                        + physical_block_offset * x;
  • Unlike to q_ptr, k_ptr in each thread will point to different key token at different iterations. As shown above, that k_ptr points to key token data based on k_cache at assigned block, assigned head and assigned token.

    :align: center
    :alt: key
    :width: 70%
    
    Key data of all context tokens at one head
    
  • The diagram above illustrates the memory layout for key data. It assumes that the BLOCK_SIZE is 16, HEAD_SIZE is 128, x is 8, THREAD_GROUP_SIZE is 2, and there are a total of 4 warps. Each rectangle represents all the elements for one key token at one head, which will be processed by one thread group. The left half shows the total 16 blocks of key token data for warp 0, while the right half represents the remaining key token data for other warps or iterations. Inside each rectangle, there are a total 32 vecs (128 elements for one token) that will be processed by 2 threads (one thread group) separately.

    :align: center
    :alt: k_vecs
    :width: 70%
    
    `k_vecs` for one thread
    
    K_vec k_vecs[NUM_VECS_PER_THREAD]
  • Next, we need to read the key token data from k_ptr and store them on register memory as k_vecs. We use register memory for k_vecs because it will only be accessed by one thread once, whereas q_vecs will be accessed by multiple threads multiple times. Each k_vecs will contain multiple vectors for later calculation. Each vec will be set at each inner iteration. The assignment of vecs allows neighboring threads in a warp to read neighboring memory together, which again promotes the memory coalescing. For instance, thread 0 will read vec 0, while thread 1 will read vec 1. In the next inner loop, thread 0 will read vec 2, while thread 1 will read vec 3, and so on.

  • You may still be a little confused about the overall flow. Don't worry, please keep reading the next "QK" section. It will illustrate the query and key calculation flow in a clearer and higher-level manner.

QK

  • As shown the pseudo code below, before the entire for loop block, we fetch the query data for one token and store it in q_vecs. Then, in the outer for loop, we iterate through different k_ptrs that point to different tokens and prepare the k_vecs in the inner for loop. Finally, we perform the dot multiplication between the q_vecs and each k_vecs.

    q_vecs = ...
    for ... {
       k_ptr = ...
       for ... {
          k_vecs[i] = ...
       }
       ...
       float qk = scale * Qk_dot<scalar_t, THREAD_GROUP_SIZE>::dot(q_vecs[thread_group_offset], k_vecs);
    }
  • As mentioned before, for each thread, it only fetches part of the query and key token data at a time. However, there will be a cross thread group reduction happen in the Qk_dot<>::dot . So qk returned here is not just between part of the query and key token dot multiplication, but actually a full result between entire query and key token data.

  • For example, if the value of HEAD_SIZE is 128 and THREAD_GROUP_SIZE is 2, each thread's k_vecs will contain total 64 elements. However, the returned qk is actually the result of dot multiplication between 128 query elements and 128 key elements. If you want to learn more about the details of the dot multiplication and reduction, you may refer to the implementation of Qk_dot<>::dot. However, for the sake of simplicity, I will not cover it in this document.

Softmax

  • Next, we need to calculate the normalized softmax for all qks, as shown above, where each $x$ represents a qk. To do this, we must obtain the reduced value of qk_max($m(x)$) and the exp_sum($\ell(x)$) of all qks. The reduction should be performed across the entire thread block, encompassing results between the query token and all context key tokens.

    :nowrap: true
    
    \begin{gather*}
    m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
    \quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
    \end{gather*}
    

qk_max and logits

  • Just right after we get the qk result, we can set the temporary logits result with qk (In the end, the logits should store the normalized softmax result). Also we can compare and collect the qk_max for all qks that are calculated by current thread group.

    if (thread_group_offset == 0) {
       const bool mask = token_idx >= context_len;
       logits[token_idx - start_token_idx] = mask ? 0.f : qk;
       qk_max = mask ? qk_max : fmaxf(qk_max, qk);
    }
  • Please note that the logits here is on shared memory, so each thread group will set the fields for its own assigned context tokens. Overall, the size of logits should be number of context tokens.

    for (int mask = WARP_SIZE / 2; mask >= THREAD_GROUP_SIZE; mask /= 2) {
        qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
    }
    
    if (lane == 0) {
       red_smem[warp_idx] = qk_max;
    }
  • Then we need to get the reduced qk_max across each warp. The main idea is to make threads in warp to communicate with each other and get the final max qk .

    for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) {
        qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
    }
    qk_max = VLLM_SHFL_SYNC(qk_max, 0);
  • Finally, we can get the reduced qk_max from whole thread block by compare the qk_max from all warps in this thread block. Then we need to broadcast the final result to each thread.

exp_sum

  • Similar to qk_max, we need to get the reduced sum value from the entire thread block too.

    for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
        float val = __expf(logits[i] - qk_max);
        logits[i] = val;
        exp_sum += val;
    }
    ...
    exp_sum = block_sum<NUM_WARPS>(&red_smem[NUM_WARPS], exp_sum);
  • Firstly, sum all exp values from each thread group, and meanwhile, convert each entry of logits from qk to exp(qk - qk_max). Please note, the qk_max here is already the max qk across the whole thread block. And then we can do reduction for exp_sum across whole thread block just like the qk_max.

    const float inv_sum = __fdividef(1.f, exp_sum + 1e-6f);
    for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
       logits[i] *= inv_sum;
    }
  • Finally, with the reduced qk_max and exp_sum, we can obtain the final normalized softmax result as logits. This logits variable will be used for dot multiplication with the value data in later steps. Now, it should store the normalized softmax result of qk for all assigned context tokens.

Value

:align: center
:alt: value
:width: 70%

Value data of all context tokens at one head
:align: center
:alt: logits_vec
:width: 50%

`logits_vec` for one thread
:align: center
:alt: v_vec
:width: 70%

List of `v_vec` for one thread
  • Now we need to retrieve the value data and perform dot multiplication with logits. Unlike query and key, there is no thread group concept for value data. As shown in diagram, different from key token memory layout, elements from the same column correspond to the same value token. For one block of value data, there are HEAD_SIZE of rows and BLOCK_SIZE of columns that are split into multiple v_vecs.

  • Each thread always fetches V_VEC_SIZE elements from the same V_VEC_SIZE of tokens at a time. As a result, a single thread retrieves multiple v_vecs from different rows and the same columns through multiple inner iterations. For each v_vec, it needs to be dot multiplied with the corresponding logits_vec, which is also V_VEC_SIZE elements from logits. Overall, with multiple inner iterations, each warp will process one block of value tokens. And with multiple outer iterations, the whole context value tokens are processd

    float accs[NUM_ROWS_PER_THREAD];
    for ... { // Iteration over different blocks.
        logits_vec = ...
        for ... { // Iteration over different rows.
            v_vec = ...
            ...
            accs[i] += dot(logits_vec, v_vec);
        }
    }
  • As shown in the above pseudo code, in the outer loop, similar to k_ptr, logits_vec iterates over different blocks and reads V_VEC_SIZE elements from logits. In the inner loop, each thread reads V_VEC_SIZE elements from the same tokens as a v_vec and performs dot multiplication. It is important to note that in each inner iteration, the thread fetches different head position elements for the same tokens. The dot result is then accumulated in accs. Therefore, each entry of accs is mapped to a head position assigned to the current thread.

  • For example, if BLOCK_SIZE is 16 and V_VEC_SIZE is 8, each thread fetches 8 value elements for 8 tokens at a time. Each element is from different tokens at the same head position. If HEAD_SIZE is 128 and WARP_SIZE is 32, for each inner loop, a warp needs to fetch WARP_SIZE * V_VEC_SIZE = 256 elements. This means there are a total of 128 * 16 / 256 = 8 inner iterations for a warp to handle a whole block of value tokens. And each accs in each thread contains 8 elements that accumulated at 8 different head positions. For the thread 0, the accs variable will have 8 elements, which are 0th, 32th … 224th elements of a value head that are accumulated from all assigned 8 tokens.

LV

  • Now, we need to perform reduction for accs within each warp. This process allows each thread to accumulate the accs for the assigned head positions of all tokens in one block.

    for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
       float acc = accs[i];
       for (int mask = NUM_V_VECS_PER_ROW / 2; mask >= 1; mask /= 2) {
          acc += VLLM_SHFL_XOR_SYNC(acc, mask);
       }
       accs[i] = acc;
    }
  • Next, we perform reduction for accs across all warps, allowing each thread to have the accumulation of accs for the assigned head positions of all context tokens. Please note that each accs in every thread only stores the accumulation for a portion of elements of the entire head for all context tokens. However, overall, all results for output have been calculated but are just stored in different thread register memory.

    float* out_smem = reinterpret_cast<float*>(shared_mem);
    for (int i = NUM_WARPS; i > 1; i /= 2) {
        // Upper warps write to shared memory.
        ...
            float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE];
            for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
                    ...
            dst[row_idx] = accs[i];
        }
    
        // Lower warps update the output.
            const float* src = &out_smem[warp_idx * HEAD_SIZE];
        for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
                    ...
            accs[i] += src[row_idx];
        }
    
            // Write out the accs.
    }

Output

  • Now we can write all of calculated result from local register memory to final output global memory.

    scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
                    + head_idx * max_num_partitions * HEAD_SIZE
                    + partition_idx * HEAD_SIZE;
  • First, we need to define the out_ptr variable, which points to the start address of the assigned sequence and assigned head.

    for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
    const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
    if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) {
        from_float(*(out_ptr + row_idx), accs[i]);
    }
    }
  • Finally, we need to iterate over different assigned head positions and write out the corresponding accumulated result based on the out_ptr.


source/design/mm_processing.md

(mm-processing)=

Multi-Modal Data Processing

To enable various optimizations in vLLM such as chunked prefill and prefix caching, we use {class}~vllm.multimodal.processing.BaseMultiModalProcessor to provide the correspondence between placeholder feature tokens (e.g. <image>) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.

Here are the main features of {class}~vllm.multimodal.processing.BaseMultiModalProcessor:

Prompt Replacement Detection

One of the main responsibilies of HF processor is to replace input placeholder tokens (e.g. <image> for a single image) with feature placeholder tokens (e.g. <image><image>...<image>, the number of which equals to the feature size). The information about which tokens have been replaced is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.

In vLLM, this information is specified using {class}~vllm.multimodal.processing.PromptReplacement in {meth}~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements. Given this specification, we can automatically detect whether HF has replaced the input placeholder tokens by checking whether the feature placeholder tokens exist in the prompt.

Tokenized Prompt Inputs

To enable tokenization in a separate process, we support passing input token IDs alongside multi-modal data.

The problem

Consider that HF processors follow these main steps:

  1. Tokenize the text
  2. Process multi-modal inputs
  3. Perform prompt replacement

And we require that:

  • For text + multi-modal inputs, apply all steps 1--3.
  • For tokenized + multi-modal inputs, apply only steps 2--3.

How can we achieve this without rewriting HF processors? We can try to call the HF processor several times on different inputs:

  • For text + multi-modal inputs, simply call the HF processor directly.
  • For tokenized + multi-modal inputs, call the processor only on the multi-modal inputs.

While HF processors support text + multi-modal inputs natively, this is not so for tokenized + multi-modal inputs: an error is thrown if the number of input placeholder tokens do not correspond to the number of multi-modal inputs.

Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.

(mm-dummy-text)=

Dummy text

We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_processor_inputs. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.

(mm-automatic-prompt-replacement)=

Automatic prompt replacement

We address the second issue by implementing model-agnostic code in {meth}~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_replacements to automatically replace input placeholder tokens with feature placeholder tokens based on the specification outputted by {meth}~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements.

Summary

With the help of dummy text and automatic prompt replacement, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main.

Processor Output Caching

Some HF processors, such as the one for Qwen2-VL, are very slow. To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again.

When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.

Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using dummy text to avoid HF errors. Since this skips HF's prompt replacement code, we apply automatic prompt replacement afterwards to keep the output tokens and multi-modal data consistent with each other.


source/design/multiprocessing.md

Python Multiprocessing

Debugging

Please see the Troubleshooting page for information on known issues and how to solve them.

Introduction

The source code references are to the state of the code at the time of writing in December, 2024.

The use of Python multiprocessing in vLLM is complicated by:

  • The use of vLLM as a library and the inability to control the code using vLLM
  • Varying levels of incompatibilities between multiprocessing methods and vLLM dependencies

This document describes how vLLM deals with these challenges.

Multiprocessing Methods

Python multiprocessing methods include:

  • spawn - spawn a new Python process. This will be the default as of Python 3.14.

  • fork - Use os.fork() to fork the Python interpreter. This is the default in Python versions prior to 3.14.

  • forkserver - Spawn a server process that will fork a new process on request.

Tradeoffs

fork is the fastest method, but is incompatible with dependencies that use threads.

spawn is more compatible with dependencies, but can be problematic when vLLM is used as a library. If the consuming code does not use a __main__ guard (if __name__ == "__main__":), the code will be inadvertently re-executed when vLLM spawns a new process. This can lead to infinite recursion, among other problems.

forkserver will spawn a new server process that will fork new processes on demand. This unfortunately has the same problem as spawn when vLLM is used as a library. The server process is created as a spawned new process, which will re-execute code not protected by a __main__ guard.

For both spawn and forkserver, the process must not depend on inheriting any global state as would be the case with fork.

Compatibility with Dependencies

Multiple vLLM dependencies indicate either a preference or requirement for using spawn:

It is perhaps more accurate to say that there are known problems with using fork after initializing these dependencies.

Current State (v0)

The environment variable VLLM_WORKER_MULTIPROC_METHOD can be used to control which method is used by vLLM. The current default is fork.

When we know we own the process because the vllm command was used, we use spawn because it's the most widely compatible.

The multiproc_xpu_executor forces the use of spawn.

There are other miscellaneous places hard-coding the use of spawn:

Related PRs:

  • gh-pr:8823

Prior State in v1

There was an environment variable to control whether multiprocessing is used in the v1 engine core, VLLM_ENABLE_V1_MULTIPROCESSING. This defaulted to off.

When it was enabled, the v1 LLMEngine would create a new process to run the engine core.

It was off by default for all the reasons mentioned above - compatibility with dependencies and code using vLLM as a library.

Changes Made in v1

There is not an easy solution with Python's multiprocessing that will work everywhere. As a first step, we can get v1 into a state where it does "best effort" choice of multiprocessing method to maximize compatibility.

  • Default to fork.
  • Use spawn when we know we control the main process (vllm was executed).
  • If we detect cuda was previously initialized, force spawn and emit a warning. We know fork will break, so this is the best we can do.

The case that is known to still break in this scenario is code using vLLM as a library that initializes cuda before calling vLLM. The warning we emit should instruct users to either add a __main__ guard or to disable multiprocessing.

If that known-failure case occurs, the user will see two messages that explain what is happening. First, a log message from vLLM:

WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
    https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
    for more information.

Second, Python itself will raise an exception with a nice explanation:

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

Alternatives Considered

Detect if a __main__ guard is present

It has been suggested that we could behave better if we could detect whether code using vLLM as a library has a __main__ guard in place. This post on stackoverflow was from a library author facing the same question.

It is possible to detect whether we are in the original, __main__ process, or a subsequent spawned process. However, it does not appear to be straight forward to detect whether a __main__ guard is present in the code.

This option has been discarded as impractical.

Use forkserver

At first it appears that forkserver is a nice solution to the problem. However, the way it works presents the same challenges that spawn does when vLLM is used as a library.

Force spawn all the time

One way to clean this up is to just force the use of spawn all the time and document that the use of a __main__ guard is required when using vLLM as a library. This would unfortunately break existing code and make vLLM harder to use, violating the desire to make the LLM class as easy as possible to use.

Instead of pushing this on our users, we will retain the complexity to do our best to make things work.

Future Work

We may want to consider a different worker management approach in the future that works around these challenges.

  1. We could implement something forkserver-like, but have the process manager be something we initially launch by running our own subprocess and a custom entrypoint for worker management (launch a vllm-manager process).

  2. We can explore other libraries that may better suit our needs. Examples to consider:


source/design/plugin_system.md

(plugin-system)=

vLLM's Plugin System

The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.

How Plugins Work in vLLM

Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see ), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the load_general_plugins function in the vllm.plugins module. This function is called for every process created by vLLM before it starts any work.

How vLLM Discovers Plugins

vLLM's plugin system uses the standard Python entry_points mechanism. This mechanism allows developers to register functions in their Python packages for use by other packages. An example of a plugin:

# inside `setup.py` file
from setuptools import setup

setup(name='vllm_add_dummy_model',
      version='0.1',
      packages=['vllm_add_dummy_model'],
      entry_points={
          'vllm.general_plugins':
          ["register_dummy_model = vllm_add_dummy_model:register"]
      })

# inside `vllm_add_dummy_model.py` file
def register():
    from vllm import ModelRegistry

    if "MyLlava" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model("MyLlava",
                                        "vllm_add_dummy_model.my_llava:MyLlava")

For more information on adding entry points to your package, please check the official documentation.

Every plugin has three parts:

  1. Plugin group: The name of the entry point group. vLLM uses the entry point group vllm.general_plugins to register general plugins. This is the key of entry_points in the setup.py file. Always use vllm.general_plugins for vLLM's general plugins.
  2. Plugin name: The name of the plugin. This is the value in the dictionary of the entry_points dictionary. In the example above, the plugin name is register_dummy_model. Plugins can be filtered by their names using the VLLM_PLUGINS environment variable. To load only a specific plugin, set VLLM_PLUGINS to the plugin name.
  3. Plugin value: The fully qualified name of the function to register in the plugin system. In the example above, the plugin value is vllm_add_dummy_model:register, which refers to a function named register in the vllm_add_dummy_model module.

Types of supported plugins

  • General plugins (with group name vllm.general_plugins): The primary use case for these plugins is to register custom, out-of-the-tree models into vLLM. This is done by calling ModelRegistry.register_model to register the model inside the plugin function.

  • Platform plugins (with group name vllm.platform_plugins): The primary use case for these plugins is to register custom, out-of-the-tree platforms into vLLM. The plugin function should return None when the platform is not supported in the current environment, or the platform class's fully qualified name when the platform is supported.

Guidelines for Writing Plugins

  • Being re-entrant: The function specified in the entry point should be re-entrant, meaning it can be called multiple times without causing issues. This is necessary because the function might be called multiple times in some processes.

Compatibility Guarantee

vLLM guarantees the interface of documented plugins, such as ModelRegistry.register_model, will always be available for plugins to register models. However, it is the responsibility of plugin developers to ensure their plugins are compatible with the version of vLLM they are targeting. For example, "vllm_add_dummy_model.my_llava:MyLlava" should be compatible with the version of vLLM that the plugin targets. The interface for the model may change during vLLM's development.


source/features/automatic_prefix_caching.md

(automatic-prefix-caching)=

Automatic Prefix Caching

Introduction

Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.

Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).

Enabling APC in vLLM

Set enable_prefix_caching=True in vLLM engine to enable APC. Here is an example:

import time
from vllm import LLM, SamplingParams


# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """
| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1   | John Doe      | 29  | Engineer      | USA           | [email protected]   | 555-1234       | 123 Elm St, Springfield, IL  |
| 2   | Jane Smith    | 34  | Doctor        | Canada        | [email protected] | 555-5678       | 456 Oak St, Toronto, ON      |
| 3   | Alice Johnson | 27  | Teacher       | UK            | [email protected]    | 555-8765       | 789 Pine St, London, UK      |
| 4   | Bob Brown     | 45  | Artist        | Australia     | [email protected]      | 555-4321       | 321 Maple St, Sydney, NSW    |
| 5   | Carol White   | 31  | Scientist     | New Zealand   | [email protected]    | 555-6789       | 654 Birch St, Wellington, NZ |
| 6   | Dave Green    | 28  | Lawyer        | Ireland       | [email protected]     | 555-3456       | 987 Cedar St, Dublin, IE     |
| 7   | Emma Black    | 40  | Musician      | USA           | [email protected]     | 555-1111       | 246 Ash St, New York, NY     |
| 8   | Frank Blue    | 37  | Chef          | Canada        | [email protected]    | 555-2222       | 135 Spruce St, Vancouver, BC |
| 9   | Grace Yellow  | 50  | Engineer      | UK            | [email protected]    | 555-3333       | 864 Fir St, Manchester, UK   |
| 10  | Henry Violet  | 32  | Artist        | Australia     | [email protected]    | 555-4444       | 753 Willow St, Melbourne, VIC|
| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | [email protected]    | 555-5555       | 912 Poplar St, Auckland, NZ  |
| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | [email protected]     | 555-6666       | 159 Elm St, Cork, IE         |
| 13  | Karen Red     | 41  | Lawyer        | USA           | [email protected]    | 555-7777       | 357 Cedar St, Boston, MA     |
| 14  | Leo Brown     | 30  | Chef          | Canada        | [email protected]      | 555-8888       | 246 Oak St, Calgary, AB      |
| 15  | Mia Green     | 33  | Musician      | UK            | [email protected]      | 555-9999       | 975 Pine St, Edinburgh, UK   |
| 16  | Noah Yellow   | 29  | Doctor        | Australia     | [email protected]     | 555-0000       | 864 Birch St, Brisbane, QLD  |
| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | [email protected]   | 555-1212       | 753 Maple St, Hamilton, NZ   |
| 18  | Peter Black   | 42  | Artist        | Ireland       | [email protected]    | 555-3434       | 912 Fir St, Limerick, IE     |
| 19  | Quinn White   | 28  | Scientist     | USA           | [email protected]    | 555-5656       | 159 Willow St, Seattle, WA   |
| 20  | Rachel Red    | 31  | Teacher       | Canada        | [email protected]   | 555-7878       | 357 Poplar St, Ottawa, ON    |
| 21  | Steve Green   | 44  | Lawyer        | UK            | [email protected]    | 555-9090       | 753 Elm St, Birmingham, UK   |
| 22  | Tina Blue     | 36  | Musician      | Australia     | [email protected]     | 555-1213       | 864 Cedar St, Perth, WA      |
| 23  | Umar Black    | 39  | Chef          | New Zealand   | [email protected]     | 555-3435       | 975 Spruce St, Christchurch, NZ|
| 24  | Victor Yellow | 43  | Engineer      | Ireland       | [email protected]   | 555-5657       | 246 Willow St, Galway, IE    |
| 25  | Wendy Orange  | 27  | Artist        | USA           | [email protected]    | 555-7879       | 135 Elm St, Denver, CO       |
| 26  | Xavier Green  | 34  | Scientist     | Canada        | [email protected]   | 555-9091       | 357 Oak St, Montreal, QC     |
| 27  | Yara Red      | 41  | Teacher       | UK            | [email protected]     | 555-1214       | 975 Pine St, Leeds, UK       |
| 28  | Zack Blue     | 30  | Lawyer        | Australia     | [email protected]     | 555-3436       | 135 Birch St, Adelaide, SA   |
| 29  | Amy White     | 33  | Musician      | New Zealand   | [email protected]      | 555-5658       | 159 Maple St, Wellington, NZ |
| 30  | Ben Black     | 38  | Chef          | Ireland       | [email protected]      | 555-7870       | 246 Fir St, Waterford, IE    |
"""


def get_generation_time(llm, sampling_params, prompts):
    # time the generation
    start_time = time.time()
    output = llm.generate(prompts, sampling_params=sampling_params)
    end_time = time.time()
    # print the output and generation time
    print(f"Output: {output[0].outputs[0].text}")
    print(f"Generation time: {end_time - start_time} seconds.")


# set enable_prefix_caching=True to enable APC
llm = LLM(
    model='lmsys/longchat-13b-16k',
    enable_prefix_caching=True
)

sampling_params = SamplingParams(temperature=0, max_tokens=100)

# Querying the age of John Doe
get_generation_time(
    llm,
    sampling_params,
    LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ",
)

# Querying the age of Zack Blue
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
get_generation_time(
    llm,
    sampling_params,
    LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ",
)

Example workloads

We describe two example workloads, where APC can provide huge performance benefit:

  • Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document only once, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency.
  • Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency.

Limits

APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).


source/features/compatibility_matrix.md

(compatibility-matrix)=

Compatibility Matrix

The tables below show mutually exclusive features and the support on some hardware.

Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.

Feature x Feature

<style>
  /* Make smaller to try to improve readability  */
  td {
    font-size: 0.8rem;
    text-align: center;
  }

  th {
    text-align: center;
    font-size: 0.8rem;
  }
</style>
   :header-rows: 1
   :stub-columns: 1
   :widths: auto

   * - Feature
     - [CP](#chunked-prefill)
     - [APC](#automatic-prefix-caching)
     - [LoRA](#lora-adapter)
     - <abbr title="Prompt Adapter">prmpt adptr</abbr>
     - [SD](#spec_decode)
     - CUDA graph
     - <abbr title="Pooling Models">pooling</abbr>
     - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
     - <abbr title="Logprobs">logP</abbr>
     - <abbr title="Prompt Logprobs">prmpt logP</abbr>
     - <abbr title="Async Output Processing">async output</abbr>
     - multi-step
     - <abbr title="Multimodal Inputs">mm</abbr>
     - best-of
     - beam-search
     - <abbr title="Guided Decoding">guided dec</abbr>
   * - [CP](#chunked-prefill)
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - [APC](#automatic-prefix-caching)
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - [LoRA](#lora-adapter)
     - [✗](gh-pr:9057)
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - <abbr title="Prompt Adapter">prmpt adptr</abbr>
     - ✅
     - ✅
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - [SD](#spec_decode)
     - ✅
     - ✅
     - ✗
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - CUDA graph
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - <abbr title="Pooling Models">pooling</abbr>
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
     -
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
     - ✗
     - [✗](gh-issue:7366)
     - ✗
     - ✗
     - [✗](gh-issue:7366)
     - ✅
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
     -
   * - <abbr title="Logprobs">logP</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✗
     - ✅
     -
     -
     -
     -
     -
     -
     -
     -
   * - <abbr title="Prompt Logprobs">prmpt logP</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - [✗](gh-pr:8199)
     - ✅
     - ✗
     - ✅
     - ✅
     -
     -
     -
     -
     -
     -
     -
   * - <abbr title="Async Output Processing">async output</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✗
     - ✅
     - ✗
     - ✗
     - ✅
     - ✅
     -
     -
     -
     -
     -
     -
   * - multi-step
     - ✗
     - ✅
     - ✗
     - ✅
     - ✗
     - ✅
     - ✗
     - ✗
     - ✅
     - [✗](gh-issue:8198)
     - ✅
     -
     -
     -
     -
     -
   * - <abbr title="Multimodal Inputs">mm</abbr>
     - ✅
     -  [✗](gh-pr:8348)
     -  [✗](gh-pr:7199)
     - ?
     - ?
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ?
     -
     -
     -
     -
   * - best-of
     - ✅
     - ✅
     - ✅
     - ✅
     - [✗](gh-issue:6137)
     - ✅
     - ✗
     - ✅
     - ✅
     - ✅
     - ?
     - [✗](gh-issue:7968)
     - ✅
     -
     -
     -
   * - beam-search
     - ✅
     - ✅
     - ✅
     - ✅
     - [✗](gh-issue:6137)
     - ✅
     - ✗
     - ✅
     - ✅
     - ✅
     - ?
     - [✗](gh-issue:7968>)
     - ?
     - ✅
     -
     -
   * - <abbr title="Guided Decoding">guided dec</abbr>
     - ✅
     - ✅
     - ?
     - ?
     - ✅
     - ✅
     - ✗
     - ?
     - ✅
     - ✅
     - ✅
     - [✗](gh-issue:9893)
     - ?
     - ✅
     - ✅
     -

(feature-x-hardware)=

Feature x Hardware

   :header-rows: 1
   :stub-columns: 1
   :widths: auto

   * - Feature
     - Volta
     - Turing
     - Ampere
     - Ada
     - Hopper
     - CPU
     - AMD
   * - [CP](#chunked-prefill)
     - [✗](gh-issue:2729)
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - [APC](#automatic-prefix-caching)
     - [✗](gh-issue:3687)
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - [LoRA](#lora-adapter)
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - <abbr title="Prompt Adapter">prmpt adptr</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - [✗](gh-issue:8475)
     - ✅
   * - [SD](#spec_decode)
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - CUDA graph
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✗
     - ✅
   * - <abbr title="Pooling Models">pooling</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ?
   * - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✗
   * - <abbr title="Multimodal Inputs">mm</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - <abbr title="Logprobs">logP</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - <abbr title="Prompt Logprobs">prmpt logP</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - <abbr title="Async Output Processing">async output</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✗
     - ✗
   * - multi-step
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - [✗](gh-issue:8477)
     - ✅
   * - best-of
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - beam-search
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
   * - <abbr title="Guided Decoding">guided dec</abbr>
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅
     - ✅

source/features/disagg_prefill.md

(disagg-prefill)=

Disaggregated Prefilling (experimental)

This page introduces you the disaggregated prefilling feature in vLLM.

This feature is experimental and subject to change.

Why disaggregated prefilling?

Two main reasons:

  • Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. tp and pp) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
  • Controlling tail ITL. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
Disaggregated prefill DOES NOT improve throughput.

Usage example

Please refer to examples/online_serving/disaggregated_prefill.sh for the example usage of disaggregated prefilling.

Benchmarks

Please refer to benchmarks/disagg_benchmarks/ for disaggregated prefilling benchmarks.

Development

We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.

All disaggregated prefilling implementation is under vllm/distributed/kv_transfer.

Key abstractions for disaggregated prefilling:

  • Connector: Connector allows kv consumer to retrieve the KV caches of a batch of request from kv producer.
  • LookupBuffer: LookupBuffer provides two API: insert KV cache and drop_select KV cache. The semantics of insert and drop_select are similar to SQL, where insert inserts a KV cache into the buffer, and drop_select returns the KV cache that matches the given condition and drop it from the buffer.
  • Pipe: A single-direction FIFO pipe for tensor transmission. It supports send_tensor and recv_tensor.
`insert` is non-blocking operation but `drop_select` is blocking operation.

Here is a figure illustrating how the above 3 abstractions are organized:

:alt: Disaggregated prefilling abstractions

The workflow of disaggregated prefilling is as follows:

:alt: Disaggregated prefilling workflow

The buffer corresponds to insert API in LookupBuffer, and the drop_select corresponds to drop_select API in LookupBuffer.

Third-party contributions

Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).

We recommend three ways of implementations:

  • Fully-customized connector: Implement your own Connector, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
  • Database-like connector: Implement your own LookupBuffer and support the insert and drop_select APIs just like SQL.
  • Distributed P2P connector: Implement your own Pipe and support the send_tensor and recv_tensor APIs, just like torch.distributed.

source/features/lora.md

(lora-adapter)=

LoRA Adapters

This document shows you how to use LoRA adapters with vLLM on top of a base model.

LoRA adapters can be used with any vLLM model that implements {class}~vllm.model_executor.models.interfaces.SupportsLoRA.

Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save them locally with

from huggingface_hub import snapshot_download

sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")

Then we instantiate the base model and pass in the enable_lora=True flag:

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)

We can now submit the prompts and call llm.generate with the lora_request parameter. The first parameter of LoRARequest is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third parameter is the path to the LoRA adapter.

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["[/assistant]"]
)

prompts = [
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)

Check out gh-file:examples/offline_inference/multilora_inference.py for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.

Serving LoRA Adapters

LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use --lora-modules {name}={path} {name}={path} to specify each LoRA module when we kickoff the server:

vllm serve meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.

The server entrypoint accepts all other LoRA configuration parameters (max_loras, max_lora_rank, max_cpu_loras, etc.), which will apply to all forthcoming requests. Upon querying the /models endpoint, we should see our LoRA along with its base model:

curl localhost:8000/v1/models | jq .
{
    "object": "list",
    "data": [
        {
            "id": "meta-llama/Llama-2-7b-hf",
            "object": "model",
            ...
        },
        {
            "id": "sql-lora",
            "object": "model",
            ...
        }
    ]
}

Requests can specify the LoRA adapter as if it were any other model via the model request parameter. The requests will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other LoRA adapter requests if they were provided and max_loras is set high enough).

The following is an example request

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sql-lora",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq

Dynamically serving LoRA Adapters

In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility to change models on-the-fly is needed.

Note: Enabling this feature in production environments is risky as user may participate model adapter management.

To enable dynamic LoRA loading and unloading, ensure that the environment variable VLLM_ALLOW_RUNTIME_LORA_UPDATING is set to True. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

Loading a LoRA Adapter:

To dynamically load a LoRA adapter, send a POST request to the /v1/load_lora_adapter endpoint with the necessary details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.

Example request to load a LoRA adapter:

curl -X POST http://localhost:8000/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{
    "lora_name": "sql_adapter",
    "lora_path": "/path/to/sql-lora-adapter"
}'

Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter cannot be found or loaded, an appropriate error message will be returned.

Unloading a LoRA Adapter:

To unload a LoRA adapter that has been previously loaded, send a POST request to the /v1/unload_lora_adapter endpoint with the name or ID of the adapter to be unloaded.

Example request to unload a LoRA adapter:

curl -X POST http://localhost:8000/v1/unload_lora_adapter \
-H "Content-Type: application/json" \
-d '{
    "lora_name": "sql_adapter"
}'

New format for --lora-modules

In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:

--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/

This would only include the name and path for each LoRA module, but did not provide a way to specify a base_model_name. Now, you can specify a base_model_name alongside the name and path using JSON format. For example:

--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'

To provide the backward compatibility support, you can still use the old key-value format (name=path), but the base_model_name will remain unspecified in that case.

Lora model lineage in model card

The new format of --lora-modules is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:

  • The parent field of LoRA model sql-lora now links to its base model meta-llama/Llama-2-7b-hf. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
  • The root field points to the artifact location of the lora adapter.
$ curl http://localhost:8000/v1/models

{
    "object": "list",
    "data": [
        {
        "id": "meta-llama/Llama-2-7b-hf",
        "object": "model",
        "created": 1715644056,
        "owned_by": "vllm",
        "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
        "parent": null,
        "permission": [
            {
            .....
            }
        ]
        },
        {
        "id": "sql-lora",
        "object": "model",
        "created": 1715644056,
        "owned_by": "vllm",
        "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
        "parent": meta-llama/Llama-2-7b-hf,
        "permission": [
            {
            ....
            }
        ]
        }
    ]
}

source/features/quantization/auto_awq.md

(auto-awq)=

AutoAWQ

Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.

To create a new 4-bit quantized model, you can leverage AutoAWQ. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.

You can quantize your own models by installing AutoAWQ or picking one of the 400+ models on Huggingface.

pip install autoawq

After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize mistralai/Mistral-7B-Instruct-v0.2:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

To run an AWQ model with vLLM, you can use TheBloke/Llama-2-7b-Chat-AWQ with the following command:

python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq

AWQ models are also supported directly through the LLM entrypoint:

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

source/features/quantization/bnb.md

(bits-and-bytes)=

BitsAndBytes

vLLM now supports BitsAndBytes for more efficient model inference. BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy. Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.

Below are the steps to utilize BitsAndBytes with vLLM.

pip install bitsandbytes>=0.45.0

vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.

You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes. And usually, these repositories have a config.json file that includes a quantization_config section.

Read quantized checkpoint

from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")

Inflight quantization: load as 4bit quantization

from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")

OpenAI Compatible Server

Append the following to your 4bit model arguments:

--quantization bitsandbytes --load-format bitsandbytes

source/features/quantization/fp8.md

(fp8)=

FP8 W8A8

vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels. Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.

Please visit the HF collection of quantized FP8 checkpoints of popular LLMs ready to use with vLLM.

The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:

  • E4M3: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and nan.
  • E5M2: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- inf, and nan. The tradeoff for the increased dynamic range is lower precision of the stored values.
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.

Quick Start with Online Dynamic Quantization

Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying --quantization="fp8" in the command line or setting quantization="fp8" in the LLM constructor.

In this mode, all Linear modules (except for the final lm_head) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.

from vllm import LLM
model = LLM("facebook/opt-125m", quantization="fp8")
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result = model.generate("Hello, my name is")
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.

Installation

To produce performant FP8 quantized models with vLLM, you'll need to install the llm-compressor library:

pip install llmcompressor

Quantization Process

The quantization process involves three main steps:

  1. Loading the model
  2. Applying quantization
  3. Evaluating accuracy in vLLM

1. Loading the Model

Load your model and tokenizer using the standard transformers AutoModel classes:

from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

2. Applying Quantization

For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all Linear layers using the FP8_DYNAMIC scheme, which uses:

  • Static, per-channel quantization on the weights
  • Dynamic, per-token quantization on the activations

Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)

# Save the model.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

3. Evaluating Accuracy

Install vllm and lm-evaluation-harness:

pip install vllm lm-eval==0.4.4

Load and run the model in vllm:

from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
model.generate("Hello my name is")

Evaluate accuracy with lm_eval (for example on 250 samples of gsm8k):

Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
$ lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL,add_bos_token=True \
  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250

Here's an example of the resulting scores:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
|     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|

Troubleshooting and Support

If you encounter any issues or have feature requests, please open an issue on the vllm-project/llm-compressor GitHub repository.

Deprecated Flow

The following information is preserved for reference and search purposes.
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.

For static per-tensor offline quantization to FP8, please install the AutoFP8 library.

git clone https://github.com/neuralmagic/AutoFP8.git
pip install -e AutoFP8

This package introduces the AutoFP8ForCausalLM and BaseQuantizeConfig objects for managing how your model will be compressed.

Offline Quantization with Static Activation Scaling Factors

You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the activation_scheme="static" argument.

from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Load and tokenize 512 dataset samples for calibration of activation scales
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Your model checkpoint with quantized weights and activations should be available at Meta-Llama-3-8B-Instruct-FP8/. Finally, you can load the quantized model checkpoint directly in vLLM.

from vllm import LLM
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
result = model.generate("Hello, my name is")

source/features/quantization/fp8_e4m3_kvcache.md

(fp8-e4m3-kvcache)=

FP8 E4M3 KV Cache

Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput. OCP (Open Compute Project www.opencompute.org) specifies two common 8-bit floating point data formats: E5M2 (5 exponent bits and 2 mantissa bits) and E4M3FN (4 exponent bits and 3 mantissa bits), often shortened as E4M3. One benefit of the E4M3 format over E5M2 is that floating point numbers are represented in higher precision. However, the small dynamic range of FP8 E4M3 (±240.0 can be represented) typically necessitates the use of a higher-precision (typically FP32) scaling factor alongside each quantized tensor. For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling factors of a finer granularity (e.g. per-channel).

These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).

To install AMMO (AlgorithMic Model Optimization):

pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo

Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to and from fp32, fp16, bf16, etc. Thus, LLM inference is greatly accelerated with minimal accuracy loss.

Here is an example of how to enable this feature:

# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
# https://github.com/vllm-project/vllm/blob/main/examples/other/fp8/README.md to generate kv_cache_scales.json of your own.

from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          kv_cache_dtype="fp8",
          quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json")
prompt = "London is the capital of"
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
print(out)

# output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
# output w/o scaling factors:  England, located in the southeastern part of the country. It is known

source/features/quantization/fp8_e5m2_kvcache.md

(fp8-kv-cache)=

FP8 E5M2 KV Cache

The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits. The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bfloat16 and fp8 to each other.

Here is an example of how to enable this feature:

from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

source/features/quantization/gguf.md

(gguf)=

GGUF

Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.

To run a GGUF model with vLLM, you can download and use the local GGUF model from TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF with the following command:

wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0

You can also add --tensor-parallel-size 2 to enable tensor parallelism inference with 2 GPUs:

# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.

You can also use the GGUF model directly through the LLM entrypoint:

from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant"
   },
   {
      "role": "user",
      "content": "Hello"
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
         tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

source/features/quantization/index.md

(quantization-index)=

Quantization

Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

:caption: Contents
:maxdepth: 1

supported_hardware
auto_awq
bnb
gguf
int8
fp8
fp8_e5m2_kvcache
fp8_e4m3_kvcache

source/features/quantization/int8.md

(int8)=

INT8 W8A8

vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size while maintaining good performance.

Please visit the HF collection of quantized INT8 checkpoints of popular LLMs ready to use with vLLM.

INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).

Prerequisites

To use INT8 quantization with vLLM, you'll need to install the llm-compressor library:

pip install llmcompressor

Quantization Process

The quantization process involves four main steps:

  1. Loading the model
  2. Preparing calibration data
  3. Applying quantization
  4. Evaluating accuracy in vLLM

1. Loading the Model

Load your model and tokenizer using the standard transformers AutoModel classes:

from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

2. Preparing Calibration Data

When quantizing activations to INT8, you need sample data to estimate the activation scales. It's best to use calibration data that closely matches your deployment data. For a general-purpose instruction-tuned model, you can use a dataset like ultrachat:

from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load and preprocess the dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

3. Applying Quantization

Now, apply the quantization algorithms:

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# Configure the quantization algorithms
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

This process creates a W8A8 model with weights and activations quantized to 8-bit integers.

4. Evaluating Accuracy

After quantization, you can load and run the model in vLLM:

from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")

To evaluate accuracy, you can use lm_eval:

$ lm_eval --model vllm \
  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.

Best Practices

  • Start with 512 samples for calibration data (increase if accuracy drops)
  • Use a sequence length of 2048 as a starting point
  • Employ the chat template or instruction template that the model was trained with
  • If you've fine-tuned a model, consider using a sample of your training data for calibration

Troubleshooting and Support

If you encounter any issues or have feature requests, please open an issue on the vllm-project/llm-compressor GitHub repository.


source/features/quantization/supported_hardware.md

(quantization-supported-hardware)=

Supported Hardware

The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8

* - Implementation
  - Volta
  - Turing
  - Ampere
  - Ada
  - Hopper
  - AMD GPU
  - Intel GPU
  - x86 CPU
  - AWS Inferentia
  - Google TPU
* - AWQ
  - ✗
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✅︎
  - ✅︎
  - ✗
  - ✗
* - GPTQ
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✅︎
  - ✅︎
  - ✗
  - ✗
* - Marlin (GPTQ/AWQ/FP8)
  - ✗
  - ✗
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✗
  - ✗
  - ✗
* - INT8 (W8A8)
  - ✗
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✅︎
  - ✗
  - ✗
* - FP8 (W8A8)
  - ✗
  - ✗
  - ✗
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✗
  - ✗
* - AQLM
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✗
  - ✗
  - ✗
* - bitsandbytes
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✗
  - ✗
  - ✗
* - DeepSpeedFP
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✗
  - ✗
  - ✗
* - GGUF
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✅︎
  - ✗
  - ✗
  - ✗
  - ✗
  • Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
  • "✅︎" indicates that the quantization method is supported on the specified hardware.
  • "✗" indicates that the quantization method is not supported on the specified hardware.
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.

For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.

source/features/spec_decode.md

(spec-decode)=

Speculative Decoding

Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.

This document shows how to use Speculative Decoding with vLLM. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

Speculating with a draft model

The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

To perform the same with an online mode launch the server:

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
    --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
    --num_speculative_tokens 5 --gpu_memory_utilization 0.8

Then use a client:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = False
completion = client.completions.create(
    model=model,
    prompt="The future of AI is",
    echo=False,
    n=1,
    stream=stream,
)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

Speculating by matching n-grams in the prompt

The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt. For more information read this thread.

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Speculating using MLP speculators

The following code configures vLLM to use speculative decoding where proposals are generated by draft models that conditioning draft predictions on both context vectors and sampled tokens. For more information see this blog or this technical report.

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    speculative_model="ibm-fms/llama3-70b-accelerator",
    speculative_draft_tensor_parallel_size=1,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Note that these speculative models currently need to be run without tensor parallelism, although it is possible to run the main model using tensor parallelism (see example above). Since the speculative models are relatively small, we still see significant speedups. However, this limitation will be fixed in a future release.

A variety of speculative models of this type are available on HF hub:

Speculating using EAGLE based draft models

The following code configures vLLM to use speculative decoding where proposals are generated by an EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) based draft model.

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=4,
    speculative_model="path/to/modified/eagle/model",
    speculative_draft_tensor_parallel_size=1,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

A few important things to consider when using the EAGLE based draft models:

  1. The EAGLE draft models available in the HF repository for EAGLE models cannot be used directly with vLLM due to differences in the expected layer names and model definition. To use these models with vLLM, use the following script to convert them. Note that this script does not modify the model's weights.

    In the above example, use the script to first convert the yuhuili/EAGLE-LLaMA3-Instruct-8B model and then use the converted checkpoint as the draft model in vLLM.

  2. The EAGLE based draft models need to be run without tensor parallelism (i.e. speculative_draft_tensor_parallel_size is set to 1), although it is possible to run the main model using tensor parallelism (see example above).

  3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is reported in the reference implementation here. This issue is under investigation and tracked here: vllm-project/vllm#9565.

A variety of EAGLE draft models are available on the Hugging Face hub:

Base Model EAGLE on Hugging Face # EAGLE Parameters
Vicuna-7B-v1.3 yuhuili/EAGLE-Vicuna-7B-v1.3 0.24B
Vicuna-13B-v1.3 yuhuili/EAGLE-Vicuna-13B-v1.3 0.37B
Vicuna-33B-v1.3 yuhuili/EAGLE-Vicuna-33B-v1.3 0.56B
LLaMA2-Chat 7B yuhuili/EAGLE-llama2-chat-7B 0.24B
LLaMA2-Chat 13B yuhuili/EAGLE-llama2-chat-13B 0.37B
LLaMA2-Chat 70B yuhuili/EAGLE-llama2-chat-70B 0.99B
Mixtral-8x7B-Instruct-v0.1 yuhuili/EAGLE-mixtral-instruct-8x7B 0.28B
LLaMA3-Instruct 8B yuhuili/EAGLE-LLaMA3-Instruct-8B 0.25B
LLaMA3-Instruct 70B yuhuili/EAGLE-LLaMA3-Instruct-70B 0.99B
Qwen2-7B-Instruct yuhuili/EAGLE-Qwen2-7B-Instruct 0.26B
Qwen2-72B-Instruct yuhuili/EAGLE-Qwen2-72B-Instruct 1.05B

Lossless guarantees of Speculative Decoding

In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of speculative decoding, breaking down the guarantees into three key areas:

  1. Theoretical Losslessness - Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might cause slight variations in output distributions, as discussed in Accelerating Large Language Model Decoding with Speculative Sampling

  2. Algorithmic Losslessness - vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:

    • Rejection Sampler Convergence: Ensures that samples from vLLM’s rejection sampler align with the target distribution. View Test Code
    • Greedy Sampling Equality: Confirms that greedy sampling with speculative decoding matches greedy sampling without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler, provides a lossless guarantee. Almost all of the tests in gh-dir:tests/spec_decode/e2e. verify this property using this assertion implementation
  3. vLLM Logprob Stability - vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the same request across runs. For more details, see the FAQ section titled Can the output of a prompt vary across runs in vLLM? in the FAQs.

While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding can occur due to following factors:

  • Floating-Point Precision: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
  • Batch Size and Numerical Stability: Changes in batch size may cause variations in logprobs and output probabilities, potentially due to non-deterministic behavior in batched operations or numerical instability.

For mitigation strategies, please refer to the FAQ entry Can the output of a prompt vary across runs in vLLM? in the FAQs.

Resources for vLLM contributors


source/features/structured_outputs.md

(structured-outputs)=

Structured Outputs

vLLM supports the generation of structured outputs using outlines, lm-format-enforcer, or xgrammar as backends for the guided decoding. This document shows you some examples of the different options that are available to generate structured outputs.

Online Serving (OpenAI API)

You can generate structured outputs using the OpenAI's Completions and Chat API.

The following parameters are supported, which must be added as extra parameters:

  • guided_choice: the output will be exactly one of the choices.
  • guided_regex: the output will follow the regex pattern.
  • guided_json: the output will follow the JSON schema.
  • guided_grammar: the output will follow the context free grammar.
  • guided_whitespace_pattern: used to override the default whitespace pattern for guided json decoding.
  • guided_decoding_backend: used to select the guided decoding backend to use.

You can see the complete list of supported parameters on the OpenAI-Compatible Serverpage.

Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="-",
)

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_body={"guided_choice": ["positive", "negative"]},
)
print(completion.choices[0].message.content)

The next example shows how to use the guided_regex. The idea is to generate an email address, given a simple regex template:

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: [email protected]\n",
        }
    ],
    extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
)
print(completion.choices[0].message.content)

One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:

  • Using directly a JSON Schema
  • Defining a Pydantic model and then extracting the JSON Schema from it (which is normally an easier option).

The next example shows how to use the guided_json parameter with a Pydantic model:

from pydantic import BaseModel
from enum import Enum

class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"


class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType


json_schema = CarDescription.model_json_schema()

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
        }
    ],
    extra_body={"guided_json": json_schema},
)
print(completion.choices[0].message.content)
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
This can improve the results notably in most cases.

Finally we have the guided_grammar, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:

simplified_sql_grammar = """
    ?start: select_statement

    ?select_statement: "SELECT " column_list " FROM " table_name

    ?column_list: column_name ("," column_name)*

    ?table_name: identifier

    ?column_name: identifier

    ?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
"""

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
        }
    ],
    extra_body={"guided_grammar": simplified_sql_grammar},
)
print(completion.choices[0].message.content)

Full example: gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py

Experimental Automatic Parsing (OpenAI API)

This section covers the OpenAI beta wrapper over the client.chat.completions.create() method that provides richer integrations with Python specific types.

At the time of writing (openai==1.54.4), this is a "beta" feature in the OpenAI client library. Code reference can be found here.

For the following examples, vLLM was setup using vllm serve meta-llama/Llama-3.1-8B-Instruct

Here is a simple example demonstrating how to get structured output using Pydantic models:

from pydantic import BaseModel
from openai import OpenAI


class Info(BaseModel):
    name: str
    age: int


client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
    ],
    response_format=Info,
    extra_body=dict(guided_decoding_backend="outlines"),
)

message = completion.choices[0].message
print(message)
assert message.parsed
print("Name:", message.parsed.name)
print("Age:", message.parsed.age)

Output:

ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
Name: Cameron
Age: 28

Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:

from typing import List
from pydantic import BaseModel
from openai import OpenAI


class Step(BaseModel):
    explanation: str
    output: str


class MathResponse(BaseModel):
    steps: List[Step]
    final_answer: str


client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful expert math tutor."},
        {"role": "user", "content": "Solve 8x + 31 = 2."},
    ],
    response_format=MathResponse,
    extra_body=dict(guided_decoding_backend="outlines"),
)

message = completion.choices[0].message
print(message)
assert message.parsed
for i, step in enumerate(message.parsed.steps):
    print(f"Step #{i}:", step)
print("Answer:", message.parsed.final_answer)

Output:

ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
Answer: x = -29/8

Offline Inference

Offline inference allows for the same types of guided decoding. To use it, we´ll need to configure the guided decoding using the class GuidedDecodingParams inside SamplingParams. The main available options inside GuidedDecodingParams are:

  • json
  • regex
  • choice
  • grammar
  • backend
  • whitespace_pattern

These parameters can be used in the same way as the parameters from the Online Serving examples above. One example for the usage of the choices parameter is shown below:

from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")

guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
outputs = llm.generate(
    prompts="Classify this sentiment: vLLM is wonderful!",
    sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)

Full example: gh-file:examples/offline_inference/structured_outputs.py


source/features/tool_calling.md

Tool Calling

vLLM currently supports named function calling, as well as the auto and none options for the tool_choice field in the chat completion API. The tool_choice option required is not yet supported but on the roadmap.

Quickstart

Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the llama3 tool calling chat template from the vLLM examples directory:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --chat-template examples/tool_chat_template_llama3.1_json.jinja

Next, make a request to the model that should result in it using the available tools:

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

def get_weather(location: str, unit: str):
    return f"Getting the weather for {location} in {unit}..."
tool_functions = {"get_weather": get_weather}

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location", "unit"]
        }
    }
}]

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")

Example output:

Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"}
Result: Getting the weather for San Francisco, CA in fahrenheit...

This example demonstrates:

  • Setting up the server with tool calling enabled
  • Defining an actual function to handle tool calls
  • Making a request with tool_choice="auto"
  • Handling the structured response and executing the corresponding function

You can also specify a particular function using named function calling by setting tool_choice={"type": "function", "function": {"name": "get_weather"}}. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.

Remember that it's the callers responsibility to:

  1. Define appropriate tools in the request
  2. Include relevant context in the chat messages
  3. Handle the tool calls in your application logic

For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below.

Named Function Calling

vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a high-quality one.

vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the tools parameter. For best results, we recommend ensuring that the expected output format / schema is specified in the prompt to ensure that the model's intended generation is aligned with the schema that it's being forced to generate by the guided decoding backend.

To use a named function, you need to define the functions in the tools parameter of the chat completion request, and specify the name of one of the tools in the tool_choice parameter of the chat completion request.

Automatic Function Calling

To enable this feature, you should set the following flags:

  • --enable-auto-tool-choice -- mandatory Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
  • --tool-call-parser -- select the tool parser to use (listed below). Additional tool parsers will continue to be added in the future, and also can register your own tool parsers in the --tool-parser-plugin.
  • --tool-parser-plugin -- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in --tool-call-parser.
  • --chat-template -- optional for auto tool choice. the path to the chat template which handles tool-role messages and assistant-role messages that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their tokenizer_config.json files, but you can specify a custom template. This argument can be set to tool_use if your model has a tool use-specific chat template configured in the tokenizer_config.json. In this case, it will be used per the transformers specification. More on this here from HuggingFace; and you can find an example of this in a tokenizer_config.json here

If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!

Hermes Models (hermes)

All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.

  • NousResearch/Hermes-2-Pro-*
  • NousResearch/Hermes-2-Theta-*
  • NousResearch/Hermes-3-*

Note that the Hermes 2 Theta models are known to have degraded tool call quality & capabilities due to the merge step in their creation.

Flags: --tool-call-parser hermes

Mistral Models (mistral)

Supported models:

  • mistralai/Mistral-7B-Instruct-v0.3 (confirmed)
  • Additional mistral function-calling models are compatible as well.

Known issues:

  1. Mistral 7B struggles to generate parallel tool calls correctly.
  2. Mistral's tokenizer_config.json chat template requires tool call IDs that are exactly 9 digits, which is much shorter than what vLLM generates. Since an exception is thrown when this condition is not met, the following additional chat templates are provided:
  • examples/tool_chat_template_mistral.jinja - this is the "official" Mistral chat template, but tweaked so that it works with vLLM's tool call IDs (provided tool_call_id fields are truncated to the last 9 digits)
  • examples/tool_chat_template_mistral_parallel.jinja - this is a "better" version that adds a tool-use system prompt when tools are provided, that results in much better reliability when working with parallel tool calling.

Recommended flags: --tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja

Llama Models (llama3_json)

Supported models:

  • meta-llama/Meta-Llama-3.1-8B-Instruct
  • meta-llama/Meta-Llama-3.1-70B-Instruct
  • meta-llama/Meta-Llama-3.1-405B-Instruct
  • meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

The tool calling that is supported is the JSON based tool calling. For pythonic tool calling in Llama-3.2 models, see the pythonic tool parser below. Other tool calling formats like the built in python tool calling or custom tool calling are not supported.

Known issues:

  1. Parallel tool calls are not supported.
  2. The model can generate parameters with a wrong format, such as generating an array serialized as string instead of an array.

The tool_chat_template_llama3_json.jinja file contains the "official" Llama chat template, but tweaked so that it works better with vLLM.

Recommended flags: --tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3_json.jinja

IBM Granite

Supported models:

  • ibm-granite/granite-3.0-8b-instruct

Recommended flags: --tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja

examples/tool_chat_template_granite.jinja: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.

  • ibm-granite/granite-3.1-8b-instruct

Recommended flags: --tool-call-parser granite

The chat template from Huggingface can be used directly. Parallel function calls are supported.

  • ibm-granite/granite-20b-functioncalling

Recommended flags: --tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja

examples/tool_chat_template_granite_20b_fc.jinja: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from the paper. Parallel function calls are supported.

InternLM Models (internlm)

Supported models:

  • internlm/internlm2_5-7b-chat (confirmed)
  • Additional internlm2.5 function-calling models are compatible as well

Known issues:

  • Although this implementation also supports InternLM2, the tool call results are not stable when testing with the internlm/internlm2-chat-7b model.

Recommended flags: --tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja

Jamba Models (jamba)

AI21's Jamba-1.5 models are supported.

  • ai21labs/AI21-Jamba-1.5-Mini
  • ai21labs/AI21-Jamba-1.5-Large

Flags: --tool-call-parser jamba

Models with Pythonic Tool Calls (pythonic)

A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The pythonic tool parser can support such models.

As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:

[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]

Limitations:

  • The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.)
  • Llama's smaller models struggle to use tools effectively.

Example supported models:

  • meta-llama/Llama-3.2-1B-Instruct* (use with examples/tool_chat_template_llama3.2_pythonic.jinja)
  • meta-llama/Llama-3.2-3B-Instruct* (use with examples/tool_chat_template_llama3.2_pythonic.jinja)
  • Team-ACE/ToolACE-8B (use with examples/tool_chat_template_toolace.jinja)
  • fixie-ai/ultravox-v0_4-ToolACE-8B (use with examples/tool_chat_template_toolace.jinja)

Flags: --tool-call-parser pythonic --chat-template {see_above}


WARNING Llama's smaller models frequently fail to emit tool calls in the correct format. Your mileage may vary.


How to write a tool parser plugin

A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the Hermes2ProToolParser in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.

Here is a summary of a plugin file:

# import the required packages

# define a tool parser and register it to vllm
# the name list in register_module can be used
# in --tool-call-parser. you can define as many
# tool parsers as you want here.
@ToolParserManager.register_module(["example"])
class ExampleToolParser(ToolParser):
    def __init__(self, tokenizer: AnyTokenizer):
        super().__init__(tokenizer)

    # adjust request. e.g.: set skip special tokens
    # to False for tool call output.
    def adjust_request(
            self, request: ChatCompletionRequest) -> ChatCompletionRequest:
        return request

    # implement the tool call parse for stream call
    def extract_tool_calls_streaming(
        self,
        previous_text: str,
        current_text: str,
        delta_text: str,
        previous_token_ids: Sequence[int],
        current_token_ids: Sequence[int],
        delta_token_ids: Sequence[int],
        request: ChatCompletionRequest,
    ) -> Union[DeltaMessage, None]:
        return delta

    # implement the tool parse for non-stream call
    def extract_tool_calls(
        self,
        model_output: str,
        request: ChatCompletionRequest,
    ) -> ExtractedToolCallInformation:
        return ExtractedToolCallInformation(tools_called=False,
                                            tool_calls=[],
                                            content=text)

Then you can use this plugin in the command line like this.

    --enable-auto-tool-choice \
    --tool-parser-plugin <absolute path of the plugin file>
    --tool-call-parser example \
    --chat-template <your chat template> \

source/getting_started/faq.md

(faq)=

Frequently Asked Questions

Q: How can I serve multiple models on a single port using the OpenAI API?

A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.


Q: Which model to use for offline inference embedding?

A: You can try e5-mistral-7b-instruct and BAAI/bge-base-en-v1.5; more are listed here.

By extracting hidden states, vLLM can automatically convert text generation models like Llama-3-8B, Mistral-7B-Instruct-v0.3 into embedding models, but they are expected be inferior to models that are specifically trained on embedding tasks.


Q: Can the output of a prompt vary across runs in vLLM?

A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details, see the Numerical Accuracy section.

In vLLM, the same requests might be batched differently due to factors such as other concurrent requests, changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations, can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in different tokens being sampled. Once a different token is sampled, further divergence is likely.

Mitigation Strategies

  • For improved stability and reduced variance, use float32. Note that this will require more memory.
  • If using bfloat16, switching to float16 can also help.
  • Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.

source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md

Installation

This tab provides instructions on running vLLM with Intel Gaudi devices.

Requirements

  • OS: Ubuntu 22.04 LTS
  • Python: 3.10
  • Intel Gaudi accelerator
  • Intel Gaudi software version 1.18.0

Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Configure a new environment

Environment verification

To verify that the Intel Gaudi software was correctly installed, run:

hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
pip list | grep neural # verify that neural_compressor is installed

Refer to Intel Gaudi Software Stack Verification for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image:

docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

Set up using Python

Pre-built wheels

Currently, there are no pre-built Intel Gaudi wheels.

Build wheel from source

To build and install vLLM from source, run:

git clone https://github.com/vllm-project/vllm.git
cd vllm
python setup.py develop

Currently, the latest features and performance optimizations are developed in Gaudi's vLLM-fork and we periodically upstream them to vLLM main repo. To install latest HabanaAI/vLLM-fork, run the following:

git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork
git checkout habana_main
python setup.py develop

Set up using Docker

Pre-built images

Currently, there are no pre-built Intel Gaudi images.

Build image from source

docker build -f Dockerfile.hpu -t vllm-hpu-env  .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.

Extra information

Supported features

  • Offline inference
  • Online serving via OpenAI-Compatible Server
  • HPU autodetection - no need to manually select device within vLLM
  • Paged KV cache with algorithms enabled for Intel Gaudi accelerators
  • Custom Intel Gaudi implementations of Paged Attention, KV cache ops, prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding
  • Tensor parallelism support for multi-card inference
  • Inference with HPU Graphs for accelerating low-batch latency and throughput
  • Attention with Linear Biases (ALiBi)

Unsupported features

  • Beam search
  • LoRA adapters
  • Quantization
  • Prefill chunking (mixed-batch inferencing)

Supported configurations

The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.

Performance tuning

Execution modes

Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via PT_HPU_LAZY_MODE environment variable), and --enforce-eager flag.

:widths: 25 25 50
:header-rows: 1

* - `PT_HPU_LAZY_MODE`
  - `enforce_eager`
  - execution mode
* - 0
  - 0
  - torch.compile
* - 0
  - 1
  - PyTorch eager mode
* - 1
  - 0
  - HPU Graphs
* - 1
  - 1
  - PyTorch lazy mode
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.

(gaudi-bucketing-mechanism)=

Bucketing mechanism

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. Intel Gaudi Graph Compiler is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution. In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - batch_size and sequence_length.

Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.

Bucketing ranges are determined with 3 parameters - min, step and max. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:

INFO 08-01 21:37:59 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-01 21:37:59 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]

min determines the lowest value of the bucket. step determines the interval between buckets, and max determines the upper bound of the bucket. Furthermore, interval between min and step has special handling -- min gets multiplied by consecutive powers of two, until step gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.

Example (with ramp-up)

min = 2, step = 32, max = 64
=> ramp_up = (2, 4, 8, 16)
=> stable = (32, 64)
=> buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)

Example (without ramp-up)

min = 128, step = 128, max = 512
=> ramp_up = ()
=> stable = (128, 256, 384, 512)
=> buckets = ramp_up + stable => (128, 256, 384, 512)

In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.

If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.

As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as (4, 512) prefill bucket, as batch_size (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as (4, 512) decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a (2, 512) bucket, or context length increases above 512 tokens, in which case it will become (4, 640) bucket.

Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.

Warmup

Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:

INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
...
INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
...
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB

This example uses the same buckets as in the Bucketing Mechanism section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.

Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.

HPU Graph capture

HPU Graphs are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.

When HPU Graphs are being used, they share the common memory pool ("usable memory") as KV cache, determined by gpu_memory_utilization flag (0.9 by default). Before KV cache gets allocated, model weights are loaded onto the device, and a forward pass of the model is executed on dummy data, to estimate memory usage. Only after that, gpu_memory_utilization flag is utilized - at its default value, will mark 90% of free device memory at that point as usable. Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured. Environment variable VLLM_GRAPH_RESERVED_MEM defines the ratio of memory reserved for HPU Graphs capture. With its default value (VLLM_GRAPH_RESERVED_MEM=0.1), 10% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 90% will be utilized for KV cache. Environment variable VLLM_GRAPH_PROMPT_RATIO determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (VLLM_GRAPH_PROMPT_RATIO=0.3), both stages have equal memory constraints. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. VLLM_GRAPH_PROMPT_RATIO=0.2 will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.

`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.

User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented: - max_bs - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. (64, 128), (64, 256), (32, 128), (32, 256), (1, 128), (1,256)), default strategy for decode - min_tokens - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (batch_size*sequence_length), default strategy for prompt

When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by max_bs strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in min_tokens strategy.

`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.

Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):

INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
...
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
...
INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
...
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)

Recommended vLLM Parameters

  • We recommend running inference on Gaudi 2 with block_size of 128 for BF16 data type. Using default values (16, 32) might lead to sub-optimal performance due to Matrix Multiplication Engine under-utilization (see Gaudi Architecture).
  • For max throughput on Llama 7B, we recommend running with batch size of 128 or 256 and max context length of 2048 with HPU Graphs enabled. If you encounter out-of-memory issues, see troubleshooting section.

Environment variables

Diagnostic and profiling knobs:

  • VLLM_PROFILER_ENABLED: if true, high level profiler will be enabled. Resulting JSON traces can be viewed in perfetto.habana.ai. Disabled by default.
  • VLLM_HPU_LOG_STEP_GRAPH_COMPILATION: if true, will log graph compilations per each vLLM engine step, only when there was any - highly recommended to use alongside PT_HPU_METRICS_GC_DETAILS=1. Disabled by default.
  • VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL: if true, will log graph compilations per each vLLM engine step, always, even if there were none. Disabled by default.
  • VLLM_HPU_LOG_STEP_CPU_FALLBACKS: if true, will log cpu fallbacks per each vLLM engine step, only when there was any. Disabled by default.
  • VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL: if true, will log cpu fallbacks per each vLLM engine step, always, even if there were none. Disabled by default.

Performance tuning knobs:

  • VLLM_SKIP_WARMUP: if true, warmup will be skipped, false by default

  • VLLM_GRAPH_RESERVED_MEM: percentage of memory dedicated for HPUGraph capture, 0.1 by default

  • VLLM_GRAPH_PROMPT_RATIO: percentage of reserved graph memory dedicated for prompt graphs, 0.3 by default

  • VLLM_GRAPH_PROMPT_STRATEGY: strategy determining order of prompt graph capture, min_tokens or max_bs, min_tokens by default

  • VLLM_GRAPH_DECODE_STRATEGY: strategy determining order of decode graph capture, min_tokens or max_bs, max_bs by default

  • VLLM_{phase}_{dim}_BUCKET_{param} - collection of 12 environment variables configuring ranges of bucketing mechanism

    • {phase} is either PROMPT or DECODE

    • {dim} is either BS, SEQ or BLOCK

    • {param} is either MIN, STEP or MAX

    • Default values:

      • Prompt:
        • batch size min (VLLM_PROMPT_BS_BUCKET_MIN): 1
        • batch size step (VLLM_PROMPT_BS_BUCKET_STEP): min(max_num_seqs, 32)
        • batch size max (VLLM_PROMPT_BS_BUCKET_MAX): min(max_num_seqs, 64)
        • sequence length min (VLLM_PROMPT_SEQ_BUCKET_MIN): block_size
        • sequence length step (VLLM_PROMPT_SEQ_BUCKET_STEP): block_size
        • sequence length max (VLLM_PROMPT_SEQ_BUCKET_MAX): max_model_len
      • Decode:
        • batch size min (VLLM_DECODE_BS_BUCKET_MIN): 1
        • batch size step (VLLM_DECODE_BS_BUCKET_STEP): min(max_num_seqs, 32)
        • batch size max (VLLM_DECODE_BS_BUCKET_MAX): max_num_seqs
        • sequence length min (VLLM_DECODE_BLOCK_BUCKET_MIN): block_size
        • sequence length step (VLLM_DECODE_BLOCK_BUCKET_STEP): block_size
        • sequence length max (VLLM_DECODE_BLOCK_BUCKET_MAX): max(128, (max_num_seqs*max_model_len)/block_size)

Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:

  • PT_HPU_LAZY_MODE: if 0, PyTorch Eager backend for Gaudi will be used, if 1 PyTorch Lazy backend for Gaudi will be used, 1 is default
  • PT_HPU_ENABLE_LAZY_COLLECTIVES: required to be true for tensor parallel inference with HPU Graphs

Troubleshooting: tweaking HPU graphs

If you experience device out-of-memory issues or want to attempt inference at higher batch sizes, try tweaking HPU Graphs by following the below:

  • Tweak gpu_memory_utilization knob. It will decrease the allocation of KV cache, leaving some headroom for capturing graphs with larger batch size. By default gpu_memory_utilization is set to 0.9. It attempts to allocate ~90% of HBM left for KV cache after short profiling run. Note that decreasing reduces the number of KV cache blocks you have available, and therefore reduces the effective maximum number of tokens you can handle at a given time.
  • If this method is not efficient, you can disable HPUGraph completely. With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding --enforce-eager flag to server (for online serving), or by passing enforce_eager=True argument to LLM constructor (for offline inference).

source/getting_started/installation/ai_accelerator/index.md

Other AI accelerators

vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "# Installation"
:end-before: "## Requirements"

:::

::::

Requirements

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "## Requirements"
:end-before: "## Configure a new environment"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "## Requirements"
:end-before: "## Configure a new environment"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "## Requirements"
:end-before: "## Configure a new environment"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

::::

Configure a new environment

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"

:::

:::{tab-item} OpenVINO :sync: openvino

:::

::::

Set up using Python

Pre-built wheels

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

::::

Build wheel from source

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

::::

Set up using Docker

Pre-built images

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

::::

Build image from source

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "### Build image from source"
:end-before: "## Extra information"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "### Build image from source"
:end-before: "## Extra information"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "### Build image from source"
:end-before: "## Extra information"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "### Build image from source"
:end-before: "## Extra information"

:::

::::

Extra information

::::{tab-set} :sync-group: device

:::{tab-item} TPU :sync: tpu

:start-after: "## Extra information"

:::

:::{tab-item} Intel Gaudi :sync: hpu-gaudi

:start-after: "## Extra information"

:::

:::{tab-item} Neuron :sync: neuron

:start-after: "## Extra information"

:::

:::{tab-item} OpenVINO :sync: openvino

:start-after: "## Extra information"

:::

::::


source/getting_started/installation/ai_accelerator/neuron.inc.md

Installation

vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching. Paged Attention and Chunked Prefill are currently in development and will be available soon. Data types currently supported in Neuron SDK are FP16 and BF16.

Requirements

  • OS: Linux
  • Python: 3.9 -- 3.11
  • Accelerator: NeuronCore_v2 (in trn1/inf2 instances)
  • Pytorch 2.0.1/2.1.1
  • AWS Neuron SDK 2.16/2.17 (Verified on python 3.8)

Configure a new environment

Launch Trn1/Inf2 instances

Here are the steps to launch trn1/inf2 instances, in order to install PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS.

  • Please follow the instructions at launch an Amazon EC2 Instance to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
  • To get more information about instances sizes and pricing see: Trn1 web page, Inf2 web page
  • Select Ubuntu Server 22.04 TLS AMI
  • When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB.
  • After launching the instance, follow the instructions in Connect to your instance to connect to the instance

Install drivers and tools

The installation of drivers and tools wouldn't be necessary, if Deep Learning AMI Neuron is installed. In case the drivers and tools are not installed on the operating system, follow the steps below:

# Configure Linux for Neuron repository updates
. /etc/os-release
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

# Update OS packages
sudo apt-get update -y

# Install OS headers
sudo apt-get install linux-headers-$(uname -r) -y

# Install git
sudo apt-get install git -y

# install Neuron Driver
sudo apt-get install aws-neuronx-dkms=2.* -y

# Install Neuron Runtime
sudo apt-get install aws-neuronx-collectives=2.* -y
sudo apt-get install aws-neuronx-runtime-lib=2.* -y

# Install Neuron Tools
sudo apt-get install aws-neuronx-tools=2.* -y

# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH

Set up using Python

Pre-built wheels

Currently, there are no pre-built Neuron wheels.

Build wheel from source

The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.

Following instructions are applicable to Neuron SDK 2.16 and beyond.

Install transformers-neuronx and its dependencies

transformers-neuronx will be the backend to support inference on trn1/inf2 instances. Follow the steps below to install transformer-neuronx package and its dependencies.

# Install Python venv
sudo apt-get install -y python3.10-venv g++

# Create Python venv
python3.10 -m venv aws_neuron_venv_pytorch

# Activate Python venv
source aws_neuron_venv_pytorch/bin/activate

# Install Jupyter notebook kernel
pip install ipykernel
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels

# Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

# Install wget, awscli
python -m pip install wget
python -m pip install awscli

# Update Neuron Compiler and Framework
python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx

Install vLLM from source

Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -U -r requirements-neuron.txt
VLLM_TARGET_DEVICE="neuron" pip install .

If neuron packages are detected correctly in the installation process, vllm-0.3.0+neuron212 will be installed.

Set up using Docker

Pre-built images

Currently, there are no pre-built Neuron images.

Build image from source

See project:#deployment-docker-build-image-from-source for instructions on building the Docker image.

Make sure to use gh-file:Dockerfile.neuron in place of the default Dockerfile.

Extra information

There is no extra information for this device.


source/getting_started/installation/ai_accelerator/openvino.inc.md

Installation

vLLM powered by OpenVINO supports all LLM models from vLLM supported models list and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs (the list of supported GPUs).

Requirements

  • OS: Linux
  • Instruction set architecture (ISA) requirement: at least AVX2.

Set up using Python

Pre-built wheels

Currently, there are no pre-built OpenVINO wheels.

Build wheel from source

First, install Python. For example, on Ubuntu 22.04, you can run:

sudo apt-get update  -y
sudo apt-get install python3

Second, install prerequisites vLLM OpenVINO backend installation:

pip install --upgrade pip
pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu

Finally, install vLLM with OpenVINO backend:

PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v .

:::{tip} To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html. :::

Set up using Docker

Pre-built images

Currently, there are no pre-built OpenVINO images.

Build image from source

docker build -f Dockerfile.openvino -t vllm-openvino-env .
docker run -it --rm vllm-openvino-env

Extra information

Supported features

OpenVINO vLLM backend supports the following advanced vLLM features:

  • Prefix caching (--enable-prefix-caching)
  • Chunked prefill (--enable-chunked-prefill)

Performance tips

vLLM OpenVINO backend environment variables

  • VLLM_OPENVINO_DEVICE to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, VLLM_OPENVINO_DEVICE=GPU.1). If the value is not specified, CPU device is used by default.
  • VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using optimum-cli and pass exported folder as <model_id>

CPU performance tips

CPU uses the following environment variables to control behavior:

  • VLLM_OPENVINO_KVCACHE_SPACE to specify the KV Cache size (e.g, VLLM_OPENVINO_KVCACHE_SPACE=40 means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
  • VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 to control KV cache precision. By default, FP16 / BF16 is used depending on platform.

To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (--enable-chunked-prefill). Based on the experiments, the recommended batch size is 256 (--max-num-batched-tokens)

OpenVINO best known configuration for CPU is:

$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
    python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256

GPU performance tips

GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account gpu_memory_utilization option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using VLLM_OPENVINO_KVCACHE_SPACE environment variable (e.g, VLLM_OPENVINO_KVCACHE_SPACE=8 means 8 GB space for KV cache).

Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and preemption-mode=swap.

OpenVINO best known configuration for GPU is:

$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
    python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json

Limitations

  • LoRA serving is not supported.
  • Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration.
  • Tensor and pipeline parallelism are not currently enabled in vLLM integration.

source/getting_started/installation/ai_accelerator/tpu.inc.md

Installation

Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are available in different versions each with different hardware specifications. For more information about TPUs, see TPU System Architecture. For more information on the TPU versions supported with vLLM, see:

These TPU versions allow you to configure the physical arrangements of the TPU chips. This can improve throughput and networking performance. For more information see:

In order for you to use Cloud TPUs you need to have TPU quota granted to your Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a GPC project and are specified in terms of TPU version, the number of TPU you want to use, and quota type. For more information, see TPU quota.

For TPU pricing information, see Cloud TPU pricing.

You may need additional persistent storage for your TPU VMs. For more information, see Storage options for Cloud TPU data.

Requirements

  • Google Cloud TPU VM
  • TPU versions: v6e, v5e, v5p, v4
  • Python: 3.10 or newer

Provision Cloud TPUs

You can provision Cloud TPUs using the Cloud TPU API or the queued resources API. This section shows how to create TPUs using the queued resource API. For more information about using the Cloud TPU API, see Create a Cloud TPU using the Create Node API. Queued resources enable you to request Cloud TPU resources in a queued manner. When you request queued resources, the request is added to a queue maintained by the Cloud TPU service. When the requested resource becomes available, it's assigned to your Google Cloud project for your immediate exclusive use.

In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information.

Provision Cloud TPUs with GKE

For more information about using TPUs with GKE, see:

Configure a new environment

Provision a Cloud TPU with the queued resource API

Create a TPU v5e with 4 TPU chips:

gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--node-id TPU_NAME \
--project PROJECT_ID \
--zone ZONE \
--accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT
:header-rows: 1

* - Parameter name
  - Description
* - QUEUED_RESOURCE_ID
  - The user-assigned ID of the queued resource request.
* - TPU_NAME
  - The user-assigned name of the TPU which is created when the queued
    resource request is allocated.
* - PROJECT_ID
  - Your Google Cloud project
* - ZONE
  - The GCP zone where you want to create your Cloud TPU. The value you use
    depends on the version of TPUs you are using. For more information, see
    `TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
* - ACCELERATOR_TYPE
  - The TPU version you want to use. Specify the TPU version, for example
    `v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
    see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
  - The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
  - The email address for your service account. You can find it in the IAM
    Cloud Console under *Service Accounts*. For example:
    `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU using SSH:

gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE

Set up using Python

Pre-built wheels

Currently, there are no pre-built TPU wheels.

Build wheel from source

Install Miniconda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

Create and activate a Conda environment for vLLM:

conda create -n vllm python=3.10 -y
conda activate vllm

Clone the vLLM repository and go to the vLLM directory:

git clone https://github.com/vllm-project/vllm.git && cd vllm

Uninstall the existing torch and torch_xla packages:

pip uninstall torch torch-xla -y

Install build dependencies:

pip install -r requirements-tpu.txt
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev

Run the setup script:

VLLM_TARGET_DEVICE="tpu" python setup.py develop

Set up using Docker

Pre-built images

See project:#deployment-docker-pre-built-image for instructions on using the official Docker image, making sure to substitute the image name vllm/vllm-openai with vllm/vllm-tpu.

Build image from source

You can use gh-file:Dockerfile.tpu to build a Docker image with TPU support.

docker build -f Dockerfile.tpu -t vllm-tpu .

Run the Docker image with the following command:

# Make sure to add `--privileged --net host --shm-size=16G`.
docker run --privileged --net host --shm-size=16G -it vllm-tpu
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
If you encounter the following error:

```console
from torch._C import *  # noqa: F403
ImportError: libopenblas.so.0: cannot open shared object file: No such
file or directory
```

Install OpenBLAS with the following command:

```console
$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```

Extra information

There is no extra information for this device.


source/getting_started/installation/cpu/apple.inc.md

Installation

vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.

Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.

Requirements

  • OS: macOS Sonoma or later
  • SDK: XCode 15.4 or later with Command Line Tools
  • Compiler: Apple Clang >= 15.0.0

Set up using Python

Pre-built wheels

Build wheel from source

After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-cpu.txt
pip install -e . 
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.

Troubleshooting

If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your Command Line Tools for Xcode.

[...] fatal error: 'map' file not found
          1 | #include <map>
            |          ^~~~~
      1 error generated.
      [2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o

[...] fatal error: 'cstddef' file not found
         10 | #include <cstddef>
            |          ^~~~~~~~~
      1 error generated.

Set up using Docker

Pre-built images

Build image from source

Extra information


source/getting_started/installation/cpu/arm.inc.md

Installation

vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.

ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.

Requirements

  • OS: Linux
  • Compiler: gcc/g++ >= 12.3.0 (optional, recommended)
  • Instruction Set Architecture (ISA): NEON support is required

Set up using Python

Pre-built wheels

Build wheel from source

:::{include} build.inc.md :::

Testing has been conducted on AWS Graviton3 instances for compatibility.

Set up using Docker

Pre-built images

Build image from source

Extra information


source/getting_started/installation/cpu/build.inc.md

First, install recommended compiler. We recommend to use gcc/g++ >= 12.3.0 as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

Second, install Python packages for vLLM CPU backend building:

pip install --upgrade pip
pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

Finally, build and install vLLM CPU backend:

VLLM_TARGET_DEVICE=cpu python setup.py install

source/getting_started/installation/cpu/index.md

CPU

vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:

::::{tab-set} :sync-group: device

:::{tab-item} x86 :sync: x86

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} ARM :sync: arm

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} Apple silicon :sync: apple

:start-after: "# Installation"
:end-before: "## Requirements"

:::

::::

Requirements

  • Python: 3.9 -- 3.12

::::{tab-set} :sync-group: device

:::{tab-item} x86 :sync: x86

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

:::{tab-item} ARM :sync: arm

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

:::{tab-item} Apple silicon :sync: apple

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

::::

Set up using Python

Create a new Python environment

Pre-built wheels

Currently, there are no pre-built CPU wheels.

Build wheel from source

::::{tab-set} :sync-group: device

:::{tab-item} x86 :sync: x86

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} ARM :sync: arm

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} Apple silicon :sync: apple

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

::::

Set up using Docker

Pre-built images

Currently, there are no pre-build CPU images.

Build image from source

$ docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
$ docker run -it \
             --rm \
             --network=host \
             --cpuset-cpus=<cpu-id-list, optional> \
             --cpuset-mems=<memory-node, optional> \
             vllm-cpu-env

:::{tip} For ARM or Apple silicon, use Dockerfile.arm :::

Supported features

vLLM CPU backend supports the following vLLM features:

  • Tensor Parallel
  • Model Quantization (INT8 W8A8, AWQ, GPTQ)
  • Chunked-prefill
  • Prefix-caching
  • FP8-E5M2 KV-Caching (TODO)

Related runtime environment variables

  • VLLM_CPU_KVCACHE_SPACE: specify the KV Cache size (e.g, VLLM_CPU_KVCACHE_SPACE=40 means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
  • VLLM_CPU_OMP_THREADS_BIND: specify the CPU cores dedicated to the OpenMP threads. For example, VLLM_CPU_OMP_THREADS_BIND=0-31 means there will be 32 OpenMP threads bound on 0-31 CPU cores. VLLM_CPU_OMP_THREADS_BIND=0-31|32-63 means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores.

Performance tips

  • We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
find / -name *libtcmalloc* # find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
python examples/offline_inference/basic.py # run vLLM
  • When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m
  • If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using VLLM_CPU_OMP_THREADS_BIND. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores

# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ      MHZ
0    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
1    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
2    0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
3    0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
4    0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
5    0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
6    0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
7    0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000
8    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
9    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
10   0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
11   0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
12   0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
13   0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
14   0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
15   0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000

# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic.py
  • If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using VLLM_CPU_OMP_THREADS_BIND to avoid cross NUMA node memory access.

Other considerations

  • The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.

  • Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.

  • On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the topology. For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.

    • Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With TP feature on CPU merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:

      VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
    • Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like Nginx or HAProxy are recommended. Anyscale Ray project provides the feature on LLM serving. Here is the example to setup a scalable LLM serving with Ray Serve.


source/getting_started/installation/cpu/x86.inc.md

Installation

vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.

Requirements

  • OS: Linux
  • Compiler: gcc/g++ >= 12.3.0 (optional, recommended)
  • Instruction Set Architecture (ISA): AVX512 (optional, recommended)

Set up using Python

Pre-built wheels

Build wheel from source

:::{include} build.inc.md :::

- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.

Set up using Docker

Pre-built images

Build image from source

Extra information

Intel Extension for PyTorch


source/getting_started/installation/device.template.md

Installation

Requirements

Set up using Python

Pre-built wheels

Build wheel from source

Set up using Docker

Pre-built images

Build image from source

Extra information


source/getting_started/installation/gpu/cuda.inc.md

Installation

vLLM contains pre-compiled C++ and CUDA (12.1) binaries.

Requirements

  • GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

Set up using Python

Create a new Python environment

PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.

In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.

Therefore, it is recommended to install vLLM with a fresh new environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See below for more details.

Pre-built wheels

You can install vLLM using either pip or uv pip:

# Install vLLM with CUDA 12.1.
pip install vllm # If you are using pip.
uv pip install vllm # If you are using uv.

As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions:

# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.6.1.post1
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

(install-the-latest-code)=

Install the latest code

LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since v0.5.3.

Install the latest code using pip
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

--pre is required for pip to consider pre-released versions.

If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of pip, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:

export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Note that the wheels are built with Python 3.8 ABI (see PEP 425 for more details about ABI), so they are compatible with Python 3.8 and later. The version string in the wheel file name (1.0.0.dev) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.

Install the latest code using uv

Another way to install the latest code is to use uv:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:

export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}

The uv approach works for vLLM v0.6.6 and later and offers an easy-to-remember command. A unique feature of uv is that packages in --extra-index-url have higher priority than the default index. If the latest public release is v0.6.6.post1, uv's behavior allows installing a commit before v0.6.6.post1 by specifying the --extra-index-url. In contrast, pip combines packages from --extra-index-url and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.

Build wheel from source

Set up using Python-only build (without compilation)

If you only need to change Python code, you can build and install vLLM without compilation. Using pip's --editable flag, changes you make to the code will be reflected when you run vLLM:

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .

This will download the latest nightly wheel and use the compiled libraries from there in the installation.

The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable can be used instead of VLLM_USE_PRECOMPILED to specify a custom path or URL to the wheel file. For example, to use the 0.6.1.post1 PyPi wheel:

export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/4a/4c/ee65ba33467a4c0de350ce29fbae39b9d0e7fcd887cc756fa993654d1228/vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .

You can find more information about vLLM's wheels in project:#install-the-latest-code.

There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to <project:#install-the-latest-code> for instructions on how to install a specified wheel.

Full build (with compilation)

If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.

For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.

[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
Use an existing PyTorch installation

There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:

  • Building vLLM with PyTorch nightly or a custom PyTorch build.
  • Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 to install PyTorch nightly, and then build vLLM on top of it.

To build vLLM using an existing PyTorch installation:

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements-build.txt
pip install -e . --no-build-isolation
Use the local cutlass for compilation

Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead. To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
Troubleshooting

To avoid your system being overloaded, you can limit the number of compilation jobs to be run simultaneously, via the environment variable MAX_JOBS. For example:

export MAX_JOBS=6
pip install -e .

This is especially useful when you are building on less powerful machines. For example, when you use WSL it only assigns 50% of the total memory by default, so using export MAX_JOBS=1 can avoid compiling multiple files simultaneously and running out of memory. A side effect is a much slower build process.

Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.

# Use `--ipc=host` to make sure the shared memory is large enough.
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from the official website. After installation, set the environment variable CUDA_HOME to the installation path of CUDA Toolkit, and make sure that the nvcc compiler is in your PATH, e.g.:

export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"

Here is a sanity check to verify that the CUDA Toolkit is correctly installed:

nvcc --version # verify that nvcc is in your PATH
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME

Unsupported OS build

vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems.

Simply disable the VLLM_TARGET_DEVICE environment variable before installing:

export VLLM_TARGET_DEVICE=empty
pip install -e .

Set up using Docker

Pre-built images

See project:#deployment-docker-pre-built-image for instructions on using the official Docker image.

Another way to access the latest code is to use the docker images:

export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}

These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.

The latest code can contain bugs and may not be stable. Please use it with caution.

Build image from source

See project:#deployment-docker-build-image-from-source for instructions on building the Docker image.

Supported features

See project:#feature-x-hardware compatibility matrix for feature support information.


source/getting_started/installation/gpu/index.md

GPU

vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "# Installation"
:end-before: "## Requirements"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "# Installation"
:end-before: "## Requirements"

:::

::::

Requirements

  • OS: Linux
  • Python: 3.9 -- 3.12

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "## Requirements"
:end-before: "## Set up using Python"

:::

::::

Set up using Python

Create a new Python environment

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "## Create a new Python environment"
:end-before: "### Pre-built wheels"

:::

:::{tab-item} ROCm :sync: rocm

There is no extra information on creating a new Python environment for this device.

:::

:::{tab-item} XPU :sync: xpu

There is no extra information on creating a new Python environment for this device.

:::

::::

Pre-built wheels

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"

:::

::::

(build-from-source)=

Build wheel from source

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"

:::

::::

Set up using Docker

Pre-built images

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "### Pre-built images"
:end-before: "### Build image from source"

:::

::::

Build image from source

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "### Build image from source"
:end-before: "## Supported features"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "### Build image from source"
:end-before: "## Supported features"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "### Build image from source"
:end-before: "## Supported features"

:::

::::

Supported features

::::{tab-set} :sync-group: device

:::{tab-item} CUDA :sync: cuda

:start-after: "## Supported features"

:::

:::{tab-item} ROCm :sync: rocm

:start-after: "## Supported features"

:::

:::{tab-item} XPU :sync: xpu

:start-after: "## Supported features"

:::

::::


source/getting_started/installation/gpu/rocm.inc.md

Installation

vLLM supports AMD GPUs with ROCm 6.2.

Requirements

  • GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
  • ROCm 6.2

Set up using Python

Pre-built wheels

Currently, there are no pre-built ROCm wheels.

However, the AMD Infinity hub for vLLM offers a prebuilt, optimized docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.

Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
for instructions on how to use this prebuilt docker image.

Build wheel from source

  1. Install prerequisites (skip if you are already in an environment/docker with the following installed):
  • ROCm

  • PyTorch

    For installing PyTorch, you can start from a fresh docker image, e.g, rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0, rocm/pytorch-nightly.

    Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch Getting Started

  1. Install Triton flash attention for ROCm

    Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from ROCm/triton

    python3 -m pip install ninja cmake wheel pybind11
    pip uninstall -y triton
    git clone https://github.com/OpenAI/triton.git
    cd triton
    git checkout e192dba
    cd python
    pip3 install .
    cd ../..
    - If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
    
  2. Optionally, if you choose to use CK flash attention, you can install flash attention for ROCm

    Install ROCm's flash attention (v2.5.9.post1) following the instructions from ROCm/flash-attention Alternatively, wheels intended for vLLM use can be accessed under the releases.

    For example, for ROCm 6.2, suppose your gfx arch is gfx90a. To get your gfx architecture, run rocminfo |grep gfx.

    git clone https://github.com/ROCm/flash-attention.git
    cd flash-attention
    git checkout 3cea2fb
    git submodule update --init
    GPU_ARCHS="gfx90a" python3 setup.py install
    cd ..
    - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
    
  3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:

    $ pip install --upgrade pip
    
    # Install PyTorch
    $ pip uninstall torch -y
    $ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/rocm6.2
    
    # Build & install AMD SMI
    $ pip install /opt/rocm/share/amd_smi
    
    # Install dependencies
    $ pip install --upgrade numba scipy huggingface-hub[cli]
    $ pip install "numpy<2"
    $ pip install -r requirements-rocm.txt
    
    # Build vLLM for MI210/MI250/MI300.
    $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
    $ python3 setup.py develop

    This may take 5-10 minutes. Currently, pip install . does not work for ROCm installation.

    - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
    - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
    - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
    - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
    
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
  For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).

Set up using Docker

Pre-built images

Currently, there are no pre-built ROCm images.

Build image from source

Building the Docker image from source is the recommended way to use vLLM with ROCm.

First, build a docker image from gh-file:Dockerfile.rocm and launch a docker container from the image. It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:

{
    "features": {
        "buildkit": true
    }
}

gh-file:Dockerfile.rocm uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches. It provides flexibility to customize the build of docker image using the following arguments:

  • BASE_IMAGE: specifies the base image used when running docker build. The default value rocm/vllm-dev:base is an image published and maintained by AMD. It is being built using gh-file:Dockerfile.rocm_base
  • USE_CYTHON: An option to run cython compilation on a subset of python files upon docker build
  • BUILD_RPD: Include RocmProfileData profiling tool in the image
  • ARG_PYTORCH_ROCM_ARCH: Allows to override the gfx architecture values from the base docker image

Their values can be passed in when running docker build with --build-arg options.

To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default:

DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .

To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:

DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" -f Dockerfile.rocm -t vllm-rocm .

To run the above docker image vllm-rocm, use the below command:

docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v <path/to/model>:/app/model \
   vllm-rocm \
   bash

Where the <path/to/model> is the location where the model is stored, for example, the weights for llama2 or llama3 models.

Supported features

See project:#feature-x-hardware compatibility matrix for feature support information.


source/getting_started/installation/gpu/xpu.inc.md

Installation

vLLM initially supports basic model inferencing and serving on Intel GPU platform.

Requirements

  • Supported Hardware: Intel Data Center GPU, Intel ARC GPU
  • OneAPI requirements: oneAPI 2024.2

Set up using Python

Pre-built wheels

Currently, there are no pre-built XPU wheels.

Build wheel from source

  • First, install required driver and intel OneAPI 2024.2 or later.
  • Second, install Python packages for vLLM XPU backend building:
source /opt/intel/oneapi/setvars.sh
pip install --upgrade pip
pip install -v -r requirements-xpu.txt
  • Finally, build and install vLLM XPU backend:
VLLM_TARGET_DEVICE=xpu python setup.py install
- FP16 is the default data type in the current XPU backend. The BF16 data
  type will be supported in the future.

Set up using Docker

Pre-built images

Currently, there are no pre-built XPU images.

Build image from source

$ docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
$ docker run -it \
             --rm \
             --network=host \
             --device /dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             vllm-xpu-env

Supported features

XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:

python -m vllm.entrypoints.openai.api_server \
     --model=facebook/opt-13b \
     --dtype=bfloat16 \
     --device=xpu \
     --max_model_len=1024 \
     --distributed-executor-backend=ray \
     --pipeline-parallel-size=2 \
     -tp=8

By default, a ray instance will be launched automatically if no existing one is detected in system, with num-gpus equals to parallel_config.world_size. We recommend properly starting a ray cluster before execution, referring to the gh-file:examples/online_serving/run_cluster.sh helper script.


source/getting_started/installation/index.md

(installation-index)=

Installation

vLLM supports the following hardware platforms:

:maxdepth: 1

gpu/index
cpu/index
ai_accelerator/index

source/getting_started/installation/python_env_setup.inc.md

You can create a new Python environment using conda:

# (Recommended) Create a new conda environment.
conda create -n myenv python=3.12 -y
conda activate myenv
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.

Or you can create a new Python environment using uv, a very fast Python environment manager. Please follow the documentation to install uv. After installing uv, you can create a new Python environment using the following command:

# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
uv venv myenv --python 3.12 --seed
source myenv/bin/activate

source/getting_started/quickstart.md

(quickstart)=

Quickstart

This guide will help you quickly get started with vLLM to perform:

Prerequisites

  • OS: Linux
  • Python: 3.9 -- 3.12

Installation

If you are using NVIDIA GPUs, you can install vLLM using pip directly.

It's recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment and install vLLM using the following commands:

uv venv myenv --python 3.12 --seed
source myenv/bin/activate
uv pip install vllm

You can also use conda to create and manage Python environments.

conda create -n myenv python=3.12 -y
conda activate myenv
pip install vllm
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.

(quickstart-offline)=

Offline Batched Inference

With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: gh-file:examples/offline_inference/basic.py

The first line of this example imports the classes {class}~vllm.LLM and {class}~vllm.SamplingParams:

  • {class}~vllm.LLM is the main class for running offline inference with vLLM engine.
  • {class}~vllm.SamplingParams specifies the parameters for the sampling process.
from vllm import LLM, SamplingParams

The next section defines a list of input prompts and sampling parameters for text generation. The sampling temperature is set to 0.8 and the nucleus sampling probability is set to 0.95. You can find more information about the sampling parameters here.

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

The {class}~vllm.LLM class initializes vLLM's engine and the OPT-125M model for offline inference. The list of supported models can be found here.

llm = LLM(model="facebook/opt-125m")
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.

Now, the fun part! The outputs are generated using llm.generate. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of RequestOutput objects, which include all of the output tokens.

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

(quickstart-online)=

OpenAI-Compatible Server

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. The server currently hosts one model at a time and implements endpoints such as list models, create chat completion, and create completion endpoints.

Run the following command to start the vLLM server with the Qwen2.5-1.5B-Instruct model:

vllm serve Qwen/Qwen2.5-1.5B-Instruct
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template).

This server can be queried in the same format as OpenAI API. For example, to list the models:

curl http://localhost:8000/v1/models

You can pass in the argument --api-key or environment variable VLLM_API_KEY to enable the server to check for API key in the header.

OpenAI Completions API with vLLM

Once your server is started, you can query the model with input prompts:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the openai Python package:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a")
print("Completion result:", completion)

A more detailed client example can be found here: gh-file:examples/online_serving/openai_completion_client.py

OpenAI Chat Completions API with vLLM

vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

You can use the create chat completion endpoint to interact with the model:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Alternatively, you can use the openai Python package:

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

source/getting_started/troubleshooting.md

(troubleshooting)=

Troubleshooting

This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please search existing issues first to see if it has already been reported. If not, please file a new issue, providing as much relevant information as possible.

Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.

Hangs downloading a model

If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection. It's recommended to download the model first using the huggingface-cli and passing the local path to the model to vLLM. This way, you can isolate the issue.

Hangs loading a model from disk

If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow. It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.

To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.

Model is too large

If the model is too large to fit in a single GPU, you might want to consider tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using gh-file:examples/offline_inference/save_sharded_state.py. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.

Enable more logging

If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue:

  • export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging.
  • export CUDA_LAUNCH_BLOCKING=1 to identify which CUDA kernel is causing the problem.
  • export NCCL_DEBUG=TRACE to turn on more logging for NCCL.
  • export VLLM_TRACE_FUNCTION=1 to record all function calls for inspection in the log files to tell which function crashes or hangs.

Incorrect network setup

The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl and the IP address should be the correct one. If it's not, override the IP address using the environment variable export VLLM_HOST_IP=<your_ip_address>.

You might also need to set export NCCL_SOCKET_IFNAME=<your_network_interface> and export GLOO_SOCKET_IFNAME=<your_network_interface> to specify the network interface for the IP address.

Error near self.graph.replay()

If vLLM crashes and the error trace captures it somewhere around self.graph.replay() in vllm/worker/model_runner.py, it is a CUDA error inside CUDAGraph. To identify the particular CUDA operation that causes the error, you can add --enforce-eager to the command line, or enforce_eager=True to the {class}~vllm.LLM class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.

(troubleshooting-incorrect-hardware-driver)=

Incorrect hardware/driver

If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.

# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

if world_size <= 1:
    exit()

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    pynccl.all_reduce(data, stream=s)
    value = data.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()

If you are testing with a single node, adjust --nproc-per-node to the number of GPUs you want to use:

NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py

If you are testing with multi-nodes, adjust --nproc-per-node and --nnodes according to your setup and set MASTER_ADDR to the correct IP address of the master node, reachable from all nodes. Then, run:

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py

If the script runs successfully, you should see the message sanity check is successful!.

If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as export NCCL_P2P_DISABLE=1 to see if it helps. Please check their documentation for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.

A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:

- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.

Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.

(troubleshooting-python-multiprocessing)=

Python multiprocessing

RuntimeError Exception

If you have seen a warning in your logs like this:

WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
    https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
    for more information.

or an error from Python that looks like this:

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

then you must update your Python code to guard usage of vllm behind a if __name__ == '__main__': block. For example, instead of this:

import vllm

llm = vllm.LLM(...)

try this instead:

if __name__ == '__main__':
    import vllm

    llm = vllm.LLM(...)

Known Issues

  • In v0.5.2, v0.5.3, and v0.5.3.post1, there is a bug caused by zmq , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of vllm to include the fix.
  • To circumvent a NCCL bug , all vLLM processes will set an environment variable NCCL_CUMEM_ENABLE=0 to disable NCCL's cuMem allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the RLHF integration and the discussion .

source/index.md

Welcome to vLLM

:align: center
:alt: vLLM
:class: no-scaled-link
:width: 60%
<p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone
</strong>
</p>

<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-lora support

For more information, check out the following:

Documentation

% How to start using vLLM?

:caption: Getting Started
:maxdepth: 1

getting_started/installation/index
getting_started/quickstart
getting_started/examples/examples_index
getting_started/troubleshooting
getting_started/faq

% What does vLLM support?

:caption: Models
:maxdepth: 1

models/generative_models
models/pooling_models
models/supported_models
models/extensions/index

% Additional capabilities

:caption: Features
:maxdepth: 1

features/quantization/index
features/lora
features/tool_calling
features/structured_outputs
features/automatic_prefix_caching
features/disagg_prefill
features/spec_decode
features/compatibility_matrix

% Details about running vLLM

:caption: Inference and Serving
:maxdepth: 1

serving/offline_inference
serving/openai_compatible_server
serving/multimodal_inputs
serving/distributed_serving
serving/metrics
serving/engine_args
serving/env_vars
serving/usage_stats
serving/integrations/index

% Scaling up vLLM for production

:caption: Deployment
:maxdepth: 1

deployment/docker
deployment/k8s
deployment/nginx
deployment/frameworks/index
deployment/integrations/index

% Making the most out of vLLM

:caption: Performance
:maxdepth: 1

performance/optimization
performance/benchmarks

% Explanation of vLLM internals

:caption: Design Documents
:maxdepth: 2

design/arch_overview
design/huggingface_integration
design/plugin_system
design/kernel/paged_attention
design/mm_processing
design/automatic_prefix_caching
design/multiprocessing

% How to contribute to the vLLM project

:caption: Developer Guide
:maxdepth: 2

contributing/overview
contributing/profiling/profiling_index
contributing/dockerfile/dockerfile
contributing/model/index
contributing/vulnerability_management

% Technical API specifications

:caption: API Reference
:maxdepth: 2

api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index

% Latest news and acknowledgements

:caption: Community
:maxdepth: 1

community/meetups
community/sponsors

Indices and tables

  • {ref}genindex
  • {ref}modindex

source/models/extensions/index.md

Built-in Extensions

:maxdepth: 1

runai_model_streamer
tensorizer

source/models/extensions/runai_model_streamer.md

(runai-model-streamer)=

Loading models with Run:ai Model Streamer

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

pip3 install vllm[runai]

To run it as an OpenAI-compatible server, add the --load-format runai_streamer flag:

vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer

To run model from AWS S3 object store run:

vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer

To run model from a S3 compatible object store run:

RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer

Tunable parameters

You can tune parameters using --model-loader-extra-config:

You can tune concurrency that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer. For reading from S3, it will be the number of client instances the host is opening to the S3 server.

vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'

You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size. You can read further about CPU buffer memory limiting here.

vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).

source/models/extensions/tensorizer.md

(tensorizer)=

Loading models with CoreWeave's Tensorizer

vLLM supports loading models with CoreWeave's Tensorizer. vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized at runtime extremely quickly directly to the GPU, resulting in significantly shorter Pod startup times and CPU memory usage. Tensor encryption is also supported.

For more information on CoreWeave's Tensorizer, please refer to CoreWeave's Tensorizer documentation. For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see the vLLM example script.

Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.

source/models/generative_models.md

(generative-models)=

Generative Models

vLLM provides first-class support for generative models, which covers most of LLMs.

In vLLM, generative models implement the {class}~vllm.model_executor.models.VllmModelForTextGeneration interface. Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, which are then passed through {class}~vllm.model_executor.layers.Sampler to obtain the final text.

For generative models, the only supported --task option is "generate". Usually, this is automatically inferred so you don't have to specify it.

Offline Inference

The {class}~vllm.LLM class provides various methods for offline inference. See Engine Arguments for a list of options when initializing the model.

LLM.generate

The {class}~vllm.LLM.generate method is available to all generative models in vLLM. It is similar to its counterpart in HF Transformers, except that tokenization and detokenization are also performed automatically.

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate("Hello, my name is")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

You can optionally control the language generation by passing {class}~vllm.SamplingParams. For example, you can use greedy sampling by setting temperature=0:

llm = LLM(model="facebook/opt-125m")
params = SamplingParams(temperature=0)
outputs = llm.generate("Hello, my name is", params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

A code example can be found here: gh-file:examples/offline_inference/basic.py

LLM.beam_search

The {class}~vllm.LLM.beam_search method implements beam search on top of {class}~vllm.LLM.generate. For example, to search using 5 beams and output at most 50 tokens:

llm = LLM(model="facebook/opt-125m")
params = BeamSearchParams(beam_width=5, max_tokens=50)
outputs = llm.generate("Hello, my name is", params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

LLM.chat

The {class}~vllm.LLM.chat method implements chat functionality on top of {class}~vllm.LLM.generate. In particular, it accepts input similar to OpenAI Chat Completions API and automatically applies the model's chat template to format the prompt.

In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation.
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Hello"
    },
    {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
    },
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]
outputs = llm.chat(conversation)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

A code example can be found here: gh-file:examples/offline_inference/chat.py

If the model doesn't have a chat template or you want to specify another one, you can explicitly pass a chat template:

from vllm.entrypoints.chat_utils import load_chat_template

# You can find a list of existing chat templates under `examples/`
custom_template = load_chat_template(chat_template="<path_to_template>")
print("Loaded chat template:", custom_template)

outputs = llm.chat(conversation, chat_template=custom_template)

Online Serving

Our OpenAI-Compatible Server provides endpoints that correspond to the offline APIs:


source/models/pooling_models.md

(pooling-models)=

Pooling Models

vLLM also supports pooling models, including embedding, reranking and reward models.

In vLLM, pooling models implement the {class}~vllm.model_executor.models.VllmModelForPooling interface. These models use a {class}~vllm.model_executor.layers.Pooler to extract the final hidden states of the input before returning them.

We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.

For pooling models, we support the following --task options. The selected option sets the default pooler used to extract the final hidden states:

:widths: 50 25 25 25
:header-rows: 1

* - Task
  - Pooling Type
  - Normalization
  - Softmax
* - Embedding (`embed`)
  - `LAST`
  - ✅︎
  - ✗
* - Classification (`classify`)
  - `LAST`
  - ✗
  - ✅︎
* - Sentence Pair Scoring (`score`)
  - \*
  - \*
  - \*
* - Reward Modeling (`reward`)
  - `ALL`
  - ✗
  - ✗

*The default pooler is always defined by the model.

If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.

When loading Sentence Transformers models, we attempt to override the default pooler based on its Sentence Transformers configuration file (modules.json).

You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.

Offline Inference

The {class}~vllm.LLM class provides various methods for offline inference. See Engine Arguments for a list of options when initializing the model.

LLM.encode

The {class}~vllm.LLM.encode method is available to all pooling models in vLLM. It returns the extracted hidden states directly, which is useful for reward models.

llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")

LLM.embed

The {class}~vllm.LLM.embed method outputs an embedding vector for each prompt. It is primarily designed for embedding models.

llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

A code example can be found here: gh-file:examples/offline_inference/embedding.py

LLM.classify

The {class}~vllm.LLM.classify method outputs a probability vector for each prompt. It is primarily designed for classification models.

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")

A code example can be found here: gh-file:examples/offline_inference/classification.py

LLM.score

The {class}~vllm.LLM.score method outputs similarity scores between sentence pairs. It is primarily designed for cross-encoder models. These types of models serve as rerankers between candidate query-document pairs in RAG systems.

vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
(output,) = llm.score("What is the capital of France?",
                      "The capital of Brazil is Brasilia.")

score = output.outputs.score
print(f"Score: {score}")

A code example can be found here: gh-file:examples/offline_inference/scoring.py

Online Serving

Our OpenAI-Compatible Server provides endpoints that correspond to the offline APIs:


source/models/supported_models.md

(supported-models)=

List of Supported Models

vLLM supports generative and pooling models across various tasks. If a model supports more than one task, you can set the task via the --task argument.

For each task, we list the model architectures that have been implemented in vLLM. Alongside each architecture, we include some popular models that use it.

Loading a Model

HuggingFace Hub

By default, vLLM loads models from HuggingFace (HF) Hub.

To determine whether a given model is supported, you can check the config.json file inside the HF repository. If the "architectures" field contains a model architecture listed below, then it should be supported in theory.

The easiest way to check if your model is really supported at runtime is to run the program below:

```python
from vllm import LLM

# For generative models (task=generate) only
llm = LLM(model=..., task="generate")  # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)

# For pooling models (task={embed,classify,reward,score}) only
llm = LLM(model=..., task="embed")  # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)
```

If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.

Otherwise, please refer to Adding a New Model for instructions on how to implement your model in vLLM. Alternatively, you can open an issue on GitHub to request vLLM support.

ModelScope

To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

export VLLM_USE_MODELSCOPE=True

And use with trust_remote_code=True.

from vllm import LLM

llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)

# For generative models (task=generate) only
output = llm.generate("Hello, my name is")
print(output)

# For pooling models (task={embed,classify,reward,score}) only
output = llm.encode("Hello, my name is")
print(output)

List of Text-only Language Models

Generative Models

See this page for more information on how to use generative models.

Text Generation (--task generate)

:widths: 25 25 50 5 5
:header-rows: 1

* - Architecture
  - Models
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
* - `AquilaForCausalLM`
  - Aquila, Aquila2
  - `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.
  - ✅︎
  - ✅︎
* - `ArcticForCausalLM`
  - Arctic
  - `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc.
  -
  - ✅︎
* - `BaiChuanForCausalLM`
  - Baichuan2, Baichuan
  - `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
  - ✅︎
  - ✅︎
* - `BloomForCausalLM`
  - BLOOM, BLOOMZ, BLOOMChat
  - `bigscience/bloom`, `bigscience/bloomz`, etc.
  -
  - ✅︎
* - `BartForConditionalGeneration`
  - BART
  - `facebook/bart-base`, `facebook/bart-large-cnn`, etc.
  -
  -
* - `ChatGLMModel`
  - ChatGLM
  - `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.
  - ✅︎
  - ✅︎
* - `CohereForCausalLM`, `Cohere2ForCausalLM`
  - Command-R
  - `CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, etc.
  - ✅︎
  - ✅︎
* - `DbrxForCausalLM`
  - DBRX
  - `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc.
  -
  - ✅︎
* - `DeciLMForCausalLM`
  - DeciLM
  - `Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.
  -
  - ✅︎
* - `DeepseekForCausalLM`
  - DeepSeek
  - `deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat` etc.
  -
  - ✅︎
* - `DeepseekV2ForCausalLM`
  - DeepSeek-V2
  - `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat` etc.
  -
  - ✅︎
* - `DeepseekV3ForCausalLM`
  - DeepSeek-V3
  - `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3` etc.
  -
  - ✅︎
* - `ExaoneForCausalLM`
  - EXAONE-3
  - `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc.
  - ✅︎
  - ✅︎
* - `FalconForCausalLM`
  - Falcon
  - `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.
  -
  - ✅︎
* - `FalconMambaForCausalLM`
  - FalconMamba
  - `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc.
  - ✅︎
  - ✅︎
* - `GemmaForCausalLM`
  - Gemma
  - `google/gemma-2b`, `google/gemma-7b`, etc.
  - ✅︎
  - ✅︎
* - `Gemma2ForCausalLM`
  - Gemma2
  - `google/gemma-2-9b`, `google/gemma-2-27b`, etc.
  - ✅︎
  - ✅︎
* - `GlmForCausalLM`
  - GLM-4
  - `THUDM/glm-4-9b-chat-hf`, etc.
  - ✅︎
  - ✅︎
* - `GPT2LMHeadModel`
  - GPT-2
  - `gpt2`, `gpt2-xl`, etc.
  -
  - ✅︎
* - `GPTBigCodeForCausalLM`
  - StarCoder, SantaCoder, WizardCoder
  - `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc.
  - ✅︎
  - ✅︎
* - `GPTJForCausalLM`
  - GPT-J
  - `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.
  -
  - ✅︎
* - `GPTNeoXForCausalLM`
  - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
  - `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.
  -
  - ✅︎
* - `GraniteForCausalLM`
  - Granite 3.0, Granite 3.1, PowerLM
  - `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc.
  - ✅︎
  - ✅︎
* - `GraniteMoeForCausalLM`
  - Granite 3.0 MoE, PowerMoE
  - `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc.
  - ✅︎
  - ✅︎
* - `GritLM`
  - GritLM
  - `parasail-ai/GritLM-7B-vllm`.
  - ✅︎
  - ✅︎
* - `InternLMForCausalLM`
  - InternLM
  - `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.
  - ✅︎
  - ✅︎
* - `InternLM2ForCausalLM`
  - InternLM2
  - `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc.
  - ✅︎
  - ✅︎
* - `InternLM3ForCausalLM`
  - InternLM3
  - `internlm/internlm3-8b-instruct`, etc.
  - ✅︎
  - ✅︎
* - `JAISLMHeadModel`
  - Jais
  - `inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, etc.
  -
  - ✅︎
* - `JambaForCausalLM`
  - Jamba
  - `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc.
  - ✅︎
  - ✅︎
* - `LlamaForCausalLM`
  - Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
  - `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc.
  - ✅︎
  - ✅︎
* - `MambaForCausalLM`
  - Mamba
  - `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc.
  -
  - ✅︎
* - `MiniCPMForCausalLM`
  - MiniCPM
  - `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc.
  - ✅︎
  - ✅︎
* - `MiniCPM3ForCausalLM`
  - MiniCPM3
  - `openbmb/MiniCPM3-4B`, etc.
  - ✅︎
  - ✅︎
* - `MistralForCausalLM`
  - Mistral, Mistral-Instruct
  - `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.
  - ✅︎
  - ✅︎
* - `MixtralForCausalLM`
  - Mixtral-8x7B, Mixtral-8x7B-Instruct
  - `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc.
  - ✅︎
  - ✅︎
* - `MPTForCausalLM`
  - MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
  - `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc.
  -
  - ✅︎
* - `NemotronForCausalLM`
  - Nemotron-3, Nemotron-4, Minitron
  - `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc.
  - ✅︎
  - ✅︎
* - `OLMoForCausalLM`
  - OLMo
  - `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc.
  -
  - ✅︎
* - `OLMo2ForCausalLM`
  - OLMo2
  - `allenai/OLMo2-7B-1124`, etc.
  -
  - ✅︎
* - `OLMoEForCausalLM`
  - OLMoE
  - `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc.
  - ✅︎
  - ✅︎
* - `OPTForCausalLM`
  - OPT, OPT-IML
  - `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.
  -
  - ✅︎
* - `OrionForCausalLM`
  - Orion
  - `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc.
  -
  - ✅︎
* - `PhiForCausalLM`
  - Phi
  - `microsoft/phi-1_5`, `microsoft/phi-2`, etc.
  - ✅︎
  - ✅︎
* - `Phi3ForCausalLM`
  - Phi-3
  - `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc.
  - ✅︎
  - ✅︎
* - `Phi3SmallForCausalLM`
  - Phi-3-Small
  - `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc.
  -
  - ✅︎
* - `PhiMoEForCausalLM`
  - Phi-3.5-MoE
  - `microsoft/Phi-3.5-MoE-instruct`, etc.
  - ✅︎
  - ✅︎
* - `PersimmonForCausalLM`
  - Persimmon
  - `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc.
  -
  - ✅︎
* - `QWenLMHeadModel`
  - Qwen
  - `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.
  - ✅︎
  - ✅︎
* - `Qwen2ForCausalLM`
  - QwQ, Qwen2
  - `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc.
  - ✅︎
  - ✅︎
* - `Qwen2MoeForCausalLM`
  - Qwen2MoE
  - `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
  -
  - ✅︎
* - `StableLmForCausalLM`
  - StableLM
  - `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.
  -
  - ✅︎
* - `Starcoder2ForCausalLM`
  - Starcoder2
  - `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc.
  -
  - ✅︎
* - `SolarForCausalLM`
  - Solar Pro
  - `upstage/solar-pro-preview-instruct`, etc.
  - ✅︎
  - ✅︎
* - `TeleChat2ForCausalLM`
  - TeleChat2
  - `TeleAI/TeleChat2-3B`, `TeleAI/TeleChat2-7B`, `TeleAI/TeleChat2-35B`, etc.
  - ✅︎
  - ✅︎
* - `XverseForCausalLM`
  - XVERSE
  - `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc.
  - ✅︎
  - ✅︎
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.

Pooling Models

See this page for more information on how to use pooling models.

Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.

Text Embedding (--task embed)

:widths: 25 25 50 5 5
:header-rows: 1

* - Architecture
  - Models
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
* - `BertModel`
  - BERT-based
  - `BAAI/bge-base-en-v1.5`, etc.
  -
  -
* - `Gemma2Model`
  - Gemma2-based
  - `BAAI/bge-multilingual-gemma2`, etc.
  -
  - ✅︎
* - `GritLM`
  - GritLM
  - `parasail-ai/GritLM-7B-vllm`.
  - ✅︎
  - ✅︎
* - `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc.
  - Llama-based
  - `intfloat/e5-mistral-7b-instruct`, etc.
  - ✅︎
  - ✅︎
* - `Qwen2Model`, `Qwen2ForCausalLM`
  - Qwen2-based
  - `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc.
  - ✅︎
  - ✅︎
* - `RobertaModel`, `RobertaForMaskedLM`
  - RoBERTa-based
  - `sentence-transformers/all-roberta-large-v1`, `sentence-transformers/all-roberta-large-v1`, etc.
  -
  -
* - `XLMRobertaModel`
  - XLM-RoBERTa-based
  - `intfloat/multilingual-e5-large`, etc.
  -
  -
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
You should manually set mean pooling by passing `--override-pooler-config '{"pooling_type": "MEAN"}'`.
Unlike base Qwen2, `Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention.
You can set `--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly.

On the other hand, its 1.5B variant (`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
despite being described otherwise on its model card.

Regardless of the variant, you need to enable `--trust-remote-code` for the correct tokenizer to be
loaded. See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882).

If your model is not in the above list, we will try to automatically convert the model using {func}~vllm.model_executor.models.adapters.as_embedding_model. By default, the embeddings of the whole prompt are extracted from the normalized hidden state corresponding to the last token.

Reward Modeling (--task reward)

:widths: 25 25 50 5 5
:header-rows: 1

* - Architecture
  - Models
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
* - `InternLM2ForRewardModel`
  - InternLM2-based
  - `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc.
  - ✅︎
  - ✅︎
* - `LlamaForCausalLM`
  - Llama-based
  - `peiyi9979/math-shepherd-mistral-7b-prm`, etc.
  - ✅︎
  - ✅︎
* - `Qwen2ForRewardModel`
  - Qwen2-based
  - `Qwen/Qwen2.5-Math-RM-72B`, etc.
  - ✅︎
  - ✅︎
* - `Qwen2ForProcessRewardModel`
  - Qwen2-based
  - `Qwen/Qwen2.5-Math-PRM-7B`, `Qwen/Qwen2.5-Math-PRM-72B`, etc.
  - ✅︎
  - ✅︎

If your model is not in the above list, we will try to automatically convert the model using {func}~vllm.model_executor.models.adapters.as_reward_model. By default, we return the hidden states of each token directly.

For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.

Classification (--task classify)

:widths: 25 25 50 5 5
:header-rows: 1

* - Architecture
  - Models
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
* - `JambaForSequenceClassification`
  - Jamba
  - `ai21labs/Jamba-tiny-reward-dev`, etc.
  - ✅︎
  - ✅︎
* - `Qwen2ForSequenceClassification`
  - Qwen2-based
  - `jason9693/Qwen2.5-1.5B-apeach`, etc.
  - ✅︎
  - ✅︎

If your model is not in the above list, we will try to automatically convert the model using {func}~vllm.model_executor.models.adapters.as_classification_model. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

Sentence Pair Scoring (--task score)

:widths: 25 25 50 5 5
:header-rows: 1

* - Architecture
  - Models
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
* - `BertForSequenceClassification`
  - BERT-based
  - `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
  -
  -
* - `RobertaForSequenceClassification`
  - RoBERTa-based
  - `cross-encoder/quora-roberta-base`, etc.
  -
  -
* - `XLMRobertaForSequenceClassification`
  - XLM-RoBERTa-based
  - `BAAI/bge-reranker-v2-m3`, etc.
  -
  -

(supported-mm-models)=

List of Multimodal Language Models

The following modalities are supported depending on the model:

  • Text
  • Image
  • Video
  • Audio

Any combination of modalities joined by + are supported.

  • e.g.: T + I means that the model supports text-only, image-only, and text-with-image inputs.

On the other hand, modalities separated by / are mutually exclusive.

  • e.g.: T / I means that the model supports text-only and image-only inputs, but not text-with-image inputs.

See this page on how to pass multi-modal inputs to the model.

To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:

Offline inference:
```python
llm = LLM(
    model="Qwen/Qwen2-VL-7B-Instruct",
    limit_mm_per_prompt={"image": 4},
)
```

Online serving:
```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
```
vLLM currently only supports adding LoRA to the language backbone of multimodal models.

Generative Models

See this page for more information on how to use generative models.

Text Generation (--task generate)

:widths: 25 25 15 20 5 5 5
:header-rows: 1

* - Architecture
  - Models
  - Inputs
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
  - [V1](gh-issue:8779)
* - `AriaForConditionalGeneration`
  - Aria
  - T + I<sup>+</sup>
  - `rhymes-ai/Aria`
  -
  - ✅︎
  - ✅︎
* - `Blip2ForConditionalGeneration`
  - BLIP-2
  - T + I<sup>E</sup>
  - `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc.
  -
  - ✅︎
  - ✅︎
* - `ChameleonForConditionalGeneration`
  - Chameleon
  - T + I
  - `facebook/chameleon-7b` etc.
  -
  - ✅︎
  - ✅︎
* - `DeepseekVLV2ForCausalLM`
  - DeepSeek-VL2
  - T + I<sup>+</sup>
  - `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. (see note)
  -
  - ✅︎
  - ✅︎
* - `FuyuForCausalLM`
  - Fuyu
  - T + I
  - `adept/fuyu-8b` etc.
  -
  - ✅︎
  - ✅︎
* - `ChatGLMModel`
  - GLM-4V
  - T + I
  - `THUDM/glm-4v-9b` etc.
  - ✅︎
  - ✅︎
  -
* - `H2OVLChatModel`
  - H2OVL
  - T + I<sup>E+</sup>
  - `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
  -
  - ✅︎
  -
* - `Idefics3ForConditionalGeneration`
  - Idefics3
  - T + I
  - `HuggingFaceM4/Idefics3-8B-Llama3` etc.
  - ✅︎
  -
  -
* - `InternVLChatModel`
  - InternVL 2.5, Mono-InternVL, InternVL 2.0
  - T + I<sup>E+</sup>
  - `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
  -
  - ✅︎
  - ✅︎
* - `LlavaForConditionalGeneration`
  - LLaVA-1.5
  - T + I<sup>E+</sup>
  - `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
  -
  - ✅︎
  - ✅︎
* - `LlavaNextForConditionalGeneration`
  - LLaVA-NeXT
  - T + I<sup>E+</sup>
  - `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
  -
  - ✅︎
  - ✅︎
* - `LlavaNextVideoForConditionalGeneration`
  - LLaVA-NeXT-Video
  - T + V
  - `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
  -
  - ✅︎
  - ✅︎
* - `LlavaOnevisionForConditionalGeneration`
  - LLaVA-Onevision
  - T + I<sup>+</sup> + V<sup>+</sup>
  - `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
  -
  - ✅︎
  - ✅︎
* - `MiniCPMV`
  - MiniCPM-V
  - T + I<sup>E+</sup>
  - `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
  - ✅︎
  - ✅︎
  -
* - `MllamaForConditionalGeneration`
  - Llama 3.2
  - T + I<sup>+</sup>
  - `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc.
  -
  -
  -
* - `MolmoForCausalLM`
  - Molmo
  - T + I
  - `allenai/Molmo-7B-D-0924`, `allenai/Molmo-72B-0924`, etc.
  - ✅︎
  - ✅︎
  - ✅︎
* - `NVLM_D_Model`
  - NVLM-D 1.0
  - T + I<sup>E+</sup>
  - `nvidia/NVLM-D-72B`, etc.
  -
  - ✅︎
  - ✅︎
* - `PaliGemmaForConditionalGeneration`
  - PaliGemma, PaliGemma 2
  - T + I<sup>E</sup>
  - `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
  -
  - ✅︎
  -
* - `Phi3VForCausalLM`
  - Phi-3-Vision, Phi-3.5-Vision
  - T + I<sup>E+</sup>
  - `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc.
  -
  - ✅︎
  - ✅︎
* - `PixtralForConditionalGeneration`
  - Pixtral
  - T + I<sup>+</sup>
  - `mistralai/Pixtral-12B-2409`, `mistral-community/pixtral-12b` (see note), etc.
  -
  - ✅︎
  - ✅︎
* - `QWenLMHeadModel`
  - Qwen-VL
  - T + I<sup>E+</sup>
  - `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
  - ✅︎
  - ✅︎
  -
* - `Qwen2AudioForConditionalGeneration`
  - Qwen2-Audio
  - T + A<sup>+</sup>
  - `Qwen/Qwen2-Audio-7B-Instruct`
  -
  - ✅︎
  - ✅︎
* - `Qwen2VLForConditionalGeneration`
  - QVQ, Qwen2-VL
  - T + I<sup>E+</sup> + V<sup>E+</sup>
  - `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc.
  - ✅︎
  - ✅︎
  - ✅︎
* - `UltravoxModel`
  - Ultravox
  - T + A<sup>E+</sup>
  - `fixie-ai/ultravox-v0_3`
  -
  - ✅︎
  - ✅︎

E Pre-computed embeddings can be inputted for this modality.
+ Multiple items can be inputted per text prompt for this modality.

To use `DeepSeek-VL2` series models, you have to pass `--hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` when running vLLM.
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
The chat template for Pixtral-HF is incorrect (see [discussion](https://huggingface.co/mistral-community/pixtral-12b/discussions/22)).
A corrected version is available at <gh-file:examples/template_pixtral_hf.jinja>.

Pooling Models

See this page for more information on how to use pooling models.

Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.

Text Embedding (--task embed)

Any text generation model can be converted into an embedding model by passing --task embed.

To get the best results, you should use pooling models that are specifically trained as such.

The following table lists those that are tested in vLLM.

:widths: 25 25 15 25 5 5
:header-rows: 1

* - Architecture
  - Models
  - Inputs
  - Example HF Models
  - [LoRA](#lora-adapter)
  - [PP](#distributed-serving)
* - `LlavaNextForConditionalGeneration`
  - LLaVA-NeXT-based
  - T / I
  - `royokong/e5-v`
  -
  - ✅︎
* - `Phi3VForCausalLM`
  - Phi-3-Vision-based
  - T + I
  - `TIGER-Lab/VLM2Vec-Full`
  - 🚧
  - ✅︎
* - `Qwen2VLForConditionalGeneration`
  - Qwen2-VL-based
  - T + I
  - `MrLight/dse-qwen2-2b-mrl-v1`
  -
  - ✅︎

Model Support Policy

At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:

  1. Community-Driven Support: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. Call for contribution: PRs coming directly from model vendors are greatly appreciated!

  2. Best-Effort Consistency: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.

    When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
    
  3. Issue Resolution and Model Updates: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.

  4. Monitoring and Updates: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.

  5. Selective Focus: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.

Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.

Note that, as an inference engine, vLLM does not introduce new models. Therefore, all models supported by vLLM are third-party models in this regard.

We have the following levels of testing for models:

  1. Strict Consistency: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to models tests for the models that have passed this test.
  2. Output Sensibility: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
  3. Runtime Functionality: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to functionality tests and examples for the models that have passed this test.
  4. Community Feedback: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.

source/performance/benchmarks.md

(benchmarks)=

Benchmark Suites

vLLM contains two sets of benchmarks:

(performance-benchmarks)=

Performance Benchmarks

The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the perf-benchmarks and ready labels, and when a PR is merged into vLLM.

The latest performance results are hosted on the public vLLM Performance Dashboard.

More information on the performance benchmarks and their parameters can be found here.

(nightly-benchmarks)=

Nightly Benchmarks

These compare vLLM's performance against alternatives (tgi, trt-llm, and lmdeploy) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the perf-benchmarks and nightly-benchmarks labels.

The latest nightly benchmark results are shared in major release blog posts such as vLLM v0.6.0.

More information on the nightly benchmarks and their parameters can be found here.


source/performance/optimization.md

(optimization-and-tuning)=

Optimization and Tuning

Preemption

Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests. The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes available again. When this occurs, the following warning is printed:

WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1

While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency. If you frequently encounter preemptions from the vLLM engine, consider the following actions:

  • Increase gpu_memory_utilization. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
  • Decrease max_num_seqs or max_num_batched_tokens. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
  • Increase tensor_parallel_size. This approach shards model weights, so each GPU has more memory available for KV cache.

You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.

(chunked-prefill)=

Chunked Prefill

vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.

You can enable the feature by specifying --enable-chunked-prefill in the command line or setting enable_chunked_prefill=True in the LLM constructor.

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
# Set max_num_batched_tokens to tune performance.
# NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048)

By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch. This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.

Once chunked prefill is enabled, the policy is changed to prioritize decode requests. It batches all pending decode requests to the batch before scheduling any prefill. When there are available token_budget (max_num_batched_tokens), it schedules pending prefills. If a last pending prefill request cannot fit into max_num_batched_tokens, it chunks it.

This policy has two benefits:

  • It improves ITL and generation decode because decode requests are prioritized.
  • It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.

You can tune the performance by changing max_num_batched_tokens. By default, it is set to 2048. Smaller max_num_batched_tokens achieves better ITL because there are fewer prefills interrupting decodes. Higher max_num_batched_tokens achieves better TTFT as you can put more prefill to the batch.

  • If max_num_batched_tokens is the same as max_model_len, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
  • Note that the default value (2048) of max_num_batched_tokens is optimized for ITL, and it may have lower throughput than the default scheduler.

We recommend you set max_num_batched_tokens > 2048 for throughput.

See related papers for more details (https://arxiv.org/pdf/2401.08671 or https://arxiv.org/pdf/2308.16369).

Please try out this feature and let us know your feedback via GitHub issues!


source/serving/distributed_serving.md

(distributed-serving)=

Distributed Inference and Serving

How to decide the distributed inference strategy?

Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:

  • Single GPU (no distributed inference): If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
  • Single-Node Multi-GPU (tensor parallel inference): If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
  • Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference): If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.

In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.

After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like # GPU blocks: 790. Multiply the number by 16 (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.

There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.

Running vLLM on a single node

vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support Megatron-LM's tensor parallel algorithm. We manage the distributed runtime with either Ray or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.

Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured tensor_parallel_size, otherwise Ray will be used. This default can be overridden via the LLM class distributed_executor_backend argument or --distributed-executor-backend API server argument. Set it to mp for multiprocessing or ray for Ray. It's not required for Ray to be installed for the multiprocessing case.

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

To run multi-GPU serving, pass in the --tensor-parallel-size argument when starting the server. For example, to run API server on 4 GPUs:

 vllm serve facebook/opt-13b \
     --tensor-parallel-size 4

You can also additionally specify --pipeline-parallel-size to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:

 vllm serve gpt2 \
     --tensor-parallel-size 4 \
     --pipeline-parallel-size 2

Running vLLM on multiple nodes

If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.

The first step, is to start containers and organize them into a cluster. We have provided the helper script gh-file:examples/online_serving/run_cluster.sh to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have CAP_SYS_ADMIN to the docker container by using the --cap-add option in the docker run command.

Pick a node as the head node, and run the following command:

bash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --head \
                /path/to/the/huggingface/home/in/this/node

On the rest of the worker nodes, run the following command:

bash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --worker \
                /path/to/the/huggingface/home/in/this/node

Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument ip_of_head_node should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.

Then, on any node, use docker exec -it node /bin/bash to enter the container, execute ray status to check the status of the Ray cluster. You should see the right number of nodes and GPUs.

After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:

 vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 8 \
     --pipeline-parallel-size 2

You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:

vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 16

To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like --privileged -e NCCL_IB_HCA=mlx5 to the run_cluster.sh script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with NCCL_DEBUG=TRACE environment variable set, e.g. NCCL_DEBUG=TRACE vllm serve ... and check the logs for the NCCL version and the network used. If you find [send] via NET/Socket in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find [send] via NET/IB/GDRDMA in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.

After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.

When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.

source/serving/engine_args.md

(engine-args)=

Engine Arguments

Below, you can find an explanation of every engine argument for vLLM:

.. argparse::
    :module: vllm.engine.arg_utils
    :func: _engine_args_parser
    :prog: vllm serve
    :nodefaultconst:

Async Engine Arguments

Below are the additional arguments related to the asynchronous engine:

.. argparse::
    :module: vllm.engine.arg_utils
    :func: _async_engine_args_parser
    :prog: vllm serve
    :nodefaultconst:

source/serving/env_vars.md

Environment Variables

vLLM uses the following environment variables to configure the system:

Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.

All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
:end-before: end-env-vars-definition
:language: python
:start-after: begin-env-vars-definition

source/serving/integrations/index.md

External Integrations

:maxdepth: 1

langchain
llamaindex

source/serving/integrations/langchain.md

(serving-langchain)=

LangChain

vLLM is also available via LangChain .

To install LangChain, run

pip install langchain langchain_community -q

To run inference on a single or multiple GPUs, use VLLM class from langchain.

from langchain_community.llms import VLLM

llm = VLLM(model="mosaicml/mpt-7b",
           trust_remote_code=True,  # mandatory for hf models
           max_new_tokens=128,
           top_k=10,
           top_p=0.95,
           temperature=0.8,
           # tensor_parallel_size=... # for distributed inference
)

print(llm("What is the capital of France ?"))

Please refer to this Tutorial for more details.


source/serving/integrations/llamaindex.md

(serving-llamaindex)=

LlamaIndex

vLLM is also available via LlamaIndex .

To install LlamaIndex, run

pip install llama-index-llms-vllm -q

To run inference on a single or multiple GPUs, use Vllm class from llamaindex.

from llama_index.llms.vllm import Vllm

llm = Vllm(
    model="microsoft/Orca-2-7b",
    tensor_parallel_size=4,
    max_new_tokens=100,
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
)

Please refer to this Tutorial for more details.


source/serving/metrics.md

Production Metrics

vLLM exposes a number of metrics that can be used to monitor the health of the system. These metrics are exposed via the /metrics endpoint on the vLLM OpenAI compatible API server.

You can start the server using Python, or using Docker:

vllm serve unsloth/Llama-3.2-1B-Instruct

Then query the endpoint to get the latest metrics from the server:

$ curl http://0.0.0.0:8000/metrics

# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
...

The following metrics are exposed:

:end-before: end-metrics-definitions
:language: python
:start-after: begin-metrics-definitions

source/serving/multimodal_inputs.md

(multimodal-inputs)=

Multimodal Inputs

This page teaches you how to pass multi-modal inputs to multi-modal models in vLLM.

We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.

Offline Inference

To input multi-modal data, follow this schema in {class}vllm.inputs.PromptType:

  • prompt: The prompt should follow the format that is documented on HuggingFace.
  • multi_modal_data: This is a dictionary that follows the schema defined in {class}vllm.multimodal.inputs.MultiModalDataDict.

Image

You can pass a single image to the 'image' field of the multi-modal dictionary, as shown in the following examples:

llm = LLM(model="llava-hf/llava-1.5-7b-hf")

# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

# Load the image using PIL.Image
image = PIL.Image.open(...)

# Single prompt inference
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image},
})

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

# Batch inference
image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...)
outputs = llm.generate(
    [
        {
            "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
            "multi_modal_data": {"image": image_1},
        },
        {
            "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
            "multi_modal_data": {"image": image_2},
        }
    ]
)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Full example: gh-file:examples/offline_inference/vision_language.py

To substitute multiple images inside the same text prompt, you can pass in a list of images instead:

llm = LLM(
    model="microsoft/Phi-3.5-vision-instruct",
    trust_remote_code=True,  # Required to load Phi-3.5-vision
    max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
    limit_mm_per_prompt={"image": 2},  # The maximum number to accept
)

# Refer to the HuggingFace repo for the correct format to use
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"

# Load the images using PIL.Image
image1 = PIL.Image.open(...)
image2 = PIL.Image.open(...)

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {
        "image": [image1, image2]
    },
})

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Full example: gh-file:examples/offline_inference/vision_language_multi_image.py

Multi-image input can be extended to perform video captioning. We show this with Qwen2-VL as it supports videos:

# Specify the maximum number of frames per video to be 4. This can be changed.
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})

# Create the request payload.
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
message = {
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
    ],
}
for i in range(len(video_frames)):
    base64_image = encode_image(video_frames[i]) # base64 encoding.
    new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
    message["content"].append(new_image)

# Perform inference and log output.
outputs = llm.chat([message])

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Video

You can pass a list of NumPy arrays directly to the 'video' field of the multi-modal dictionary instead of using multi-image input.

Full example: gh-file:examples/offline_inference/vision_language.py

Audio

You can pass a tuple (array, sampling_rate) to the 'audio' field of the multi-modal dictionary.

Full example: gh-file:examples/offline_inference/audio_language.py

Embedding

To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, pass a tensor of shape (num_items, feature_size, hidden_size of LM) to the corresponding field of the multi-modal dictionary.

# Inference with image embeddings as input
llm = LLM(model="llava-hf/llava-1.5-7b-hf")

# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

# Embeddings for single image
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image_embeds},
})

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:

# Construct the prompt based on your model
prompt = ...

# Embeddings for multiple images
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)

# Qwen2-VL
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
mm_data = {
    "image": {
        "image_embeds": image_embeds,
        # image_grid_thw is needed to calculate positional encoding.
        "image_grid_thw": torch.load(...),  # torch.Tensor of shape (1, 3),
    }
}

# MiniCPM-V
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
mm_data = {
    "image": {
        "image_embeds": image_embeds,
        # image_size_list is needed to calculate details of the sliced image.
        "image_size_list": [image.size for image in images],  # list of image sizes
    }
}

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": mm_data,
})

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Online Serving

Our OpenAI-compatible server accepts multi-modal data via the Chat Completions API.

A chat template is **required** to use Chat Completions API.

Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>

Image

Image input is supported according to OpenAI Vision API. Here is a simple example using Phi-3.5-Vision.

First, launch the OpenAI-compatible server:

vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
  --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2

Then, you can use the OpenAI client as follows:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

chat_response = client.chat.completions.create(
    model="microsoft/Phi-3.5-vision-instruct",
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the image token `<image>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "What’s in this image?"},
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)

# Multi-image input inference
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"

chat_response = client.chat.completions.create(
    model="microsoft/Phi-3.5-vision-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the animals in these images?"},
            {"type": "image_url", "image_url": {"url": image_url_duck}},
            {"type": "image_url", "image_url": {"url": image_url_lion}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)

Full example: gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py

Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request.
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable:

```console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
```

Video

Instead of image_url, you can pass a video file via video_url. Here is a simple example using LLaVA-OneVision.

First, launch the OpenAI-compatible server:

vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192

Then, you can use the OpenAI client as follows:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"

## Use video url in the payload
chat_completion_from_url = client.chat.completions.create(
    messages=[{
        "role":
        "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this video?"
            },
            {
                "type": "video_url",
                "video_url": {
                    "url": video_url
                },
            },
        ],
    }],
    model=model,
    max_completion_tokens=64,
)

result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from image url:", result)

Full example: gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py

By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable:

```console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
```

Audio

Audio input is supported according to OpenAI Audio API. Here is a simple example using Ultravox-v0.3.

First, launch the OpenAI-compatible server:

vllm serve fixie-ai/ultravox-v0_3

Then, you can use the OpenAI client as follows:

import base64
import requests
from openai import OpenAI
from vllm.assets.audio import AudioAsset

def encode_base64_content_from_url(content_url: str) -> str:
    """Encode a content retrieved from a remote url to base64 format."""

    with requests.get(content_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode('utf-8')

    return result

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Any format supported by librosa is supported
audio_url = AudioAsset("winning_call").url
audio_base64 = encode_base64_content_from_url(audio_url)

chat_completion_from_base64 = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this audio?"
            },
            {
                "type": "input_audio",
                "input_audio": {
                    "data": audio_base64,
                    "format": "wav"
                },
            },
        ],
    }],
    model=model,
    max_completion_tokens=64,
)

result = chat_completion_from_base64.choices[0].message.content
print("Chat completion output from input audio:", result)

Alternatively, you can pass audio_url, which is the audio counterpart of image_url for image input:

chat_completion_from_url = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this audio?"
            },
            {
                "type": "audio_url",
                "audio_url": {
                    "url": audio_url
                },
            },
        ],
    }],
    model=model,
    max_completion_tokens=64,
)

result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result)

Full example: gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py

By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable:

```console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```

Embedding

vLLM's Embeddings API is a superset of OpenAI's Embeddings API, where a list of chat messages can be passed instead of batched inputs. This enables multi-modal inputs to be passed to embedding models.

The schema of `messages` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.

Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images. Refer to the examples below for illustration.

Here is an end-to-end example using VLM2Vec. To serve the model:

vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
  --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode.

The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja>

Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level requests library:

import requests

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={
        "model": "TIGER-Lab/VLM2Vec-Full",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }],
        "encoding_format": "float",
    },
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])

Below is another example, this time using the MrLight/dse-qwen2-2b-mrl-v1 model.

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
  --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
Like with VLM2Vec, we have to explicitly pass `--task embed`.

Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.

Full example: gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py


source/serving/offline_inference.md

(offline-inference)=

Offline Inference

You can run vLLM in your own code on a list of prompts.

The offline API is based on the {class}~vllm.LLM class. To initialize the vLLM engine, create a new instance of LLM and specify the model to run.

For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.

llm = LLM(model="facebook/opt-125m")

After initializing the LLM instance, you can perform model inference using various APIs. The available APIs depend on the type of model that is being run:

Please refer to the above pages for more details about each API.

[API Reference](/api/offline_inference/index)

Configuration Options

This section lists the most common options for running the vLLM engine. For a full list, refer to the Engine Arguments page.

Model resolution

vLLM loads HuggingFace-compatible models by inspecting the architectures field in config.json of the model repository and finding the corresponding implementation that is registered to vLLM. Nevertheless, our model resolution may fail for the following reasons:

  • The config.json of the model repository lacks the architectures field.
  • Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
  • The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.

In those cases, vLLM may throw an error like:

Traceback (most recent call last):
...
  File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
    for arch in architectures:
TypeError: 'NoneType' object is not iterable

or:

  File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]

:::{note} The above error is distinct from the following similar but different error:

  File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.

This error means that vLLM failed to import the model file. Usually, it is related to missing dependencies or outdated binaries in the vLLM build. Please read the logs carefully to determine the real cause of the error. :::

To fix this, explicitly specify the model architecture by passing config.json overrides to the hf_overrides option. For example:

model = LLM(
    model="cerebras/Cerebras-GPT-1.3B",
    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
)

Our list of supported models shows the model architectures that are recognized by vLLM.

Reducing memory usage

Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.

Tensor Parallelism (TP)

Tensor parallelism (tensor_parallel_size option) can be used to split the model across multiple GPUs.

The following code splits the model across 2 GPUs.

llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.

Quantization

Quantized models take less memory at the cost of lower precision.

Statically quantized models can be downloaded from HF Hub (some popular ones are available at Neural Magic) and used directly without extra configuration.

Dynamic quantization is also supported via the quantization option -- see here for more details.

Context length and batch size

You can further reduce memory usage by limiting the context length of the model (max_model_len option) and the maximum batch size (max_num_seqs option).

llm = LLM(model="adept/fuyu-8b",
          max_model_len=2048,
          max_num_seqs=2)

Performance optimization and tuning

You can potentially improve the performance of vLLM by finetuning various options. Please refer to this guide for more details.


source/serving/openai_compatible_server.md

(openai-compatible-server)=

OpenAI-Compatible Server

vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more!

You can start the server via the vllm serve command, or through Docker:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

To call the server, you can use the official OpenAI Python client, or any other HTTP client.

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

Supported APIs

We currently support the following OpenAI APIs:

In addition, we have the following custom APIs:

(chat-template)=

Chat Template

In order for the language model to support chat protocol, vLLM requires the model to include a chat template in its tokenizer configuration. The chat template is a Jinja2 template that specifies how are roles, messages, and other chat-specific tokens are encoded in the input.

An example chat template for NousResearch/Meta-Llama-3-8B-Instruct can be found here

Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those model, you can manually specify their chat template in the --chat-template parameter with the file path to the chat template, or the template in string form. Without a chat template, the server will not be able to process chat and all chat requests will error.

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM community provides a set of chat templates for popular models. You can find them under the gh-dir:examples directory.

With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies both a type and a text field. An example is provided below:

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
  ]
)

Most chat templates for LLMs expect the content field to be a string, but there are some newer models like meta-llama/Llama-Guard-3-1B that expect the content to be formatted according to the OpenAI schema in the request. vLLM provides best-effort support to detect this automatically, which is logged as a string like "Detected the chat template content format to be...", and internally converts incoming requests to match the detected format, which can be one of:

  • "string": A string.
    • Example: "Hello world"
  • "openai": A list of dictionaries, similar to OpenAI schema.
    • Example: [{"type": "text", "text": "Hello world!"}]

If the result is not what you expect, you can set the --chat-template-content-format CLI argument to override which format to use.

Extra Parameters

vLLM supports a set of parameters that are not part of the OpenAI API. In order to use them, you can pass them as extra parameters in the OpenAI client. Or directly merge them into the JSON payload if you are using HTTP call directly.

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
  ],
  extra_body={
    "guided_choice": ["positive", "negative"]
  }
)

Extra HTTP Headers

Only X-Request-Id HTTP request header is supported for now. It can be enabled with --enable-request-id-headers.

Note that enablement of the headers can impact performance significantly at high QPS rates. We recommend implementing HTTP headers at the router level (e.g. via Istio), rather than within the vLLM layer for this reason. See this PR for more details.

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
  ],
  extra_headers={
    "x-request-id": "sentiment-classification-00001",
  }
)
print(completion._request_id)

completion = client.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  prompt="A robot may not injure a human being",
  extra_headers={
    "x-request-id": "completion-test",
  }
)
print(completion._request_id)

CLI Reference

(vllm-serve)=

vllm serve

The vllm serve command is used to launch the OpenAI-compatible server.

:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: vllm serve

Configuration file

You can load CLI arguments via a YAML config file. The argument names must be the long form of those outlined above.

For example:

# config.yaml

host: "127.0.0.1"
port: 6379
uvicorn-log-level: "info"

To use the above config file:

vllm serve SOME_MODEL --config config.yaml
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`.

API Reference

(completions-api)=

Completions API

Our Completions API is compatible with OpenAI's Completions API; you can use the official OpenAI Python client to interact with it.

Code example: gh-file:examples/online_serving/openai_completion_client.py

Extra parameters

The following sampling parameters are supported.

:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params

The following extra parameters are supported:

:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params

(chat-api)=

Chat API

Our Chat API is compatible with OpenAI's Chat Completions API; you can use the official OpenAI Python client to interact with it.

We support both Vision- and Audio-related parameters; see our Multimodal Inputs guide for more information.

  • Note: image_url.detail parameter is not supported.

Code example: gh-file:examples/online_serving/openai_chat_completion_client.py

Extra parameters

The following sampling parameters are supported.

:language: python
:start-after: begin-chat-completion-sampling-params
:end-before: end-chat-completion-sampling-params

The following extra parameters are supported:

:language: python
:start-after: begin-chat-completion-extra-params
:end-before: end-chat-completion-extra-params

(embeddings-api)=

Embeddings API

Our Embeddings API is compatible with OpenAI's Embeddings API; you can use the official OpenAI Python client to interact with it.

If the model has a chat template, you can replace inputs with a list of messages (same schema as Chat API) which will be treated as a single prompt to the model.

This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.

Code example: gh-file:examples/online_serving/openai_embedding_client.py

Extra parameters

The following pooling parameters are supported.

:language: python
:start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params

The following extra parameters are supported by default:

:language: python
:start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params

For chat-like input (i.e. if messages is passed), these extra parameters are supported instead:

:language: python
:start-after: begin-chat-embedding-extra-params
:end-before: end-chat-embedding-extra-params

(tokenizer-api)=

Tokenizer API

Our Tokenizer API is a simple wrapper over HuggingFace-style tokenizers. It consists of two endpoints:

  • /tokenize corresponds to calling tokenizer.encode().
  • /detokenize corresponds to calling tokenizer.decode().

(pooling-api)=

Pooling API

Our Pooling API encodes input prompts using a pooling model and returns the corresponding hidden states.

The input format is the same as Embeddings API, but the output data can contain an arbitrary nested list, not just a 1-D list of floats.

Code example: gh-file:examples/online_serving/openai_pooling_client.py

(score-api)=

Score API

Our Score API applies a cross-encoder model to predict scores for sentence pairs. Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.

You can find the documentation for these kind of models at sbert.net.

Code example: gh-file:examples/online_serving/openai_cross_encoder_score.py

Single inference

You can pass a string to both text_1 and text_2, forming a single sentence pair.

Request:

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

Response:

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

Batch inference

You can pass a string to text_1 and a list to text_2, forming multiple sentence pairs where each pair is built from text_1 and a string in text_2. The total number of pairs is len(text_2).

Request:

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

Response:

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

You can pass a list to both text_1 and text_2, forming multiple sentence pairs where each pair is built from a string in text_1 and the corresponding string in text_2 (similar to zip()). The total number of pairs is len(text_2).

Request:

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

Response:

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

Extra parameters

The following pooling parameters are supported.

:language: python
:start-after: begin-score-pooling-params
:end-before: end-score-pooling-params

The following extra parameters are supported:

:language: python
:start-after: begin-score-extra-params
:end-before: end-score-extra-params

source/serving/usage_stats.md

Usage Stats Collection

vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information, and will be publicly released for the community's benefit.

What data is collected?

The list of data collected by the latest version of vLLM can be found here: gh-file:vllm/usage/usage_lib.py

Here is an example as of v0.4.0:

{
  "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
  "provider": "GCP",
  "num_cpu": 24,
  "cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
  "cpu_family_model_stepping": "6,85,7",
  "total_memory": 101261135872,
  "architecture": "x86_64",
  "platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
  "gpu_count": 2,
  "gpu_type": "NVIDIA L4",
  "gpu_memory_per_device": 23580639232,
  "model_architecture": "OPTForCausalLM",
  "vllm_version": "0.3.2+cu123",
  "context": "LLM_CLASS",
  "log_time": 1711663373492490000,
  "source": "production",
  "dtype": "torch.float16",
  "tensor_parallel_size": 1,
  "block_size": 16,
  "gpu_memory_utilization": 0.9,
  "quantization": null,
  "kv_cache_dtype": "auto",
  "enable_lora": false,
  "enable_prefix_caching": false,
  "enforce_eager": false,
  "disable_custom_all_reduce": true
}

You can preview the collected data by running the following command:

tail ~/.config/vllm/usage_stats.json

Opting out

You can opt-out of usage stats collection by setting the VLLM_NO_USAGE_STATS or DO_NOT_TRACK environment variable, or by creating a ~/.config/vllm/do_not_track file:

# Any of the following methods can disable usage stats collection
export VLLM_NO_USAGE_STATS=1
export DO_NOT_TRACK=1
mkdir -p ~/.config/vllm && touch ~/.config/vllm/do_not_track

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment