Run vLLM in Distributed Mode with Ray

Prerequisites

A docker image with the vLLM server installed.

export DOCKER_IMAGE=docker.io/fxnlabs/vllm-openai
# or use the following if you want to use the latest version
# export DOCKER_IMAGE=docker.io/vllm/vllm-openai

Start Ray Head Node

docker run \
    --entrypoint /bin/bash \
    --network host \
    --ipc=host \
    --name vllm-ray-head \
    --gpus 0 \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    docker.io/vllm/vllm-openai \
    -c 'ray start --head --port 6379 --node-ip-address 192.168.196.97 --block --num-gpus 1'

Start Ray Worker Node

docker run \
    --entrypoint /bin/bash \
    --network host \
    --ipc=host \
    --name vllm-ray-worker \
    --gpus 1 \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    $DOCKER_IMAGE \
    -c "ray start --address='192.168.196.97:6379' --block"

Run vLLM Server

docker exec -it vllm-ray-head bash -c 'python3 -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --pipeline-parallel-size 2  \
  --port 8080    \
  --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8
' > server.log 2>&1

Debug vLLM Server

docker run -it --rm \
    --runtime=nvidia \
    --network host \
    --ipc=host \
    --gpus 0 \
    --entrypoint /bin/bash \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" docker.io/fxnlabs/vllm-openai:latest \
    $DOCKER_IMAGE \
    -c 'python3 -m vllm.entrypoints.openai.api_server             --host 0.0.0.0             --pipeline-parallel-size 2              --port 8080                --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8' > server.log 2>&1

export BUILD_TYPE=wheel-builder export USE_BUILDX=1 export CUDA_VERSION=12.4.1 # Or whatever CUDA version you want export CUDA_VERSION_SHORT=12.4 # Or whatever CUDA version you want export PYTHON_VERSION=3.12.8 make -f docker.Makefile runtime-image # Note the docker image # e.g. docker.io/ae/pytorch:v2.5.1-1-g061a464786e-cuda12.4-cudnn9-runtime

docker buildx build . \ --target vllm-openai \ --build-arg torch_cuda_arch_list="" \ --build-arg max_jobs=16 \ --build-arg nvcc_threads=2 \ --build-arg RUN_WHEEL_CHECK=false \ --tag docker.io/fxnlabs/vllm-openai

0xBigBoss/runnning.md

Run vLLM in Distributed Mode with Ray

Prerequisites

Start Ray Head Node

Start Ray Worker Node

Run vLLM Server

Debug vLLM Server

Build vLLM + PyTorch + Torchvision

Build PyTorch

Build PyTorch wheel using docker

Copy PyTorch wheel out of docker

Install PyTorch wheel

Build Torchvision

Install Torchvision wheel

Build vLLM

Copy wheels into ./wheels directory

Build vLLM docker image