Skip to content

Instantly share code, notes, and snippets.

@0xBigBoss
Last active January 2, 2025 23:43
Show Gist options
  • Save 0xBigBoss/4f85f08f3462e3eea54a0629aa37a30e to your computer and use it in GitHub Desktop.
Save 0xBigBoss/4f85f08f3462e3eea54a0629aa37a30e to your computer and use it in GitHub Desktop.
Building vLLM + PyTorch + Torchvision from source

Run vLLM in Distributed Mode with Ray

Prerequisites

A docker image with the vLLM server installed.

export DOCKER_IMAGE=docker.io/fxnlabs/vllm-openai
# or use the following if you want to use the latest version
# export DOCKER_IMAGE=docker.io/vllm/vllm-openai

Start Ray Head Node

docker run \
    --entrypoint /bin/bash \
    --network host \
    --ipc=host \
    --name vllm-ray-head \
    --gpus 0 \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    docker.io/vllm/vllm-openai \
    -c 'ray start --head --port 6379 --node-ip-address 192.168.196.97 --block --num-gpus 1'

Start Ray Worker Node

docker run \
    --entrypoint /bin/bash \
    --network host \
    --ipc=host \
    --name vllm-ray-worker \
    --gpus 1 \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    $DOCKER_IMAGE \
    -c "ray start --address='192.168.196.97:6379' --block"

Run vLLM Server

docker exec -it vllm-ray-head bash -c 'python3 -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --pipeline-parallel-size 2  \
  --port 8080    \
  --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8
' > server.log 2>&1

Debug vLLM Server

docker run -it --rm \
    --runtime=nvidia \
    --network host \
    --ipc=host \
    --gpus 0 \
    --entrypoint /bin/bash \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" docker.io/fxnlabs/vllm-openai:latest \
    $DOCKER_IMAGE \
    -c 'python3 -m vllm.entrypoints.openai.api_server             --host 0.0.0.0             --pipeline-parallel-size 2              --port 8080                --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8' > server.log 2>&1

Build vLLM + PyTorch + Torchvision

Build PyTorch

Build PyTorch wheel using docker

export BUILD_TYPE=wheel-builder
export USE_BUILDX=1
export CUDA_VERSION=12.4.1  # Or whatever CUDA version you want
export CUDA_VERSION_SHORT=12.4  # Or whatever CUDA version you want
export PYTHON_VERSION=3.12.8
make -f docker.Makefile runtime-image
# Note the docker image
# e.g. docker.io/ae/pytorch:v2.5.1-1-g061a464786e-cuda12.4-cudnn9-runtime

Copy PyTorch wheel out of docker

docker run \
  -v ./dist:/dist \
  --entrypoint /bin/bash \
  docker.io/ae/pytorch:v2.5.1-1-g061a464786e-cuda12.4-cudnn9-runtime \
  -c 'cp ./dist/*.whl /dist'

Install PyTorch wheel

pip install dist/*.whl

Build Torchvision

python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38

Install Torchvision wheel

pip install dist/*.whl

Build vLLM

Copy wheels into ./wheels directory

cp ../vision/dist/*.whl ./wheels
cp ../pytorch/dist/*.whl ./wheels
# ensure only one wheel of each package is in the wheels directory

Build vLLM docker image

docker buildx build . \
  --target vllm-openai \
  --build-arg torch_cuda_arch_list="" \
  --build-arg max_jobs=16 \
  --build-arg nvcc_threads=2 \
  --build-arg RUN_WHEEL_CHECK=false \
  --tag docker.io/fxnlabs/vllm-openai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment