Skip to content

Instantly share code, notes, and snippets.

@romilbhardwaj
Last active December 21, 2023 02:56
Show Gist options
  • Save romilbhardwaj/acde8657e319ecdc6ae9e50646acca33 to your computer and use it in GitHub Desktop.
Save romilbhardwaj/acde8657e319ecdc6ae9e50646acca33 to your computer and use it in GitHub Desktop.
Using local GPUs with SkyPilot + Kubernetes

Using local GPUs with SkyPilot + Kubernetes

This is a guide to using GPUs on your local machine with SkyPilot. This guide sets up a Kubernetes cluster (using KinD) so you can use SkyPilot's Kubernetes support to get it running.

Inspired by Klueska's comment and Sam Stoelinga's blog post.

Prerequisites

Install the NVIDIA container toolkit

Follow the official install docs:

  1. Configure the repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. Install the NVIDIA Container Toolkit packages:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
  1. Configure NVIDIA to be the default runtime for docker:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

Creating your local Kubernetes cluster with GPUs

  1. Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml:
sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml
  1. Create a Kind Cluster:
kind create cluster --name skypilot --config - <<EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
  # required for GPU workaround
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
EOF
  1. Run patch for missing ldconfig.real:
# https://github.com/NVIDIA/nvidia-docker/issues/614#issuecomment-423991632
docker exec -ti skypilot-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real
  1. Install the NVIDIA GPU operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false
  1. Wait for a bit for GPU operator to get installed. Check status with kubectl get pods -A and make sure all pods in gpu-operator namespace are running. This may take a couple of minutes.

  2. Verify GPU operator is installed correctly by running kubectl describe nodes | grep nvidia.com/gpu and make sure the output is similar to the following:

    nvidia.com/gpu:  1
    nvidia.com/gpu:  1
    
  3. Run SkyPilot GPU Labelling script to label nodes with GPUs:

python -m sky.utils.kubernetes.gpu_labeler
  1. Wait for labelling jobs to complete. To check the status of GPU labeling jobs, run kubectl get jobs -n kube-system -l job=sky-gpu-labeler.

  2. Run sky check. This should show Kubernetes: enabled without any warnings.

  3. You're ready to go! Run sky show-gpus --cloud kubernetes to see the GPUs available on your local machine

(base) gcpuser@ray-test-2ea4-head-fcdc6cbf-compute:~$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES
T4          1, 2

You should then be able to run SkyPilot commands as usual, e.g.:

sky launch -c test --cloud kubernetes --gpus T4:1 -- nvidia-smi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment