This is a guide to using GPUs on your local machine with SkyPilot. This guide sets up a Kubernetes cluster (using KinD) so you can use SkyPilot's Kubernetes support to get it running.
Inspired by Klueska's comment and Sam Stoelinga's blog post.
- Docker
- SkyPilot
- NVIDIA Container Toolkit. If not installed, follow guide below.
Follow the official install docs:
- Configure the repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- Install the NVIDIA Container Toolkit packages:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
- Configure NVIDIA to be the default runtime for docker:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
- Set
accept-nvidia-visible-devices-as-volume-mounts = true
in/etc/nvidia-container-runtime/config.toml
:
sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml
- Create a Kind Cluster:
kind create cluster --name skypilot --config - <<EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
# required for GPU workaround
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
EOF
- Run patch for missing
ldconfig.real
:
# https://github.com/NVIDIA/nvidia-docker/issues/614#issuecomment-423991632
docker exec -ti skypilot-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real
- Install the NVIDIA GPU operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator --set driver.enabled=false
-
Wait for a bit for GPU operator to get installed. Check status with
kubectl get pods -A
and make sure all pods in gpu-operator namespace are running. This may take a couple of minutes. -
Verify GPU operator is installed correctly by running
kubectl describe nodes | grep nvidia.com/gpu
and make sure the output is similar to the following:nvidia.com/gpu: 1 nvidia.com/gpu: 1
-
Run SkyPilot GPU Labelling script to label nodes with GPUs:
python -m sky.utils.kubernetes.gpu_labeler
-
Wait for labelling jobs to complete. To check the status of GPU labeling jobs, run
kubectl get jobs -n kube-system -l job=sky-gpu-labeler
. -
Run
sky check
. This should showKubernetes: enabled
without any warnings. -
You're ready to go! Run
sky show-gpus --cloud kubernetes
to see the GPUs available on your local machine
(base) gcpuser@ray-test-2ea4-head-fcdc6cbf-compute:~$ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
T4 1, 2
You should then be able to run SkyPilot commands as usual, e.g.:
sky launch -c test --cloud kubernetes --gpus T4:1 -- nvidia-smi