Skip to content

Instantly share code, notes, and snippets.

@rampadc
Last active February 2, 2025 04:03
Show Gist options
  • Save rampadc/0b097902e589d82485eead1b1f8c3958 to your computer and use it in GitHub Desktop.
Save rampadc/0b097902e589d82485eead1b1f8c3958 to your computer and use it in GitHub Desktop.

NVIDIA GPU Debugging in MicroK8s

Running microk8s enable gpu didn't start the GPU operator correctly. I found this to work.

1. Check NVIDIA container runtime installation

dpkg -l | grep nvidia-container

2. Verify NVIDIA container runtime setup

nvidia-container-cli info

3. Check GPU status in Kubernetes

microk8s kubectl describe node | grep -i gpu

4. List all NVIDIA-related pods

microk8s kubectl get pods -A | grep nvidia

5. Get logs from failing NVIDIA pods

microk8s kubectl logs -n gpu-operator-resources nvidia-container-toolkit-daemonset-XXXXX --all-containers
microk8s kubectl logs -n gpu-operator-resources nvidia-operator-validator-XXXXX --all-containers

6. Check if NVIDIA device nodes exist

ls -l /dev/nvidia*
ls -l /dev/char/ | grep nvidia

7. Manually create missing symlinks if needed

sudo ln -s /dev/nvidiactl /dev/char/195:255
sudo ln -s /dev/nvidia-modeset /dev/char/195:254
sudo ln -s /dev/nvidia0 /dev/char/195:0
sudo ln -s /dev/nvidia-uvm /dev/char/235:0
sudo ln -s /dev/nvidia-uvm-tools /dev/char/235:1

8. Patch validator to disable symlink creation bug

microk8s kubectl patch clusterpolicy cluster-policy --type merge -p '{"spec":{"validator":{"driver":{"env":[{"name":"DISABLE_DEV_CHAR_SYMLINK_CREATION","value":"true"}]}}}}'

9. Check for image pull errors

microk8s kubectl describe pod -n gpu-operator-resources nvidia-dcgm-exporter-XXXXX | grep -i "image:"

10. Try pulling the image manually

sudo ctr -n=k8s.io image pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

microk8s use containerd

11. Ensure the user has permissions for containerd

sudo usermod -aG microk8s $USER
newgrp microk8s

12. Verify GPU availability inside a pod

microk8s kubectl run gpu-test --rm -it --restart=Never --image=nvidia/cuda:12.2.0-runtime-ubuntu22.04

Run nvidia-smi inside to check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment