NVIDIA GPU Debugging in MicroK8s

Running microk8s enable gpu didn't start the GPU operator correctly. I found this to work.

1. Check NVIDIA container runtime installation

dpkg -l | grep nvidia-container

2. Verify NVIDIA container runtime setup

nvidia-container-cli info

3. Check GPU status in Kubernetes

microk8s kubectl describe node | grep -i gpu

4. List all NVIDIA-related pods

microk8s kubectl get pods -A | grep nvidia

5. Get logs from failing NVIDIA pods

microk8s kubectl logs -n gpu-operator-resources nvidia-container-toolkit-daemonset-XXXXX --all-containers
microk8s kubectl logs -n gpu-operator-resources nvidia-operator-validator-XXXXX --all-containers

6. Check if NVIDIA device nodes exist

ls -l /dev/nvidia*
ls -l /dev/char/ | grep nvidia

7. Manually create missing symlinks if needed

sudo ln -s /dev/nvidiactl /dev/char/195:255
sudo ln -s /dev/nvidia-modeset /dev/char/195:254
sudo ln -s /dev/nvidia0 /dev/char/195:0
sudo ln -s /dev/nvidia-uvm /dev/char/235:0
sudo ln -s /dev/nvidia-uvm-tools /dev/char/235:1

8. Patch validator to disable symlink creation bug

microk8s kubectl patch clusterpolicy cluster-policy --type merge -p '{"spec":{"validator":{"driver":{"env":[{"name":"DISABLE_DEV_CHAR_SYMLINK_CREATION","value":"true"}]}}}}'

9. Check for image pull errors

microk8s kubectl describe pod -n gpu-operator-resources nvidia-dcgm-exporter-XXXXX | grep -i "image:"

10. Try pulling the image manually

sudo ctr -n=k8s.io image pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

microk8s use containerd

11. Ensure the user has permissions for containerd

sudo usermod -aG microk8s $USER
newgrp microk8s

12. Verify GPU availability inside a pod

microk8s kubectl run gpu-test --rm -it --restart=Never --image=nvidia/cuda:12.2.0-runtime-ubuntu22.04

Run nvidia-smi inside to check

rampadc/microk8s nvidia debugging.md

NVIDIA GPU Debugging in MicroK8s

1. Check NVIDIA container runtime installation

2. Verify NVIDIA container runtime setup

3. Check GPU status in Kubernetes

4. List all NVIDIA-related pods

5. Get logs from failing NVIDIA pods

6. Check if NVIDIA device nodes exist

7. Manually create missing symlinks if needed

8. Patch validator to disable symlink creation bug

9. Check for image pull errors

10. Try pulling the image manually

11. Ensure the user has permissions for containerd

12. Verify GPU availability inside a pod