Using the GPU id instead of uuid in the NVIDIA device plugin

Introduction

This is a short-writeup to explain how to verify the fix to Issue 200 reported for the NVIDIA device plugin: NVIDIA/k8s-device-plugin#200

This issue happens when the NVIDIA device plugin is configured to allow only privileged access to all GPUs to containers (rather than allow unprivileged containers from getting access to GPUs that the container did not request). A detailed write up on this aspect is described here.

Issue #200 is specifically observed on IaaS cloud where VMs could be stopped and then restarted - any pods that had GPUs assigned can fail since in a cloud environment, different physical GPUs could be attached to VMs on restart. The issue was that the device plugin only supported enumerating GPUs to containers using UUIDs (which are unique), but these can change when VMs are restarted. The fix was to add a new option called deviceIDStartegy to the plugin to allow it to enumerate GPUs using ids (e.g. when using nvidia-smi -i) rather than UUIDs.

Reproducing the issue

After setting up a K8s cluster, deploy NVIDIA device plugin v0.8.2 using Helm:

$ helm install \
  --version=0.8.2 \
  --generate-name \
  --set securityContext.privileged=true \
  --set deviceListStrategy=volume-mounts \
  nvdp/nvidia-device-plugin

$ helm ls
NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                            APP VERSION
nvidia-device-plugin-1614725384 default         1               2021-03-02 22:49:44.942016648 +0000 UTC deployed        nvidia-device-plugin-0.8.2       0.8.2

Try running a simple container that has an infinite loop to call nvidia-smi:

$ kubectl run nvidia-loop \
  --image=dualvtable/nvidia-loop \
  --limits=nvidia.com/gpu=1

After the VM has been stopped and restarted, we can see that the nvidia-loop pod has entered into a CrashLoopBackoff state.

$ kubectl get pods -A

NAMESPACE     NAME                                       READY   STATUS             RESTARTS   AGE
default       nvidia-loop                                0/1     CrashLoopBackOff   1          5m3s
kube-system   calico-kube-controllers-6949477b58-5r8q6   1/1     Running            3          50m
kube-system   calico-node-jv5x9                          1/1     Running            3          50m

$ kubectl describe po nvidia-loop

Name:         nvidia-loop
Namespace:    default
Priority:     0
Node:         ip-172-31-30-232/172.31.30.232
Start Time:   Tue, 02 Mar 2021 23:21:07 +0000
Labels:       run=nvidia-loop
Annotations:  cni.projectcalico.org/podIP: 192.168.228.219/32
              cni.projectcalico.org/podIPs: 192.168.228.219/32
Status:       Running
IP:           192.168.228.219
<snip>
Warning  Failed          15s (x2 over 29s)  kubelet            Error: failed to start container "nvidia-loop": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-7d0b1464-cb0a-3394-b09d-af8ac70d175b: unknown device: unknown

The new GPU allocated to the VM has changed since the VM was stopped/restarted:

$ nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-55ecfd38-cce7-e013-22a6-8d5e26b93b89)

Verifying the fix

Now delete the 0.8.2 plugin deployment and install the latest release with the option of using GPU ids rather than UUIDs:

$ helm install \
    --version=0.9.0 \
    --generate-name \
    --set securityContext.privileged=true \
    --set deviceListStrategy=volume-mounts \
    --set deviceIDStrategy=index \
    nvdp/nvidia-device-plugin

Stop and restart the VM (so that another GPU is attached on bootstrap). Once the VM is available, the pods should be successfully running, including containers that were assigned GPUs:

$ kubectl get pods -A

NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
default       nvidia-loop                                1/1     Running   0          4s
kube-system   calico-kube-controllers-6949477b58-5r8q6   1/1     Running   3          90m
kube-system   calico-node-jv5x9                          1/1     Running   3          90m
kube-system   coredns-74ff55c5b-bxrgh                    1/1     Running   3          92m
kube-system   coredns-74ff55c5b-zmcs2                    1/1     Running   3          92m
kube-system   etcd-ip-172-31-30-232                      1/1     Running   4          92m
kube-system   kube-apiserver-ip-172-31-30-232            1/1     Running   4          92m
kube-system   kube-controller-manager-ip-172-31-30-232   1/1     Running   3          92m
kube-system   kube-proxy-lqt42                           1/1     Running   4          92m
kube-system   kube-scheduler-ip-172-31-30-232            1/1     Running   3          92m
kube-system   nvidia-device-plugin-1614729844-rgwvq      1/1     Running   0          2m9s

We can also verify that the GPU is different after the VM has started:

$ kubectl logs nvidia-loop

GPU 0: Tesla T4 (UUID: GPU-bf1f9fb7-b55a-029d-c27d-7726529b207d)

dualvtable/k8s-device-plugin-issue-200.md

Introduction

Reproducing the issue

Verifying the fix