This is a short-writeup to explain how to verify the fix to Issue 200 reported for the NVIDIA device plugin: NVIDIA/k8s-device-plugin#200
This issue happens when the NVIDIA device plugin is configured to allow only privileged access to all GPUs to containers (rather than allow unprivileged containers from getting access to GPUs that the container did not request). A detailed write up on this aspect is described here.
Issue #200 is specifically observed on IaaS cloud where VMs could be stopped and then restarted - any pods that had GPUs assigned can fail since in a cloud environment, different physical GPUs could be attached to VMs on restart. The issue was that the device plugin only supported enumerating GPUs to containers using UUIDs (which are unique), but these can change when VMs are restarted. The fix was to add a new option called deviceIDStartegy
to the plugin to allow it to enumerate GPUs using ids (e.g. when using nvidia-smi -i
) rather than UUIDs.
After setting up a K8s cluster, deploy NVIDIA device plugin v0.8.2 using Helm:
$ helm install \
--version=0.8.2 \
--generate-name \
--set securityContext.privileged=true \
--set deviceListStrategy=volume-mounts \
nvdp/nvidia-device-plugin
$ helm ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
nvidia-device-plugin-1614725384 default 1 2021-03-02 22:49:44.942016648 +0000 UTC deployed nvidia-device-plugin-0.8.2 0.8.2
Try running a simple container that has an infinite loop to call nvidia-smi
:
$ kubectl run nvidia-loop \
--image=dualvtable/nvidia-loop \
--limits=nvidia.com/gpu=1
After the VM has been stopped and restarted, we can see that the nvidia-loop
pod has entered into a CrashLoopBackoff state.
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nvidia-loop 0/1 CrashLoopBackOff 1 5m3s
kube-system calico-kube-controllers-6949477b58-5r8q6 1/1 Running 3 50m
kube-system calico-node-jv5x9 1/1 Running 3 50m
$ kubectl describe po nvidia-loop
Name: nvidia-loop
Namespace: default
Priority: 0
Node: ip-172-31-30-232/172.31.30.232
Start Time: Tue, 02 Mar 2021 23:21:07 +0000
Labels: run=nvidia-loop
Annotations: cni.projectcalico.org/podIP: 192.168.228.219/32
cni.projectcalico.org/podIPs: 192.168.228.219/32
Status: Running
IP: 192.168.228.219
<snip>
Warning Failed 15s (x2 over 29s) kubelet Error: failed to start container "nvidia-loop": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-7d0b1464-cb0a-3394-b09d-af8ac70d175b: unknown device: unknown
The new GPU allocated to the VM has changed since the VM was stopped/restarted:
$ nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-55ecfd38-cce7-e013-22a6-8d5e26b93b89)
Now delete the 0.8.2
plugin deployment and install the latest release with the option of using GPU ids rather than UUID
s:
$ helm install \
--version=0.9.0 \
--generate-name \
--set securityContext.privileged=true \
--set deviceListStrategy=volume-mounts \
--set deviceIDStrategy=index \
nvdp/nvidia-device-plugin
Stop and restart the VM (so that another GPU is attached on bootstrap). Once the VM is available, the pods should be successfully running, including containers that were assigned GPUs:
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nvidia-loop 1/1 Running 0 4s
kube-system calico-kube-controllers-6949477b58-5r8q6 1/1 Running 3 90m
kube-system calico-node-jv5x9 1/1 Running 3 90m
kube-system coredns-74ff55c5b-bxrgh 1/1 Running 3 92m
kube-system coredns-74ff55c5b-zmcs2 1/1 Running 3 92m
kube-system etcd-ip-172-31-30-232 1/1 Running 4 92m
kube-system kube-apiserver-ip-172-31-30-232 1/1 Running 4 92m
kube-system kube-controller-manager-ip-172-31-30-232 1/1 Running 3 92m
kube-system kube-proxy-lqt42 1/1 Running 4 92m
kube-system kube-scheduler-ip-172-31-30-232 1/1 Running 3 92m
kube-system nvidia-device-plugin-1614729844-rgwvq 1/1 Running 0 2m9s
We can also verify that the GPU is different after the VM has started:
$ kubectl logs nvidia-loop
GPU 0: Tesla T4 (UUID: GPU-bf1f9fb7-b55a-029d-c27d-7726529b207d)