Skip to content

Instantly share code, notes, and snippets.

@allanlei
Created March 23, 2023 15:51
Show Gist options
  • Save allanlei/f8432eb1f193ec5f26c073a42e077809 to your computer and use it in GitHub Desktop.
Save allanlei/f8432eb1f193ec5f26c073a42e077809 to your computer and use it in GitHub Desktop.
GKE nvidia driver installer with driver version control

Find the corresponding nvidia driver for COS

  1. gsutil ls gsutil ls gs://cos-tools-asia/${COS_VERSION}/extensions/gpu
  2. Pick an available version and update annotations.version

Creating a Node pool with a specific driver

  1. Create nodepool with taint nvidia.com/gpu-driver-version=DRIVER_VERSION_MAJOR
  2. Create corresponding daemonset driver installer with nvidia.com/gpu-driver-version=DRIVER_VERSION_MAJOR
apiVersion: apps/v1
kind: DaemonSet
metadata:
namespace: kube-system
name: nvidia-driver-installer-cos-525
spec:
selector:
matchLabels:
name: nvidia-driver-installer-cos-525
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: "100%"
template:
metadata:
labels:
name: nvidia-driver-installer-cos-525
annotations:
version: "525.60.13"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-os-distribution
operator: In
values:
- "cos"
- key: cloud.google.com/gke-accelerator
operator: Exists
tolerations:
- key: nvidia.com/gpu
operator: Exists
- key: nvidia.com/gpu-driver-version
operator: Equal
value: "525"
effect: NoSchedule
hostNetwork: true
hostPID: true
serviceAccountName: node-controller
volumes:
- name: dev
hostPath:
path: /dev
- name: vulkan-icd-mount
hostPath:
path: /home/kubernetes/bin/nvidia/vulkan/icd.d
- name: nvidia-install-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: root-mount
hostPath:
path: /
- name: cos-tools
hostPath:
path: /var/lib/cos-tools
initContainers:
- name: nvidia-driver-installer
image: gcr.io/cos-cloud/cos-gpu-installer:v2.0.31
args: ["install", "-version", "$(VERSION)"]
resources:
requests:
cpu: 0m
securityContext:
privileged: true
env:
- name: VERSION
valueFrom:
fieldRef:
fieldPath: metadata.annotations['version']
- name: NVIDIA_INSTALL_DIR_HOST
value: /home/kubernetes/bin/nvidia
- name: NVIDIA_INSTALL_DIR_CONTAINER
value: /usr/local/nvidia
- name: VULKAN_ICD_DIR_HOST
value: /home/kubernetes/bin/nvidia/vulkan/icd.d
- name: VULKAN_ICD_DIR_CONTAINER
value: /etc/vulkan/icd.d
- name: ROOT_MOUNT_DIR
value: /root
- name: COS_TOOLS_DIR_HOST
value: /var/lib/cos-tools
- name: COS_TOOLS_DIR_CONTAINER
value: /build/cos-tools
volumeMounts:
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
- name: vulkan-icd-mount
mountPath: /etc/vulkan/icd.d
- name: dev
mountPath: /dev
- name: root-mount
mountPath: /root
- name: cos-tools
mountPath: /build/cos-tools
- name: label-node
image: swaglive/kubectl:1.11
args: ["label", "node", "--overwrite", "$(NODE_NAME)", "nvidia.com/gpu-driver-version-installed=$(VERSION)"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: VERSION
valueFrom:
fieldRef:
fieldPath: metadata.annotations['version']
containers:
- image: gcr.io/google-containers/pause:2.0
name: pause
resources:
requests:
cpu: 0m
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: default
name: encoder
spec:
selector:
matchLabels:
app: encoder
template:
metadata:
labels:
app: encoder
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: In
values:
- "nvidia-tesla-t4"
- key: nvidia.com/gpu-driver-version-installed
operator: Exists
tolerations:
- key: cloud.google.com/gke-preemptible
operator: "Exists"
- key: nvidia.com/gpu
operator: Exists
- key: nvidia.com/gpu-driver-version
operator: Equal
value: "525"
effect: NoSchedule
containers:
- image: gcr.io/google-containers/pause:2.0
name: pause
resources:
limits:
nvidia.com/gpu: 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment