π Add a MachineSet for GPU Workers by exporting an existing worker's MachineSet and switching the instance type, name, and selflink. You have some choices here depending on what you want to do.
- Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.
- Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.
- Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.
- Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.
They're not cheap so check costs before picking, I demo with the g4dn.4xlarge (currently costs $1.204/hr)
Instance | GPUs | vCPU | Memory (GiB) | GPU Memory (GiB) | Instance Storage (GB) | Network Performance (Gbps)*** | EBS Bandwidth (Gbps) |
---|---|---|---|---|---|---|---|
g4dn.4xlarge | 1 | 16 | 64 | 16 | 1 x 225 NVMe SSD | Up to 25 | 4.75 |
This should exist in 4.8 and later, however some streams in 4.8.z and 4.9.8 might be buggy (don't be there)
oc get -n openshift is/driver-toolkit
oc get pods -n openshift-nfd
cat << EOF | oc create -f -
apiVersion: v1
kind: Namespace
metadata:
name: openshift-nfd
EOF
π Install NodeFeatureDiscovery from OperatorHub into the new namespace
π Goto Compute -> Nodes -> look at labels
π Create an instance of NFD from operator hub installed operator
π Goto Compute -> Nodes -> look at labels (now there should be pci
labels)
(note that the PCI ids are used to identify hardware)
cat << EOF | oc create -f -
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-gpu-operator
EOF
π Install GPU operator into that namespace
π Create an instance of ClusterPolicy from the installed operator
oc get pods,ds -n nvidia-gpu-operator
oc describe ns/nvidia-gpu-operator | grep cluster-monitoring
oc project nvidia-gpu-operator
oc get pods | grep nvidia-driver-daemonset
oc exec -it $POD -- nvidia-smi
cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
oc logs pod/cuda-vectoradd
Warning: this is a pretty big container - 6.5GB
cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cheminformatics
labels:
app: cheminformatics
spec:
restartPolicy: OnFailure
containers:
- name: cheminformatics
image: "nvcr.io/nvidia/clara/cheminformatics_demo:0.1.2"
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: cheminformatics
spec:
selector:
app: cheminformatics
ports:
- protocol: TCP
port: 80
targetPort: 8888
EOF
oc expose svc cheminformatics
Maybe try this too:
kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml