GPU OpenShift demo install and setup

Prerequisites OCP cluster 4.9+

Setup GPU Nodes

👉 Add a MachineSet for GPU Workers by exporting an existing worker's MachineSet and switching the instance type, name, and selflink. You have some choices here depending on what you want to do.

Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.
Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.
Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.
Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.

They're not cheap so check costs before picking, I demo with the g4dn.4xlarge (currently costs $1.204/hr)

Instance	GPUs	vCPU	Memory (GiB)	GPU Memory (GiB)	Instance Storage (GB)	Network Performance (Gbps)***	EBS Bandwidth (Gbps)
g4dn.4xlarge	1	16	64	16	1 x 225 NVMe SSD	Up to 25	4.75

Check for drivertool kit

This should exist in 4.8 and later, however some streams in 4.8.z and 4.9.8 might be buggy (don't be there)

oc get -n openshift is/driver-toolkit

Check for pods

oc get pods -n openshift-nfd

Setup NFD Operator

cat << EOF | oc create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-nfd
EOF

👉 Install NodeFeatureDiscovery from OperatorHub into the new namespace

👉 Goto Compute -> Nodes -> look at labels

👉 Create an instance of NFD from operator hub installed operator

👉 Goto Compute -> Nodes -> look at labels (now there should be pci labels)

(note that the PCI ids are used to identify hardware)

Setup GPU Operator

cat << EOF | oc create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-gpu-operator
EOF

👉 Install GPU operator into that namespace

👉 Create an instance of ClusterPolicy from the installed operator

Wait 10-20 min

Verify installation

oc get pods,ds -n nvidia-gpu-operator

Sanity check that we have monitoring setup

oc describe ns/nvidia-gpu-operator | grep cluster-monitoring

Look at GPU info

oc project nvidia-gpu-operator

oc get pods | grep nvidia-driver-daemonset

oc exec -it $POD -- nvidia-smi

Demo GPU app

cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF

oc logs pod/cuda-vectoradd

More Advanced Demo (jupyter notebook demos)

Warning: this is a pretty big container - 6.5GB

cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cheminformatics
  labels:
    app: cheminformatics
spec:
 restartPolicy: OnFailure
 containers:
 - name: cheminformatics
   image: "nvcr.io/nvidia/clara/cheminformatics_demo:0.1.2"
   resources:
     limits:
       nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: cheminformatics
spec:
  selector:
    app: cheminformatics
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8888
EOF

oc expose svc cheminformatics

The above assumes you've got an OpenShift cluster already, if you need to get one, go here: https://console.redhat.com/openshift/create

e.g. in a AWS self hosted install (or if you are a Red Hatter using the RHPDS Open Environment) you would do these things:
Log into your AWS cluster via CLI

aws configure --profile sandbox-uniqueid
export AWS_PROFILE=sandbox-uniqueid
aws route53 list-hosted-zones-by-name

If you don’t have ssh keys to use: ssh-keygen -t ed25519 -N '' -f ~/.ssh/opentlc-sandbox

Make the config: ./openshift-install create install-config --dir=.
Tweak the config as desired (note: we will add GPU workers later)
Install the cluster: ./openshift-install create cluster --dir=. --log-level=info

dudash/gpu-operator-quick-demo-cheatsheet.md

Prerequisites OCP cluster 4.9+

Setup GPU Nodes

Check for drivertool kit

Check for pods

Setup NFD Operator

Setup GPU Operator

Wait 10-20 min

Verify installation

Sanity check that we have monitoring setup

Look at GPU info

Demo GPU app

More Advanced Demo (jupyter notebook demos)

dudash commented Mar 31, 2022

Uh oh!

dudash commented Mar 31, 2022 •

edited

Loading

Uh oh!

dudash commented Jun 21, 2022 •

edited

Loading

Uh oh!

dudash/gpu-operator-quick-demo-cheatsheet.md

Prerequisites OCP cluster 4.9+

Setup GPU Nodes

Check for drivertool kit

Check for pods

Setup NFD Operator

Setup GPU Operator

Wait 10-20 min

Verify installation

Sanity check that we have monitoring setup

Look at GPU info

Demo GPU app

More Advanced Demo (jupyter notebook demos)

dudash commented Mar 31, 2022

Uh oh!

dudash commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dudash commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dudash commented Mar 31, 2022 •

edited

Loading

dudash commented Jun 21, 2022 •

edited

Loading