Notes for setting up GPU with Openshift
- Centos7 or RHEL7
- Openshift 3.7+
- CUDA9 and CuDNN7
lspci -nn | grep VGA
lspci | grep -i --color 'vga\|3d\|2d'
00:02.0 VGA compatible controller [0300]: Cirrus Logic GD 5446 [1013:00b8]
00:1e.0 VGA compatible controller [0300]: NVIDIA Corporation GM204GL [Tesla M60] [10de:13f2] (rev a1)
you might need to do yum install pciutils
Not all kernel versions are supported for drivers
uname -r
3.10.0-765.el7.x86_64
grep '^menuentry' /boot/grub2/grub.cfg
menuentry 'Red Hat Enterprise Linux Server (3.10.0-765.el7.x86_64) 7.4 (Maipo)' --class red --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-de4def96-ff72-4eb9-ad5e-0847257d1866' {
menuentry 'Red Hat Enterprise Linux Server (3.10.0-693.el7.x86_64) 7.4 (Maipo)' --class red --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-de4def96-ff72-4eb9-ad5e-0847257d1866' {
menuentry 'Red Hat Enterprise Linux Server (0-rescue-164021837e1e44d996ee35429af77920) 7.4 (Maipo)' --class red --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-0-rescue-164021837e1e44d996ee35429af77920-advanced-de4def96-ff72-4eb9-ad5e-0847257d1866' {
- for me 3.10.0-693 worked so
vi /etc/default/grub
Modify"GRUB_DEFAULT=2"
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-765.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-765.el7.x86_64.img
Found linux image: /boot/vmlinuz-3.10.0-693.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-693.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-164021837e1e44d996ee35429af77920
Found initrd image: /boot/initramfs-0-rescue-164021837e1e44d996ee35429af77920.img
Found linux image: /boot/vmlinuz-0-rescue-f2732ce2113247b297338f3e384f122e
Found initrd image: /boot/initramfs-0-rescue-f2732ce2113247b297338f3e384f122e.img
done
reboot
uname -r
3.10.0-693.el7.x86_64
- Add "blacklist nouveau"
vi /etc/modprobe.d/blacklist.conf
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut -v /boot/initramfs-$(uname -r).img $(uname -r)
reboot
yum install "kernel-devel-uname-r == $(uname -r)"
yum groupinstall "Development Tools"
Make a note of the GPU Product Type.For me it is Tesla M60. Download from http://www.nvidia.com/Download/index.aspx For me it was
- Tesla M Class ,Product M60, RHEL7, CUDA 9.0
- Version: 384.81
- Release Date: 2017.9.25
- Operating System: Linux 64-bit
- CUDA Toolkit: 9.0
- Language: English (US)
- GOTO http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/384.81/NVIDIA-Linux-x86_64-384.81.run&lang=us&type=Tesla
wget http://us.download.nvidia.com/tesla/384.81/NVIDIA-Linux-x86_64-384.81.run
./NVIDIA-Linux-x86_64-384.81.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81................................................................
Mon Nov 27 21:11:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 13W / 150W | 11MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
yum --showduplicates list cuda
yum install -y cuda-9.0.176-1
echo -e “$CUDA_HOME \\n $CUDA_PATH \\n $LD_LIBRARY_PATH”
ls -l /usr/local/cuda
export CUDA_HOME="/usr/local/cuda"
export CUDA_PATH="${CUDA_HOME}"
export PATH="${CUDA_HOME}/bin${PATH:+:${PATH}}"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}";
echo -e “$CUDA_HOME \\n $CUDA_PATH \\n $LD_LIBRARY_PATH”
wget http://perf1.perf.lab.eng.bos.redhat.com/jeder/NVIDIA/cuDNN/CUDA9/cuDNN_v7.0.3/cudnn-9.0-linux-x64-v7.tgz
tar -C /usr/local -xvf cudnn-9.0-linux-x64-v7.tgz
ls -l /usr/local/cuda/lib64/libcudnn.so.7
ls -l /usr/local/cuda/lib64/libcudnn_static.a
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/
- Find GPU name
nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
For me it was "Tesla-M60"
- Label the node with the gpu name(here it is Tesla-M60)
oc label node ip-172-31-1-39.us-west-2.compute.internal alpha.kubernetes.io/nvidia-gpu-name='Tesla-M60' --overwrite
- Modify node config &
systemctl restart atomic-openshift-node
open node config filevi /etc/origin/node/node-config.yaml
and add below lines
kubeletArguments:
feature-gates:
- Accelerators=true
- Check for changes in node
journalctl -u atomic-openshift-node --since="5 minutes ago" | grep feature
< Aug 22 14:57:31 ip-172-31-4-10.us-west-2.compute.internal atomic-openshift-node[27395]: I0822 14:57:31.486944 27395 feature_gate.go:144] feature gates: map[Accelerators:true]
- Check for label changes on node.You can see that my systme has a single GPU capacity.
oc describe node ip-172-31-1-39.us-west-2.compute.internal | egrep -B1 'Name:|gpu:'
Name: ip-172-31-1-39.us-west-2.compute.internal
--
Capacity:
alpha.kubernetes.io/nvidia-gpu: 1
--
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 1
- Add changes to DeploymentConfig in affinity and resources/limits
{
"kind": "DeploymentConfig",
"apiVersion": "v1",
"metadata": {
"name": "${APPLICATION_NAME}",
"labels": {
"appid": "tf-app-server-${APPLICATION_NAME}",
"appName": "${APPLICATION_NAME}"
}
},
"spec": {
"strategy": {
"type": "Rolling"
},
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "alpha.kubernetes.io/nvidia-gpu-name",
"operator": "In",
"values": [
"Tesla_M60"
]
}
]
}
]
}
}
},
"triggers": [
{
"type": "ConfigChange"
},
{
"type": "ImageChange",
"imageChangeParams": {
"automatic": true,
"containerNames": [
"${APPLICATION_NAME}"
],
"from": {
"kind": "ImageStreamTag",
"name": "${APPLICATION_NAME}:latest"
}
}
}
],
"replicas": 1,
"selector": {
"deploymentconfig": "${APPLICATION_NAME}"
},
"template": {
"metadata": {
"labels": {
"appid": "tf-app-server-${APPLICATION_NAME}",
"deploymentconfig": "${APPLICATION_NAME}",
"appName": "${APPLICATION_NAME}"
}
},
"spec": {
"containers": [
{
"env": [
{
"name": "MODEL_NAME",
"value": "${MODEL_NAME}"
},
{
"name": "MODEL_PATH",
"value": "${SOURCE_DIRECTORY}"
}
],
"name": "${APPLICATION_NAME}",
"image": "${APPLICATION_NAME}:latest",
"resources": {
"limits": {
"alpha.kubernetes.io/nvidia-gpu": "1"
}
},
"ports": [
{
"containerPort": 6006,
"protocol": "TCP"
}
]
}
]
}
}
}
},
-
deploy the template
oc create -f template.yaml
-
Verify if Pods are allocated only in GPU pod
oadm manage-node ip-172-31-1-39.us-west-2.compute.internal --list-pods
OR
oc get pod <tf-pods> --template={{.spec.nodeName}}