GPU testing on Openshift 3.10:

STATUS: tensorflow 1.11.0 can access GPUs. CUDA 9.0 . -Oct 5.

https://gist.github.com/sub-mod/18c23839ccbac660de08ba5f6033defd

Dockerfile

#FROM centos:7
FROM nvidia/cuda:9.0-cudnn7-devel-centos7

MAINTAINER Subin Modeel <[email protected]>

USER root


ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=9.0" # depending on the driver
ENV TESTCMD "/bin/nvidia-smi"
ENV BASHWAITCMD "trap : TERM INT; $TESTCMD && sleep infinity & wait"

CMD exec /bin/bash -c "$BASHWAITCMD"

Build image

docker build -t submod/test-gpu-310 -f Dockerfile .
docker run -it submod/test-gpu-310
docker push submod/test-gpu-310

docker run --privileged -i -t submod/test-gpu-310
docker run --privileged -i -t docker-registry.default.svc:5000/test/cuda-tf-runtime-36-redhat:rhel7-1-56-10.0-cudnn7-devel-rhel7 /bin/bash
docker run --privileged -i -t nvidia/cuda:9.0-base nvidia-smi

create a user in nvidia project

oc login -u test-gpu -p test-gpu
oc policy add-role-to-user edit test-gpu -n nvidia

find the SA with scc

oc login -u system:admin
oc get scc | grep nvidia

use the SA in yaml

apiVersion: v1
kind: Pod
metadata:
 name: test-gpu-310
spec:
 restartPolicy: OnFailure
 serviceAccountName: nvidia-deviceplugin
 containers:
   - name: test-gpu-310
     image: "submod/test-gpu-310"
     env:
       - name: NVIDIA_VISIBLE_DEVICES
         value: all
       - name: NVIDIA_DRIVER_CAPABILITIES
         value: "compute,utility"
       - name: NVIDIA_REQUIRE_CUDA
         value: "cuda>=9.0"
     securityContext:
       privileged: true
     resources:
       limits:
         nvidia.com/gpu: 1 # requesting 1 GPU

{
	"apiVersion": "v1",
	"kind": "Pod",
	"metadata": {
		"name": "test-gpu-310"
	},
	"spec": {
		"restartPolicy": "OnFailure",
		"serviceAccountName": "nvidia-deviceplugin",
		"containers": [
			{
				"name": "test-gpu-310",
				"image": "submod/test-gpu-310",
				"env": [
					{
						"name": "NVIDIA_VISIBLE_DEVICES",
						"value": "all"
					},
					{
						"name": "NVIDIA_DRIVER_CAPABILITIES",
						"value": "compute,utility"
					},
					{
						"name": "NVIDIA_REQUIRE_CUDA",
						"value": "cuda>=9.0"
					}
				],
				"securityContext": {
					"privileged": true
				},
				"resources": {
					"limits": {
						"nvidia.com/gpu": 1
					}
				}
			}
		]
	}
}

error without scc SA

# NOTE: without scc SA I get an error
# oc create -f test-gpu-centos7.yaml 
Error from server (Forbidden): error when creating "test-gpu-centos7.yaml": pods "test-gpu-310" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

user NOT a cluster-admin

oc adm  policy remove-cluster-role-from-user cluster-admin smodeel

#without scc SA
# oc create -f test-gpu-centos7_2.yaml 
Error from server (Forbidden): error when creating "test-gpu-centos7_2.yaml": pods "test-gpu-310" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

#with scc SA
pod "test-gpu-310" created

user is as cluster-admin(obviously this would work)

oc adm  policy add-cluster-role-to-user cluster-admin smodeel

### works without SA && with SA
pod "test-gpu-310" created

Useful links

https://blog.openshift.com/how-to-use-gpus-with-deviceplugin-in-openshift-3-10/
https://blog.openshift.com/understanding-service-accounts-sccs/

logs

sh-4.2# curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1604k  100 1604k    0     0  1933k      0 --:--:-- --:--:-- --:--:-- 1932k
sh-4.2# python get-pip.py
Collecting pip
  Downloading https://files.pythonhosted.org/packages/5f/25/e52d3f31441505a5f3af41213346e5b6c221c9e086a166f3703d2ddaf940/pip-18.0-py2.py3-none-any.whl (1.3MB)
    100% |################################| 1.3MB 6.7MB/s
Collecting setuptools
  Downloading https://files.pythonhosted.org/packages/96/06/c8ee69628191285ddddffb277bd5abdf769166e7a14b867c2a172f0175b1/setuptools-40.4.3-py2.py3-none-any.whl (569kB)
    100% |################################| 573kB 7.2MB/s
Collecting wheel
  Downloading https://files.pythonhosted.org/packages/fc/e9/05316a1eec70c2bfc1c823a259546475bd7636ba6d27ec80575da523bc34/wheel-0.32.1-py2.py3-none-any.whl
Installing collected packages: pip, setuptools, wheel
Successfully installed pip-18.0 setuptools-40.4.3 wheel-0.32.1
sh-4.2# pip install tensorflow-gpu
Collecting tensorflow-gpu
  Downloading https://files.pythonhosted.org/packages/ce/5c/33d3fc212cd392ceb4396d6b5280fdb426856b8d9702a710444370c50e8c/tensorflow_gpu-1.11.0-cp27-cp27mu-manylinux1_x86_64.whl (258.9MB)
    100% |################################| 258.9MB 161kB/s
Collecting setuptools<=39.1.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/8c/10/79282747f9169f21c053c562a0baa21815a8c7879be97abd930dbcf862e8/setuptools-39.1.0-py2.py3-none-any.whl (566kB)
    100% |################################| 573kB 21.5MB/s
Collecting astor>=0.6.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/35/6b/11530768cac581a12952a2aad00e1526b89d242d0b9f59534ef6e6a1752f/astor-0.7.1-py2.py3-none-any.whl
Collecting enum34>=1.1.6 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/c5/db/e56e6b4bbac7c4a06de1c50de6fe1ef3810018ae11732a50f15f62c7d050/enum34-1.1.6-py2-none-any.whl
Collecting gast>=0.2.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/5c/78/ff794fcae2ce8aa6323e789d1f8b3b7765f601e7702726f430e814822b96/gast-0.2.0.tar.gz
Collecting keras-preprocessing>=1.0.3 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/fc/94/74e0fa783d3fc07e41715973435dd051ca89c550881b3454233c39c73e69/Keras_Preprocessing-1.0.5-py2.py3-none-any.whl
Collecting keras-applications>=1.0.5 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/3f/c4/2ff40221029f7098d58f8d7fb99b97e8100f3293f9856f0fb5834bef100b/Keras_Applications-1.0.6-py2.py3-none-any.whl (44kB)
    100% |################################| 51kB 30.5MB/s
Requirement already satisfied: wheel in /usr/lib/python2.7/site-packages (from tensorflow-gpu) (0.32.1)
Collecting absl-py>=0.1.6 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/16/db/cce5331638138c178dd1d5fb69f3f55eb3787a12efd9177177ae203e847f/absl-py-0.5.0.tar.gz (90kB)
    100% |################################| 92kB 45.3MB/s
Collecting backports.weakref>=1.0rc1 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/88/ec/f598b633c3d5ffe267aaada57d961c94fdfa183c5c3ebda2b6d151943db6/backports.weakref-1.0.post1-py2.py3-none-any.whl
Collecting tensorboard<1.12.0,>=1.11.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/76/f9/e62022d00940e4df9a629d6bfe42eb28907bb35808db62bb9e8b69ea5ef3/tensorboard-1.11.0-py2-none-any.whl (3.0MB)
    100% |################################| 3.0MB 9.3MB/s
Collecting six>=1.10.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting mock>=2.0.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/e6/35/f187bdf23be87092bd0f1200d43d23076cee4d0dec109f195173fd3ebc79/mock-2.0.0-py2.py3-none-any.whl (56kB)
    100% |################################| 61kB 37.3MB/s
Collecting termcolor>=1.1.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting numpy>=1.13.3 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/40/c5/f1ed15dd931d6667b40f1ab1c2fe1f26805fc2b6c3e25e45664f838de9d0/numpy-1.15.2-cp27-cp27mu-manylinux1_x86_64.whl (13.8MB)
    100% |################################| 13.8MB 3.4MB/s
Collecting grpcio>=1.8.6 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/3d/15/b34114198a2bc9c9bb73b21e2b559468a1a68bb28b373d21da6e51d6204f/grpcio-1.15.0-cp27-cp27mu-manylinux1_x86_64.whl (9.4MB)
    100% |################################| 9.4MB 4.7MB/s
Collecting protobuf>=3.6.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/b8/c2/b7f587c0aaf8bf2201405e8162323037fe8d17aa21d3c7dda811b8d01469/protobuf-3.6.1-cp27-cp27mu-manylinux1_x86_64.whl (1.1MB)
    100% |################################| 1.1MB 30.1MB/s
Collecting h5py (from keras-applications>=1.0.5->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/33/0c/1c5dfa85e05052aa5f50969d87c67a2128dc39a6f8ce459a503717e56bd0/h5py-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl (2.7MB)
    100% |################################| 2.7MB 18.6MB/s
Collecting werkzeug>=0.11.10 (from tensorboard<1.12.0,>=1.11.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl (322kB)
    100% |################################| 327kB 51.1MB/s
Collecting futures>=3.1.1; python_version < "3" (from tensorboard<1.12.0,>=1.11.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/2d/99/b2c4e9d5a30f6471e410a146232b4118e697fa3ffc06d6a65efde84debd0/futures-3.2.0-py2-none-any.whl
Collecting markdown>=2.6.8 (from tensorboard<1.12.0,>=1.11.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/7a/6b/5600647404ba15545ec37d2f7f58844d690baf2f81f3a60b862e48f29287/Markdown-3.0.1-py2.py3-none-any.whl (89kB)
    100% |################################| 92kB 49.5MB/s
Collecting pbr>=0.11 (from mock>=2.0.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/01/0a/1e81639e7ed6aa51554ab05827984d07885d6873e612a97268ab3d80c73f/pbr-4.3.0-py2.py3-none-any.whl (106kB)
    100% |################################| 112kB 57.4MB/s
Collecting funcsigs>=1; python_version < "3.3" (from mock>=2.0.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl
Building wheels for collected packages: gast, absl-py, termcolor
  Running setup.py bdist_wheel for gast ... done
  Stored in directory: /root/.cache/pip/wheels/9a/1f/0e/3cde98113222b853e98fc0a8e9924480a3e25f1b4008cedb4f
  Running setup.py bdist_wheel for absl-py ... done
  Stored in directory: /root/.cache/pip/wheels/3c/33/ae/db8cd618e62f87594c13a5483f96e618044f9b01596efd013f
  Running setup.py bdist_wheel for termcolor ... done
  Stored in directory: /root/.cache/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6
Successfully built gast absl-py termcolor
Installing collected packages: setuptools, astor, enum34, gast, numpy, six, keras-preprocessing, h5py, keras-applications, absl-py, backports.weakref, werkzeug,futures, markdown, protobuf, grpcio, tensorboard, pbr, funcsigs, mock, termcolor, tensorflow-gpu
  Found existing installation: setuptools 40.4.3
    Uninstalling setuptools-40.4.3:
      Successfully uninstalled setuptools-40.4.3
Successfully installed absl-py-0.5.0 astor-0.7.1 backports.weakref-1.0.post1 enum34-1.1.6 funcsigs-1.0.2 futures-3.2.0 gast-0.2.0 grpcio-1.15.0 h5py-2.8.0 keras-applications-1.0.6 keras-preprocessing-1.0.5 markdown-3.0.1 mock-2.0.0 numpy-1.15.2 pbr-4.3.0 protobuf-3.6.1 setuptools-39.1.0 six-1.11.0 tensorboard-1.11.0 tensorflow-gpu-1.11.0 termcolor-1.1.0 werkzeug-0.14.1
2018-10-05 03:36:07.959887: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-10-05 03:36:08.250225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:3b:00.0
totalMemory: 11.91GiB freeMemory: 11.63GiB
2018-10-05 03:36:08.456621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 1 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:af:00.0
totalMemory: 11.91GiB freeMemory: 11.63GiB
2018-10-05 03:36:08.673869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 2 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:d8:00.0
totalMemory: 11.91GiB freeMemory: 11.63GiB
2018-10-05 03:36:08.677474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0, 1, 2
2018-10-05 03:36:09.671222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-05 03:36:09.671304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 1 2
2018-10-05 03:36:09.671314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N Y Y
2018-10-05 03:36:09.671322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1:   Y N Y
2018-10-05 03:36:09.671330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 2:   Y Y N
2018-10-05 03:36:09.672093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11247 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:3b:00.0, compute capability: 6.0)
2018-10-05 03:36:09.776775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11247 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:af:00.0, compute capability: 6.0)
2018-10-05 03:36:09.881314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11247 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-12GB, pci bus id: 0000:d8:00.0, compute capability: 6.0)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:3b:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:af:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla P100-PCIE-12GB, pci bus id: 0000:d8:00.0, compute capability: 6.0
2018-10-05 03:36:09.993977: I tensorflow/core/common_runtime/direct_session.cc:291] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:3b:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:af:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla P100-PCIE-12GB, pci bus id: 0000:d8:00.0, compute capability: 6.0

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-05 03:36:09.996259: I tensorflow/core/common_runtime/placer.cc:922] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-05 03:36:09.996312: I tensorflow/core/common_runtime/placer.cc:922] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-05 03:36:09.996325: I tensorflow/core/common_runtime/placer.cc:922] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]

sub-mod/cmds.md

STATUS: tensorflow 1.11.0 can access GPUs. CUDA 9.0 . -Oct 5.

Dockerfile

Build image

create a user in nvidia project

find the SA with scc

use the SA in yaml

error without scc SA

user NOT a cluster-admin

user is as cluster-admin(obviously this would work)

Useful links

logs