Last active
May 14, 2018 20:41
-
-
Save justindav1s/4f74b7cb6de09839f313cd7525224401 to your computer and use it in GitHub Desktop.
Host a RHEL7.4 Guest VM on Ubuntu 18.04 with PCI passthrough for NVIDIA GPU for Deep Learning on nvidia-docker and Openshift
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Links | |
https://medium.com/@calerogers/gpu-virtualization-with-kvm-qemu-63ca98a6a172 | |
http://vfio.blogspot.co.uk/2015/05/vfio-gpu-how-to-series-part-1-hardware.html | |
#####HOST####### | |
Ubuntu 18.04 with functioning NVIDIA 1050 Ti. | |
Docker-nvidia and GPU based tensorflow all work well. | |
justin@justin-ubuntu:~$ lspci | |
00:00.0 Host bridge: Intel Corporation Intel Kaby Lake Host Bridge (rev 05) | |
00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05) | |
00:08.0 System peripheral: Intel Corporation Skylake Gaussian Mixture Model | |
00:14.0 USB controller: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller | |
00:14.2 Signal processing controller: Intel Corporation 200 Series PCH Thermal Subsystem | |
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1 | |
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode] | |
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #1 (rev f0) | |
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0) | |
00:1c.6 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #7 (rev f0) | |
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0) | |
00:1f.0 ISA bridge: Intel Corporation 200 Series PCH LPC Controller (Z270) | |
00:1f.2 Memory controller: Intel Corporation 200 Series PCH PMC | |
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio | |
00:1f.4 SMBus: Intel Corporation 200 Series PCH SMBus Controller | |
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V | |
01:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1) | |
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1) | |
03:00.0 USB controller: ASMedia Technology Inc. Device 2142 | |
04:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8192EE PCIe Wireless Network Adapter | |
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 | |
justin@justin-ubuntu:~$ nvidia-smi | |
Sat May 12 09:11:46 2018 | |
+-----------------------------------------------------------------------------+ | |
| NVIDIA-SMI 390.48 Driver Version: 390.48 | | |
|-------------------------------+----------------------+----------------------+ | |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | |
|===============================+======================+======================| | |
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A | | |
| 0% 38C P8 N/A / 90W | 3814MiB / 4038MiB | 0% Default | | |
+-------------------------------+----------------------+----------------------+ | |
+-----------------------------------------------------------------------------+ | |
| Processes: GPU Memory | | |
| GPU PID Type Process name Usage | | |
|=============================================================================| | |
| 0 1726 G /usr/lib/xorg/Xorg 24MiB | | |
| 0 1821 G /usr/bin/gnome-shell 48MiB | | |
| 0 2939 G /usr/lib/xorg/Xorg 203MiB | | |
| 0 3083 G /usr/bin/gnome-shell 177MiB | | |
| 0 3568 G ...-token=03EC8ADDFA5A61CA5607DDD3A8C603D3 63MiB | | |
| 0 4607 G gnome-control-center 1MiB | | |
| 0 23089 C /usr/bin/python 3281MiB | | |
+-----------------------------------------------------------------------------+ | |
justin@justin-ubuntu:~$ | |
root@justin-ubuntu:/etc/initramfs-tools# lspci -nnk | grep -i nvidia | |
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) | |
Kernel driver in use: nvidia | |
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia | |
01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1) | |
CHANGES | |
root@justin-ubuntu:/etc/default# diff grub.backup grub | |
12c12 | |
< GRUB_CMDLINE_LINUX="" | |
--- | |
> GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt rd.driver.pre=vfio-pci" | |
root@justin-ubuntu:/etc/initramfs-tools# diff modules.backup modules | |
11a12,17 | |
> | |
> # Added for PCI passthrough for NVidia card | |
> vfio | |
> vfio_iommu_type1 | |
> vfio_pci | |
> vfio_virqfd | |
New File : | |
root@justin-ubuntu:/etc/modprobe.d# cat /etc/modprobe.d/local.conf | |
options vfio-pci ids=10de:1c82,10de:0fb9 | |
options vfio-pci disable_vga=1 | |
update-initramfs -u | |
lspci -nnk | |
... | |
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] GP107 [GeForce GTX 1050 Ti] [1462:3351] | |
Kernel driver in use: vfio-pci | |
Kernel modules: nvidiafb, nouveau | |
01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] GP107GL High Definition Audio Controller [1462:3351] | |
Kernel driver in use: vfio-pci | |
Kernel modules: snd_hda_intel | |
... | |
#######GUEST : RHEL7.4 VM config with virt-manager####### | |
From "Add Hardware " : Add a PCI Host Device, select your GPU and the associated sound device | |
Boot VM | |
before attempting to install the nvidia driver, prevent the driver from discovering its running in on kvm | |
add | |
<kvm> | |
<hidden state='on'/> | |
</kvm> | |
virsh edit <vm_name> | |
<features> | |
<acpi/> | |
<apic/> | |
<kvm> | |
<hidden state='on'/> | |
</kvm> | |
<vmport state='off'/> | |
</features> | |
Subscribe, then setup some repos for later : | |
subscription-manager repos --enable=rhel-7-server-extras-rpms | |
subscription-manager repos --disable=rhel-7-server-htb-rpms | |
Update everything etc. | |
yum -y update | |
also : | |
yum install gcc kernel-devel wget pci-utils yum-utils | |
and also : | |
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-11.noarch.rpm | |
rpm -Uvh epel-release*rpm | |
reboot | |
To see PCI devices : | |
lspci -nnk | |
initially you get the default NVIDIA driver : nouveau | |
... | |
00:08.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351] | |
Kernel driver in use: nouveau | |
Kernel modules: nouveau | |
00:09.0 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351] | |
Kernel driver in use: snd_hda_intel | |
Kernel modules: snd_hda_intel | |
... | |
Remove nouveau driver & NVIDIA driver installation | |
follow this : | |
https://access.redhat.com/solutions/1155663 | |
or perhaps better, this : | |
https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9/ | |
Edit /etc/default/grub and add the following to the GRUB_CMDLINE_LINUX line: | |
modprobe.blacklist=nouveau | |
# grub2-mkconfig -o /boot/grub2/grub.cfg | |
# reboot | |
after removing nouveau : | |
... | |
00:08.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351] | |
Kernel modules: nouveau | |
00:09.0 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351] | |
Kernel driver in use: snd_hda_intel | |
Kernel modules: snd_hda_intel | |
... | |
After driver build install and reboot : | |
00:08.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351] | |
Kernel driver in use: nvidia | |
Kernel modules: nouveau, nvidia_drm, nvidia | |
00:09.0 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1) | |
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351] | |
Kernel driver in use: snd_hda_intel | |
Kernel modules: snd_hda_intel | |
[justin@localhost ~]$ nvidia-smi | |
Sat May 12 14:33:36 2018 | |
+-----------------------------------------------------------------------------+ | |
| NVIDIA-SMI 390.48 Driver Version: 390.48 | | |
|-------------------------------+----------------------+----------------------+ | |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | |
|===============================+======================+======================| | |
| 0 GeForce GTX 105... Off | 00000000:00:08.0 Off | N/A | | |
| 0% 41C P0 N/A / 90W | 0MiB / 4040MiB | 0% Default | | |
+-------------------------------+----------------------+----------------------+ | |
+-----------------------------------------------------------------------------+ | |
| Processes: GPU Memory | | |
| GPU PID Type Process name Usage | | |
|=============================================================================| | |
| No running processes found | | |
+-----------------------------------------------------------------------------+ | |
Doing useful ML stuff with ML, two options : | |
####1. Install nvidia-docker | |
nvidia-docker requires docker-ce : | |
https://stackoverflow.com/questions/42981114/install-docker-ce-17-03-on-rhel7 | |
docker-ce requires pigz | |
https://centos.pkgs.org/7/epel-x86_64/pigz-2.3.4-1.el7.x86_64.rpm.html | |
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-11.noarch.rpm | |
rpm -Uvh epel-release*rpm | |
yum install pigz | |
yum install yum-utils | |
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo | |
subscription-manager repos --enable=rhel-7-server-extras-rpms | |
yum install docker-ce | |
https://github.com/NVIDIA/nvidia-docker | |
# Add the package repositories | |
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) | |
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo | |
# Install nvidia-docker2 and reload the Docker daemon configuration | |
sudo yum install -y nvidia-docker2 | |
systemctl enable docker | |
systemctl start docker | |
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi | |
###Tensorflow | |
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker | |
do : | |
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu | |
gives you jupyter notebook. | |
for tflearn run in first cell | |
import sys | |
!{sys.executable} -m pip install tflearn | |
********2. Openshift with NVIDIA device plugin | |
For GPU enabled Openshift, this : | |
https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9/ | |
got this when deploying caffe pod | |
Error from server (Forbidden): error when creating "caffe2.yaml": pods "caffe2" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 8888: Host ports are not allowed to be used] | |
Solution | |
https://adam.younglogic.com/2017/06/creating-a-privileged-container-in-openshift/ | |
Deploying Jupyter/Tensorflow/GPU | |
https://github.com/justindav1s/openshift-ansible-on-openstack/tree/master/nvidia |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment