Using TF 1.15 with CUDA 10.0 on RHEL 8.3 is not easy. If you upgrade the driver the TensorFlow setup with CUDA may stop working. Setup RHEL - 8.3.1 with Nvidia drivers for TensorFlow-2.x and TensorFlow 1.15.x.
These steps work for RHEL-8.3.1
- Uninstall current driver and cuda
- Install latest NVIDIA Driver
- Setup CUDA for TensorFlow 2.4.0
- Setup CUDA for TensorFlow 1.15.4
My current Driver Version on RHEL-8 is 440.31
and I wanted to update to latest version 460.32.03
.
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31 Driver Version: 440.31 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 24% 35C P0 28W / 257W | 0MiB / 11016MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# ./NVIDIA-Linux-x86_64-440.31.run --uninstall
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 440.31................
Note : This step will most probably mess-up your previous setup for TensorFlow.
# rpm -qa | grep cuda
cuda-repo-rhel8-10.2.89-1.x86_64
cuda-nsight-compute-11-1-11.1.1-1.x86_64
cuda-repo-rhel8-10-2-local-10.2.89-440.33.01-1.0-1.x86_64
cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64
cuda-repo-rhel8-10-1-local-10.1.243-418.87.00-1.0-1.x86_64
# rpm -qa | grep nvidia
cuda-repo-rhel8-10.2.89-1.x86_64
cuda-nsight-compute-11-1-11.1.1-1.x86_64
cuda-repo-rhel8-10-2-local-10.2.89-440.33.01-1.0-1.x86_64
cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64
cuda-repo-rhel8-10-1-local-10.1.243-418.87.00-1.0-1.x86_64
# rpm -e <package-name>
Nvidia drivers are available from https://www.nvidia.com/download/index.aspx?lang=en-us
As of Jan-20-2020 the latest driver for my GPU was NVIDIA-Linux-x86_64-460.32.03.run
# chmod 777 NVIDIA-Linux-x86_64-460.32.03.run
Check the system for GPU and gcc.
# uname -r
4.18.0-240.10.1.el8_3.x86_64
# lspci | grep -i nvidia
65:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 3080] (rev a1)
65:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
65:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
65:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
# gcc --version
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Let's update the RHEL-8 system.
# sudo yum update -y
# sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo ./NVIDIA-Linux-x86_64-460.32.03.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03.. ..
# nvidia-smi
Wed Jan 20 16:51:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 24% 35C P0 28W / 257W | 0MiB / 11016MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
IMPORTANT NOTE: If you look at the nvidia-smi putput you notice that the CUDA Version for the Driver Version 460.32.03
is 11.2 but TensorFlow as of version 2.4.0 only supports CUDA 11.0. So, we are not going to install CUDA 11.2. Instead we will proceed to install CUDA-11.0 .
TensorFlow 2.4.0 requires CUDA-11.0 . Download the cuda-11.0 rpm for RHEL-8 from https://developer.nvidia.com/cuda-toolkit-archive
# wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-rhel8-11-0-local-11.0.3_450.51.06-1.x86_64.rpm
IMPORTANT NOTE: If you look at the rhel8-11-0
rpm the Driver Version is different 450.51.06
and not 460.32.03
.
# sudo rpm -i cuda-repo-rhel8-11-0-local-11.0.3_450.51.06-1.x86_64.rpm
# sudo dnf clean all
# dnf install cuda
Updating Subscription Management repositories.
cuda-rhel8-11-0-local 103 MB/s | 105 kB 00:00
Extra Packages for Enterprise Linux Modular 8 - x86_64 256 kB/s | 537 kB 00:02
Extra Packages for Enterprise Linux 8 - x86_64 2.9 MB/s | 8.8 MB 00:02
Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs) 14 MB/s | 27 MB 00:01
Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs) 10 MB/s | 25 MB 00:02
Dependencies resolved.
==================================================================================================================================================================================================================
Package Architecture Version Repository Size
==================================================================================================================================================================================================================
Installing:
cuda x86_64 11.0.3-1 cuda-rhel8-11-0-local 2.7 k
Installing dependencies:
cuda-11-0 x86_64 11.0.3-1 cuda-rhel8-11-0-local 2.8 k
cuda-drivers x86_64 450.51.06-1
# nvidia-smi
Wed Jan 20 18:07:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 24% 36C P0 27W / 257W | 0MiB / 11016MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
IMPORTANT NOTE: If you look at the nvidia-smi putput you notice that the the Driver Version is 450.51.06
and not 460.32.03
.
if you don't install the cudnn then TensorFlow is going to throw error saying that it can't find the GPU device because it can't load dynamic library libcudnn.so.8
.
>>> print("TF version: ", tf.__version__)
TF version: 2.4.0
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 17:57:11.951188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-20 17:57:11.951205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-20 17:57:11.951213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-20 17:57:11.951221: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-20 17:57:11.951228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-20 17:57:11.951236: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-20 17:57:11.951243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-20 17:57:11.951290: W 2021-01-20 18:04:48.881243: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.0/lib64
2021-01-20 18:04:48.881249: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
# sudo yum -y install kernel-devel-`uname -r` kernel-headers-`uname -r`
# sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# sudo yum -y install dkms
cudnn-8.0.4 is available from https://developer.nvidia.com/rdp/cudnn-archive.
# tar -xvf cudnn-11.0-linux-x64-v8.0.4.30.tgz
cuda/include/cudnn.h
cuda/include/cudnn_adv_infer.h
cuda/include/cudnn_adv_train.h
cuda/include/cudnn_backend.h
cuda/include/cudnn_cnn_infer.h
cuda/include/cudnn_cnn_train.h
cuda/include/cudnn_ops_infer.h
cuda/include/cudnn_ops_train.h
cuda/include/cudnn_version.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.8
cuda/lib64/libcudnn.so.8.0.4
cuda/lib64/libcudnn_adv_infer.so
cuda/lib64/libcudnn_adv_infer.so.8
cuda/lib64/libcudnn_adv_infer.so.8.0.4
cuda/lib64/libcudnn_adv_train.so
cuda/lib64/libcudnn_adv_train.so.8
cuda/lib64/libcudnn_adv_train.so.8.0.4
cuda/lib64/libcudnn_cnn_infer.so
cuda/lib64/libcudnn_cnn_infer.so.8
cuda/lib64/libcudnn_cnn_infer.so.8.0.4
cuda/lib64/libcudnn_cnn_train.so
cuda/lib64/libcudnn_cnn_train.so.8
cuda/lib64/libcudnn_cnn_train.so.8.0.4
cuda/lib64/libcudnn_ops_infer.so
cuda/lib64/libcudnn_ops_infer.so.8
cuda/lib64/libcudnn_ops_infer.so.8.0.4
cuda/lib64/libcudnn_ops_train.so
cuda/lib64/libcudnn_ops_train.so.8
cuda/lib64/libcudnn_ops_train.so.8.0.4
cuda/lib64/libcudnn_static.a
cuda/lib64/libcudnn_static.a
# sudo cp cuda/include/cudnn*.h /usr/local/cuda-11.0/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda-11.0/lib64
# sudo chmod a+r /usr/local/cuda-11.0/include/cudnn*.h /usr/local/cuda-11.0/lib64/libcudnn*
>>> print("TF version: ", tf.__version__)
TF version: 2.4.0
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 18:12:34.375773: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-20 18:12:34.377637: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-20 18:12:34.377684: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-20 18:12:34.378409: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-20 18:12:34.378572: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-20 18:12:34.380425: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-20 18:12:34.380856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-20 18:12:34.380928: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-20 18:12:34.381675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
Num GPUs Available: 1
>>> print("TF version: ", tf.__version__)
TF version: 1.15.4
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 15:30:28.794748: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794783: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794812: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794841: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794869: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794899: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794927: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794933: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
As you can see without CUDA-10.0 it is not possible to use TensorFlow - 1.15.4.
IMPORTANT NOTE: TensorFlow 1.15.x only supports CUDA 10.0. So, we need to install CUDA 10.0 on RHEL-8 but https://developer.nvidia.com/cuda-toolkit-archive doesn't list the rhel-8 repo as of now. So we will download the cuda-repo-rhel7-10-0
rpm to install CUDA-11.0
download the cuda rpm from https://developer.nvidia.com/cuda-toolkit-archive
# sudo rpm -ivh cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64.rpm
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:cuda-repo-rhel7-10-0-local-10.0.1################################# [100%]
We are still using gcc-8.3.1 .
# yum install cuda-10-0
Updating Subscription Management repositories.
Last metadata expiration check: 0:00:17 ago on Wed 20 Jan 2021 06:09:26 PM EST.
Dependencies resolved.
==================================================================================================================================================================================================================
Package Architecture Version Repository Size
==================================================================================================================================================================================================================
Installing:
cuda-10-0 x86_64 10.0.130-1 cuda-10-0-local-10.0.130-410.48 6.1 k
Installing dependencies:
cuda-driver-dev-10-0 x86_64 10.0.130-1 cuda-10-0-local-10.0.130-410.48 20 k
# nvidia-smi
Wed Jan 20 19:28:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 24% 32C P8 27W / 257W | 158MiB / 11016MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 182307 C ...-battship/env/bin/python3 155MiB |
+-----------------------------------------------------------------------------+
Note: the Driver Version hasn't changed after installing cuda-10.0 . Which is why we can install and use both cuda-11.0 and cuda-10.0 .
cudnn-7.6.5 is available from https://developer.nvidia.com/rdp/cudnn-archive.
# tar -xvf cudnn-10.0-linux-x64-v7.6.5.32.tgz
cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.6.5
cuda/lib64/libcudnn_static.a
# sudo cp cuda/include/cudnn*.h /usr/local/cuda-10.0/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
# sudo chmod a+r /usr/local/cuda-10.0/include/cudnn*.h /usr/local/cuda-10.0/lib64/libcudnn*
>>> print("TF version: ", tf.__version__)
TF version: 1.15.4
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 18:20:01.798000: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-20 18:20:01.798009: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-20 18:20:01.798016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-20 18:20:01.798024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-20 18:20:01.798032: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-20 18:20:01.798039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-20 18:20:01.798047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-20 18:20:01.798729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2021-01-20 18:20:01.798749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-20 18:20:01.798755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2021-01-20 18:20:01.798759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2021-01-20 18:20:01.799487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 , pci bus id: 0000:65:00.0, compute capability: 7.5)
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow.python.client import device_lib
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
print("TF version: ", tf.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print("Devices Available: ", get_available_devices())