date | title | tags | |
---|---|---|---|
2020-02-29 |
Proper CUDA and cuDNN installation |
|
You're here, so you're probably already hurting because of CUDA and cuDNN compatibility, and I won't have to motivate you, or explain why you'd want to have standalone CUDA and cuDNN installations, if you're going to develop using Tensorflow in the long term.
Check out the TF compatibility matrix. Then hit NVidia's CUDA toolkit archive. Get your local runfiles.
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run
wget https://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda_11.5.1_495.29.05_linux.run
wget https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda_11.6.0_510.39.01_linux.run
sudo bash cuda_10.1.243_418.87.00_linux.run --no-man-page --override --silent \
--toolkit --toolkitpath=/usr/local/cuda-10.1.243 --librarypath=/usr/local/cuda-10.1.243
sudo bash cuda_10.2.89_440.33.01_linux.run --no-man-page --override --silent \
--toolkit --toolkitpath=/usr/local/cuda-10.2.89 --librarypath=/usr/local/cuda-10.2.89
sudo bash cuda_11.0.3_450.51.06_linux.run --no-man-page --override --silent \
--toolkit --toolkitpath=/usr/local/cuda-11.0.3 --librarypath=/usr/local/cuda-11.0.3
sudo bash cuda_11.2.2_460.32.03_linux.run --no-man-page --override --silent \
--toolkit --toolkitpath=/usr/local/cuda-11.2.2 --librarypath=/usr/local/cuda-11.2.2
sudo bash cuda_11.5.1_495.29.05_linux.run --no-man-page --override --silent \
--toolkit --toolkitpath=/usr/local/cuda-11.5.1 --librarypath=/usr/local/cuda-11.5.1
sudo bash cuda_11.6.0_510.39.01_linux.run --no-man-page --override --silent \
--toolkit --toolkitpath=/usr/local/cuda-11.6.0 --librarypath=/usr/local/cuda-11.6.0
wget https://developer.download.nvidia.com/compute/redist/cudnn/v7.6.5/cudnn-10.1-linux-x64-v7.6.5.32.tgz
wget https://developer.download.nvidia.com/compute/redist/cudnn/v7.6.5/cudnn-10.2-linux-x64-v7.6.5.32.tgz
wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.0.5/cudnn-11.0-linux-x64-v8.0.5.39.tgz
wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.1.1/cudnn-11.2-linux-x64-v8.1.1.33.tgz
wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz
sudo mkdir /usr/local/cudnn-10.1-7.6.5.32
sudo tar -xzf cudnn-10.1-linux-x64-v7.6.5.32.tgz -C /usr/local/cudnn-10.1-7.6.5.32 --strip 1
sudo mkdir /usr/local/cudnn-10.2-7.6.5.32
sudo tar -xzf cudnn-10.2-linux-x64-v7.6.5.32.tgz -C /usr/local/cudnn-10.2-7.6.5.32 --strip 1
sudo mkdir /usr/local/cudnn-11.0-8.0.5.39
sudo tar -xzf cudnn-11.0-linux-x64-v8.0.5.39.tgz -C /usr/local/cudnn-11.0-8.0.5.39 --strip 1
sudo mkdir /usr/local/cudnn-11.2-8.1.1.33
sudo tar -xzf cudnn-11.2-linux-x64-v8.1.1.33.tgz -C /usr/local/cudnn-11.2-8.1.1.33 --strip 1
Your TensorFlow-using application will load the TF language support .so
,
which will load libtensorflow.so
, which will dynamically load the various libraries.
We'll use the LD_LIBRARY_PATH
environmental variable to tell the dynamic library
loader where to look for the shared library files, first. That way we'll force Tensorflow
to load the very specific CUDA and cuDNN libraries that are compatible with it.
In addition, the Tensorflow developers seem to like hardcoding paths and values,
such as the path for the ptxas
binary, and thus you'll encounter this error.
2020-03-01 13:19:42.121134: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
Then your program will hang for a good minute or so. So nice.
As @DawyD informs us in his comment to #33375,
the currently accepted solution to this issue is to symbolically link to the CUDA version's bin
directory
from the current working directory from which you execute your Tensorflow application.
Clarification.
You remember the TF compatibility matrix?
If you hit
2020-03-04 18:49:06.955169: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
you need to work around that Tensorflow issue by setting this environmental variable:
export TF_FORCE_GPU_ALLOW_GROWTH=true
or alternatively you can modify your Python code with
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
$ ln -snf /usr/local/cuda-10.0.130/bin bin
$ LD_LIBRARY_PATH=/usr/local/cuda-10.0.130/lib64:/usr/local/cudnn-10.0-7.4.2.24/lib64 python -c 'import tensorflow'
$ ln -snf /usr/local/cuda-10.1.243/bin bin
$ LD_LIBRARY_PATH=/usr/local/cuda-10.1.243/lib64:/usr/local/cuda-10.1.243/extras/CUPTI/lib64:/usr/local/cudnn-10.1-7.6.5.32/lib64 python -c 'import tensorflow'
$ ln -snf /usr/local/cuda-10.2.89/bin bin
$ LD_LIBRARY_PATH=/usr/local/cuda-10.2.89/lib64:/usr/local/cuda-10.2.89/extras/CUPTI/lib64:/usr/local/cudnn-10.2-7.6.5.32/lib64 python -c 'import tensorflow'
$ ln -snf /usr/local/cuda-11.0.3/bin bin
$ LD_LIBRARY_PATH=/usr/local/cuda-11.0.3/lib64:/usr/local/cuda-11.0.3/extras/CUPTI/lib64:/usr/local/cudnn-11.0-8.0.5.39/lib64 python -c 'import tensorflow'
$ ln -snf /usr/local/cuda-11.2.2/bin bin
$ LD_LIBRARY_PATH=/usr/local/cuda-11.2.2/lib64:/usr/local/cuda-11.2.2/extras/CUPTI/lib64:/usr/local/cudnn-11.2-8.1.1.33/lib64 python -c 'import tensorflow'
Your directory for the tarball for cuda 11 after the -C arg is not correctly set