State as of 2017-07-31.
Recently I was installing CUDA on a Azure NC6 VM with Tesla K80 and later the same day I also upgraded my personal machine with GTX 980 Ti from Ubuntu 15.10 to 16.04 and completely reinstalled CUDA.
I had NVIDIA driver 361, CUDA 7.5 and cuDNN 4 or 5 and wanted CUDA 8.0 for new TensorFlow 1.2.1. So I had to upgrade also Ubuntu.
- remove old NVIDIA drivers, CUDA toolkit, cuDNN and related packages
- install NVIDIA drivers, CUDA toolkit, cuDNN
- verify it works OK
Target version:
- NVIDIA driver: 375.82 (long-living stable)
- latest: 384.59 (possibly OK, I took the stable version to be more sure)
- CUDA toolkit: 8.0
- cuDNN: 5.1
- latest: 6.0 (not supported by latest TensorFlow 1.2.1, only by 1.3.0-rc1)
There's a change after upgrading Ubuntu from 15.10 that Ubuntu 16.04 uses Secure Boot mechanism to digitally sign kernel drivers (originally used to protect Windows from viruses). Unfortunately this doesn't work well with 3rd-party binary drivers such as NVIDIA. The working option for me was to disable Secure Boot entirely, otherwise Ubuntu is not able to load the drivers. The complication is that this cannot be done over SSH, but at the boot time in the UEFI BIOS settings! Since my computer is across the world, I had to call somebody to modify the BIOS settings for me.
Example procedure for my motherboard ASUS Z170-E (video):
- restart, press F8 during boot to enter the bios
- Boot -> Secure Boot (down) -> Key Management -> Clear Secure Boot Keys
- Boot -> Secure Boot -> Secure Boot state - should be disabled
- Exit -> Save changes & Reset
- restart
I recommend to do this even before upgrading Ubuntu.
PC with ASUS Z170-E
motherboard and GTX 980 Ti
GPU.
$ lspci | grep -i NVIDIA
01:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1)
It was quite easy:
sudo do-release-upgrade
And resolve a few conflicts in /etc
files. If I could disable Secure Boot first it would be easier.
Took about an hour.
If NVIDIA driver was installed via deb package:
sudo apt-get remove --purge 'nvidia*' 'cuda*' 'libcuda1*' 'libcudnn5*' libxnvctrl0
If NVIDIA driver was installed via the *.run file:
sudo nvidia-uninstall
There are two options:
- official installer:
NVIDIA-Linux-x86_64-<version>.run
- http://www.nvidia.com/Download/index.aspx
- http://www.nvidia.com/download/driverResults.aspx/120917/en-us
- only available version: 384.59, but 375.82 still can be downloaded directly
- unofficial debian package (PPA):
nvidia-375
Although NVIDIA recommends only the official installer it's so much a hassle to work with and it didn't work well work me. The debian package (not recommended by NVIDIA) worked like charm and without tons of questions. So my recommendation is to the the debian package.
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-375
Although I don't recommend it, here are the steps for the NVIDIA installer:
NV_VERSION=375.82
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/${NV_VERSION}/NVIDIA-Linux-x86_64-${NV_VERSION}.run
chmod +x NVIDIA-Linux-x86_64-${NV_VERSION}.run
If you have X11 server (GUI) running, you need to stop it during the installation.
## ERROR: You appear to be running an X server; please exit X before installing.
sudo service lightdm stop # display manager
sudo systemctl stop vncserver@1 # also stop VNC server if present
sudo init 3 # run level without GUI
sudo ./NVIDIA-Linux-x86_64-${NV_VERSION}.run
sudo reboot
https://developer.nvidia.com/cuda-downloads
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
TensorFlow 1.2.1 needs cuDNN 5.1 (not 6.0).
Needs to be downloaded via registered NVIDIA account. https://developer.nvidia.com/rdp/cudnn-download
This can be downloaded from a browser and then copied to the target machine via SCP:
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64-deb
Add to ~/.profile:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
. ~/.profile
sudo apt-get remove --purge 'nvidia*' 'cuda*' 'libcuda1*' 'libcudnn5*' libxnvctrl0
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64-deb
sudo apt-get update
sudo apt-get install nvidia-375 cuda
sudo reboot
We should see the GPU infomation:
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 0000:01:00.0 On | N/A |
| 22% 39C P8 19W / 250W | 92MiB / 6075MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1096 G /usr/lib/xorg/Xorg 53MiB |
| 0 2388 G /usr/bin/gnome-shell 37MiB |
+-----------------------------------------------------------------------------+
Let's run a simple "hello world" MNIST MLP in Keras/Tensorflow:
pip install tensorflow-gpu==1.2.1 keras==2.0.6
wget https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_mlp.py
python mnist_mlp.py
We should see that it uses the GPU and trains properly:
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
That's it. Happy training!