-
Star
(192)
You must be signed in to star a gist -
Fork
(67)
You must be signed in to fork a gist
-
-
Save bogdan-kulynych/f64eb148eeef9696c70d485a76e42c3a to your computer and use it in GitHub Desktop.
# WARNING: These steps seem to not work anymore! | |
#!/bin/bash | |
# Purge existign CUDA first | |
sudo apt --purge remove "cublas*" "cuda*" | |
sudo apt --purge remove "nvidia*" | |
# Install CUDA Toolkit 10 | |
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb | |
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && sudo apt update | |
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb | |
sudo apt update | |
sudo apt install -y cuda | |
# Install CuDNN 7 and NCCL 2 | |
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb | |
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb | |
sudo apt update | |
sudo apt install -y libcudnn7 libcudnn7-dev libnccl2 libc-ares-dev | |
sudo apt autoremove | |
sudo apt upgrade | |
# Link libraries to standard locations | |
sudo mkdir -p /usr/local/cuda-10.0/nccl/lib | |
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ | |
sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/ | |
echo 'If everything worked fine, reboot now.' |
The fix is this before installing cuda: mkdir -p /usr/share/man/man1
. ref: geerlingguy/ansible-role-java#64
This did not work for me. How do I undo all of these steps?
This did not work for me. How do I undo all of these steps?
sudo apt --fix-broken install
sudo dpkg --configure -a
sudo apt-get clean
dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 sudo dpkg --purge
df -h
sudo apt-get purge nvidia*
sudo apt-get -f install
sudo apt autoremove
sudo apt-get --purge remove "cublas" "cuda*"
sudo apt-get --purge remove "nvidia"
I ran this code and I restarted my pc. Now Ubuntu isn't loading the display at all. It stops after the login screen itself. What should I do?
When we run:
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
We get:
ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
These steps don't work because Nvidia's repo forces you to install nvidia-driver-418, which does not support Turing cards, which are probably the reason you are trying to upgrade CUDA in the first place.
Edit: to avoid getting the 418 driver, install one of the toolkit packages, eg cuda-toolkit-10-1
instead of the cuda
package. Then you can keep your existing, working driver version. This avoids the black screen when your card is too new for 418.
@bogdan-kulynych
Don't we need an nvidia driver first so nvidia-smi show something?
So I did:
sudo ubuntu-drivers autoinstall
before starting these instructions and now I got:
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
Hi,
if i run the code I get 10.1 installed and not 10.0. Any idea why?
Hi,
if i run the code I get 10.1 installed and not 10.0. Any idea why?
Same thing here..
@ifaino, you can try the run file at this link
rename the file with
mv cuda_10.0.130_410.48_linux cuda_10.0.130_410.48_linux.run
sudo sh cuda_10.0.130_410.48_linux.run
to install it
Hi,
if i run the code I get 10.1 installed and not 10.0. Any idea why?
Same here!!
@ifaino, you can try the run file at this link
rename the file with
mv cuda_10.0.130_410.48_linux cuda_10.0.130_410.48_linux.run
sudo sh cuda_10.0.130_410.48_linux.run
to install it
It did not work!
This is a working solution for Cuda.10.0 in particular that worked for me: https://gist.github.com/Mahedi-61/2a2f1579d4271717d421065168ce6a73#file-cuda_10-0_installation_on_ubuntu_18-04
This worked for me but it install Cuda 10.1 instead of 10.0. Any fix for that?
@istiaq28 Just replace
sudo apt install -y cuda
by
sudo apt install -y cuda-10-0
Hi,
if i run the code I get 10.1 installed and not 10.0. Any idea why?Same thing here..
I was having the same issue, the 'apt update' and 'apt upgrade' will replace cuda-10.0 with cuda-10.1. To resolve this I had to run:
sudo apt install cuda=10.0.130-1
Don't run the upgrade command, if you do then add cuda-10.1 to blocked packages or rerun the above command after removing cuda.
Hi,
if i run the code I get 10.1 installed and not 10.0. Any idea why?Same thing here..
I was having the same issue, the 'apt update' and 'apt upgrade' will replace cuda-10.0 with cuda-10.1. To resolve this I had to run:
sudo apt install cuda=10.0.130-1
Don't run the upgrade command, if you do then add cuda-10.1 to blocked packages or rerun the above command after removing cuda.
and if you want to see available versions:
apt-cache showpkg cuda
It might be useful to add this to the very first line:
sudo apt list > ~/apt_list_backup.txt
...in case people need to undo some of the changes.
Finally, to verify the installation, check
nvidia-smi
nvcc -V
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
In what point is nvcc
installed ?
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
In what point is
nvcc
installed ?
I did "sudo apt-get cuda toolkit" before this script and then it worked out.
Very useful! Thanks
I needed to downgrade CUDA from10.2 to 10.0 version because Pytorch 1.5.1 does not support Tesla 40 GPUs...
I reinstalled Pytorch 1.2.0 with:
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
Important steps before Pytorch installation:
- Use
sudo apt install cuda=10.0.130-1
insteadsudo apt install cuda
- Don't use
sudo apt upgrade
- Include in ~/.bashrc this line:
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
- Warning:
nvidia-smi
shows 'CUDA Version 11.0' but v10.0 is really installed and working correctly... I don't know why it shows a different version...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K40c On | 00000000:03:00.0 Off | 0 |
| 28% 57C P0 66W / 235W | 1058MiB / 11441MiB | 0% Default |
| | | N/A |
Very useful! Thanks
I needed to downgrade CUDA from10.2 to 10.0 version because Pytorch 1.5.1 does not support Tesla 40 GPUs...
I reinstalled Pytorch 1.2.0 with:
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
Important steps before Pytorch installation:
1. Use `sudo apt install cuda=10.0.130-1` instead `sudo apt install cuda` 2. Don't use `sudo apt upgrade` 3. Include in ~/.bashrc this line: `export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}` 4. Warning: `nvidia-smi` **shows 'CUDA Version 11.0' but v10.0 is really installed and working correctly**... I don't know why it shows a different version...
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K40c On | 00000000:03:00.0 Off | 0 | | 28% 57C P0 66W / 235W | 1058MiB / 11441MiB | 0% Default | | | | N/A |
Worked thanks!!!
Here is the version that works for me. Credits to @jpison and @bogdan-kulynych.
#!/bin/bash
# Purge existign CUDA first
sudo apt --purge remove "cublas*" "cuda*"
sudo apt --purge remove "nvidia*"
# Install CUDA Toolkit 10
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && sudo apt update
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt update
sudo apt install -y cuda=10.0.130-1
# Install CuDNN 7 and NCCL 2
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install -y libcudnn7 libcudnn7-dev libnccl2 libc-ares-dev
sudo apt autoremove
# sudo apt upgrade
# Link libraries to standard locations
# sudo mkdir -p /usr/local/cuda-10.0/nccl/lib
# sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
# sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/
echo 'export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'If everything worked fine, reboot now.'
- Since we are using Deb installation method, we don't need to change the
LD_LIBRARY_PATH
variables (reference). - After running this script, you need to reboot the system.
- Type
nvidia-smi
andnvcc --version
to verify your installation.
Whole day I had been looking for a way to install cuda 10.0, but every methods I found were ended up with the message, "cuda : Depends: cuda-10-0 (>= 10.0.130) but it is not going to be installed".
I am quite sure above procedures are only way to install cuda 10.0 version that tensorflow-gpu==1.14
requires on "Ubuntu 18.04.5 LTS".
The very very important thing is that never install "nvidia-driver-***" driver by yourself.
Required nvidia drivers are installed while doing sudo apt install -y cuda=10.0.130-1
In addition, for me following commands are not necessary.
# It seems no need following:
sudo mkdir -p /usr/local/cuda-10.0/nccl/lib
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/
Here shows my driver is running that twice faster than tensorflow CPU does.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M On | 00000000:01:00.0 Off | N/A |
| N/A 55C P0 N/A / N/A | 3972MiB / 4046MiB | 66% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 139MiB |
| 0 N/A N/A 1388 G /usr/bin/gnome-shell 120MiB |
| 0 N/A N/A 17390 C python 3706MiB |
+-----------------------------------------------------------------------------
Whole day I had been looking for a way to install cuda 10.0, but every methods I found were ended up with the message, "cuda : Depends: cuda-10-0 (>= 10.0.130) but it is not going to be installed".
I am quite sure above procedures are only way to install cuda 10.0 version thattensorflow-gpu==1.14
requires on "Ubuntu 18.04.5 LTS".
The very very important thing is that never install "nvidia-driver-***" driver by yourself.
Required nvidia drivers are installed while doingsudo apt install -y cuda=10.0.130-1
In addition, for me following commands are not necessary.
# It seems no need following: sudo mkdir -p /usr/local/cuda-10.0/nccl/lib sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/
Here shows my driver is running that twice faster than tensorflow CPU does.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 960M On | 00000000:01:00.0 Off | N/A | | N/A 55C P0 N/A / N/A | 3972MiB / 4046MiB | 66% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 139MiB | | 0 N/A N/A 1388 G /usr/bin/gnome-shell 120MiB | | 0 N/A N/A 17390 C python 3706MiB | +-----------------------------------------------------------------------------
I have a Tesla K80 (Linux) GPU and I want to use TF 1.14.
I have a dev GPU in which I have installed driver 435.21 and CUDA 10; I can see the PID's and memory spike here. However, on a different k8 pod which comes with preinstalled nvidia-driver 450, I am not able to see any processes when running nvidia-smi (CUDA 10).
I am using the same Docker image to install the CUDA,cUDNN, and tensorflow_gpu==1.14.0, which worded with driver version 435.21.
Does anyone have any idea what is going wrong?
this crashed my server..
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
In what point is
nvcc
installed ?
You can have nvcc by install nvdia cuda toolkit: "sudo apt install nvidia-cuda-toolkit". Now you can check the directory again. good luck
While doing sudo apt install -y cuda=10.0.130-1
I ran into a error: cuda : Depends: cuda-10-0 (>= 10.0.130) but it is not going to be installed
. Running sudo dpkg --force-all -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
fixed it for me. Note, though, that that will overwrite any existing versions.
Did you get any error message while installation?
The possible reason for error is multiple versions of cuda, it is always advised to removed all existing cuda versions before installating new one.
Try removing all cuda version by running following command lines:
sudo apt --purge remove "cublas*" "cuda*"
sudo apt --purge remove "nvidia*"
then run the script again.