-
Star
(192)
You must be signed in to star a gist -
Fork
(67)
You must be signed in to fork a gist
-
-
Save bogdan-kulynych/f64eb148eeef9696c70d485a76e42c3a to your computer and use it in GitHub Desktop.
# WARNING: These steps seem to not work anymore! | |
#!/bin/bash | |
# Purge existign CUDA first | |
sudo apt --purge remove "cublas*" "cuda*" | |
sudo apt --purge remove "nvidia*" | |
# Install CUDA Toolkit 10 | |
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb | |
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && sudo apt update | |
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb | |
sudo apt update | |
sudo apt install -y cuda | |
# Install CuDNN 7 and NCCL 2 | |
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb | |
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb | |
sudo apt update | |
sudo apt install -y libcudnn7 libcudnn7-dev libnccl2 libc-ares-dev | |
sudo apt autoremove | |
sudo apt upgrade | |
# Link libraries to standard locations | |
sudo mkdir -p /usr/local/cuda-10.0/nccl/lib | |
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ | |
sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/ | |
echo 'If everything worked fine, reboot now.' |
Finally, to verify the installation, check
nvidia-smi
nvcc -V
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
In what point is nvcc
installed ?
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
In what point is
nvcc
installed ?
I did "sudo apt-get cuda toolkit" before this script and then it worked out.
Very useful! Thanks
I needed to downgrade CUDA from10.2 to 10.0 version because Pytorch 1.5.1 does not support Tesla 40 GPUs...
I reinstalled Pytorch 1.2.0 with:
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
Important steps before Pytorch installation:
- Use
sudo apt install cuda=10.0.130-1
insteadsudo apt install cuda
- Don't use
sudo apt upgrade
- Include in ~/.bashrc this line:
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
- Warning:
nvidia-smi
shows 'CUDA Version 11.0' but v10.0 is really installed and working correctly... I don't know why it shows a different version...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K40c On | 00000000:03:00.0 Off | 0 |
| 28% 57C P0 66W / 235W | 1058MiB / 11441MiB | 0% Default |
| | | N/A |
Very useful! Thanks
I needed to downgrade CUDA from10.2 to 10.0 version because Pytorch 1.5.1 does not support Tesla 40 GPUs...
I reinstalled Pytorch 1.2.0 with:
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
Important steps before Pytorch installation:
1. Use `sudo apt install cuda=10.0.130-1` instead `sudo apt install cuda` 2. Don't use `sudo apt upgrade` 3. Include in ~/.bashrc this line: `export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}` 4. Warning: `nvidia-smi` **shows 'CUDA Version 11.0' but v10.0 is really installed and working correctly**... I don't know why it shows a different version...
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K40c On | 00000000:03:00.0 Off | 0 | | 28% 57C P0 66W / 235W | 1058MiB / 11441MiB | 0% Default | | | | N/A |
Worked thanks!!!
Here is the version that works for me. Credits to @jpison and @bogdan-kulynych.
#!/bin/bash
# Purge existign CUDA first
sudo apt --purge remove "cublas*" "cuda*"
sudo apt --purge remove "nvidia*"
# Install CUDA Toolkit 10
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && sudo apt update
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt update
sudo apt install -y cuda=10.0.130-1
# Install CuDNN 7 and NCCL 2
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install -y libcudnn7 libcudnn7-dev libnccl2 libc-ares-dev
sudo apt autoremove
# sudo apt upgrade
# Link libraries to standard locations
# sudo mkdir -p /usr/local/cuda-10.0/nccl/lib
# sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
# sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/
echo 'export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'If everything worked fine, reboot now.'
- Since we are using Deb installation method, we don't need to change the
LD_LIBRARY_PATH
variables (reference). - After running this script, you need to reboot the system.
- Type
nvidia-smi
andnvcc --version
to verify your installation.
Whole day I had been looking for a way to install cuda 10.0, but every methods I found were ended up with the message, "cuda : Depends: cuda-10-0 (>= 10.0.130) but it is not going to be installed".
I am quite sure above procedures are only way to install cuda 10.0 version that tensorflow-gpu==1.14
requires on "Ubuntu 18.04.5 LTS".
The very very important thing is that never install "nvidia-driver-***" driver by yourself.
Required nvidia drivers are installed while doing sudo apt install -y cuda=10.0.130-1
In addition, for me following commands are not necessary.
# It seems no need following:
sudo mkdir -p /usr/local/cuda-10.0/nccl/lib
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/
sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/
Here shows my driver is running that twice faster than tensorflow CPU does.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M On | 00000000:01:00.0 Off | N/A |
| N/A 55C P0 N/A / N/A | 3972MiB / 4046MiB | 66% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 139MiB |
| 0 N/A N/A 1388 G /usr/bin/gnome-shell 120MiB |
| 0 N/A N/A 17390 C python 3706MiB |
+-----------------------------------------------------------------------------
Whole day I had been looking for a way to install cuda 10.0, but every methods I found were ended up with the message, "cuda : Depends: cuda-10-0 (>= 10.0.130) but it is not going to be installed".
I am quite sure above procedures are only way to install cuda 10.0 version thattensorflow-gpu==1.14
requires on "Ubuntu 18.04.5 LTS".
The very very important thing is that never install "nvidia-driver-***" driver by yourself.
Required nvidia drivers are installed while doingsudo apt install -y cuda=10.0.130-1
In addition, for me following commands are not necessary.
# It seems no need following: sudo mkdir -p /usr/local/cuda-10.0/nccl/lib sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.0/lib64/
Here shows my driver is running that twice faster than tensorflow CPU does.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 960M On | 00000000:01:00.0 Off | N/A | | N/A 55C P0 N/A / N/A | 3972MiB / 4046MiB | 66% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 139MiB | | 0 N/A N/A 1388 G /usr/bin/gnome-shell 120MiB | | 0 N/A N/A 17390 C python 3706MiB | +-----------------------------------------------------------------------------
I have a Tesla K80 (Linux) GPU and I want to use TF 1.14.
I have a dev GPU in which I have installed driver 435.21 and CUDA 10; I can see the PID's and memory spike here. However, on a different k8 pod which comes with preinstalled nvidia-driver 450, I am not able to see any processes when running nvidia-smi (CUDA 10).
I am using the same Docker image to install the CUDA,cUDNN, and tensorflow_gpu==1.14.0, which worded with driver version 435.21.
Does anyone have any idea what is going wrong?
this crashed my server..
sudo ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/nccl/lib/ln: target '/usr/local/cuda/nccl/lib/' is not a directory: No such file or directory
In what point is
nvcc
installed ?
You can have nvcc by install nvdia cuda toolkit: "sudo apt install nvidia-cuda-toolkit". Now you can check the directory again. good luck
While doing sudo apt install -y cuda=10.0.130-1
I ran into a error: cuda : Depends: cuda-10-0 (>= 10.0.130) but it is not going to be installed
. Running sudo dpkg --force-all -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
fixed it for me. Note, though, that that will overwrite any existing versions.
It might be useful to add this to the very first line:
...in case people need to undo some of the changes.