The following notes are generated from ChatGPT and modified while dumping here.
- The GPU driver acts as an interface between your operating system and the hardware.
- It ensures your OS can communicate with and utilize the GPU for tasks.
- The CUDA Toolkit is required for developing and running GPU-accelerated applications.
- It includes libraries (like cuBLAS, cuDNN), compilers, and tools for building and optimizing GPU programs.
- Part of the CUDA Toolkit, it compiles CUDA programs written in C/C++ to run on the GPU.
- Necessary if you're building custom CUDA kernels.
- A library that provides optimized routines for deep learning frameworks like TensorFlow and PyTorch.
- Libraries like TensorFlow or PyTorch require specific versions of CUDA and cuDNN to use the GPU.
- Check GPU Model:
nvidia-smi
This command lists your GPU model and current driver version. If it doesn't show up anything which means you don't have GPU drivre installed. Then follow the next steps.
-
Download Driver: Visit NVIDIA Driver Downloads to get the appropriate driver.
-
Install the Driver: Follow installation instructions provided on the NVIDIA website. For Linux: Use the
.run
file or a package manager likeapt
oryum
.
- Check Compatibility: Check which version of CUDA is supported by your framework (e.g., TensorFlow, PyTorch).
- Download CUDA Toolkit: Visit CUDA Toolkit Downloads.
- Install CUDA: Follow the instructions for your OS (Linux, Windows, macOS). Example for Linux (using
apt
):
'''
Linux -> x86_64 -> Ubuntu -> 22.04 -> runfile (local)
'''
!wget https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run
!sudo sh cuda_12.6.2_560.35.03_linux.run
Next, Follow the instructions. After completion, it will suggest you to set PATH
and LD_LIBRARY_PATH
to your environment. You need to do this. But let's check follows if needed:
ls /usr/local/ # to check existed cuda files
It may show:
..., cuda, cuda-12.0, cuda-11.0, cuda-12.6, ...
You can keep all or remove if you are okay with that.
sudo rm /usr/local/cuda # [optional]
sudo rm -r /usr/local/cuda-12.0 # [optional]
And let's say, your desire cuda version is 12.6
. You can do:
sudo ln -s /usr/local/cuda-12.6 /usr/local/cuda
- Update
PATH
Variables: Open~/.bashrc
or~/.zshrc
usingnano
:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Press Ctrl+0
to save and Ctrl+X
to exit.
or, run it from terminal.
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
- Source the File:
source ~/.bashrc
- Verify Installation:
nvcc -V
It should be like this:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 12.6, V11.8.89
Build cuda_12.6.r12.6/compiler.31833905_0
As shown, it should match whatever we set in PATH
variable. If we need to change the cuda version, just do the following:
sudo rm /usr/local/cuda
sudo ln -s /usr/local/cuda-12.8 /usr/local/cuda
Now, running nvcc -V
will show compiled cuda 12.8.
- Download cuDNN: Visit cuDNN Download Page.
- Install cuDNN: Extract and copy files to the appropriate CUDA directory. Example (Linux):
sudo cp lib/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/
- Run
nvidia-smi
to check GPU status. - Test with a Framework:
import torch
torch.cuda.is_available()
torch.distributed.is_nccl_available()
torch.cuda.nccl.version()
torch.cuda.device_count()
torch.version.cuda
- Driver Issues: Ensure driver and CUDA versions are compatible.
- Version Mismatch: Use the framework’s recommended CUDA version.
- CUDA Path Not Found: Ensure nvcc and libraries are correctly added to the environment.
- When running multi-GPU computations with NCCL (NVIDIA Collective Communications Library) is mandatory. Get the NCCL version:
locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
Reason:
- The installed NCCL version might not be compatible with the CUDA Toolkit version.
Solution:
- Verify NCCL compatibility with your CUDA version (NCCL Compatibility Matrix).
- Update or downgrade the CUDA Toolkit or NCCL library as needed.
Misc
- GPU driver installation can be failed due to missing kernel headers and source files. Identify Your Current Kernel Version: run
uname -r
. The output can be look like this6.1.0-31-cloud-amd64
(my current system: Debian 12 (Bookworm)). We need to install the matching headers. To do that, run
sudo apt update
sudo apt install -y linux-headers-$(uname -r) build-essential
After that, we can verify the Kernel Headers installation by
ls -l /usr/src/linux-headers-$(uname -r)
If the directory exists and contains files, the headers are correctly installed. Now, try again installing the gpu driver, either directly or via cuda toolkit.