Preface

The following notes are generated from ChatGPT and modified while dumping here.

GPU Driver:

The GPU driver acts as an interface between your operating system and the hardware.
It ensures your OS can communicate with and utilize the GPU for tasks.

CUDA Toolkit:

The CUDA Toolkit is required for developing and running GPU-accelerated applications.
It includes libraries (like cuBLAS, cuDNN), compilers, and tools for building and optimizing GPU programs.

`nvcc` Compiler:

Part of the CUDA Toolkit, it compiles CUDA programs written in C/C++ to run on the GPU.
Necessary if you're building custom CUDA kernels.

cuDNN (optional):

A library that provides optimized routines for deep learning frameworks like TensorFlow and PyTorch.

Framework Compatibility:

Libraries like TensorFlow or PyTorch require specific versions of CUDA and cuDNN to use the GPU.

How to Install Them

Step 1: Install GPU Drivers

Check GPU Model:

nvidia-smi

This command lists your GPU model and current driver version. If it doesn't show up anything which means you don't have GPU drivre installed. Then follow the next steps.

Download Driver: Visit NVIDIA Driver Downloads to get the appropriate driver.
Install the Driver: Follow installation instructions provided on the NVIDIA website. For Linux: Use the .run file or a package manager like apt or yum.

Step 2: Install CUDA Toolkit

Check Compatibility: Check which version of CUDA is supported by your framework (e.g., TensorFlow, PyTorch).
Download CUDA Toolkit: Visit CUDA Toolkit Downloads.
Install CUDA: Follow the instructions for your OS (Linux, Windows, macOS). Example for Linux (using apt):

'''
Linux -> x86_64 -> Ubuntu -> 22.04 -> runfile (local)
'''
!wget https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run
!sudo sh cuda_12.6.2_560.35.03_linux.run

Next, Follow the instructions. After completion, it will suggest you to set PATH and LD_LIBRARY_PATH to your environment. You need to do this. But let's check follows if needed:

ls /usr/local/ # to check existed cuda files

It may show:

..., cuda, cuda-12.0, cuda-11.0, cuda-12.6, ...

You can keep all or remove if you are okay with that.

sudo rm /usr/local/cuda # [optional]
sudo rm -r /usr/local/cuda-12.0 # [optional]

And let's say, your desire cuda version is 12.6. You can do:

sudo ln -s /usr/local/cuda-12.6 /usr/local/cuda

Step 3: Configure Environment

Update PATH Variables: Open ~/.bashrc or ~/.zshrc using nano:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Press Ctrl+0 to save and Ctrl+X to exit.

or, run it from terminal.

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

Source the File:

source ~/.bashrc

Verify Installation:

nvcc -V

It should be like this:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 12.6, V11.8.89
Build cuda_12.6.r12.6/compiler.31833905_0

As shown, it should match whatever we set in PATH variable. If we need to change the cuda version, just do the following:

sudo rm /usr/local/cuda
sudo ln -s /usr/local/cuda-12.8 /usr/local/cuda

Now, running nvcc -V will show compiled cuda 12.8.

Step 4: Install cuDNN [Optional]

Download cuDNN: Visit cuDNN Download Page.
Install cuDNN: Extract and copy files to the appropriate CUDA directory. Example (Linux):

sudo cp lib/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/

Step 5: Test Installation

Run nvidia-smi to check GPU status.
Test with a Framework:

import torch
torch.cuda.is_available()
torch.distributed.is_nccl_available()
torch.cuda.nccl.version()
torch.cuda.device_count()
torch.version.cuda

Troubleshooting

Driver Issues: Ensure driver and CUDA versions are compatible.
Version Mismatch: Use the framework’s recommended CUDA version.
CUDA Path Not Found: Ensure nvcc and libraries are correctly added to the environment.
When running multi-GPU computations with NCCL (NVIDIA Collective Communications Library) is mandatory. Get the NCCL version:

locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'

Reason:

The installed NCCL version might not be compatible with the CUDA Toolkit version.

Solution:

Verify NCCL compatibility with your CUDA version (NCCL Compatibility Matrix).
Update or downgrade the CUDA Toolkit or NCCL library as needed.

Misc

GPU driver installation can be failed due to missing kernel headers and source files. Identify Your Current Kernel Version: run uname -r. The output can be look like this 6.1.0-31-cloud-amd64 (my current system: Debian 12 (Bookworm)). We need to install the matching headers. To do that, run

sudo apt update
sudo apt install -y linux-headers-$(uname -r) build-essential

After that, we can verify the Kernel Headers installation by

ls -l /usr/src/linux-headers-$(uname -r)

If the directory exists and contains files, the headers are correctly installed. Now, try again installing the gpu driver, either directly or via cuda toolkit.

innat/GPU Driver | Cuda Toolkit | NCCL | cuDNN.md