This guide covers the setup process for NVIDIA enterprise GPUs (A100, H100, H200) on Linux systems, focusing on Ubuntu LTS distributions.
- Prerequisites
- Driver Installation
- CUDA Toolkit Installation
- Optional Components
- Verification
- Troubleshooting
- Supported NVIDIA GPU (A100, H100, H200)
- PCIe Gen4 x16 slot (recommended)
- Adequate power supply and cooling
- Server-grade motherboard with proper bifurcation support
- Ubuntu LTS (20.04, 22.04, or 24.04)
- Linux kernel headers
- Build tools
Install basic requirements:
sudo apt-get update
sudo apt-get install -y \
build-essential \
linux-headers-$(uname -r) \
software-properties-common \
gnupg
- Remove any existing NVIDIA drivers:
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get autoremove
- Remove outdated signing key (if present):
sudo apt-key del 7fa2af80
- Add NVIDIA repository and GPG key:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
- Add pin file for repository priority:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//').pin
sudo mv cuda-ubuntu*.pin /etc/apt/preferences.d/cuda-repository-pin-600
- Update package lists:
sudo apt-get update
- Install NVIDIA drivers:
sudo apt-get install -y nvidia-driver-latest
- Reboot the system:
sudo reboot
- Install CUDA toolkit and development tools:
sudo apt-get install -y cuda-toolkit nvidia-cuda-toolkit
- Add CUDA paths to your environment (add to ~/.bashrc):
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
- Apply changes:
source ~/.bashrc
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
sudo apt-get install -y nvidia-gds
sudo apt-get install -y nvidia-fabric-manager
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
- Check driver installation:
nvidia-smi
Expected output should show your GPU and driver version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xxx.xx Driver Version: 535.xxx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
...
- Verify CUDA toolkit:
nvcc --version
- Test CUDA functionality:
# Create and run a basic CUDA program
cat > cuda_test.cu << EOF
#include <stdio.h>
__global__ void kernel() { }
int main() {
kernel<<<1,1>>>();
cudaDeviceSynchronize();
printf("CUDA test successful!\n");
return 0;
}
EOF
nvcc cuda_test.cu -o cuda_test
./cuda_test
-
nvidia-smi not found
- Check if driver is installed:
dpkg -l | grep nvidia-driver
- Verify kernel modules:
lsmod | grep nvidia
- Check system logs:
dmesg | grep nvidia
- Check if driver is installed:
-
CUDA not found
- Verify PATH and LD_LIBRARY_PATH settings
- Check CUDA installation:
ls /usr/local/cuda
- Run
cuda-install-samples-*.sh
and test samples
-
Performance Issues
- Check PCIe link:
nvidia-smi -q | grep "Max Link"
- Monitor power/thermal:
nvidia-smi -q -d POWER,TEMPERATURE
- Verify compute mode:
nvidia-smi -q | grep "Compute Mode"
- Check PCIe link:
- Always check system requirements and compatibility before installation
- Use official NVIDIA drivers for enterprise GPUs
- Keep drivers and CUDA toolkit up to date
- Monitor GPU temperature and power usage
- Consider using NVIDIA Data Center GPU Manager (DCGM) for production environments
Add NVIDIA’s Repository for Latest Drivers If the required NVIDIA driver version is not available in your current package manager, add NVIDIA's official repository:
Remove existing repository entries for NVIDIA:
sudo rm /etc/apt/sources.list.d/cuda*
sudo rm /etc/apt/sources.list.d/nvidia-ml*
Add NVIDIA’s repository for your distribution:
distribution=$(lsb_release -c | awk '{print $2}')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin
sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600```
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
sudo apt-key add 3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/ /"
sudo apt-get update
Install the Latest NVIDIA Driver Install the latest driver and ensure it’s the open or server-open version as needed:
sudo apt-get install nvidia-driver-565-open
If 565 is unavailable, you can try the latest available version:
sudo apt-get install nvidia-driver-latest
Install CUDA After Updating the Driver After successfully installing the updated driver, attempt to install CUDA again:
sudo apt-get install cuda
Verify Installation Reboot your system to apply changes:
sudo reboot
After rebooting, check the installed driver and CUDA version:
nvidia-smi
nvcc --version