Skip to content

Instantly share code, notes, and snippets.

@Gyarbij
Last active February 2, 2025 20:31
Show Gist options
  • Save Gyarbij/7eff02865c1542b8ad9e0d11ece1a1d6 to your computer and use it in GitHub Desktop.
Save Gyarbij/7eff02865c1542b8ad9e0d11ece1a1d6 to your computer and use it in GitHub Desktop.
NVIDIA Enterprise GPU Setup Guide

NVIDIA Enterprise GPU Setup Guide

This guide covers the setup process for NVIDIA enterprise GPUs (A100, H100, H200) on Linux systems, focusing on Ubuntu LTS distributions.

Table of Contents

Prerequisites

Hardware Requirements

  • Supported NVIDIA GPU (A100, H100, H200)
  • PCIe Gen4 x16 slot (recommended)
  • Adequate power supply and cooling
  • Server-grade motherboard with proper bifurcation support

System Requirements

  • Ubuntu LTS (20.04, 22.04, or 24.04)
  • Linux kernel headers
  • Build tools

Install basic requirements:

sudo apt-get update
sudo apt-get install -y \
    build-essential \
    linux-headers-$(uname -r) \
    software-properties-common \
    gnupg

Driver Installation

  1. Remove any existing NVIDIA drivers:
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get autoremove
  1. Remove outdated signing key (if present):
sudo apt-key del 7fa2af80
  1. Add NVIDIA repository and GPG key:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
  1. Add pin file for repository priority:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//').pin
sudo mv cuda-ubuntu*.pin /etc/apt/preferences.d/cuda-repository-pin-600
  1. Update package lists:
sudo apt-get update
  1. Install NVIDIA drivers:
sudo apt-get install -y nvidia-driver-latest
  1. Reboot the system:
sudo reboot

CUDA Toolkit Installation

  1. Install CUDA toolkit and development tools:
sudo apt-get install -y cuda-toolkit nvidia-cuda-toolkit
  1. Add CUDA paths to your environment (add to ~/.bashrc):
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  1. Apply changes:
source ~/.bashrc

Optional Components

NVIDIA Container Toolkit (for Docker)

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

NVIDIA GDS (GPUDirect Storage)

sudo apt-get install -y nvidia-gds

NVIDIA Fabric Manager (for NVLink/NVSwitch)

sudo apt-get install -y nvidia-fabric-manager
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager

Verification

  1. Check driver installation:
nvidia-smi

Expected output should show your GPU and driver version:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xxx.xx              Driver Version: 535.xxx.xx   CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
...
  1. Verify CUDA toolkit:
nvcc --version
  1. Test CUDA functionality:
# Create and run a basic CUDA program
cat > cuda_test.cu << EOF
#include <stdio.h>
__global__ void kernel() { }
int main() {
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("CUDA test successful!\n");
    return 0;
}
EOF

nvcc cuda_test.cu -o cuda_test
./cuda_test

Troubleshooting

Common Issues

  1. nvidia-smi not found

    • Check if driver is installed: dpkg -l | grep nvidia-driver
    • Verify kernel modules: lsmod | grep nvidia
    • Check system logs: dmesg | grep nvidia
  2. CUDA not found

    • Verify PATH and LD_LIBRARY_PATH settings
    • Check CUDA installation: ls /usr/local/cuda
    • Run cuda-install-samples-*.sh and test samples
  3. Performance Issues

    • Check PCIe link: nvidia-smi -q | grep "Max Link"
    • Monitor power/thermal: nvidia-smi -q -d POWER,TEMPERATURE
    • Verify compute mode: nvidia-smi -q | grep "Compute Mode"

Important Notes

  • Always check system requirements and compatibility before installation
  • Use official NVIDIA drivers for enterprise GPUs
  • Keep drivers and CUDA toolkit up to date
  • Monitor GPU temperature and power usage
  • Consider using NVIDIA Data Center GPU Manager (DCGM) for production environments

Latest Drivers

Add NVIDIA’s Repository for Latest Drivers If the required NVIDIA driver version is not available in your current package manager, add NVIDIA's official repository:

Remove existing repository entries for NVIDIA:

sudo rm /etc/apt/sources.list.d/cuda*
sudo rm /etc/apt/sources.list.d/nvidia-ml*

Add NVIDIA’s repository for your distribution:

distribution=$(lsb_release -c | awk '{print $2}')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin
sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600```
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
sudo apt-key add 3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/ /"
sudo apt-get update

Install the Latest NVIDIA Driver Install the latest driver and ensure it’s the open or server-open version as needed:

sudo apt-get install nvidia-driver-565-open

If 565 is unavailable, you can try the latest available version:

sudo apt-get install nvidia-driver-latest

Install CUDA After Updating the Driver After successfully installing the updated driver, attempt to install CUDA again:

sudo apt-get install cuda

Verify Installation Reboot your system to apply changes:

sudo reboot

After rebooting, check the installed driver and CUDA version:

nvidia-smi
nvcc --version

Additional Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment