WARNING: I don't really care about desktop, so this might fuck it up - make sure you can still ssh
into it on reboots
If your installation is really messed up or you've kind of mangled it by mashing commands, you should wipe everything cuda related and restart - the docs provides a neat that will get rid of most if not all traces
(https://gist.github.com/hitorilabs/3fed1a6e5dd500edb5ad7568562c064d)
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#removing-cuda-toolkit-and-driver
As an overview, here's a laundry list of things you probably want to check:
- At Installation:
- compiler errors (
gcc
vs.clang
)- read the warnings and errors, copy paste relevant bits into chatgpt (just take the head of logs when the errors begin)
- linker error
- where are your libraries installed? check
/usr/local/*
,/usr/lib*
,/opt/*
- where did your apt pacakges go?
sudo dpkg -L <package_name>
- what are your shared libraries linked to?
ldd /path/to/<shared_library>
- where are your libraries installed? check
- compiler errors (
- GPU driver vs. CUDA driver (https://stackoverflow.com/questions/53422407/different-cuda-versions-shown-by-nvcc-and-nvidia-smi)
- old/default driver installations (probably just want to wipe these)
- Post-installation related setup
PATH
+LD_LIBRARY_PATH
(https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#post-installation-actions)
I started with installing cuda-toolkit, which seemed to be totally fine - but I noticed that nvidia-smi was reporting a different version from nvcc. My nvidia drivers were installed by default from the Ubuntu installation.
Then I tried to install the rest of the cuda-drivers and it gave me this error:
Building initial module for 6.5.0-10-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-dkms-545.0.crash'
Error! Bad return status for module build on kernel: 6.5.0-10-generic (x86_64)
Consult /var/lib/dkms/nvidia/545.23.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-545 (--configure):
installed nvidia-dkms-545 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of cuda-drivers-545:
cuda-drivers-545 depends on nvidia-dkms-545 (>= 545.23.08); however:
Package nvidia-dkms-545 is not configured yet.
At the top of /var/crash/nvidia-dkms-545.0.crash, I saw this:
DKMSBuildLog:
DKMS make.log for nvidia-545.23.08 for kernel 6.5.0-10-generic (x86_64) Thu Nov 16 08:18:11 AM EST 2023
make[1]: Entering directory '/usr/src/linux-headers-6.5.0-10-generic'
make --no-print-directory -C /usr/src/linux-headers-6.5.0-10-generic \ -f /usr/src/linux-headers-6.5.0-10-generic/Makefile modules
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
You are using: Ubuntu clang version 17.0.4
I didn't realize that setting CC as an environment variable isn't always respected by the build process. At the system-level I had the symlink set to clang - so I fixed it by changing that with sudo update-alternatives --config cc
On reflection, I was surprised that cuda-toolkit
was totally fine being compiled using clang
. This just meant that the other half my cuda
installation was compiled using gcc
- but in nearly all situations they interoperated just fine.