hiwonjoon/Install NVIDIA Driver and CUDA.md

Forked from zhanwenchen/Install NVIDIA Driver and CUDA.md

Created September 7, 2018 19:54

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/hiwonjoon/412ceb2b594b47a67d89628c8ee2f1dc.js"></script>
Save hiwonjoon/412ceb2b594b47a67d89628c8ee2f1dc to your computer and use it in GitHub Desktop.

Download ZIP

Install NVIDIA CUDA 9.0 on Ubuntu 16.04.4 LTS

Raw

Install NVIDIA Driver and CUDA.md

Updated 4/11/2018

Here's my experience of installing the NVIDIA CUDA kit 9.0 on a fresh install of Ubuntu Desktop 16.04.4 LTS.

Install NVIDIA Graphics Driver via apt-get
Install CUDA
Install cuDNN

Table of contents generated with markdown-toc

1. Install NVIDIA Graphics Driver via apt-get

Do not use the CUDA run file to install your driver. Use apt-get instead. This way you do not need to worry about the Nouveau stuff you read about on StackOverflow.

As of 04/11/2018, the latest version of NVIDIA driver for Ubuntu 16.04.4 LTS is 384. To install the driver, excute

sudo apt-get nvidia-384 nvidia-modprobe

, and then you will be prompted to disable Secure Boot. Select Disable.

Reboot the machine but enter BIOS to disable Secure Boot. Typically you can enter BIOS by hitting F12 rapidly as soon as the system restarts.

Afterwards, you can check the Installation with the nvidia-smi command, which will report all your CUDA-capable devices in the system.

Common Errors and Solutions

ERROR: Unable to load the 'nvidia-drm' kernel module.

One probable reason is that the system is boot from UEFI but Secure Boot option is turned on in the BIOS setting. Turn it off and the problem will be solved.

Additional Notes

nvidia-smi -pm 1 can enable the persistent mode, which will save some time from loading the driver. It will have significant effect on machines with more than 4 GPUs.

nvidia-smi -e 0 can disable ECC on TESLA products, which will provide about 1/15 more video memory. Reboot is reqired for taking effect. nvidia-smi -e 1 can be used to enable ECC again.

nvidia-smi -pl <some power value> can be used for increasing or decrasing the TDP limit of the GPU. Increasing will encourage higher GPU Boost frequency, but is somehow DANGEROUS and HARMFUL to the GPU. Decreasing will help to same some power, which is useful for machines that does not have enough power supply and will shutdown unintendedly when pull all GPU to their maximum load.

-i <GPUID> can be added after above commands to specify individual GPU.

These commands can be added to /etc/rc.local for excuting at system boot.

2. Install CUDA 9.0

Installing CUDA from runfile is much simpler and smoother than installing the NVIDIA driver. It just involves copying files to system directories and has nothing to do with the system kernel or online compilation. Removing CUDA is simply removing the installation directory. So I personally does not recommend adding NVIDIA's repositories and install CUDA via apt-get or other package managers as it will not reduce the complexity of installation or uninstallation but increase the risk of messing up the configurations for repositories.

The CUDA runfile installer can be downloaded from NVIDIA's websie, or using wget in case you can't find it easily on NVIDIA:

cd
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run

What you download is a package the following three components:

an NVIDIA driver installer, but usually of stale version;
the actual CUDA installer;
the CUDA samples installer;

I suggest extracting the above three components and executing 2 and 3 separately (remember we installed the driver ourselves already). To extract them, execute the runfile installer with --extract option:

cd
chmod +x cuda_9.0.176_384.81_linux-run
./cuda_9.0.176_384.81_linux-run --extract=$HOME

You should have unpacked three components: NVIDIA-Linux-x86_64-384.81.run (1. NVIDIA driver that we ignore), cuda-linux.9.0.176-22781540.run (2. CUDA 9.0 installer), and cuda-samples.9.0.176-22781540-linux.run (3. CUDA 9.0 Samples).

Execute the second one to install the CUDA Toolkit 9.0:

sudo ./cuda-linux.9.0.176-22781540.run

You now have to accept the license by scrolling down to the bottom (hit the "d" key on your keyboard) and enter "accept". Next accept the defaults.

To verify our CUDA installation, install the sample tests by

sudo ./cuda-samples.9.0.176-22781540-linux.run

After the installation finishes, configure the runtime library.

sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig

It is also recommended for Ubuntu users to append string /usr/local/cuda/bin to system file /etc/environments so that nvcc will be included in $PATH. This will take effect after reboot. To do that, you just have to

sudo vim /etc/environments

and then add :/usr/local/cuda/bin (including the ":") at the end of the PATH="/blah:/blah/blah" string (inside the quotes).

After a reboot, let's test our installation by making and invoking our tests:

cd /usr/local/cuda-9.0/samples
sudo make

It's a long process with many irrelevant warnings about deprecated architectures (sm_20 and such ancient GPUs). After it completes, run deviceQuery and p2pBandwidthLatencyTest:

cd /usr/local/cuda/samples/bin/x86_64/linux/release
./deviceQuery

The result of running deviceQuery should look something like this:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6073 MBytes (6367739904 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Cleanup: if ./deviceQuery works, remember to rm the 4 files (1 downloaded and 3 extracted).

Install cuDNN 7.0

The recommended way for installing cuDNN is to

Download the "cuDNN v7.0.5 Library for Linux" tgz file (need to register for an Nvidia account).
sudo mv the downloaded archive to /usr/local. This might seem silly at first, but when you unzip it next you will see that the contents end up going to various folders under /usr/local/cuda and would be messy to move otherwise.
Then cd /usr/local and extract the tgz by

sudo tar -xvzf cudnn-9.0-linux-x64-v7.tgz

Finally, execute sudo ldconfig to update the shared library cache.
Clean up now or later by sudo rm cudnn-9.0-linux-x64-v7.tgz

Author

hiwonjoon commented Aug 5, 2019 •

edited

Loading

OpenGL Troubleshooting

Long story

The explanation below could contain inaccurate information.

There are three OpenGL rendering backends; GLFW, EGL, OSMesa. The main problem is GLFW since it does not support the headless rendering. The most elegant solution support EGL, but the most commonly used OpenGL library in Python, namely pyglet only support GLFW, so it becomes painful. There has been a discussion, but it is unsure when the update will be made. (Luckily, mujoco_py and dm_control libraries support EGL or OSMesa so I have not noticed this problem for now.)

One way to workaround is by using a virtual framebuffer, such as xvfb, as it is noted in many places like this and this.

However, there are known problem between nVidia driver and xvfb since xvfb is totally cpu-based ref1, ref2. It does not know how to interact with GPU based OpenGL implementation supported by the driver. Therefore, people suggest installing drivers and CUDA library without OpenGL library with --no-opengl-libs options ref1, ref2. It could be one solution, but it is not satisfactory since it abandons hardware-accelerated rendering.

If you can run X-server, (possibly more elegant?) solution is simply running an X-server to generate a virtual monitor. It can be done easily by

(maybe not required) sudo apt-get install -y xserver-xorg mesa-utils
sudo nvidia-xconfig --busid=PCI:0:30:0 --use-display-device=none --virtual=1280x1024
sudo Xorg :1

and specifying the virtual display in front of a command by, for instance, DISPLAY=:1. You can get busid with the command nvidia-xconfig --query-gpu-info.

Tips if you have a trouble in running nvidia-xconfig

Actually, you don't need to run nvidia-xconfig. All you need is a properly set xorg.conf file and PCI bus ID.

First, you can just generate a default xorg.conf file by running the following command (note that there is no options at all).

sudo nvidia-xconfig

Then, change a few sections related to a screen and serverflags. Copy /etc/X11/xorg.conf to your home directory or whatever directory you want, then change the sections.

Section "Screen"
 Identifier Screen0"
 Device "Device0"
 Monitor "Monitor0"
 DefaultDepth 24
 Option "UseDisplayDevice" "None"
 SubSection "Display"
 Virtual 1280 1024
 Depth 24
 EndSubSection
EndSection
Section "ServerFlags"
 Option "AllowMouseOpenFail" "true"
 Option "AutoAddGPU" "false"
 Option "ProbeAllGpus" "false"
EndSection

You can grab a PCI id with different command. (actually, nvidia-smi includes this info, too)

 lspci -vnn | grep VGA

You will get a result like below.

02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])

Then, your PCI ID is PCI:2:0:0, PCI:3:0:0, PCI:129:0:0, PCI:130:0:0. (note that 81, 82 is hexadecimal, so 8*16+1=129).

Finally, you can run X-server, by running a command:

sudo Xorg -noreset -sharevts -novtswitch -isolateDevice "<PCI-ID>" -config <your xorg.conf file> :<display id, such as 0,1> vt1 &

Don't forget the last &, because it is a daemon process.

You can check your x-server by monitoring nvidia-smi while running glxgears command, or with glxinfo.

DISPLAY=:0 glxinfo | grep OpenGL
glxgears -display :0

Solutions

If you can connect a monitor directly to a machine

Just connect it; then run app with DISPLAY=:0 option.

Connecting a monitor is impossible, but you have a `sudo` access.

Run X-server; generate a virtual monitor Ref.

You don't have a monitor, and you don't have a `sudo` access.

You can workaround OpenGL problem with library preloading trick: LD_PRELOAD

First, install OSMesa OpenGL library. (Or, all you need is mesa/libGL.so file. You can just copy from your local computer to a server.)

sudo apt-get libglu1-mesa libgl1-mesa-dev

It will be usually installed under /usr/lib/x86_64-linux-gnu/. Check whether mesa/libGL.so exists of which we will preload.

❯❯❯ ls -all /usr/lib/x86_64-linux-gnu/ | grep GL
lrwxrwxrwx    13 root 14 Jun  2018 libGL.so -> mesa/libGL.so

Note that, installed OSMesa library is not included in the ldconfig cache, so it won't be loaded (unsure..) unless we specify it with LD_PRELOAD.

❯❯❯ ldconfig -p | grep libGL
        libGLdispatch.so.0 (libc6,x86-64) => /usr/lib/nvidia-415/libGLdispatch.so.0
        libGLdispatch.so.0 (libc6) => /usr/lib32/nvidia-415/libGLdispatch.so.0
        libGLX_nvidia.so.0 (libc6,x86-64) => /usr/lib/nvidia-415/libGLX_nvidia.so.0
        libGLX_nvidia.so.0 (libc6) => /usr/lib32/nvidia-415/libGLX_nvidia.so.0
        libGLX.so.0 (libc6,x86-64) => /usr/lib/nvidia-415/libGLX.so.0
        libGLX.so.0 (libc6) => /usr/lib32/nvidia-415/libGLX.so.0
        libGLX.so (libc6,x86-64) => /usr/lib/nvidia-415/libGLX.so
        libGLX.so (libc6) => /usr/lib32/nvidia-415/libGLX.so
        libGLU.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGLU.so.1
        libGLEWmx.so.1.13 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGLEWmx.so.1.13
        libGLEW.so.1.13 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGLEW.so.1.13
        libGLESv2_nvidia.so.2 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv2_nvidia.so.2
        libGLESv2_nvidia.so.2 (libc6) => /usr/lib32/nvidia-415/libGLESv2_nvidia.so.2
        libGLESv2.so.2 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv2.so.2
        libGLESv2.so.2 (libc6) => /usr/lib32/nvidia-415/libGLESv2.so.2
        libGLESv2.so (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv2.so
        libGLESv2.so (libc6) => /usr/lib32/nvidia-415/libGLESv2.so
        libGLESv1_CM_nvidia.so.1 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv1_CM_nvidia.so.1
        libGLESv1_CM_nvidia.so.1 (libc6) => /usr/lib32/nvidia-415/libGLESv1_CM_nvidia.so.1
        libGLESv1_CM.so.1 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv1_CM.so.1
        libGLESv1_CM.so.1 (libc6) => /usr/lib32/nvidia-415/libGLESv1_CM.so.1
        libGLESv1_CM.so (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv1_CM.so
        libGLESv1_CM.so (libc6) => /usr/lib32/nvidia-415/libGLESv1_CM.so
        libGL.so.1 (libc6,x86-64) => /usr/lib/nvidia-415/libGL.so.1
        libGL.so.1 (libc6) => /usr/lib32/nvidia-415/libGL.so.1
        libGL.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGL.so
        libGL.so (libc6,x86-64) => /usr/lib/nvidia-415/libGL.so
        libGL.so (libc6) => /usr/lib32/nvidia-415/libGL.so

Anyway, now you can easily prevent /usr/lib32/nvidia-415/libGL.so from loading by specifying OSMesa OpenGL library.

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libGL.so xvfb-run --auto-servernum -s "-screen 0 640x480x24" glxinfo | grep OpenGL

Here are some illustrative runs. The loaded library for OpenGL is now different.

Some useful commands & links

nvidia-xconfig --query-gpu-info
glxinfo or DISPLAY=:0 glxinfo | grep OpenGL
lsof -p <process_id>; observe shared libraries in used from the process
dpkg-query -L <package-name>; check included file from the package
ldd <binary>; list libraries the binary linked against
ldconfig -p; check library cache

openai/gym#366

https://bitbucket.org/pyglet/pyglet/issues/219/egl-support-headless-rendering

Author

hiwonjoon commented Jan 30, 2020

Version check when upgrade nvidia-drivers

apt list --installed | grep cuda
apt list --installed | grep nvidia

Author

hiwonjoon commented Nov 18, 2020

nvidia-smi -q: shows detailed information about GPUs.

hiwonjoon/Install NVIDIA Driver and CUDA.md

Table of Contents

1. Install NVIDIA Graphics Driver via apt-get

Common Errors and Solutions

Additional Notes

2. Install CUDA 9.0

Install cuDNN 7.0

hiwonjoon commented Aug 5, 2019 •

edited

Loading

Uh oh!

hiwonjoon commented Jan 30, 2020

Uh oh!

hiwonjoon commented Nov 18, 2020

Uh oh!

hiwonjoon/Install NVIDIA Driver and CUDA.md

Table of Contents

1. Install NVIDIA Graphics Driver via apt-get

Common Errors and Solutions

Additional Notes

2. Install CUDA 9.0

Install cuDNN 7.0

hiwonjoon commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenGL Troubleshooting

Long story

Tips if you have a trouble in running nvidia-xconfig

Solutions

If you can connect a monitor directly to a machine

Connecting a monitor is impossible, but you have a sudo access.

You don't have a monitor, and you don't have a sudo access.

Some useful commands & links

Uh oh!

hiwonjoon commented Jan 30, 2020

Uh oh!

hiwonjoon commented Nov 18, 2020

Uh oh!

hiwonjoon commented Aug 5, 2019 •

edited

Loading

Connecting a monitor is impossible, but you have a `sudo` access.

You don't have a monitor, and you don't have a `sudo` access.