Installing NVIDIA Driver & CUDA inside an LXC container running Ubuntu 16.04 on a neuroscience computing server.
Introduction: I was trying to run some neuroscience image processing commands that uses NVIDIA GPU. The challenge is that most of our computation will be run inside an LXC container running Ubuntu 16.04 (the host runs Ubuntu 16.04 as well). Installing the NVIDIA driver on the host is not so hard, but doing it inside the LXC container is much more challenging.
I already have an unprivileged container running, so I will not repeat the steps to create an LXC container here.
Our graphics card is NVIDIA GeForce GTX 1080 Ti.
Here are the main steps:
- Install NVIDIA driver on the host
- Install NVIDIA driver in the container. The driver version in the container has to be exactly the same as the one on the host.
- Install CUDA & other GPU-related libraries in the container.
I found this page
https://blog.nelsonliu.me/2017/04/29/installing-and-updating-gtx-1080-ti-cuda-drivers-on-ubuntu/
which mostly followed the instructions on this page:
(see Section 4, Runfile Installation).
And here is what I did (mostly following steps in Section 4.2 from the link above, but I'm listing the steps that I actually did below):
- Install
gcc
and other essential packages on the host:
sudo apt install build-essential software-properties-common
- Download the CUDA Toolkit run file
cuda_9.0.176_384.81_linux-run
- Following the instructions in Section 4.3.5 to blacklist
nouveau
driver - Reboot athena into text-only mode: found this page https://askubuntu.com/questions/870221/booting-into-text-mode-in-16-04/870226 (From here on, I use the KVM environment.)
- Running into this error:
huangk04@athena:~/Downloads$ sudo sh cuda_9.0.176_384.81_linux-run
[sudo] password for huangk04:
Sorry, user huangk04 is not allowed to execute '/bin/sh cuda_9.0.176_384.81_linux-run' as root on athena.mssm.edu.
- Got past that error by typing
sudo su
and then it runs. - The root partition
/dev/mapper/vg01-lv.root
is too small for CUDA; will install CUDA in/data/cuda-9.0
and make symbolic links to/usr/local/cuda-9.0
; also install CUDA samples at/data/cuda-9.0/samples/
; also need to specify temp file directory because of disk space issue as well; so here is the command I executed:
sh cuda_9.0.176_384.81_linux-run --tmpdir=/data/tmp
(Mmm... I probably did not need to install CUDA Toolkit on the host...)
- Add the graphics driver PPA (verified that the driver version 384.98 is supported on Ubuntu 16.04), update, and then install driver version 384:
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-384
- Reboot, and then (still in host) type
nvidia-smi
to confirm that the driver version is indeed 384.98.
To set up GPU driver in the container, Prantik forwarded this tutorial to me: https://medium.com/@MARatsimbazafy/journey-to-deep-learning-nvidia-gpu-passthrough-to-lxc-container-97d0bc474957
- On the host, edit the file
/etc/modules-load.d/modules.conf
and add the following lines (not sure if this is necessary):
nvidia
nvidia_uvm
- Update
initramfs
:
sudo update-initramfs -u
- Set the login runlevel back to
graphical.target
(and another reboot is required):
sudo systemctl set-default graphical.target
- Edit the file
/home/huangk04/.local/share/lxc/athena_box/config
and add the following lines to it:
# GPU Passthrough config
lxc.cgroup.devices.allow = c 195:* rwm
lxc.cgroup.devices.allow = c 243:* rwm
lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
However, passing the GPU device in the LXC config file led to the following error when I tried to start the container:
huangk04@athena:~$ lxc-start -n athena_box -d
lxc-start: tools/lxc_start.c: main: 366 The container failed to sta
rt.
lxc-start: tools/lxc_start.c: main: 368 To get more details, run the container in foreground mode.
lxc-start: tools/lxc_start.c: main: 370 Additional information can be obtained by setting the --logfile and --logpriority options.
I find that I can't start my container if I specify anything in the config
file that tries to modify the cgroup
settings, like trying to get access to the /dev/nvidia*
devices on the host.
It looks like a cgroups
issue with LXC on Ubuntu 16.04 (maybe somehow related to systemd
, but I don't really understand what that means). What's more confusing is that Ubuntu has a package cgmanager
that manages cgroups
(by being a wrapper to send calls to dbus
?), but when I tried to install it by typing
sudo apt update
sudo apt install cgmanager
it showed that I installed the version 0.39-2ubuntu5
. But the version of cgm
I got is
huangk04@athena:~$ cgm --version
0.29
Seems like a bug in cgmanager
. Anyway, I found some instructions (e.g., https://www.berthon.eu/2015/lxc-unprivileged-containers-on-ubuntu-14-04-lts/) that I could move all processes in my current shell to a specific cgroup
with access to the devices I that need, and then I may be able to start the container. So here is what I tried:
sudo cgm create all $USER
sudo cgm chown all $USER $(id -u) $(id -g)
sudo cgm movepid all $USER $$
In fact, the second and third commands actually threw me errors. But these commands did have an effect on what I see in /proc/self/cgroup
. Before these commands, it looks like this:
huangk04@athena:~$ cat /proc/self/cgroup
11:cpuset:/
10:net_cls,net_prio:/
9:cpu,cpuacct:/user.slice
8:perf_event:/
7:memory:/user/huangk04/0
6:devices:/user.slice
5:freezer:/user/huangk04/0
4:hugetlb:/
3:blkio:/user.slice
2:pids:/user.slice/user-10354.slice
1:name=systemd:/user.slice/user-10354.slice/session-6.scope
and after the three commands above (probably the first one only need to be run once), I see
huangk04@athena:~$ cat /proc/self/cgroup
11:cpuset:/huangk04
10:net_cls,net_prio:/huangk04
9:cpu,cpuacct:/user.slice/huangk04
8:perf_event:/huangk04
7:memory:/user/huangk04/0
6:devices:/user.slice/huangk04
5:freezer:/user/huangk04/0
4:hugetlb:/huangk04
3:blkio:/user.slice/huangk04
2:pids:/user.slice/user-10354.slice/huangk04
1:name=systemd:/user.slice/user-10354.slice/session-6.scope
and now the container starts. I suspect that it is a bug in cgmanager
that's throwing me errors even though the commands worked, which could be related to the incoherent version numbers I see when viewing them in different ways. Also, the sudo cgm chown
and sudo cgm movepid
are not persistent, meaning that I need to run these commands in the future if I need to restart the container (in a different shell, most likely).
- Install NVIDIA driver in container as well (so we have the
nvidia-smi
command in the container): First, download the driver runfileNVIDIA-Linux-x86_64-384.98.run
(again, the version in the container must match the version on the host, 384.98). Then do the following (courtesy of this website: https://qiita.com/yanoshi/items/75b0fc6b65df49fc2263)
cd ~/Downloads # or wherever the runfile is in
chmod a+x NVIDIA-Linux-x86_64-384.98.run
sudo ./NVIDIA-Linux-x86_64-384.98.run --no-kernel-module
And then follow the prompts to install the driver. After that, I can see the GPU info in the container by typing nvidia-smi
:
root@xenial:/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery# nvidia-smi
Tue Nov 21 02:35:05 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98 Driver Version: 384.98 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 23% 38C P0 59W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Install CUDA in the container, following the same steps from steps 2 to 7.
- Run a CUDA test: go to the directory
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery
, typemake
to compile an executable, and run the executable./deviceQuery
, which produced the following output:
root@xenial:/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080 Ti"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11172 MBytes (11715084288 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
- It is best to remove the graphics driver PPA so that future
apt update
won't update the driver to a newer but incompatible version:
sudo add-apt-repository --remove ppa:graphics-drivers/ppa
Should do this both on the host and in the container.
To be added: run more tests using the GPU
Here is the page where one can download the CUDA- and openMP-enabled versions of eddy
from FSL
, in case I forget:
https://fsl.fmrib.ox.ac.uk/fsldownloads/patches/eddy-patch-fsl-5.0.9/centos6/
Also, here is a nice instruction on how to install CUDA 7.5 on Ubuntu 16.04: http://www.xgerrmann.com/uncategorized/building-cuda-7-5-on-ubuntu-16-04/
In the end, eddy_cuda7.5
still doesn't run properly inside the container. I wonder if it's because I installed CUDA 9.0 before installing CUDA 7.5, even though I created an environment module file for CUDA 7.5 and loaded it before running eddy_cuda7.5
(and eddy_cuda7.5
seems to be able to find the correct libraries). I'll need to experiment more with this later.