Skip to content

Instantly share code, notes, and snippets.

@lacek
Created July 15, 2019 04:21
Show Gist options
  • Save lacek/dfe5a1e3efabe425dd6c6342dd53590d to your computer and use it in GitHub Desktop.
Save lacek/dfe5a1e3efabe425dd6c6342dd53590d to your computer and use it in GitHub Desktop.
Nvidia Docker in Docker

Nvidia Docker-in-Docker

Problem

Q: Can Nvidia GPU be used in Docker daemon inside another Docker daemon?

A: Untested according to nvidia-docker wiki.

Testbed

  • Ubuntu 14.04.1 LTS (Kernel: 3.13.0-170-generic)
  • Docker CE 18.06.1-ce
  • nvidia-docker2 2.0.3
  • Nvidia driver 430.34
  • i7-3930K CPU @ 3.20GHz
  • GTX 680

Test

Test Outer Docker

docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi

Looks good:

Fri Jul 12 09:30:46 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 00000000:01:00.0 N/A |                  N/A |
| 30%   37C    P0    N/A /  N/A |      0MiB /  4035MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

Build Image

docker build -t docker:cuda .

Test Inner Docker

docker save nvidia/cuda:10.1-base > cuda_10.1-base.tar
docker run -it --rm --runtime=nvidia --privileged -v $PWD:/workdir -w /workdir docker:cuda bash
docker load < cuda_10.1-base.tar
docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi

Benchmark with Tensorflow

Building Tensorflow from source

GTX680 is pretty old and has compute capability of 3.0, which is not supported by current builds of Tensorflow. Following the instructions of official documentation, a wheel package is built with the image tensorflow/tensorflow:devel-gpu-py3 to include compute capability of 3.0.

Benchmark Outer Docker

docker run -it --runtime=nvidia --rm -v $PWD:/workdir -w /workdir tensorflow/tensorflow:1.14.0-gpu-py3 bash
pip install tensorflow-1.14.0-cp36-cp36m-linux_x86_64.whl
python matmul.py gpu 10000
# 1st: Time taken: 0:00:02.589672
# 2nd: Time taken: 0:00:02.506626
# 3rd: Time taken: 0:00:02.570411
# Average: 2.56s

python matmul.py cpu 10000
# 1st: Time taken: 0:00:12.053325
# 2nd: Time taken: 0:00:12.104460
# 3rd: Time taken: 0:00:12.023136
# Average: 12.06s

# CPU:GPU ~ 4.71

Benchmark Inner Docker

docker save tensorflow/tensorflow:1.14.0-gpu-py3 > tensorflow_1.14.0-gpu-py3.tar
docker run -it --rm --runtime=nvidia --privileged -v $PWD:/workdir -w /workdir docker:cuda bash
docker load < tensorflow_1.14.0-gpu-py3.tar
docker run -it --runtime=nvidia --rm -v $PWD:/workdir -w /workdir tensorflow/tensorflow:1.14.0-gpu-py3 bash
pip install tensorflow-1.14.0-cp36-cp36m-linux_x86_64.whl
python matmul.py gpu 10000
# 1st: Time taken: 0:00:02.594973
# 2nd: Time taken: 0:00:02.574000
# 3rd: Time taken: 0:00:02.540780
# Average: 2.57s
# 0.39% increase

python matmul.py cpu 10000
# 1st: Time taken: 0:00:12.861318
# 2nd: Time taken: 0:00:12.685900
# 3rd: Time taken: 0:00:12.373268
# Average: 12.64s
# 4.81% increase

# CPU:GPU ~ 4.92
# 4.46% increase
FROM nvidia/cuda:10.1-base-ubuntu18.04
RUN sed -i -E -e 's@//(archive|security).ubuntu@//hk.archive.ubuntu@' /etc/apt/sources.list \
&& apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
apt-transport-https \
ca-certificates \
curl \
gnupg2 \
iptables \
&& curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - \
&& echo "deb https://download.docker.com/linux/ubuntu bionic stable" | tee /etc/apt/sources.list.d/docker.list \
&& curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add - \
&& curl -fsSL https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list -o /etc/apt/sources.list.d/nvidia-docker.list \
&& apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
docker-ce=5:18.09.7~3-0~ubuntu-bionic \
nvidia-docker2=2.0.3+docker18.09.7-3 \
nvidia-container-runtime=2.0.0+docker18.09.7-3 \
&& rm -rf /var/lib/apt/lists/*
COPY entrypoint.sh /sbin/entrypoint.sh
VOLUME /var/lib/docker
EXPOSE 2375
ENTRYPOINT ["/sbin/entrypoint.sh"]
#!/bin/sh
# adapted from https://github.com/moby/moby/blob/3b5fac462d21ca164b3778647420016315289034/hack/dind
# apparmor sucks and Docker needs to know that it's in a container (c) @tianon
export container=docker
# as of docker 1.8, cgroups will be mounted in the container
if ! mountpoint -q /sys/fs/cgroup; then
# First, make sure that cgroups are mounted correctly.
CGROUP=/cgroup
mkdir -p "$CGROUP"
if ! mountpoint -q "$CGROUP"; then
mount -n -t tmpfs -o uid=0,gid=0,mode=0755 cgroup $CGROUP || {
echo >&2 'Could not make a tmpfs mount. Did you use --privileged?'
exit 1
}
fi
# Mount the cgroup hierarchies exactly as they are in the parent system.
for HIER in $(cut -d: -f2 /proc/1/cgroup); do
# If cgroup hierarchy is named(mounted with "-o name=foo") we
# need to mount it in $CGROUP/foo to create exect same
# directoryes as on host. Else we need to mount it as is e.g.
# "subsys1,subsys2" if it has two subsystems
# Named, control-less cgroups are mounted with "-o name=foo"
# (and appear as such under /proc/<pid>/cgroup) but are usually
# mounted on a directory named "foo" (without the "name=" prefix).
# Systemd and OpenRC (and possibly others) both create such a
# cgroup. So just mount them on directory $CGROUP/foo.
OHIER=$HIER
HIER="${HIER#*name=}"
mkdir -p "$CGROUP/$HIER"
if ! mountpoint -q "$CGROUP/$HIER"; then
mount -n -t cgroup -o "$OHIER" cgroup "$CGROUP/$HIER"
fi
# Likewise, on at least one system, it has been reported that
# systemd would mount the CPU and CPU accounting controllers
# (respectively "cpu" and "cpuacct") with "-o cpuacct,cpu"
# but on a directory called "cpu,cpuacct" (note the inversion
# in the order of the groups). This tries to work around it.
if [ "$HIER" = 'cpuacct,cpu' ]; then
ln -s "$HIER" "$CGROUP/cpu,cpuacct"
fi
# If hierarchy has multiple subsystems, in /proc/<pid>/cgroup
# we will see ":subsys1,subsys2,subsys3,name=foo:" substring,
# we need to mount it to "$CGROUP/foo" and if there were no
# name to "$CGROUP/subsys1,subsys2,subsys3", so we must create
# symlinks for docker daemon to find these subsystems:
# ln -s $CGROUP/foo $CGROUP/subsys1
# ln -s $CGROUP/subsys1,subsys2,subsys3 $CGROUP/subsys1
OSUBSYSTEMS="${HIER%name=*}"
SUBSYSTEMS="$(echo $OSUBSYSTEMS | tr , ' ')"
if [ "$SUBSYSTEMS" != "$OSUBSYSTEMS" ]; then
for SUBSYS in $SUBSYSTEMS
do
ln -s "$CGROUP/$HIER" "$CGROUP/$SUBSYS"
done
fi
done
fi
if [ -d /sys/kernel/security ] && ! mountpoint -q /sys/kernel/security; then
mount -t securityfs none /sys/kernel/security || {
echo >&2 'Could not mount /sys/kernel/security.'
echo >&2 'AppArmor detection and --privileged mode might break.'
}
fi
# Mount /tmp (conditionally)
if ! mountpoint -q /tmp; then
mount -t tmpfs none /tmp
fi
# Start docker service and poll
service docker start
TIMEOUT=$(( $(date +%s) + 30 ))
until docker info >/dev/null 2>&1; do
if [ $(date +%s) -ge $TIMEOUT ]; then
echo >&2 'Timed out trying to connect to internal docker host.'
break
fi
sleep 1
done
if [ $# -gt 0 ]; then
exec "$@"
fi
# copied from https://databricks.com/tensorflow/using-a-gpu
import sys
import numpy as np
import tensorflow as tf
from datetime import datetime
device_name = sys.argv[1] # Choose device from cmd line. Options: gpu or cpu
shape = (int(sys.argv[2]), int(sys.argv[2]))
if device_name == "gpu":
device_name = "/gpu:0"
else:
device_name = "/cpu:0"
with tf.device(device_name):
random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
sum_operation = tf.reduce_sum(dot_operation)
startTime = datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
result = session.run(sum_operation)
print(result)
# It can be hard to see the results on the terminal with lots of output -- add some newlines to improve readability.
print("\n" * 5)
print("Shape:", shape, "Device:", device_name)
print("Time taken:", datetime.now() - startTime)
print("\n" * 5)
@lacek
Copy link
Author

lacek commented Oct 24, 2019

Native GPU is supported since Docker CE 19.03: https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(Native-GPU-Support)

However, Docker has drop support of Ubuntu 14.04 from version 18.09, i.e. you cannot install newer version using official approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment