|
# Reproducible Research using Docker and R |
|
|
|
# Challenges of reproducibility |
|
|
|
- dependencies |
|
- isolation and transparency |
|
- portability of computationational environment |
|
- extendability and resuse |
|
- ease of use |
|
|
|
# Virtual Machines vs Containers |
|
|
|
- uses resource isolation features of the Linux kernel such as cgroups and kernel namespaces |
|
- allows independent "containers" to run within a single Linux instance |
|
- packages executable dependencies in a way that is more transparent than a VM and more robust than a README |
|
- limtationsof VMs: |
|
- Size: VMs are very large which makes them impractical to store and transfer. |
|
- Performance: running VMs consumes significant CPU and memory, which makes them impractical in many scenarios, for example local development of multi-tier applications, and large-scale deployment of cpu and memory-intensive applications on large numbers of machines. |
|
- Portability: competing VM environments don't play well with each other. Although conversion tools do exist, they are limited and add even more overhead. |
|
- Hardware-centric: VMs were designed with machine operators in mind, not software developers. As a result, they offer very limited tooling for what developers need most: building, testing and running their software. For example, VMs offer no facilities for application versioning, monitoring, configuration, logging or service discovery. |
|
|
|
# What Docker is |
|
|
|
- a shipping container for the online universe: hardware-agnostic and platform-agnostic |
|
- a tool that lets developers neatly package software and move it from machine to machine. |
|
- released as open source in March 2013, a big deal on github: 18.6k stars, 3.8k forks |
|
- dockerfiles: plain-text instructions to automatically make images |
|
- containers: the active, running parts of Docker that do something |
|
- images: pre-built environments and instructions that tell a container what to do. |
|
- registry: open online repository of images (https://registry.hub.docker.com/), including many ['trusted builds'](http://dockerfile.github.io/) |
|
|
|
# Limitations |
|
|
|
- Security: it is possible for an image hosted there to be written with some malicious intent |
|
- Limited to 64-bit host machines, making it impossible to run on older hardware |
|
- Does not provide complete virtualization but relies on the Linux kernel provided by the host |
|
- On OSX and Windows this means a VM must be present ([boot2docker](http://boot2docker.io/) installs [VirtualBox(https://www.virtualbox.org/) for this) |
|
|
|
# Getting started on OSX & Windows |
|
|
|
- Install [boot2docker](http://boot2docker.io/) |
|
- `docker pull <username>/<image_name>` an existing image from registry |
|
- eg `docker pull ubuntu` notice there's no username here, because this is an 'official repo' |
|
- after `pull` then `run` |
|
- or simply `run`, which will `pull`, `create` and `run` in one step |
|
|
|
# `docker run` and common [flags](https://docs.docker.com/reference/run/): |
|
|
|
-i Interactive (usually used with -t) |
|
-t TTY: Allocate a pseudo-TTY (basically a terminal interface for a CLI) |
|
-p Publish Ports: -p <host port>:<container port> |
|
-d Detached mode: run the container in the backgroup (opposite of -i -t) |
|
-v mount a volume from inside your container (that has been specified with the VOLUME instruction in the Dockerfile) |
|
--rm=true remove your container from the host when it stops running (only available with -it) |
|
|
|
- eg `docker run -it ubuntu` # gets ubuntu and gives us a terminal for interaction |
|
- eg `docker run -dp 8787:8787 rocker/rstudio` # gets R & RStudio and opens port 8787 for using RStudio server in a web browser at localhost:8787 (linux) or 192.168.59.103:8787 (Windows, OSX) |
|
|
|
# [Interacting with docker at the command line](https://docs.docker.com/reference/commandline/cli/) |
|
|
|
docker ps # list all the running containers on the host |
|
docker ps -a # list all the containers on the host, including those that have stopped |
|
docker exec -it <container-id> bash # opens bash shell for a currently running container |
|
docker stop <container-id> # stop a running container |
|
docker kill <container-id> # similar to docker stop, but it's more forceful, sending a SIGKILL to the command the container is running |
|
docker rm <container-id> # removes (deletes) a container. |
|
docker rmi <container-id> # removes (deletes) an image. |
|
docker rm -f $(docker ps -a -q) # remove all current containers |
|
docker rmi -f $(docker images -q) # stop and remove all images |
|
|
|
# [Writing a dockerfile](https://docs.docker.com/articles/dockerfile_best-practices/) |
|
|
|
- it is possible to use `docker commit <container>` to commit a container's file changes or settings into a new image, but it is better to use Dockerfiles & git to manage your images in a documented and maintainable way |
|
- A Dockerfile is a short plain text file that is a recipie for making a docker image |
|
|
|
# Dockerfile elements |
|
|
|
- FROM instruction specifies which base image your image is built on (ultimately back to Debian) |
|
- MAINTAINER instruction specifies who created and maintains the image. |
|
- CMD, specifies the command to run immediately when a container is started from this image, unless you specify a different command. |
|
- ADD instruction will copy new files from a source and add them to the containers filesystem path |
|
- RUN instruction does just that: It runs a command inside the container (eg. `apt-get`) |
|
- EXPOSE instruction tells Docker that the container will listen on the specified port when it starts |
|
- VOLUME instruction will create a mount point with the specified name and tell Docker that the volume may be mounted by the host |
|
|
|
- Moderately complex example: https://github.com/rocker-org/hadleyverse/blob/master/Dockerfile |
|
- To build an image from a dockerfile: `docker build --rm -t <username>/<image_name> <dockerfile>` |
|
- To send an image to the registry: `docker push <username>/<image_name>` # need to be registered at https://hub.docker.com/ |
|
|
|
# [Automated Docker image build testing](https://circleci.com/) |
|
|
|
- Automated image build testing on a new commit to the Dockerfile |
|
- Analogous to the travis-ci service, has a shield |
|
- Requires a `.circle.yml` file in github repo, eg. https://github.com/benmarwick/1989-excavation-report-Madjebebe/blob/master/circle.yml |
|
- Pushes new image to hub on successful complete of test |
|
|
|
# Doing research with RStudio and Docker |
|
|
|
- The [rocker project](https://github.com/rocker-org/) provides images that include R, key packages and other dependencies (RStudio, pandoc, LaTeX, etc.), and has excellent documentation on the github wiki (https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image) |
|
- run RStudio server in the browser, with host folder as volume |
|
- |
|
- eg `docker run -dp 8787:8787 -v /c/Users/marwick/docker:/home/rstudio/ -e ROOT=TRUE rocker/hadleyverse` |
|
- |
|
- # `-dp 8787:8787` # gives me a port for the web browser to access RStudio |
|
- # `-v /c/Users/marwick/docker:/home/rstudio/` # gives me read and write access both ways between Windows (C:/Users/marwick/docker) and RStudio |
|
- # `-e ROOT=TRUE` # sets an environment variable to enable root access for me so I can manage dependencies |
|
- I can access the docker (Debian) shell via RStudio for file manipulation, etc. (or `docker exec -it <container-id> bash`) |
|
- I store scripts on host volume because VC is simpler this way, but do development and analysis in container for isolation |
|
|
|
# ...and IPython |
|
|
|
- Choose your favourite from the registry: https://registry.hub.docker.com/search?q=ipython&s=downloads |
|
- the IPython project have a few images, and there are many user-contributed ones |
|
|
|
# Cloud computing with docker is widely supported |
|
|
|
- Amazon EC2 Container Service: docker clusters in the cloud (no registry) |
|
- Google Compute Engine: has container-optimized VMs |
|
- Google container registry: secure private docker image storage on google cloud platform |
|
- Microsoft Azure supports docker containers (docker hub is integrated) |
|
|
|
# References & further reading |
|
|
|
- http://arxiv-web3.library.cornell.edu/pdf/1410.0846v1.pdf |
|
- http://sites.duke.edu/researchcomputing/tag/docker/ |
|
- https://rc.duke.edu/duke-docker-day-was-great/ |
|
- https://github.com/LinuxAtDuke/Intro-To-Docker |
|
- http://reproducible-research.github.io/scipy-tutorial-2014/environment/docker/ |
|
- http://ropensci.org/blog/2014/10/23/introducing-rocker/ |
|
- https://github.com/wsargent/docker-cheat-sheet |