Docker is a tool for bundling together applications and their dependencies into images that can than be run as containers on many different types of computers. This is useful to scientists because:
- It greatly simplifies distribution and installation of complex work flows that have many interacting components.
- It reduces the friction of moving analyses around. You may, for example, want to explore a tool or dataset on your own laptop and then move that analysis environment to a cluster. Rather than spend time configuring your laptop and then spend time configuring your server, you can run the analyses within Docker on both.
- It is a convenient way to share your analyses and to help make them transparent and reproducible.
- It makes a lot of technical advances and investments made in industry accessible to scientists.
- Docker instances launch very quickly and Docker itself has almost no overhead. This makes it a very convenient way to create clean analysis environments.
There are some potential down sides:
- You have to learn a bit about Docker, in addition to everything else you are already doing.
- Docker wasn't designed to run on a large shared computer with many users, like an academic research cluster. This means that there are some security concerns that have led most academic research computer centers to not provide support for running Docker containers on their clusters (in fact, I'm not aware of any that do). But you can run it on your own computer and on a variety of cloud computing services. If you like containers but docker is not a great fit for your own research computing environment, you should also consider singularity.
- Docker creates another layer of abstraction between you and your analyses. It requires, for example, extra steps to get your files in and out of the analysis environment.
There are a variety of great tutorials out there on docker already. These include:
The purpose of this document is to present a streamlined introduction for common tasks in our lab, and as a quick reference for those common tasks so we know where to find them and can execute them consistently.
- A Docker Image is the snapshot that containers are launched from. They are roughly equivalent to a virtual machine image. Images are read only files.
- A Docker Container is a running instance based on one or more images. Contains an app and everything it needs to run. Containers are ephemeral, all changes are lost when they are terminated.
- A Docker Registry is where images are stored. A registry can be public (e.g. DockerHub) or your own. One registry can contain multiple repos.
- Docker has a sever/client architecture. The
docker
command is the command line client. Kitematic is the GUI client. Servers and clients can be on the same or different machines.
Head to http://docker.com, click on "Get Docker", and follow the instructions. Once it is running, go to the Docker menu > Preferences... > Advanced and adjust the computing resources dedicated to each instance.
To test your Docker installation, run:
docker run hello-world
You can run a Docker container (a virtual container) on an amazon EC2 instance (a virtual computer). Why run your analyses in a virtual container on a virtual computer rather than directly on the virtual computer? Because it is more portable. You can play with a docker container on your local computer, then launch an identical container on Amazon from the same image when you need to scale your analyses. No need to reconfigure and reinstall on a new environment.
This is a good resource for running Docker on AWS (and serves as the template for what I present here)- http://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html .
Create an EC2 instance running Amazon Linux. If you need anything more than ssh access, add a custom rule to open a port (eg 8787 for the RStudio tutorial below).
Login to the running EC2 instance using the url provided in the console and the key you configured, eg:
ssh -i biolite.pem [email protected]
Run the following on the EC2 instance:
sudo yum update -y
sudo yum install -y docker
sudo service docker start
sudo usermod -a -G docker ec2-user
Log out and then log back in. You can now run Docker within the running EC2 instance, for example:
docker run hello-world
Docker is even simpler to set up on Digital Ocean. Here are the basics:
- Sign into https://www.digitalocean.com
- Click Create a droplet
- Select One-click apps
- Select a Docker image
Here are some of the commands you will use most often with Docker.
docker run
is the most important command. Here are basic use cases and
options.
docker run [image]
creates container from the specified image and runs it. If the image is not already cached locally, it will be downloaded from DockerHub.- By default, the container runs then kicks back out to the shell when it is done.
- If you want to interact with container, you need to launch with
docker run -it [image]
. Depending on the configuration of the container, you may need to also specify that you want to run a shell, egdocker run -it [image] bash
- Containers only runs as long as the
docker run
process runs. Once you exit from a container, it is gone along with all changes. docker run -d [image]
detaches the container, i.e. puts it in background when it is launched.
-
docker ps
shows running containers. Each container can be referred to by name or ID -
docker logs
shows the output of docker containers -
docker stats
shows the resource utilization (CPU, memory, etc...) of running containers
-
docker stop
stops a running container. For example,docker stop 8a6f2b255457
where8a6f2b255457
is the container id you got from runningdocker ps
-
ctrl+p+q
allows you to get out of container without exiting it -
docker exec -it [id] bash
opens an interactive shell on a running container -
docker commit [id] -m "message about commit..."
creates an image from a running container. You can then move this image to another docker host to run it somewhere else, use it as a checkpoint on a long analysis if you think you might want to go back to this point, or share the image with others.
A Docker container has its own file system. Be default, it cannot read or write to host files or access hos services (like databases). There are a few ways to get data in and out of a container.
- To mount host files within the Docker container so that
they can be modified and read, run the container with
docker run -v
. - To open ports on the Docker container for incoming web, database, and
other access, run the container with
docker run -p
. This allows you, for example, to serve web sites from the container. - Use the
docker cp
command on the host to get files in and out of the container. First rundocker ps
to get its Container ID, then run e.g.docker cp a14d91a64fea:/root/tmp Downloads/
to copy the/root/tmp
directory from the container to theDownloads/
directory of the host. - From within the container, use standard network protocols and tools
such as
git
,wget
, andsft
to move files in and out of the container.
There are two general ways to distribute images. You can distribute a Dockerfile that has all the information needed to build an image. Dockerfiles provide explicit provenance and are very small. You can also distribute actual images. These can be huge (e.g. several gigabytes) and it isn't always clear what is in them, but they are ready to go.
There is excellent documentation on Dockerfiles in the Dockerfile reference, and the Dockerfile Best Practices document is a great way to learn how to put this information to use.
You create a docker image from a Dockerfile with the docker build
command. It takes the path to the
directory where the Dockerfile
is. Note that the Dockerfile itself always needs to be called Dockerfile
.
For example, you could cd
to the directory where your Dockerfile is and then run:
docker run .
You can also provide a url to the Dockerfile, eg:
docker build https://bitbucket.org/caseywdunn/agalma/raw/master/dev/docker-ubuntu-git/Dockerfile
You can provide a github repo:
docker build https://github.com/caseywdunn/comparative_expression_2017.git#revision:docker
Where revision
is the branch and docker
is the folder within the repo that contains the Dockerfile
Using this last approach is a great way to distribute images without people even needing to clone your repostiry, they just need a single line to get and build the image.
This section explains how to use Docker for a few example tasks.
It is often convenient to test analyses and tools on a clean Linux installation. You could make a fresh installation of Linux on a computer (which can take an hour or more and requires an available computer), launch a fresh Linux virtual machine on your own computer or a cloud service (this often takes a few minutes), or launch a Docker container (which takes seconds).
This command launches a clean interactive Ubuntu container:
docker run -it ubuntu
Note that this is a stripped down image that has a couple differences
from a default Ubuntu installation. In particular, it provides root
access so there is no sudo
command.
A Docker image is available for our phylogenetic and gene expression tool Agalma. Take a look at the README for more information and instructions.
Docker is a convenient way to run RStudio if you want a clean analysis environment or need more computing power than is available on your local computer.
Launch a RStudio container with:
docker run -dp 8787:8787 rocker/tidyverse
Point your browser to xxx:8787, where xxx is the ip address of your instance.
It will be localhost
if you are running docker locally on a Mac. If you
are running Docker on a cloud instance, you will get it from the cloud console.
Enter rstudio
for both username and password. You now have access to a RStudio
instance running in a container!
You can use the git integration features of RStudio to get code and analyses in and out of the container.
R has great package managers, but as projects grow in complexity it can be difficult
to install and manage all the packages needed for an analysis. Docker can be a great way
to provision R analysis environments in an explicit, reproducible, and customizable way.
In our lab we often include a Dockerfile
with our R projects that builds an image with all
the dependencies needed to run the analyses.
If your analysis is in a private repository, clone the repository to the machine where you
will build the image to run the analyses. Then cd
to the directory with the Dockerfile
,
and run:
docker build .
If your Dockerfile is in a public git repository, then you can specify it with a url without cloning it, as described above.
Once you have built an image from the Dockerfile and have an image id, then you can run a container based on the image and use git within the container to push your worm back to the remote repository as you work.
As an example, consider the R analysis at https://github.com/caseywdunn/executable_paper .
Execute the following to build an image based on the Dockerfile
on the master
branch in the docker
repository folder:
docker build https://github.com/caseywdunn/executable_paper.git#master:docker
If all goes well this will finish with the line Successfully built [image_id]
, where [image_id]
is the image ID for the image you just built.
Run a container based on the image you built above (substituting [image_id]
for the id you got above):
docker run -d -p 8787:8787 [image_id]
The -d
specifies that the container should detach, giving you back the command line on the host machine while the container runs in the background. The -p 8787:8787
maps port 8787
in the container to port 8787
on the host, which we will use to connect to the RStudio GUI with a browser in a bit.
Run the following to get the container_id
for the running container (this is different from the image_id
that the container is based on).
docker ps
To get ready to use git
in the container, you need to configure it with your email address and name. This is done by opening a shell on the container, switching to the rstudio
user with su
, and entering some git
commands (where the John Doe
bits are replaced with your name and email address):
docker exec -it [container_id] bash
chsh -s /bin/bash rstudio # Change the shell to bash so autocomplete, arrows, etc work
su rstudio
git config --global user.name "John Doe"
git config --global user.email "[email protected]"
Next point your browser to port 8787 at the appropriate ip address (eg, http://localhost:8787 if running docker on your local machine, or http://ec2-x-x-x.compute-1.amazonaws.com:8787 if running docker on an amazon ec2 instance, where ec2-x-x-x.compute-1.amazonaws
is the public DNS). Sign in with username rstudio
and password rstudio
. This will pull up a full RStudio interface in your web browser, powered by R in the container you just made.
Select File > New Project... > Version control > git, and enter https://github.com/caseywdunn/executable_paper.git
for the Repository URL. Then hit the "Create" button.
There will now be a git menu tab on the right side of the RStudio display in the browser window, and you can select the cloned files you want to work on in the lower right side of the page. Click the mymanuscript.rmd
file to open it.
Now you can work on your project. You change files, either by saving edits, running code, or knitting the the file. In this case, change some text in mymanuscript.rmd
and save the changes.
Now you can commit your changes. Open the Git tab, click commit, select the Staged
checkbox next to the files you have modified, enter a Commit message, and click commit. Then close the commit window that pops up.
Be sure to push your changes back to the remote repository before you destroy the container. This can also be done from the git window. Click the Push button. You will then be prompted for your username and password to push the container. You cannot actually push changes in this example, since you don't have push permission on my example repositry.
RStudio is a great tool, but as your analyses grow you may find it is unstable. In particular, it sometimes
does not work well when your projects include parallel code (eg mclapply()
calls). In that case, you can still
run your analyses at the command line in a Docker container, even when that container is based on a full RStudio
image.
First build the image and get an image id
as described above. Then run the container interactively with the following container on the docker host:
docker run -it [container_id] bash
From within the container, clone the repository and cd
into it. For example:
git clone https://github.com/caseywdunn/comparative_expression_2017.git
cd comparative_expression_2017
To knit the manuscript:
git checkout revision # change branches
nohup Rscript -e "library(rmarkdown); render('manuscript.rmd')" &
Or, to run all the code at the R console (so you can dig into particular variables, for example):
R
library(knitr)
purl("manuscript.rmd")
source("manuscript.R")