To use this technology, we need Singularity installed in the target environment by root, which is possibly the largest obstacle for us.
Here is a container for analysing 16s rRNA data in MicrobiomeDB, requiring an R package DADA2. DADA2 releases to BioConductor, but we want the ability to load an arbitrary commit for development.
We can build on somebody's container with r-devtools
, and add our libraries:
Bootstrap: docker
From: zamora/r-devtools
%post
R --version
R --slave -e 'install.packages("data.table",repos="https://cran.rstudio.com/")'
R --slave -e 'install.packages("optparse",repos="https://cran.rstudio.com/")'
R --slave -e 'library("devtools"); devtools::install_github("benjjneb/dada2", ref="v1.14")'
%test
R --slave -e 'packageVersion("dada2")'
This is how the container gets built and used:
sudo singularity build ./our-container.simg <the file above>
singularity exec ./our-container.simg R --slave -e 'packageVersion("dada2")'
singularity exec ./our-container.simg Rscript test.R
It works well enough for testing. The container is 574MB large so one can build it on a laptop, send it off to a cluster, and run it.
This isn't quite enterprise-ready:
- what version of R is it? This actually depends on when
zamora
built theirr-devtools
container, and happens to be 3.6.0 at the time of writing. - where do I keep this file?
- where do I keep the container? Ideally it should be made once and then available to everyone who wants to use it.
SingularityHub is a public resource that can build containers for us if we add the singularity files to GitHub.
SingularityHub has a convention: https://singularityhub.github.io/singularityhub-docs/docs/getting-started/naming Our group also has a convention - we keep all code that runs in distributed environments on https://github.com/VEuPathDB/DJob.
This suggests the above file should go somewhere like https://github.com/VEuPathDB/DJob/tree/master/DistribJobTasks/lib/containers
.
The name of the file needs to start with Singularity
.
It is going to be how our pipelines will refer to the container, so it should probably include:
- name of the VEuPathDB project, if it's for a single project
- name of analysis the container is used for
- something about what the container will be used for
Perhaps: Singularity.MicrobiomeDB-16srRNA-R
.
We can integrate SingularityHub with our repositories, so that pushing the file to GitHub will build a container for us. Then we could use our containers like we can already use public containers - compare with the Rocker project's base container:
singularity pull shub://r-base:3.6.2
singularity exec ./r-base-3.6.2.simg R # A quick R session
singularity pull shub://VEuPathDB:MicrobiomeDB-16srRNA-R
singularity exec ./VEuPathDB-MicrobiomeDB-16srRNA-R R # an R session including our libraries
Currently, we can use the cluster in our workflows elsewhere - it knows how to connect to the cluster, copy files to/from there, and orchestrate jobs. It requires an environment that needs to be prepared as follows:
- make a user account, install SSH keys, etc.
- make sure the
PATH
of our user includes/project/eupathdblab/workflow-software/bin
and source code location likeGUS_HOME
- install third party software
- copy source code to the cluster
/project/eupathdblab/workflow-software
is managed by the whole project. Our code uses the tools there by assuming they are on PATH
.
If we want our code to call a container through Singularity, we need to satisfy some assumptions:
- Singularity is installed and on
PATH
- The right container is somewhere on the cluster, and its location can be known
There is a program called sregistry
, a registry for Singularity containers: https://singularityhub.github.io/sregistry-cli/
We can install it in /project/eupathdblab/workflow-software
and make sure sregistry
is on PATH
.
This would possibly be the last program we need to install. :)
# Get a container corresponding to a Singularity file we added to GitHub, and add to the registry
sregistry pull shub://VEuPathDB:MicrobiomeDB-16srRNA-R
# Add a local image to the registry
sregistry add --name VEuPathDB-MicrobiomeDB-16srRNA-R dev-container.simg
# From the pipeline code
singularity exec $(sregistry get VEuPathDB-MicrobiomeDB-16srRNA-R) Rscript dada2-filterAndTrimFastqs.R $workflowDir/$projectName/input/fastqs
- we write container files, and publish them to GitHub
- we configure SingularityHub, who build the containers for us as a service
- we keep a registry on the cluster, which keeps track of which container names correspond to which image files
- when we want the code to run a new or different container image, we interact with the registry through pull/add
- the code we write refers to container names
Making a container for a project has some immediate benefits even if the project can't be deployed as such - staying on top of what needs to get installed and where, etc.
If we containerise everything, our commitment to any particular cluster environment can go away completely - switching to a different cluster or cloud provider would be as simple as installing singularity
, sregistry
, and fetching the containers needed.
Perl modules on CPAN and elsewhere frequently have a cpanfile
listing modules that need to be installed for the project to work. If we were to containerise our workflows completely, listing Perl modules that need to be installed for each one would be a prerequisite.
A rule can call any script, so it can also call singularity exec $(sregistry get VEuPathDB-MicrobiomeDB-16srRNA-R) Rscript buildErrorModelsFromFastqsFolderUsingDADA2.R $workflowDir/$projectName/input/fastqs
or something similar. There's a bit of support built in, to simplify the syntax:
https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers
I'm not sure how this can work with sregistry
, or how snakemake runs the containers. We do want to stay in charge of which container images will be used, and we want to minimise the number of container pulls: pulls are slow, and SingularityHub has a limit - so if snakemake pulls the containers before running the rule, this wouldn't work for us.