Snakemake

Here are the links to various official Snakemake things.

Main site: https://snakemake.github.io/
Paper: https://f1000research.com/articles/10-33/v2
Tutorial: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html
Full documentation: https://snakemake.readthedocs.io/en/stable/

The tutorial is pretty good, but I'll try to brain dump some information below that might help.

In programming and bioinformatics one of the situations we always find ourselves in is:

I have file X, I want to run tool T on it to produce file Y.

This happens constantly, all the time, especially when you're working in a Linux/UNIX-style environment. The tool you use and how you need to call it can vary each time, and sometimes there's more than one input or output file, but the general pattern of "input file(s) → tool→ output file(s)" is something you have to do over and over and over again, e.g.:

You have photo-1234.jpg and want to resize it to create a small preview image for it: convert photo-1234.jpg -resize 100x100 photo-1234-preview.jpg
You have a FASTQ file and want to get a report on quality metrics: fastqc -o SRA12345_R1_Report/ SRA12345_R1.fastq.gz
You have a pair of FASTQ files representing paired-end reads for a bacteria sample and want to de novo assemble a genome for it: spades.py --pe1-1 SRA12345_R1.fastq.gz --pe1-2 SRA12345_R2.fastq.gz -o SRA12345_R2_assembly/.
You have a python (or R or whatever) script that takes an assembly, performs some analysis on it, and generates a plot: python perform_analysis.py --assembly SRA12345_R2_assembly/contigs.fasta --output plot.pdf

If you just have one or a couple of tasks like this to perform, doing it manually at the command line is fine and lets you get your result and move on with your life.

If you have a lot of these tasks to perform, it can be easier to make a script to do it for you instead of typing out the commands by hand:

#!/usr/bin/env bash
set -euo pipefail

convert photo-001.jpg -resize 100x100 photo-001-preview.jpg
convert photo-002.jpg -resize 100x100 photo-002-preview.jpg
convert photo-003.jpg -resize 100x100 photo-003-preview.jpg
convert photo-004.jpg -resize 100x100 photo-004-preview.jpg
convert photo-005.jpg -resize 100x100 photo-001-preview.jpg
convert photo-006.jpg -resize 100x100 photo-006-preview.jpg
convert photo-007.jpg -resize 100x100 photo-007-preview.jpg

This is probably easier than typing it by hand, but is still kind of error prone (notice the typo in one of the commands). So maybe you make a slightly fancier bash script:

#!/usr/bin/env bash
set -euo pipefail

function make_preview {
    convert "${1}.jpg" -resize 100x100 "${1}-preview.jpg"
}

convert photo-001
convert photo-002
convert photo-003
convert photo-004
convert photo-005
convert photo-006
convert photo-007

That's a little better, but there's still some things we might want to improve. If we edit one of the photos and rerun the script, it'll recreate all the preview images even though we've only changed one of them. We could hack something into the script to try to handle that:

#!/usr/bin/env bash
set -euo pipefail

function make_preview {
    IN="${1}.jpg" 
    OUT="${1}-preview.jpg"

    if test "$IN" -nt "$OUT"; then
        # Only do the conversion if the input is newwer than the output (-nt
        # means "newer than").
        convert "$IN" -resize 100x100 "$OUT"
    fi
}

convert photo-001
convert photo-002
convert photo-003
convert photo-004
convert photo-005
convert photo-006
convert photo-007

So that works, and is kind of okay, but hacking this together for every single tool one-by-one is a pain, especially if a particular step has multiple input files (now you have to check if any input is newer than the output), multiple output files (is the input newer than any output file), or both.

Another pattern you often have to deal with is when the output of one tool becomes the input for another, e.g.:

Raw FASTQ Files
      ↓
    [Trimming program (e.g. trimmomatic)]
      ↓
Trimmed FASTQ Files
      ↓
    [Assembler (e.g. Spades)]
      ↓
Assembled FASTA file

Often it gets even more complicated, where it's not a single linear list of X→Y→Z. The flow for our project will probably look something like this:

Like before, we only want to rebuild an output if the input changed. But now that we have a graph, this isn't necessarily limited to a single file. If we change an input file, we only need to rebuild the output files that depend (transitively) on that file. In our example, if we update the reference there's no need to waste time trimming the FASTQs again.

This problem has been around for as long as UNIX has. The classic example is compiling a program written in C: you compile each file individually, then link the resulting object files into a program, then copy the program somewhere to install it, etc.

The steps of the entire process together form a graph, specifically a Directed Acyclic Graph (DAG) — it's directed because input vs output is directional, and it's acyclic because you can't require something to already exist in order to build itself. DAGs come up in a lot of places in computer science (e.g. the Git commit history is a DAG), so they're worth getting comfortable with for their own sake.

Back to the problem, we have a DAG of tasks we need to do:

Doing all of this by hand would be tedious and error prone, and programmers are often lazy and want to make the computer do the work for them. In the mid 1970s some people at Bell Labs created a tool called make, which lets you define the DAG of your input and output files and handles creating the outputs whenever they don't exist and/or whenever their inputs change. make's syntax is infamously bad, but let's take a look just for the hell of it:

%-preview.jpg: %.jpg
	convert %< -resize 100x100 %@

This says "in order to make a file named whatever-preview.jpg, you need to have a file named whatever.jpg, and then you can create it by running the following convert … command". So now we can run make foo-preview.jpg and the computer will look for foo.jpg and use it to build the preview (or spit out an error if the input file is missing).

make is used a lot in the Linux programming world to compile software, but it's not super user friendly. Snakemake is a tool that's trying to solve the same basic problem as make (take a DAG of input/outputs and (re)build any outputs that need it), but is a little friendlier (especially for bioinformatics pipelines). The syntax is Python with a few extra bits on top (hence the "snake" in "Snakemake").

In Snakemake, that make rule from before looks like:

rule preview_photo:
    input:
        "{name}.jpg"
    output:
        "{name}-preview.jpg"
    shell:
        "convert {input} -resize 100x100 {output}"

And we'd run it similarly to how we ran make before: snakemake whatever-preview.jpg.

There's something important that's easy to miss when you're skimming over the examples, but is really important to wrap your head around when working with Snakemake: rules and invocations of Snakemake are always based on the output, not the input.

For example: our rule above takes a photo as input and generates a preview. To call that rule we tell Snakemake "please build whatever-preview.jpg" and Snakemake will figure out the input it needs itself. There's no way to say "please take the input file whatever.jpg and make stuff from it", you always start at the outputs and work backwards from there. It takes some time to get used to working in that direction, but it's not too bad once you get some practice.

Defining all the steps in your workflow in Snakemake takes a bit of time and practice, but once you've done it you get a lot of stuff for free:

You can rebuild everything with a single command, without any risk of making typos along the way.
Snakemake is smart enough to only rebuild the necessary parts of the DAG when something changes, instead of building everything every time.
Snakemake can plug into Slurm on Great Lakes, so it can run individual tasks on separate cluster nodes. So if you need to run 48 assemblies, it can run them all in parallel on 48 different nodes to finish faster.
You can get some basic statistics (e.g. run time, maximum memory used, etc) automatically with a line or two of Snakemake config.
Snakemake can handle running tools from Singularity containers if you want it to, which can help solve the installation/dependency hell that a lot of bioinformatics tools (especially Python-based ones) end up in.

It's worth going through the Snakemake tutorial if you've never used it before to get an idea of how it works: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html

Note: the cluster admins recently added Snakemake as a module on Great Lakes, so you don't have to mess around installing it yourself if you want to give it a try. On Great Lakes you can just module load snakemake and you're all set.

sjl/snakemake-notes.markdown

Snakemake