Using Marine Omics Nextflow Pipelines

Installing java

First make sure you have a sufficiently modern java (11 or higher). You can find out with

java -version

If your java is too old you can try installing a newer java using sdkman.

curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 11.0.2-open
sdk default java 11.0.2-open

After running these commands once you will probably want to ensure that your new java is always used by default. You can enable this by adding the following line to your ~.bash_profile

source "$HOME/.sdkman/bin/sdkman-init.sh"

Installing nextflow

Once java is installed you can install nextflow. Start by following instructions on the nextflow website to create a nextflow executable in your working directory. When this is done you should put this nextflow executable into a sensible directory and make sure this directory is on your path. My suggestion is;

mkdir ~/bin
mv nextflow ~/bin
echo 'export PATH=${PATH}:${HOME}/bin' >> ~/.bash_profile

These commands will make a new bin directory in your home directory, move nextflow to it, and then permanently add this bin directory to your path.

Test that everything is working by logging out, then logging back in again and typing nextflow. When you should this you should see the nextflow help instructions indicating that nextflow ran successfully. Now you shoud be able to run the nextflow command from anywhere on your system.

Configuring singularity

All of the software for marine omics pipelines are captured within self-contained machine images. The first time you run a workflow it will need to download the appropriate machine image and store it somewhere on your system. These images are moderately large (~1-2Gb) so the download will take a few minutes. In addition to downloading, your system will need to translate the image from docker format into singularity format. This takes longer (sometimes around 10 minutes).

Because these images take so long to download and build nextflow will try to store them in a cache. You control the location of stored cache images with the NXF_SINGULARITY_CACHE environment variable.

A special complication on JCU systems is that genomics1 and genomics2 (MCB servers) suffer very poor IO performance when writing to your home directory. When working on these systems it is therefore quite important not to set the singularity cache to be on your home directory.

Instructions below provide solutions for various situations depending on which machines you intend to use.

Use this option ONLY if you use marine omics pipelines on zodiac

mkdir -p "${HOME}/.nxf/singularity_cache"
echo 'export NXF_SINGULARITY_CACHEDIR=${HOME}/.nxf/singularity_cache'>> ~/.bash_profile

Use this option if you intend to use marine omics pipelines on all machines (ie zodiac and genomics[12])

First create various singularity directories (do this while logged in to genomics1 or genomics2. You will need to do it for each machine separately.

Note that jcXXX stands for your jc number

mkdir -p "${HOME}/.nxf/singularity_cache"
mkdir -p /fast/jcxxx/.nxf/singularity_cache
mkdir -p /fast/jcxxx/tmp
mkdir -p /fast/jcxxx/.singularity

Add the following lines to your ~/.bash_profile

if [[ ! $HOSTNAME =~ genomics[12] ]];then
	export NXF_SINGULARITY_CACHEDIR="${HOME}/.nxf/singularity_cache"
else
	export SINGULARITY_CACHEDIR="/fast/jcxxx/.singularity"
	export SINGULARITY_TMPDIR="/fast/jcxxx/tmp"
	export APPTAINER_CACHEDIR="/fast/jcxxx/.singularity"
	export APPTAINER_TMPDIR="/fast/jcxxx/tmp"
	export NXF_SINGULARITY_CACHEDIR="/fast/jcxxx/.nxf/singularity_cache"
fi

Testing things out

A good way to test things is to run one of the pipelines that comes with built-in tests. To run a test on genomics[12] with movp try

nextflow run marine-omics/movp -latest -profile genomics,test -r main

Or if you want to test on zodiac try

nextflow run marine-omics/movp -latest -profile zodiac,test -r main

Setting local resource limits and custom arguments

Although the default settings are designed to capture common use cases you might sometimes find that you need to customise resource requests or set custom arguments for some workflow processes.

These can be accomodated by creating a local.config file within the directory where you launch your nextflow job.

The example below shows the overall structure of te file. Enclose everything within a single process directive. Individual tasks can then be addressed using their names.

Within each block you can set custom values for cpus, memory, queue and ext.args. The first three are probably self-explanatory. The third setting ext.args allows you to provide command-line arguments directly to the tool.

process {
  withName: 'gatk_mark_duplicates'{
	cpus=12
	memory=30.GB
  }
  withName: 'fastp'{
	ext.args='--trim_front1 28 --poly_g_min_len 5'
  }
  withName: 'freebayes'{
	queue='long'
	ext.args='-0 --use-duplicate-reads --genotype-qualities --strict-vcf'
	cpus = 64
  }
}

After creating this file you can tell nextflow to use these settings by providing -c local.config at the command line.

Monitoring jobs from anywhere with Tower

One of the very nice features of nextflow is that it integrates with a service called tower. To set this up;

Create a tower account and sign in. An easy way to do this would be to sign in with your github profile. Do this by going to tower.nf
Follow these instructions to create a token
Add your token permanently by adding it to your .bash_profile (edit as appropriate for your specific token)

export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE=
export NXF_VER=20.10.0

Run your pipeline with -with-tower

Now you can visit tower.nf and you should be able to see your running pipeline there.

Marine Omics samples.csv format

Paths to raw data (fastq) files need to be provided in a sample csv file in the format shown below. Note that the header line is mandatory. Note that you must use the column names (sample, fastq_1, fastq_2) in the header line spelled exactly as shown. Use of other names will result in an error.

sample,fastq_1,fastq_2
1,sample1_r1.fastq.gz,sample1_r2.fastq.gz
2,sample2_r1.fastq.gz,sample2_r2.fastq.gz

You are free to name your samples (almost) whatever you like. In this example they are called 1, 2, etc but you could give them other names (eg as shown below). Avoid the use of spaces, tabs or other special characters (|%:[]()$) that could be interpreted by the shell.

sample,fastq_1,fastq_2
PdLa,sample1_r1.fastq.gz,sample1_r2.fastq.gz
PdLb,sample2_r1.fastq.gz,sample2_r2.fastq.gz

In situations where you have multiple fastq files for the same sample (eg sequencing was run across two lanes) just repeat the sample label eg.

sample,fastq_1,fastq_2
1,sample1_l1_r1.fastq.gz,sample1_l1_r2.fastq.gz
1,sample1_l2_r1.fastq.gz,sample2_l1_r2.fastq.gz

Single-end data is also support. Just leave out the third column, eg

sample,fastq_1
1,sample1.fastq.gz
2,sample2.fastq.gz
3,sample3.fastq.gz

Paths in the csv file should either be given as absolute paths or relative to the launch directory (where you invoked the nextflow command). For example, if your fastq files are all located inside a directory called data in your launch directory the csv file would look like this

sample,fastq_1,fastq_2
1,data/sample1_r1.fastq.gz,data/sample1_r2.fastq.gz
2,data/sample2_r1.fastq.gz,data/sample2_r2.fastq.gz

iracooke/README.md