Nextflow on Google Cloud at CU Anschutz

Nextflow is a powerful workflow management system designed for creating scalable and reproducible scientific workflows. It enables you to write workflows in a declarative language, making it easy to define complex pipelines that can be executed on various platforms, including local machines, clusters, and cloud environments like Google Cloud.

This short tutorial is meant for informatics users who are comfortable with a command line interface. It also assumes that the user is familiar with and has run nextflow on a local computer or HPC system.

Roughly, this document will walk through:

reviewing prerequisites
setting up the google cloud command line interface, gcloud
setting up nextflow
walk through an example rnaseq nextflow run on Google Cloud Batch [total compute costs $0.10]
review the output of the workflow

Prerequisites

Git command line application
Terminal interface (linux/mac is best)
rclone [optional, brew install rclone]
An existing google cloud project (or have OIT create one in the next step)

Have OIT create a google cloud project for you

You will need a speedtype to pay for the cloud usage. For this example, the costs will be less than $1, but further usage can add up.

Fill out the form at: https://forms.office.com/r/vzEpYpfhX2

This will trigger OIT to create a google cloud project (you'll need the ID below), add users to the project, and set up the project for use with nextflow.

Google cloud `gcloud` setup

Follow the instructions https://cloud.google.com/sdk/docs/install to install the gcloud cli.
Initialize the gcloud cli
```
gcloud init
```
Create a google cloud configuration (so that you can use multiple configurations in the future)
```
gcloud config configurations create bbsr
```
Set up the project for your newly created configuration We now attach the GCP project that was created for us to the configuration. This just makes changes to a local config file.
```
export GCP_PROJECT=gac-som-dbmi-bbsrinf-app-9i7
gcloud config set project $GCP_PROJECT
```
Authenticate to google You now have a configuration that is connected to a project. That project has your email address associated with it, so you can
```
gcloud auth login
```
You'll be redirected to your browser. Use your normal CU Anschutz login process and follow the prompts, allowing Google Cloud Platform to connect to your account.

At this point, you will have accesss to GCP.
Test your access There is already a google cloud storage bucket available. Its contents are only visible to users who are authenticated to use the project. Let's test the access:
```
gcloud storage ls --recursive gs://bbsrinformatics/
```

Nextflow

Setup

Install the nextflow executable:

export NF_VERSION=v23.04.1
curl -s -L "https://github.com/nextflow-io/nextflow/releases/download/${NF_VERSION}/nextflow" | bash

You should see output like:

N E X T F L O W
version 23.04.1 build 5866
created 15-04-2023 06:51 UTC
cite doi:10.1038/nbt.3820
http://nextflow.io

Nextflow installation completed. Please note:
- the executable file `nextflow` has been created in the folder: ...
- you may complete the installation by moving it to a directory in your $PATH

Example run

Clone the sample pipeline repository:

git clone https://github.com/nextflow-io/rnaseq-nf.git

Go to the rnaseq-nf folder:
```
cd rnaseq-nf
```

Open the nextflow.config file using an editor. It should look something like:

gcb {
  params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
  params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
  params.multiqc = 'gs://rnaseq-nf/multiqc'
  process.executor = 'google-batch'
  process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'
  workDir = 'gs://BUCKET_NAME/WORK_DIRECTORY'
  google.region  = 'us-central1'
}

Note how the params reference gs://... URIs. Nextflow will use the files in those locations in the same way that it would it you used the filenames locally. If you have a workflow that references files locally, you can convert those files for use in a nextflow pipeline by copying those files to your cloud bucket and then using the gs://... to specify the files rather than just the "file path."

In this case, google and nextflow have placed example data in the rnaseq-nf bucket and made it available for us to use, so don't change those for now.

However, the workDir variable needs to be changed.

To do so, edit the nextflow.config file line to be:

  workDir = "gs://bbsrinformatics/bbsr_test/sean_2_davis/work"

Note the use of double quotes that will result in the appropriate environment variable expansion.

Add the lines inside the gcb block:

  google.project = "gac-som-dbmi-bbsrinf-app-9i7"
  google.batch.serviceAccountEmail = '[email protected]'
  google.batch.network = 'projects/gac-som-dbmi-bbsrinf-app-9i7/global/networks/batch'
  google.batch.subnetwork = 'projects/gac-som-dbmi-bbsrinf-app-9i7/regions/us-central1/subnetworks/batch'

Save your edits.

The entire gcb block should now look like:

gcb {
  params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
  params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
  params.multiqc = 'gs://rnaseq-nf/multiqc'
  process.executor = 'google-batch'
  process.container = 'quay.io/nextflow/rnaseq-nf:v1.2.1'
  
  /* customizations to the nextflow.config */
  
  workDir = "gs://bbsrinformatics/bbsr_test/${USER}/work"
  google.project = "gac-som-dbmi-bbsrinf-app-9i7"
  google.region  = 'us-central1'
  google.batch.serviceAccountEmail = '[email protected]'
  google.batch.network = 'projects/gac-som-dbmi-bbsrinf-app-9i7/global/networks/batch'
  google.batch.subnetwork = 'projects/gac-som-dbmi-bbsrinf-app-9i7/regions/us-central1/subnetworks/batch'
}

Run the workflow.

../nextflow run nextflow-io/rnaseq-nf -profile gcb -resume --outdir "gs://bbsrinformatics/bbsr_test/${USER}/results"

You should see the normal nextflow status logs. At the end of the run, you'll something like:

N E X T F L O W  ~  version 23.04.1
Launching `https://github.com/nextflow-io/rnaseq-nf` [crazy_curry] DSL2 - revision: 88b8ef803a [master]
 R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
 reads        : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
 outdir       : results

Uploading local `bin` scripts folder to gs://example-bucket/workdir/tmp/53/2847f2b832456a88a8e4cd44eec00a/bin
executor >  google-batch (4)
[67/71b856] process > RNASEQ:INDEX (transcript)     [100%] 1 of 1 ✔
[0c/2c79c6] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[a9/571723] process > RNASEQ:QUANT (gut)            [100%] 1 of 1 ✔
[9a/1f0dd4] process > MULTIQC                       [100%] 1 of 1 ✔

Done! Open the following report in your browser --> results/multiqc_report.html

Completed at: 20-Apr-2023 15:44:55
Duration    : 10m 13s
CPU hours   : (a few seconds)
Succeeded   : 4

Examining output

To see the outputs:

gcloud storage ls --recursive "gs://bbsrinformatics/bbsr_test/${USER}/results"

To examine a specific file:

gcloud storage cat gs://...

You can also use rclone to create a configuration and then interact with the google cloud storage. rclone can sometimes be more convenient than gcloud storage but they are functionally equivalent.

seandavi/README.md