Nextflow is a powerful workflow management system designed for creating scalable and reproducible scientific workflows. It enables you to write workflows in a declarative language, making it easy to define complex pipelines that can be executed on various platforms, including local machines, clusters, and cloud environments like Google Cloud.
This short tutorial is meant for informatics users who are comfortable with a command line interface. It also assumes that the user is familiar with and has run nextflow on a local computer or HPC system.
Roughly, this document will walk through:
- reviewing prerequisites
- setting up the google cloud command line interface,
gcloud
- setting up nextflow
- walk through an example rnaseq nextflow run on Google Cloud Batch [total compute costs $0.10]
- review the output of the workflow
- Git command line application
- Terminal interface (linux/mac is best)
- rclone [optional,
brew install rclone
] - An existing google cloud project (or have OIT create one in the next step)
You will need a speedtype to pay for the cloud usage. For this example, the costs will be less than $1, but further usage can add up.
Fill out the form at: https://forms.office.com/r/vzEpYpfhX2
This will trigger OIT to create a google cloud project (you'll need the ID below), add users to the project, and set up the project for use with nextflow.
-
Follow the instructions https://cloud.google.com/sdk/docs/install to install the
gcloud
cli. -
Initialize the
gcloud
cligcloud init
-
Create a google cloud configuration (so that you can use multiple configurations in the future)
gcloud config configurations create bbsr
-
Set up the project for your newly created configuration We now attach the GCP project that was created for us to the configuration. This just makes changes to a local config file.
export GCP_PROJECT=gac-som-dbmi-bbsrinf-app-9i7 gcloud config set project $GCP_PROJECT
-
Authenticate to google You now have a configuration that is connected to a project. That project has your email address associated with it, so you can
gcloud auth login
You'll be redirected to your browser. Use your normal CU Anschutz login process and follow the prompts, allowing Google Cloud Platform to connect to your account.
At this point, you will have accesss to GCP.
-
Test your access There is already a google cloud storage bucket available. Its contents are only visible to users who are authenticated to use the project. Let's test the access:
gcloud storage ls --recursive gs://bbsrinformatics/
Install the nextflow
executable:
export NF_VERSION=v23.04.1
curl -s -L "https://github.com/nextflow-io/nextflow/releases/download/${NF_VERSION}/nextflow" | bash
You should see output like:
N E X T F L O W
version 23.04.1 build 5866
created 15-04-2023 06:51 UTC
cite doi:10.1038/nbt.3820
http://nextflow.io
Nextflow installation completed. Please note:
- the executable file `nextflow` has been created in the folder: ...
- you may complete the installation by moving it to a directory in your $PATH
-
Clone the sample pipeline repository:
git clone https://github.com/nextflow-io/rnaseq-nf.git
-
Go to the rnaseq-nf folder:
cd rnaseq-nf
-
Open the nextflow.config file using an editor. It should look something like:
gcb { params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa' params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq' params.multiqc = 'gs://rnaseq-nf/multiqc' process.executor = 'google-batch' process.container = 'quay.io/nextflow/rnaseq-nf:v1.1' workDir = 'gs://BUCKET_NAME/WORK_DIRECTORY' google.region = 'us-central1' }
Note how the
params
referencegs://...
URIs. Nextflow will use the files in those locations in the same way that it would it you used the filenames locally. If you have a workflow that references files locally, you can convert those files for use in a nextflow pipeline by copying those files to your cloud bucket and then using thegs://...
to specify the files rather than just the "file path."In this case, google and nextflow have placed example data in the
rnaseq-nf
bucket and made it available for us to use, so don't change those for now.However, the
workDir
variable needs to be changed.To do so, edit the nextflow.config file line to be:
workDir = "gs://bbsrinformatics/bbsr_test/sean_2_davis/work"
Note the use of double quotes that will result in the appropriate environment variable expansion.
Add the lines inside the
gcb
block:google.project = "gac-som-dbmi-bbsrinf-app-9i7" google.batch.serviceAccountEmail = '[email protected]' google.batch.network = 'projects/gac-som-dbmi-bbsrinf-app-9i7/global/networks/batch' google.batch.subnetwork = 'projects/gac-som-dbmi-bbsrinf-app-9i7/regions/us-central1/subnetworks/batch'
Save your edits.
The entire
gcb
block should now look like:gcb { params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa' params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq' params.multiqc = 'gs://rnaseq-nf/multiqc' process.executor = 'google-batch' process.container = 'quay.io/nextflow/rnaseq-nf:v1.2.1' /* customizations to the nextflow.config */ workDir = "gs://bbsrinformatics/bbsr_test/${USER}/work" google.project = "gac-som-dbmi-bbsrinf-app-9i7" google.region = 'us-central1' google.batch.serviceAccountEmail = '[email protected]' google.batch.network = 'projects/gac-som-dbmi-bbsrinf-app-9i7/global/networks/batch' google.batch.subnetwork = 'projects/gac-som-dbmi-bbsrinf-app-9i7/regions/us-central1/subnetworks/batch' }
-
Run the workflow.
../nextflow run nextflow-io/rnaseq-nf -profile gcb -resume --outdir "gs://bbsrinformatics/bbsr_test/${USER}/results"
You should see the normal nextflow status logs. At the end of the run, you'll something like:
N E X T F L O W ~ version 23.04.1 Launching `https://github.com/nextflow-io/rnaseq-nf` [crazy_curry] DSL2 - revision: 88b8ef803a [master] R N A S E Q - N F P I P E L I N E =================================== transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa reads : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq outdir : results Uploading local `bin` scripts folder to gs://example-bucket/workdir/tmp/53/2847f2b832456a88a8e4cd44eec00a/bin executor > google-batch (4) [67/71b856] process > RNASEQ:INDEX (transcript) [100%] 1 of 1 ✔ [0c/2c79c6] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔ [a9/571723] process > RNASEQ:QUANT (gut) [100%] 1 of 1 ✔ [9a/1f0dd4] process > MULTIQC [100%] 1 of 1 ✔ Done! Open the following report in your browser --> results/multiqc_report.html Completed at: 20-Apr-2023 15:44:55 Duration : 10m 13s CPU hours : (a few seconds) Succeeded : 4
To see the outputs:
gcloud storage ls --recursive "gs://bbsrinformatics/bbsr_test/${USER}/results"
To examine a specific file:
gcloud storage cat gs://...
You can also use rclone
to create a configuration and then interact with the google cloud storage.
rclone
can sometimes be more convenient than gcloud storage
but they are functionally equivalent.