NOTE: this whole idea is the brain child of Jürgen Hench. He got it up and running and posted about it here. I am merely wrapping the idea in a hopefully easy to follow set of instructions for people to test themseleves.
This notebook describes processing of Nanopore sequencing data (fast5 files) in a Google Colab interactive notebook environment. This is made possible by utalising the GPU enabled runtime that is available via Colab.
Before we get started there are some important points to consider.
Some things to note before proceeding:
- you will need an ONT community forum account to download Guppy, so make sure you have one and can access the downloads section
- this is a cloud based approach, meaning all data will be located in some cloud instance somewhere (I use Google Drive in this example). This may not be appropriate for the data you have. Consider this carefully before uploading any of your data.
- this is currently a free service, it may well be removed at any stage - this is completely at the discretion of Google.
- as part of this Google has the right to monitor usage and may throttle or deny allocation of resources to users that are running constantly
- the amount and type of allocated resources can and likely will change. The current GPU instance is using GPUs that work with Guppy, and the available disk is about 64Gb, but this can change
- runtime disconnection is a thing, if the notebook is idle too long you'll be disconnected
- it is possible to run out of memory/RAM
- there is no guarantee that the GPU hardware will be available when you want to use it
- the GPU that you get allocated might not be compatible with Guppy. For example, in one instance I was assigned a Telsa K80. This is a Kepler based card and doesn't meet the requirement of CUDA compute >=6.0. This is the error that I recieved:
[guppy/error] *common::LoadModuleFromFatbin: Loading fatbin file shared.fatbin failed with: CUDA error at /builds/ofan/ont_core_cpp/ont_core/common/cuda_common.cpp:54: CUDA_ERROR_NO_BINARY_FOR_GPU
- there is no responsibility from myself or ONT, you're on your own! :)
A note of interest from the Google Colab FAQ:
"The types of GPUs that are available in Colab vary over time. This is necessary for Colab to be able to provide access to these resources for free. The GPUs available in Colab often include Nvidia K80s, T4s, P4s and P100s. There is no way to choose what type of GPU you can connect to in Colab at any given time. Users who are interested in more reliable access to Colab’s fastest GPUs may be interested in Colab Pro."
So there are 4 different GPUs on offer, and it's essentially a 'lottery' as to which you get assigned - though it is probably likely to be one of the less powerful options. Here is an overview of these GPUs with respect to which "work" with Guppy:
- Nvidia K80 - not compatible with Guppy
- Kepler 2.0 microarchitecture
- Year of release = 2014
- CUDA Compute = 3.7
- 2496 x2 CUDA cores (essential a dual GPU)
- Kepler 2.0 microarchitecture
- Nvidia P4 - compatible with Guppy
- Pascal microarchitecture
- Year of release = 2016
- CUDA Compute = 6.1
- 2560 CUDA cores
- Pascal microarchitecture
- Nvidia P100 - compatible with Guppy
- Pascal microarchitecture
- Year of release = 2016
- CUDA Compute = 6.0
- 3584 CUDA cores
- Pascal microarchitecture
- Nvidia T4 - compatible with Guppy
- Turing microarchitecture
- Year of release = 2018
- CUDA Compute = 7.5
- 2560 CUDA cores
- Turing microarchitecture
So of the 4 types of GPU currently available via the free tier of Google Colab, the Nvidia K80 is the only one which will not work with Guppy as it is currenty implemented. If you end up with an instance with a K80 then there is no point continuing, and you can try again later. If you sign up for the Pro version of Google Colab (9.99 USD p/month) then you are priority to better GPUs - food for thought.
The first thing is to make sure the runtime is set to use a GPU. To do this is pretty simple:
- go to the
Runtime
menu - select the
Change runtime type
option - make sure the
Hardware accelerator
is set toGPU
Once the above is set up you should be able to run the below code block. If successful you should see something like /device:GPU:0
as the output. This means that the GPU is available for use.
import tensorflow as tf
tf.test.gpu_device_name() # this will tell you device number (should be 0 with a single GPU)
import torch
torch.cuda.get_device_name(0) # this will tell you the name/model of the GPU
'Tesla T4'
You will need to have access to the ONT community forum here to be able to access the download section to grab a copy of Guppy.
Once you have access and can navigate to the 'Software Downloads' section of the ONT community forum you will see a listing for Guppy. I recommend grabbing the pre-compiled binaries, i.e. the version listed as Linux x64-bit GPU, it should have a file name similar to ont-guppy_X.X.X_linux64.tar.gz
- where the X's denote the version number. You can copy the link to this download and paste it into the code block below, i.e. replace the section [paste_guppy_link_here]
.
Run the code block and Guppy will be downloaded.
%%shell
GuppyBinary=[paste_guppy_link_here]
wget $GuppyBinary
...
...
...
Resolving americas.oxfordnanoportal.com (americas.oxfordnanoportal.com)... 96.126.99.215
Connecting to americas.oxfordnanoportal.com (americas.oxfordnanoportal.com)|96.126.99.215|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 637723012 (608M) [application/x-tar]
Saving to: ‘ont-guppy_4.5.3_linux64.tar.gz’
ont-guppy_4.5.3_lin 100%[===================>] 608.18M 44.7MB/s in 14s
2021-04-14 10:29:27 (42.0 MB/s) - ‘ont-guppy_4.5.3_linux64.tar.gz’ saved [637723012/637723012]
Before we can use the Guppy binaries we need to extract the file we downloaded. Replace the X's in the below code block with the version you downloaded and then run the code block. If we use version 4.5.3 as an example:
%%shell
tar -xzvf ont-guppy_4.5.3_linux64.tar.gz
ont-guppy/bin/
ont-guppy/bin/guppy_basecall_client
ont-guppy/bin/guppy_basecall_server
ont-guppy/bin/guppy_basecaller
ont-guppy/bin/guppy_basecaller_supervisor
ont-guppy/data/
ont-guppy/data/YHR174W.fasta
ont-guppy/data/adapter_scaling_dna_r10.3_min.jsn
ont-guppy/data/adapter_scaling_dna_r10.3_prom.jsn
ont-guppy/data/adapter_scaling_dna_r9.4.1_min.jsn
ont-guppy/data/adapter_scaling_dna_r9.4.1_prom.jsn
ont-guppy/data/certs-bundle.crt
ont-guppy/data/dna_r10.3_450bps_fast.cfg
ont-guppy/data/dna_r10.3_450bps_fast_prom.cfg
ont-guppy/data/dna_r10.3_450bps_hac.cfg
ont-guppy/data/dna_r10.3_450bps_hac_prom.cfg
ont-guppy/data/dna_r10.3_450bps_modbases_5mc_hac_prom.cfg
ont-guppy/data/dna_r10_450bps_fast.cfg
ont-guppy/data/dna_r10_450bps_hac.cfg
ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
ont-guppy/data/dna_r9.4.1_450bps_hac_prom.cfg
ont-guppy/data/dna_r9.4.1_450bps_hac_prom_fw205.cfg
ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg
ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac_prom.cfg
ont-guppy/data/dna_r9.4.1_450bps_sketch.cfg
ont-guppy/data/dna_r9.5_450bps.cfg
ont-guppy/data/lambda_3.6kb.fasta
ont-guppy/data/lampore_analysis-2.0.0-py3-none-any.whl
ont-guppy/data/mismatch_matrix.txt
ont-guppy/data/rna_r9.4.1_70bps_fast.cfg
ont-guppy/data/rna_r9.4.1_70bps_fast_prom.cfg
ont-guppy/data/rna_r9.4.1_70bps_hac.cfg
ont-guppy/data/rna_r9.4.1_70bps_hac_prom.cfg
ont-guppy/data/template_r10.3_450bps_fast.jsn
ont-guppy/data/template_r10.3_450bps_fast_prom.jsn
ont-guppy/data/template_r10.3_450bps_hac.jsn
ont-guppy/data/template_r10.3_450bps_hac_prom.jsn
ont-guppy/data/template_r10.3_450bps_modbases_5mc_hac_prom.jsn
ont-guppy/data/template_r10_450bps_fast.jsn
ont-guppy/data/template_r10_450bps_hac.jsn
ont-guppy/data/template_r9.4.1_450bps_fast.jsn
ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
ont-guppy/data/template_r9.4.1_450bps_hac.jsn
ont-guppy/data/template_r9.4.1_450bps_hac_prom.jsn
ont-guppy/data/template_r9.4.1_450bps_hac_prom_fw205.jsn
ont-guppy/data/template_r9.4.1_450bps_modbases_5mc_hac.jsn
ont-guppy/data/template_r9.4.1_450bps_modbases_5mc_hac_prom.jsn
ont-guppy/data/template_r9.4.1_450bps_sketch.jsn
ont-guppy/data/template_r9.5_450bps_5mer_raw.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_fast.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_fast_prom.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_hac.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_hac_prom.jsn
ont-guppy/bin/
ont-guppy/bin/guppy_aligner
ont-guppy/bin/minimap2
ont-guppy/lib/
ont-guppy/lib/MINIMAP2_LICENSE
ont-guppy/lib/libont_minimap2.so.2
ont-guppy/lib/libont_minimap2.so.2.17.2
ont-guppy/bin/
ont-guppy/bin/Nanopore Product Terms and Conditions (28 November 2018).pdf
ont-guppy/bin/THIRD_PARTY_LICENSES
ont-guppy/bin/
ont-guppy/bin/guppy_barcoder
ont-guppy/data/
ont-guppy/data/barcoding/
ont-guppy/data/barcoding/4x4_mismatch_matrix.txt
ont-guppy/data/barcoding/5x5_mismatch_matrix.txt
ont-guppy/data/barcoding/5x5_mismatch_matrix_simple.txt
ont-guppy/data/barcoding/barcode_arrs_16s.cfg
ont-guppy/data/barcoding/barcode_arrs_dual_nb24_pcr96.cfg
ont-guppy/data/barcoding/barcode_arrs_lwb.cfg
ont-guppy/data/barcoding/barcode_arrs_multivirus1.cfg
ont-guppy/data/barcoding/barcode_arrs_multivirus8.cfg
ont-guppy/data/barcoding/barcode_arrs_nb12.cfg
ont-guppy/data/barcoding/barcode_arrs_nb13-24.cfg
ont-guppy/data/barcoding/barcode_arrs_nb24.cfg
ont-guppy/data/barcoding/barcode_arrs_nb96.cfg
ont-guppy/data/barcoding/barcode_arrs_ncov8.cfg
ont-guppy/data/barcoding/barcode_arrs_ncov96.cfg
ont-guppy/data/barcoding/barcode_arrs_pcr12.cfg
ont-guppy/data/barcoding/barcode_arrs_pcr96.cfg
ont-guppy/data/barcoding/barcode_arrs_rab.cfg
ont-guppy/data/barcoding/barcode_arrs_rbk.cfg
ont-guppy/data/barcoding/barcode_arrs_rbk096.cfg
ont-guppy/data/barcoding/barcode_arrs_rbk4.cfg
ont-guppy/data/barcoding/barcode_arrs_rlb.cfg
ont-guppy/data/barcoding/barcode_arrs_vmk.cfg
ont-guppy/data/barcoding/barcode_arrs_vmk2.cfg
ont-guppy/data/barcoding/barcode_score_vs_classification.png
ont-guppy/data/barcoding/barcodes_masked.fasta
ont-guppy/data/barcoding/configuration.cfg
ont-guppy/data/barcoding/configuration_dual.cfg
ont-guppy/data/barcoding/multivirus_targets.fasta
ont-guppy/data/barcoding/ncov_targets.fasta
ont-guppy/data/barcoding/nw_barcoding_grid.png
ont-guppy/lib/
ont-guppy/lib/libvbz_hdf_plugin.so
ont-guppy/lib/libvbz_hdf_plugin.so.1
ont-guppy/lib/libvbz_hdf_plugin.so.1.0.0
We should now be able to run the Guppy binaries we downloaded. They are located in ./ont-guppy/bin
. The below code block should run guppy_basecaller
and report the version of the software.
%%shell
./ont-guppy/bin/guppy_basecaller --version
: Guppy Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 4.5.3+0ab5ebb
By mounting your Google Drive you will be able to upload fast5 files which can be processed and the output can be written back to the same location within Drive.
The below chunk performs the mounting. You will be asked to authenticate, just follow the instructions and things should go pretty smoothly.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
Mounted at /content/gdrive
For this example I created a directory within My Drive
called ONT
and then within this folder another directory called example_data
. I then uploaded a few fast5 files to this location.
We can check that the mounted drive and files are identified in the notebook environment below.
%%shell
ls gdrive/MyDrive/ONT/example_data
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_0.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_10.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_11.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_12.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_13.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_14.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_15.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_16.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_17.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_18.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_19.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_1.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_20.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_2.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_3.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_4.fast5
Looks good! We can see a list of fast5 files.
Now for the fun part!
With all the above working then we can now basecall our data. First we will set a few variables. The below code block creates shell variables for input and output locations, the guppy binary (basecaller) and several model files for basecalling (i.e. fast, hac and modified bases).
Once we're happy with these variables we can then put together the Guppy code to start basecalling. In the below it's a fairly simple run using the fast model and adjusting the parameters slightly for the compute environment.
Run this block and hopefully you'll see base calling kick off. If so that's all there is to it. :)
%%shell
inputPath="gdrive/MyDrive/ONT/example_data"
outputPath="gdrive/MyDrive/ONT/example_data"
guppy_bc=./ont-guppy/bin/guppy_basecaller # set guppy_basecaller binary location
guppy_cfg_fast=./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg # fast model calling
guppy_cfg_hac=./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg # high accuracy calling
guppy_cfg_mod=./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg # base modification calling
$guppy_bc -i $inputPath -s $outputPath \
--recursive \
--config $guppy_cfg_fast \
--gpu_runners_per_device 16 \
--cpu_threads_per_caller 2 \
--device cuda:0
ONT Guppy basecalling software version 4.5.3+0ab5ebb
config file: ./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /content/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: gdrive/MyDrive/ONT/example_data
save path: gdrive/MyDrive/ONT/example_data
chunk size: 2000
chunks per runner: 160
minimum qscore: 7
records per file: 4000
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 16
Found 16 fast5 files to process.
Init time: 696 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 101781 ms, Samples called: 1912424322, samples/s: 1.87896e+07
Finishing up any open output files.
Basecalling completed successfully.
This basecalling run performs high accuracy calling. I was actually very surprised with the speed of the GPU that generated this output (Nvida T4). I feel it would be a decent option if you wanted to turn around a small amount of data using the hac model.
The below code block will perform hac:
%%shell
inputPath="gdrive/MyDrive/ONT/example_data"
outputPath="gdrive/MyDrive/ONT/example_data"
guppy_bc=./ont-guppy/bin/guppy_basecaller # set guppy_basecaller binary location
guppy_cfg_fast=./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg # fast model calling
guppy_cfg_hac=./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg # high accuracy calling
guppy_cfg_mod=./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg # base modification calling
$guppy_bc -i $inputPath -s $outputPath \
--recursive \
--config $guppy_cfg_hac \
--gpu_runners_per_device 16 \
--cpu_threads_per_caller 2 \
--device cuda:0
ONT Guppy basecalling software version 4.5.3+0ab5ebb
config file: ./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file: /content/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path: gdrive/MyDrive/ONT/example_data
save path: gdrive/MyDrive/ONT/example_data
chunk size: 2000
chunks per runner: 512
minimum qscore: 9
records per file: 4000
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 16
Found 16 fast5 files to process.
Init time: 1864 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 409705 ms, Samples called: 1904252640, samples/s: 4.64786e+06
Finishing up any open output files.
Basecalling completed successfully.
If you are interested in exploring base modifications then you can provide the appropriate model configuration file and let it run. Again I was quite surprised by the speed in this cloud instance using an Nvidia T4 - good stuff.
Run the below code block for base modification enabled calling:
%%shell
inputPath="gdrive/MyDrive/ONT/example_data"
outputPath="gdrive/MyDrive/ONT/example_data"
guppy_bc=./ont-guppy/bin/guppy_basecaller # set guppy_basecaller binary location
guppy_cfg_fast=./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg # fast model calling
guppy_cfg_hac=./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg # high accuracy calling
guppy_cfg_mod=./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg # base modification calling
$guppy_bc -i $inputPath -s $outputPath \
--recursive \
--config $guppy_cfg_mod \
--gpu_runners_per_device 16 \
--cpu_threads_per_caller 2 \
--device cuda:0
ONT Guppy basecalling software version 4.5.3+0ab5ebb
config file: ./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg
model file: /content/ont-guppy/data/template_r9.4.1_450bps_modbases_5mc_hac.jsn
input path: gdrive/MyDrive/ONT/example_data
save path: gdrive/MyDrive/ONT/example_data
chunk size: 2000
chunks per runner: 512
minimum qscore: 9
records per file: 4000
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 16
Found 16 fast5 files to process.
Init time: 1820 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 406350 ms, Samples called: 1904252640, samples/s: 4.68624e+06
Finishing up any open output files.
Basecalling completed successfully.
Well that is really all there is to it, cloud based GPU accelerated basecalling using the free tier in Google Colab is not just possible, it's actually quite usable! Again a massive thanks to Jürgen Hench who put in all the hard work and created the initial post explaining that this was a possibility.
Moving forward it would be interestinig to see how paid tiers perform, the Pro version of Google Colab is only 9.99 USD per month and can be cancelled anytime. I might clock up a month or two and try to do a little benchmarking. It would also be very useful to examine other cloud based options, i.e. AWS with GPU enabled instances. The prices of instances with decent GPUs available in them is dropping rather quickly, which is quite exciting.
Happy GPU basecalling everyone!
I got the free Tesla. I downloaded the model file from rerio and copied it in the data folder. Other than that everything is the same.