The following code snippets demonstrate an approach to substantially speed-up BLAST searches of large query files (whole transcriptomes/genomes) that are performed against the NCBI nr/nt/refseq databases by running small jobs annotating subsets of the input sequences against subsets of the databases.
The method is based on "divide and conquer" approaches [1–2] that split the search query and the database to multiple small jobs that require modest resources and time which can utilise high priority queues and are therefore ideal for an HPC setting. The exact number of jobs to be submitted is dependant on the specifics of the HPC cluster, primarily number of available nodes, queues limits and system loads and therefore need to be experimented for optimal results.
The suggested script is tailored to run on the QRIS Awoonga HPC cluster, using Conda environment to install required software (without requiring sudo
rights). The functionality to limit the search space to particular taxonomic groups relies on the new version of the BLAST databases (version 5, see quick reference guide) and BLAST v2.10+ and above.
Conda is highly recommended to install the required software, but these can also be installed manually and must be accessible from the command line (in the user $PATH
variable):
- GNU parallel[3] - can be downloaded from the official website https://www.gnu.org/software/parallel/, or installed using a distribution package manager (rpm, apt-get, brew, etc.)
- awk, sed, grep - these Unix tools should be available with every *NIX distribution (inlcuding MacOS)
- NCBI BLAST v2.10+ suite[4] - can be downloaded and installed from the NCBI ftp repository
- NCBI Entrez Direct E-utilities[5] - these utilities are required to retrieve TaxId information and can be downloaded from the NCBI website
1. Setup a conda
environment with the required tools (see installation instructions)
# if conda is not installed
# Download conda and follow the prompts to initiate it in the shell
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh
# make sure conda-forge and bioconda channels are available and prioritised
conda config --add channels conda-forge
conda config --append channels bioconda
conda config --append channels anaconda
# install basic packages that are recommended
conda install libgcc gnutls libuuid pandoc qt scipy \
ncurses readline git libgfortran pigz parallel gawk # fonts-continuum
# create a specialised environment with the requested packages
conda create --clone root -n blast-tools biopython entrez-direct blast
conda activate blast-tools
conda update -c hcc blast # this is required until BLAST v2.10+ is available on bioconda
# Fix all perl executables to use the perl version in conda
find $CONDA_PREFIX -name "*.pl" | parallel -q sed -i.bak 's|!/usr/bin/perl|!/usr/bin/env perl|' {}
# Fix get_species_taxids.sh to work with conda installation
sed -i.bak 's|export PATH=|# export PATH=|' `which get_species_taxids.sh`
JOB_NAME=miniblastx_array
DATE=`date +%d_%m_%Y`
mkdir -p ~/90days/${JOB_NAME}_${DATE} && cd ~/90days/${JOB_NAME}_${DATE}
cp query.fasta ./
get_species_taxids.sh -n Streptophyta
Taxid: 35493
rank: phylum
division: green plants
scientific name: Streptophyta
common name:
1 matche(s) found.
# save taxids of all species under Streptophyta to a file
get_species_taxids.sh -t 35493 > my_tax.txids
# create a folder for the databases (in a `scratch` folder)
mkdir ~/30days/db
# setup an NCBI configuration file (needed for taxonomy identification)
printf "; Start the section for BLAST configuration\n[BLAST]\n; Specifies the path where BLAST databases are installed\nBLASTDB=$HOME/30days/db\n" > $HOME/.ncbirc
DB="nr"
DB_SUB="$DB__green_plants"
# submit the download job to the cluster
UPDATE_ID=$( echo "source $HOME/.bashrc; conda activate blast-tools; cd ~/30days/db ; \
update_blastdb.pl --blastdb_version 5 --num_threads \$NCPUS --source GCS taxdb $DB" \ |
qsub -A qris-gu -N update_blast -l select=1:ncpus=6:mem=8GB,walltime=2:30:00 | grep -E -o "^[0-9]+" )
# verify that job finished successfuly
qstat -fx $UPDATE_ID
grep "ExitStatus" *.e$UPDATE_ID.*
# extract accessions matching the TaxIDs (takes some time and memory)
blastdbcmd -db $DB -taxidlist my_tax.txids -outfmt "%a" -target_only > my_tax.pacc
# convert accession list to binary format
# blastdb_aliastool -seqid_file_in my_tax.pacc
# subset the whole database
blastdb_aliastool -db $DB -taxidlist my_tax.txids -dbtype prot -out $DB_SUB -title "Streptophyta (green plants) subset of the $DB db"
# get information of the database (number of sequences and size)
DB_INFO=( $(blastdbcmd -info -db $DB_SUB | gawk 'NR==2{gsub(",", ""); print $1, $3}') )
DB_SIZE=${DB_INFO[1]}
SEQ_NUM=${DB_INFO[0]}
This is a crucial step and determining the optimal number of chunks to split the query and db is still under investigation and obviously depends on the size of the original input fasta and the extent of the db. It seems that each fasta sub-query should optimally be within the 50-250kb range to run efficiently.
Currently the user can set the number of total jobs to be run with the environment variable NJOBS
, as well as the number of parts to split the input fasta into (QUERY_CHUNKS
). The script then calculates accordingly to how many parts to split the db sequences ($NJOBS/$QUERY_CHUNKS
). The number of db chunks must be an integer, set as CHUNK_MULTIPLIER
and used by GNU Parallel to split the input file and db sequences.
- The input fasta is 100Mb, so it is split to 500 chunks (
QUERY_CHUNKS=500
) to stay within the 50-250kb range per chunk. A total of 2,000 jobs are set (NJOBS=2000
), so the db will be split to 4 parts containing roughly equal number of sequences. - The input fasta is 500Mb, so we can split it to 2,000 chunks (
QUERY_CHUNKS=2000
) and the total number of jobs to 4,000 (NJOBS=4000
), so the db sequences will be split to 2 parts.
# define variables for splitting the input and database (accession list) files
NCORES=6 # number of cores for each job
NJOBS=2000 # how many sub-queries to create and submit
QUERY_CHUNKS=500 # to how many chunks to split the query file (must be whole multiplies of NJOBS)
CHUNK_MULTIPLIER=$(( $NJOBS/$QUERY_CHUNKS ))
# create empty folders
RESLOC=results_split # name of output folder
DBTMPLOC=pacc_split # name of folder of accession sub-lists
QUERYTMPLOC=query_split # folder for query subset fasta files
mkdir $DBTMPLOC $QUERYTMPLOC $RESLOC
# split input files (this can be sent to the cluster if the input and db files are VERY VERY big)
parallel --pipepart -a query.fasta --block -$(( $QUERY_CHUNKS/$CHUNK_MULTIPLIER )) -j$CHUNK_MULTIPLIER --recstart '>' "cat > $QUERYTMPLOC/query_{#}"
parallel --pipepart -a my_tax.pacc -j$CHUNK_MULTIPLIER --block -$(( $NJOBS/$QUERY_CHUNKS/$CHUNK_MULTIPLIER )) "cat > $DBTMPLOC/seqids_{#}"
# the 5 commented lines below are not needed currently (experimental)
# DB_CHUNK=500000 # how many sequences to include in each subset of the db
# QUERY_CHUNK=100k # what file size for each subset of the query (will not break fasta records)
# cat my_tax.pacc | parallel -j 12 -kN$DB_CHUNK --pipe "cat > $DBTMPLOC/seqids_{#}"
# cat query.fasta | parallel -j 12 --block $QUERY_CHUNK \
# --recstart '>' --pipe "cat > $QUERYTMPLOC/query_{#}"
# verify the number of parts
QUERY_PARTS=$( find $QUERYTMPLOC -type f -name "query_*" ! -name "*.*" | wc -l)
DB_PARTS=$( find $DBTMPLOC -type f -name "seqids_*" ! -name "*.*" | wc -l)
JOBS_NUM=$(( $QUERY_PARTS * $DB_PARTS ))
echo $JOBS_NUM # make sure this number does not exceed 2000
# convert the accession list files to binary format
parallel blastdb_aliastool -seqid_file_in ::: `find $DBTMPLOC -type f -name "seqids_*" ! -name "*.*"`
# Create the jobs for all combinations of query and database parts (the --dryrun flag is essential!!)
parallel --dryrun "blastx -db $DB_SUB -query $QUERYTMPLOC/query_{1} -outfmt '6 std stitle ssciname scomname' -evalue 1e-10 -max_target_seqs 10 -num_threads $NCORES -seqidlist $DBTMPLOC/seqids_{2}.bsl -out $RESLOC/results.query{1}.seqids{2}" ::: `seq 1 $QUERY_PARTS` ::: `seq 1 $DB_PARTS` > $JOB_NAME.cmds
# create a template PBS script for each job in the array
echo '#!/bin/bash
#PBS -l' "select=1:ncpus=$NCORES:mem=24GB,walltime=5:00:00
#PBS -A qris-gu
cd \$PBS_O_WORKDIR
source ~/.bashrc
conda activate blast-tools
awk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \${CMDS_FILE} | bash" > $JOB_NAME.pbs
# send all jobs to the cluster as a job array
ARRAY_ID=$( qsub -J1-$(cat $JOB_NAME.cmds | wc -l) -N $JOB_NAME -v CMDS_FILE=$JOB_NAME.cmds $JOB_NAME.pbs | grep -E -o "^[0-9]+" )
# check that the array finished all jobs
qstat -fx $ARRAY_ID[]
# Check that all jobs finished successfuly
find . -name '*.e$ARRAY_ID.*' | xargs grep "ExitStatus"
# Check if any jobs did NOT finish successfuly
find . -name '*.e$ARRAY_ID.*' | xargs grep "ExitStatus:[^0]"
# merge output files
find $RESLOC -name "results.query*" -exec cat {} \; > results.blastx.outfmt6
# find and remove empty files
find . -size 0 -exec rm {} +
# copy results to permanent location
cp results.blastx.outfmt6 ~/data/
cd ~/data
# verify that the new file is there and looks complete
ll ~/data/results.blastx.outfmt6
rm -r ~/90days/${JOB_NAME}_${DATE}
rm -r ~/30days/db/* # only if not needed in the very near future!
- incremental-BLAST[6], can be useful if a recurring BLAST job is being performed against a database that is changing (additional sequences or taxonomy groups).
- DIAMOND[7] is very fast (500x-20,000x faster than BLAST), but is useful only when annotating against protein databases (BLASTp and BLASTx). It requires it's own db format, so the relevant sequences must be extracted from the NCBI database as
fasta
file and reformatted.
- Mikailov M, Luo F-J, Barkley S, Valleru L, Whitney S, Liu Z, et al. Scaling bioinformatics applications on HPC. BMC Bioinformatics. 2017;18 Suppl 14. doi:10.1186/s12859-017-1902-7.
- Yim WC, Cushman JC. Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments. PeerJ. 2017;5. doi:10.7717/peerj.3486.
- Tange O. GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine. 2011;36:42–7.
- Camacho C, Madden T, Ma N, Tao T, Agarwala R, Morgulis A. BLAST Command Line Applications User Manual. 2013. http://www.ncbi.nlm.nih.gov/books/NBK1763/.
- Kans J. Entrez Direct: E-utilities on the UNIX Command Line. National Center for Biotechnology Information (US); 2020. https://www.ncbi.nlm.nih.gov/books/NBK179288/. Accessed 27 May 2020.
- Dash S, Rahman S, Hines HM, Feng W. Incremental BLAST: incremental addition of new sequence databases through e-value correction. bioRxiv. 2018;:476218.
- Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12:59–60.