IdoBar/blast_tax.md

Introduction

The following code snippets demonstrate an approach to substantially speed-up BLAST searches of large query files (whole transcriptomes/genomes) that are performed against the NCBI nr/nt/refseq databases by running small jobs annotating subsets of the input sequences against subsets of the databases.
The method is based on "divide and conquer" approaches [1–2] that split the search query and the database to multiple small jobs that require modest resources and time which can utilise high priority queues and are therefore ideal for an HPC setting. The exact number of jobs to be submitted is dependant on the specifics of the HPC cluster, primarily number of available nodes, queues limits and system loads and therefore need to be experimented for optimal results.

Requirements

The suggested script is tailored to run on the QRIS Awoonga HPC cluster, using Conda environment to install required software (without requiring sudo rights). The functionality to limit the search space to particular taxonomic groups relies on the new version of the BLAST databases (version 5, see quick reference guide) and BLAST v2.10+ and above.
Conda is highly recommended to install the required software, but these can also be installed manually and must be accessible from the command line (in the user $PATH variable):

GNU parallel[3] - can be downloaded from the official website https://www.gnu.org/software/parallel/, or installed using a distribution package manager (rpm, apt-get, brew, etc.)
awk, sed, grep - these Unix tools should be available with every *NIX distribution (inlcuding MacOS)
NCBI BLAST v2.10+ suite[4] - can be downloaded and installed from the NCBI ftp repository
NCBI Entrez Direct E-utilities[5] - these utilities are required to retrieve TaxId information and can be downloaded from the NCBI website

Instructions

1. Setup a `conda` environment with the required tools (see installation instructions)

# if conda is not installed
# Download conda and follow the prompts to initiate it in the shell
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh
# make sure conda-forge and bioconda channels are available and prioritised 
conda config --add channels conda-forge
conda config --append channels bioconda 
conda config --append channels anaconda
# install basic packages that are recommended
conda install libgcc gnutls libuuid pandoc qt scipy \
   ncurses readline git libgfortran pigz parallel gawk # fonts-continuum
# create a specialised environment with the requested packages 
conda create --clone root -n blast-tools  biopython entrez-direct blast
conda activate blast-tools
conda update -c hcc blast # this is required until BLAST v2.10+ is available on bioconda
# Fix all perl executables to use the perl version in conda
find $CONDA_PREFIX -name "*.pl" | parallel -q sed -i.bak 's|!/usr/bin/perl|!/usr/bin/env perl|' {}
# Fix get_species_taxids.sh to work with conda installation
sed -i.bak 's|export PATH=|# export PATH=|' `which get_species_taxids.sh`

2. Create a working directory in `scratch` folder and copy input fasta

JOB_NAME=miniblastx_array
DATE=`date +%d_%m_%Y`
mkdir -p ~/90days/${JOB_NAME}_${DATE} && cd ~/90days/${JOB_NAME}_${DATE}
cp query.fasta ./

3. Find the desired Taxonomic group (Streptophyta in this case) and all its daughter nodes

get_species_taxids.sh -n Streptophyta

Taxid: 35493
 rank: phylum
 division: green plants
 scientific name: Streptophyta
 common name:

1 matche(s) found.
# save taxids of all species under Streptophyta to a file
get_species_taxids.sh -t 35493 > my_tax.txids

4. Download the required NCBI database (v5) to a `scratch` folder

# create a folder for the databases (in a `scratch` folder)
mkdir ~/30days/db 
# setup an NCBI configuration file (needed for taxonomy identification)
printf "; Start the section for BLAST configuration\n[BLAST]\n; Specifies the path where BLAST databases are installed\nBLASTDB=$HOME/30days/db\n" > $HOME/.ncbirc
DB="nr"
DB_SUB="$DB__green_plants"
# submit the download job to the cluster
UPDATE_ID=$( echo "source $HOME/.bashrc; conda activate blast-tools; cd ~/30days/db ; \
update_blastdb.pl --blastdb_version 5 --num_threads \$NCPUS --source GCS taxdb $DB" \ |
qsub -A qris-gu -N update_blast -l select=1:ncpus=6:mem=8GB,walltime=2:30:00 |  grep -E -o "^[0-9]+" )
# verify that job finished successfuly
qstat -fx $UPDATE_ID
grep "ExitStatus" *.e$UPDATE_ID.*

5. Find the relevant sequences in the db that match these Taxids

# extract accessions matching the TaxIDs (takes some time and memory)
blastdbcmd -db $DB -taxidlist my_tax.txids -outfmt "%a" -target_only > my_tax.pacc
# convert accession list to binary format
# blastdb_aliastool -seqid_file_in my_tax.pacc
# subset the whole database
blastdb_aliastool -db $DB -taxidlist my_tax.txids -dbtype prot -out $DB_SUB -title "Streptophyta (green plants) subset of the $DB db"
# get information of the database (number of sequences and size)
DB_INFO=( $(blastdbcmd -info -db $DB_SUB | gawk 'NR==2{gsub(",", ""); print $1, $3}') )
DB_SIZE=${DB_INFO[1]}
SEQ_NUM=${DB_INFO[0]}

6. Split the input sequence list and input fasta to smaller chunks and save to files

This is a crucial step and determining the optimal number of chunks to split the query and db is still under investigation and obviously depends on the size of the original input fasta and the extent of the db. It seems that each fasta sub-query should optimally be within the 50-250kb range to run efficiently.
Currently the user can set the number of total jobs to be run with the environment variable NJOBS, as well as the number of parts to split the input fasta into (QUERY_CHUNKS). The script then calculates accordingly to how many parts to split the db sequences ($NJOBS/$QUERY_CHUNKS). The number of db chunks must be an integer, set as CHUNK_MULTIPLIER and used by GNU Parallel to split the input file and db sequences.

Examples

The input fasta is 100Mb, so it is split to 500 chunks (QUERY_CHUNKS=500) to stay within the 50-250kb range per chunk. A total of 2,000 jobs are set (NJOBS=2000), so the db will be split to 4 parts containing roughly equal number of sequences.
The input fasta is 500Mb, so we can split it to 2,000 chunks (QUERY_CHUNKS=2000) and the total number of jobs to 4,000 (NJOBS=4000), so the db sequences will be split to 2 parts.

# define variables for splitting the input and database (accession list) files
NCORES=6 # number of cores for each job
NJOBS=2000 # how many sub-queries to create and submit 
QUERY_CHUNKS=500 # to how many chunks to split the query file (must be whole multiplies of NJOBS)
CHUNK_MULTIPLIER=$(( $NJOBS/$QUERY_CHUNKS ))
# create empty folders
RESLOC=results_split  # name of output folder
DBTMPLOC=pacc_split   # name of folder of accession sub-lists
QUERYTMPLOC=query_split   #  folder for query subset fasta files
mkdir $DBTMPLOC $QUERYTMPLOC $RESLOC
# split input files (this can be sent to the cluster if the input and db files are VERY VERY big)
parallel --pipepart -a query.fasta --block -$(( $QUERY_CHUNKS/$CHUNK_MULTIPLIER )) -j$CHUNK_MULTIPLIER --recstart '>' "cat > $QUERYTMPLOC/query_{#}"
parallel --pipepart -a my_tax.pacc -j$CHUNK_MULTIPLIER --block -$(( $NJOBS/$QUERY_CHUNKS/$CHUNK_MULTIPLIER )) "cat > $DBTMPLOC/seqids_{#}" 
# the 5 commented lines below are not needed currently (experimental)
# DB_CHUNK=500000  # how many sequences to include in each subset of the db
# QUERY_CHUNK=100k   # what file size for each subset of the query (will not break fasta records)
# cat my_tax.pacc | parallel -j 12 -kN$DB_CHUNK --pipe "cat > $DBTMPLOC/seqids_{#}" 
# cat query.fasta | parallel -j 12 --block $QUERY_CHUNK \
# --recstart '>' --pipe "cat > $QUERYTMPLOC/query_{#}"

# verify the number of parts 
QUERY_PARTS=$( find  $QUERYTMPLOC -type f -name "query_*" ! -name "*.*" | wc -l)
DB_PARTS=$( find  $DBTMPLOC -type f -name "seqids_*" ! -name "*.*" | wc -l)
JOBS_NUM=$(( $QUERY_PARTS * $DB_PARTS )) 
echo $JOBS_NUM # make sure this number does not exceed 2000
# convert the accession list files to binary format
parallel blastdb_aliastool -seqid_file_in ::: `find  $DBTMPLOC -type f -name "seqids_*" ! -name "*.*"`

7. Within each job run a sub-query against a sub-db

# Create the jobs for all combinations of query and database parts (the --dryrun flag is essential!!)

parallel --dryrun "blastx -db $DB_SUB -query $QUERYTMPLOC/query_{1} -outfmt '6 std stitle ssciname scomname' -evalue 1e-10 -max_target_seqs 10 -num_threads $NCORES -seqidlist $DBTMPLOC/seqids_{2}.bsl -out $RESLOC/results.query{1}.seqids{2}"  ::: `seq 1 $QUERY_PARTS` ::: `seq 1 $DB_PARTS` > $JOB_NAME.cmds

# create a template PBS script for each job in the array
echo '#!/bin/bash 
#PBS -l' "select=1:ncpus=$NCORES:mem=24GB,walltime=5:00:00
#PBS -A qris-gu

cd \$PBS_O_WORKDIR
source ~/.bashrc 
conda activate blast-tools
awk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \${CMDS_FILE} | bash" > $JOB_NAME.pbs

# send all jobs to the cluster as a job array
ARRAY_ID=$( qsub -J1-$(cat $JOB_NAME.cmds | wc -l) -N $JOB_NAME -v CMDS_FILE=$JOB_NAME.cmds $JOB_NAME.pbs | grep -E -o "^[0-9]+" )

8. Verify that all jobs completed correctly and Merge output files

# check that the array finished all jobs
qstat -fx $ARRAY_ID[]
# Check that all jobs finished successfuly
find . -name '*.e$ARRAY_ID.*' | xargs grep "ExitStatus"
# Check if any jobs did NOT finish successfuly
find . -name '*.e$ARRAY_ID.*' | xargs grep "ExitStatus:[^0]"
# merge output files
find $RESLOC -name "results.query*" -exec cat {} \; > results.blastx.outfmt6
# find and remove empty files
find . -size 0 -exec rm {} + 
# copy results to permanent location
cp results.blastx.outfmt6 ~/data/
cd ~/data
# verify that the new file is there and looks complete
ll ~/data/results.blastx.outfmt6

9. Cleanup working and database folders (if not needed in the very near future)

rm -r ~/90days/${JOB_NAME}_${DATE}
rm -r ~/30days/db/* # only if not needed in the very near future!

Alternatives

incremental-BLAST[6], can be useful if a recurring BLAST job is being performed against a database that is changing (additional sequences or taxonomy groups).
DIAMOND[7] is very fast (500x-20,000x faster than BLAST), but is useful only when annotating against protein databases (BLASTp and BLASTx). It requires it's own db format, so the relevant sequences must be extracted from the NCBI database as fasta file and reformatted.

Bibliography

Mikailov M, Luo F-J, Barkley S, Valleru L, Whitney S, Liu Z, et al. Scaling bioinformatics applications on HPC. BMC Bioinformatics. 2017;18 Suppl 14. doi:10.1186/s12859-017-1902-7.
Yim WC, Cushman JC. Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments. PeerJ. 2017;5. doi:10.7717/peerj.3486.
Tange O. GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine. 2011;36:42–7.
Camacho C, Madden T, Ma N, Tao T, Agarwala R, Morgulis A. BLAST Command Line Applications User Manual. 2013. http://www.ncbi.nlm.nih.gov/books/NBK1763/.
Kans J. Entrez Direct: E-utilities on the UNIX Command Line. National Center for Biotechnology Information (US); 2020. https://www.ncbi.nlm.nih.gov/books/NBK179288/. Accessed 27 May 2020.
Dash S, Rahman S, Hines HM, Feng W. Incremental BLAST: incremental addition of new sequence databases through e-value correction. bioRxiv. 2018;:476218.
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12:59–60.

Optional steps

These snippets are experimental/optional to the main pipeline described above and are not required

1a. Download BLAST v2.10+ (must be done manually until the new version will be available on conda)

cd ~/sw
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.10.0+-x64-linux.tar.gz
tar xzf ncbi-blast-2.10.0+-x64-linux.tar.gz
echo 'export PATH="$HOME/sw/ncbi-blast-2.10.0+/bin:$PATH"' >> ~/.bashrc
# start a new shell or source ~/.bashrc to apply the changes 
source ~/.bashrc
# reactivate the conda environment
conda activate blast-tools
# Fix get_species_taxids.sh to work with conda installation
sed -i.bak 's|export PATH=|# export PATH=|' `which get_species_taxids.sh`

7b. Test a single job before sending the entire job array

# send a single command to verify it works as intended before sending the entire job array
head -n1 $JOB_NAME.cmds | qsub -A qris-gu -N test_blast -l \
select=1:ncpus=$NCORES:mem=32GB,walltime=8:00:00 # sent to the HPC cluster
# then check resources used in the log file 
grep "ResourcesUsed" test_blast.e*
# modify $JOB_NAME.pbs if the resources requested are not sufficient

8b. Calculate average and maximum runtime of jobs in the array

grep "ResourcesUsed" *.e$ARRAY_ID.* | gawk 'BEGIN{maxtime=0; mintime=10000}\
{match($0, /walltime=([0-9]+):([0-9]+):([0-9]+)$/, ts);\
runtime=(ts[1] * 3600) + (ts[2] * 60) + ts[3]; tot_time+=runtime;\
if(runtime<mintime) mintime=runtime; if(runtime>maxtime) maxtime=runtime}\
END{ave_time=tot_time/NR; printf "Min used walltime: %.2f mins\nAverage used walltime: %.2f mins\
\nMax used walltime: %.2f mins\n", mintime/60, ave_time/60, maxtime/60}'

Alternative approaches

5b. Create an aliased small database for each job

echo '#!/bin/bash 
#PBS -l' "select=1:ncpus=4:mem=16GB,walltime=2:30:00
#PBS -A qris-gu

cd \$PBS_O_WORKDIR
source ~/.bashrc 
conda activate bioinfo
I1=\$((((((\$PBS_ARRAY_INDEX-1))%$QUERY_PARTS))+1))
I2=\$((((((\$PBS_ARRAY_INDEX-1))/$QUERY_PARTS))+1))

# convert the accession list to binary format
blastdb_aliastool -seqid_file_in $DBTMPLOC/db_\${I1} && blastx -db nr_green_plants -query $QUERYTMPLOC/query_\${I2} \
-outfmt '6 std stitle ssciname scomname' \
-evalue 1e-10 -max_target_seqs 10 -num_threads \$NCPUS -seqidlist $DBTMPLOC/db_\${I1}.bsl \
-out $RESLOC/results.query\${I1}_db\${I2}" > miniblastx_arrays.pbs
qsub -J1-5 miniblastx_arrays.pbs

(for not so large input files or databases)

6. Download HpcGridRunner and setup a configuration file (see example file below)

For QRIS Awoonga, use the customized 'awoonga' branch of HpcGridRunner

mkdir -p ~/etc/tools && cd ~/etc/tools
git clone https://github.com/HpcGridRunner/HpcGridRunner
curl -L -o ~/etc/tools/HpcGridRunner/hpc_conf/small_PBS_jobs.conf https://gist.githubusercontent.com/IdoBar/5db8973e99463e4c41a10564780cfe6d/raw/bfc6b41d19a921dfaf362200f756ca4de711f1c9/small_PBS_jobs.conf

7. Split input `fasta` file and submit BLAST jobs to the HPC

~/etc/tools/HpcGridRunner/BioIfx/hpc_FASTA_GridRunner.pl --cmd_template "source $HOME/.bashrc; \
conda activate blast-tools; cd \$PBS_O_WORKDIR; blastx -query __QUERY_FILE__ -db ~/30days/db/$DB \
-outfmt '6 std stitle ssciname scomname' -evalue 1e-10 -max_target_seqs 10 -taxidlist 35493.txids \
-num_threads \$NCPUS" --query_fasta query.fasta \
-G ~/etc/tools/HpcGridRunner/hpc_conf/small_PBS_jobs.conf -N 10 -O GridRunner.blastx

8. Merge output files

find GridRunner.blastx -name "*.fa.OUT" -exec cat {} \; > results.blastx.outfmt6

	[GRID]
	# grid type:
	gridtype=PBSPro

	# old version below:
	# grid=PBSPro

	# template for a grid submission (note that you need to specify an account on Awoonga)
	#cmd=qsub V -q queue_name -l walltime=02:30:00
	cmd=qsub -A qris-gu -l select=1:ncpus=8:mem=96GB,walltime=8:00:00

	# note -e error.file -o out.file are set internally, so dont set them in the above cmd.

	##########################################################################################
	# settings below configure the Trinity job submission system, not tied to the grid itself.
	##########################################################################################

	# number of grid submissions to be maintained at steady state by the Trinity submission system
	max_nodes=40

	# number of commands that are batched into a single grid submission job.
	cmds_per_node=3