NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.
The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.
At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)
To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
Enter the following FASTA sequence in the query text box:
>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD
Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:
- The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.
If you now change the Max target sequences to 500, and rerun, you see:
- The top hit is Bacteria (Burkholderia kururiensis 2e-40)
This is reproducible on the command line (using the versions of blastp mentioned above):
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt
Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/
Screenshots attached:
Nidhi Shah has shared a test case at https://github.com/shahnidhi/BLAST_maxtargetseq_analysis (they got in touch via a comment on my blog).
Blog part three looks at the test case from Nidhi Shah https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html
Blog part four at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used) https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html