NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.
The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.
At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)
To test online, go to http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
Enter the following FASTA sequence in the query text box:
>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD
Database nr (default) Leave other fields blank Check that the algorithm parameters say: Max target sequences 100 (default) Expect threshold: 1e-5 (instead of default 100) Leave rest of parameters as default. Click BLAST:
- The first 20 hits are all eukaryotic (top hit Trichuris trichura 8e-36). No bacterial hits in results at all.
If you now change the Max target sequences to 500, and rerun, you see:
- The top hit is Bacteria (Burkholderia kururiensis 2e-40)
This is reproducible on the command line (using the versions of blastp mentioned above):
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 100 -evalue 1e-5 >out.1e-5.max100.txt
blastp -query input.fasta -db nr -outfmt 6 -max_target_seqs 500 -evalue 1e-5 >out.1e-5.max500.txt
Can someone else confirm that they have seen this bug? It's possible I am doing something silly, but if not, then this is a serious bug, because max_target_seqs is only supposed to change the number of matches returned, not the TOP hits. See http://www.ncbi.nlm.nih.gov/books/NBK279682/
Screenshots attached:
Issue resolved thanks to prompt reply from @ncbi [email protected]
Hello,
Thank you for the report. We don't consider this a bug, but I agree that we should document this possibility better. This can happen because limits, including max target sequences, are applied in an early ungapped phase of the algorithm, as well as later. In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits. In your case, relaxing the limit to 200 appears to have allowed hits that would have been excluded in the ungapped phase at 100 max target sequences to rise.