Created
October 3, 2016 17:16
-
-
Save MikeTrizna/95658087d7e4f9ef6166a351a517ba38 to your computer and use it in GitHub Desktop.
Demo of a full BLAST creation workflow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I'm going to start out with a slightly different query than yours (to limit the number of sequences in the database). Here's the URL of the search, which as of today gives 10,550 results: https://www.ncbi.nlm.nih.gov/nuccore/?term=rbcL%5BGENE%5D+AND+barcode%5Bkeyword%5D." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"%%bash\n", | |
"esearch -db nuccore -query \"rbcL[GENE] AND barcode[keyword]\" | \\\n", | |
"efetch -format fasta | \\\n", | |
"grep \".\" > rbcl.fasta" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Next I'm going to create the BLAST database. The one thing I will change from the original example is to omit the \"title\" parameter. I see it as redundant with the \"name\" parameter, and often get them confused in examples. Maybe that was the problem you encountered?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\n", | |
"\n", | |
"Building a new DB, current time: 10/03/2016 13:13:17\n", | |
"New DB name: /Users/miketrizna/Dropbox (Smithsonian)/custom_blast/answer_question/rbcl\n", | |
"New DB title: rbcl.fasta\n", | |
"Sequence type: Nucleotide\n", | |
"Keep MBits: T\n", | |
"Maximum file size: 1000000000B\n", | |
"Adding sequences from FASTA; added 10550 sequences in 0.544506 seconds.\n" | |
] | |
} | |
], | |
"source": [ | |
"%%bash\n", | |
"makeblastdb -in rbcl.fasta -parse_seqids -dbtype nucl -out rbcl" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now I'll run an \"ls\" command to show you the current contents of the directory. All of those files (other than rbcl.fasta and this .ipynb file) make up the BLAST database." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Answering BLAST question.ipynb rbcl.nni\r\n", | |
"rbcl.fasta rbcl.nog\r\n", | |
"rbcl.nhr rbcl.nsd\r\n", | |
"rbcl.nin rbcl.nsi\r\n", | |
"rbcl.nnd rbcl.nsq\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!ls" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now I'll grab some rbcL genes to use as a test query. Here I make sure that they don't have the BARCODE keyword (to ensure they're not in the database), and I also filter down to the last 10 days. This gives me 132 matching sequences." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%bash\n", | |
"esearch -db nuccore -query \"(rbcL[GENE] AND 0:1000[SLEN]) NOT barcode[keyword]\" | \\\n", | |
"efilter -days 10 -datetype PDAT | \\\n", | |
"efetch -format fasta | \\\n", | |
"grep \".\" > rbcl_query.fasta" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"132\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!grep -c \"^>\" rbcl_query.fasta" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finally, I'll run the BLAST search -- making sure to use the same name in the \"-db\" parameter as the \"-out\" parameter I used above." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"\n", | |
"real\t0m20.531s\n", | |
"user\t0m20.376s\n", | |
"sys\t0m0.072s\n" | |
] | |
} | |
], | |
"source": [ | |
"%%bash\n", | |
"time \\\n", | |
"blastn -query rbcl_query.fasta -db rbcl -out blast_results.tsv \\\n", | |
" -max_target_seqs 5 \\\n", | |
" -outfmt \"6 qseqid sacc stitle pident length\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And finally, as a bonus, here's the same BLAST search with the addition of the \"num_threads\" parameter. You can see the runtime drops significantly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"\n", | |
"real\t0m4.951s\n", | |
"user\t0m34.484s\n", | |
"sys\t0m0.178s\n" | |
] | |
} | |
], | |
"source": [ | |
"%%bash\n", | |
"time \\\n", | |
"blastn -query rbcl_query.fasta -db rbcl -out blast_results.tsv \\\n", | |
" -max_target_seqs 5 \\\n", | |
" -outfmt \"6 qseqid sacc stitle pident length\" \\\n", | |
" -num_threads 8" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"anaconda-cloud": {}, | |
"kernelspec": { | |
"display_name": "Python [conda root]", | |
"language": "python", | |
"name": "conda-root-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment