Last active
January 2, 2016 12:59
-
-
Save BenLangmead/8307011 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"FASTA\n", | |
"=====\n", | |
"\n", | |
"This notebook briefly explores the [FASTA] format, a very common format for storing DNA sequences. FASTA is the preferred format for storing *reference genomes*.\n", | |
"\n", | |
"FASTA and FASTQ are rather similar, but FASTQ is almost always used for storing *sequencing reads* (with associated quality values), whereas FASTA is used for storing all kinds of DNA, RNA or protein sequencines (without associated quality values).\n", | |
"\n", | |
"Before delving into the format, I should mention that there are great tools and libraries for parsing and manipulating FASTA, e.g. [SAMtools], [FASTX], and [BioPython]'s [SeqIO] module. If your needs are relatively simple, you might try using these tools and libraries and skip reading this document.\n", | |
"\n", | |
"[FASTA]: http://en.wikipedia.org/wiki/FASTA_format\n", | |
"[BioPython]: http://biopython.org/wiki/Main_Page\n", | |
"[SeqIO]: http://biopython.org/wiki/SeqIO\n", | |
"[SAMtools]: http://samtools.sourceforge.net/\n", | |
"[FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Basic format\n", | |
"\n", | |
"Here is the basic format:\n", | |
"\n", | |
" >sequence1_short_name with optional additional info after whitespace\n", | |
" ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA\n", | |
" GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC\n", | |
" AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGAT\n", | |
" >sequence2_short_name with optional additional info after whitespace\n", | |
" GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCG\n", | |
" ATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATA\n", | |
" ATATAG" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A line starting with a `>` (greater-than) sign indicates the beginning of a new sequence and specifies its name. Take the first line above. Everything after the `>` up to and excluding the first whitespace character (`sequence1_short_name`), is the \"short name.\" Everything after the `>` up to the end of the *line* (`sequence1_short_name with optional additional info after whitespace`) is the \"long name.\" We usually use the short name when referring to FASTA sequences.\n", | |
"\n", | |
"The next three lines contain nucleotides. This FASTA file has a \"width\" of 70; that is, 70 nucleotides are allowed per line. If the sequence is longer than the width, the sequence \"wraps\" (spills over) onto subsequent lines. Not every FASTA file has the same width; it might be 60 or 50 instead of 70. But a *given* FASTA file must use the *same width throughout*. Only the final line of the sequence is allowed to have fewer characters than the width, as you see above.\n", | |
"\n", | |
"The sequences above are made up. Here's a real-world reference sequence (the [human mitochondrial genome]) in FASTA format:\n", | |
"\n", | |
"[human mitochondrial genome]: http://en.wikipedia.org/wiki/Human_mitochondrial_genetics" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from gzip import GzipFile\n", | |
"from StringIO import StringIO\n", | |
"from urllib import urlopen\n", | |
"fagz = urlopen('ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/non-nuclear/assembled_chromosomes/FASTA/chrMT.fa.gz').read()\n", | |
"print GzipFile(fileobj=StringIO(fagz)).read()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
">gi|113200490|gb|J01415.2|HUMMTCG Homo sapiens mitochondrion, complete genome\n", | |
"GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG\n", | |
"GTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC\n", | |
"CTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTA\n", | |
"ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATC\n", | |
"ATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCA\n", | |
"AACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC\n", | |
"TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATA\n", | |
"CAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCC\n", | |
"AAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTC\n", | |
"ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA\n", | |
"GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC\n", | |
"AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA\n", | |
"ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGA\n", | |
"TTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACT\n", | |
"CACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACAC\n", | |
"ACAATAGCTAAGACCCAAACTGGGATTAGATACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATC\n", | |
"AACAAAACTGCTCGCCAGAACACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATC\n", | |
"CCTCTAGAGGAGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA\n", | |
"CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAGACGTTAGGTC\n", | |
"AAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAGAAAACTACGATAGCCCTTAT\n", | |
"GAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGA\n", | |
"AGCGCGTACACACCGCCCGTCACCCTCCTCAAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCA\n", | |
"TTTATATAGAGGAGACAAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAACCAGAGTGTA\n", | |
"GCTTAACACAAAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA\n", | |
"GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCG\n", | |
"ATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATA\n", | |
"ATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAATTAACTAGAAATAACTTTGCAAGGAGAGCC\n", | |
"AAAGCTAAGACCCCCGAAACCAGACGAGCTACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCA\n", | |
"AAATAGTGGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGAT\n", | |
"AGAATCTTAGTTCAACTTTAAATTTGCCCACAGAACCCTCTAAATCCCCTTGTAAATTTAACTGTTAGTC\n", | |
"CAAAGAGGAACAGCTCTTTGGACACTAGGAAAAAACCTTGTAGAGAGAGTAAAAAATTTAACACCCATAG\n", | |
"TAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCACTACCTAAAAAATCCCAAAC\n", | |
"ATATAACTGAACTCCTCACACCCAATTGGACCAATCTATCACCCTATAGAAGAACTAATGTTAGTATAAG\n", | |
"TAACATGAAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTAAAACACTGAACTGACAATTAACAGCCC\n", | |
"AATATCTACAATCAACCAACAAGTCATTATTACCCTCACTGTCAACCCAACACAGGCATGCTCATAAGGA\n", | |
"AAGGTTAAAAAAAGTAAAAGGAACTCGGCAAATCTTACCCCGCCTGTTTACCAAAAACATCACCTCTAGC\n", | |
"ATCACCAGTATTAGAGGCACCGCCTGCCCAGTGACACATGTTTAACGGCCGCGGTACCCTAACCGTGCAA\n", | |
"AGGTAGCATAATCACTTGTTCCTTAAATAGGGACCTGTATGAATGGCTCCACGAGGGTTCAGCTGTCTCT\n", | |
"TACTTTTAACCAGTGAAATTGACCTGCCCGTGAAGAGGCGGGCATAACACAGCAAGACGAGAAGACCCTA\n", | |
"TGGAGCTTTAATTTATTAATGCAAACAGTACCTAACAAACCCACAGGTCCTAAACTACCAAACCTGCATT\n", | |
"AAAAATTTCGGTTGGGGCGACCTCGGAGCAGAACCCAACCTCCGAGCAGTACATGCTAAGACTTCACCAG\n", | |
"TCAAAGCGAACTACTATACTCAATTGATCCAATAACTTGACCAACGGAACAAGTTACCCTAGGGATAACA\n", | |
"GCGCAATCCTATTCTAGAGTCCATATCAACAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCG\n", | |
"ATGGTGCAGCCGCTATTAAAGGTTCGTTTGTTCAACGATTAAAGTCCTACGTGATCTGAGTTCAGACCGG\n", | |
"AGTAATCCAGGTCGGTTTCTATCTACNTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCT\n", | |
"ACTTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTATTATACCCACACCCACCCAAGA\n", | |
"ACAGGGTTTGTTAAGATGGCAGAGCCCGGTAATCGCATAAAACTTAAAACTTTACAGTCAGAGGTTCAAT\n", | |
"TCCTCTTCTTAACAACATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCA\n", | |
"TTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCC\n", | |
"CCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCAC\n", | |
"ATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCC\n", | |
"CTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAG\n", | |
"CCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGC\n", | |
"AGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGC\n", | |
"TCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGG\n", | |
"CCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTC\n", | |
"CGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATAC\n", | |
"ACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCC\n", | |
"CTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAAC\n", | |
"AGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTA\n", | |
"GCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAAGAAATAT\n", | |
"GTCTGATAAAAGAGTTACTTTGATAGAGTAAATAATAGGAGCTTAAACCCCCTTATTTCTAGGACTATGA\n", | |
"GAATCGAACCCATCCCTGAGAATCCAAAATTCTCCGTGCCACCTATCACACCCCATCCTAAAGTAAGGTC\n", | |
"AGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTTATACCCTTCCCGTACTAATTAATCCCCT\n", | |
"GGCCCAACCCGTCATCTACTCTACCATCTTTGCAGGCACACTCATCACAGCGCTAAGCTCGCACTGATTT\n", | |
"TTTACCTGAGTAGGCCTAGAAATAAACATGCTAGCTTTTATTCCAGTTCTAACCAAAAAAATAAACCCTC\n", | |
"GTTCCACAGAAGCTGCCATCAAGTATTTCCTCACGCAAGCAACCGCATCCATAATCCTTCTAATAGCTAT\n", | |
"CCTCTTCAACAATATACTCTCCGGACAATGAACCATAACCAATACTACCAATCAATACTCATCATTAATA\n", | |
"ATCATAATAGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAG\n", | |
"GCACCCCTCTGACATCCGGCCTGCTTCTTCTCACATGACAAAAACTAGCCCCCATCTCAATCATATACCA\n", | |
"AATCTCTCCCTCACTAAACGTAAGCCTTCTCCTCACTCTCTCAATCTTATCCATCATAGCAGGCAGTTGA\n", | |
"GGTGGATTAAACCAAACCCAGCTACGCAAAATCTTAGCATACTCCTCAATTACCCACATAGGATGAATAA\n", | |
"TAGCAGTTCTACCGTACAACCCTAACATAACCATTCTTAATTTAACTATTTATATTATCCTAACTACTAC\n", | |
"CGCATTCCTACTACTCAACTTAAACTCCAGCACCACGACCCTACTACTATCTCGCACCTGAAACAAGCTA\n", | |
"ACATGACTAACACCCTTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCCCGCTAACCGGCTTTT\n", | |
"TGCCCAAATGGGCCATTATCGAAGAATTCACAAAAAACAATAGCCTCATCATCCCCACCATCATAGCCAC\n", | |
"CATCACCCTCCTTAACCTCTACTTCTACCTACGCCTAATCTACTCCACCTCAATCACACTACTCCCCATA\n", | |
"TCTAACAACGTAAAAATAAAATGACAGTTTGAACATACAAAACCCACCCCATTCCTCCCCACACTCATCG\n", | |
"CCCTTACCACGCTACTCCTACCTATCTCCCCTTTTATACTAATAATCTTATAGAAATTTAGGTTAAATAC\n", | |
"AGACCAAGAGCCTTCAAAGCCCTCAGTAAGTTGCAATACTTAATTTCTGTAACAGCTAAGGACTGCAAAA\n", | |
"CCCCACTCTGCATCAACTGAACGCAAATCAGCCACTTTAATTAAGCTAAGCCCTTACTAGACCAATGGGA\n", | |
"CTTAAACCCACAAACACTTAGTTAACAGCTAAGCACCCTAATCAACTGGCTTCAATCTACTTCTCCCGCC\n", | |
"GCCGGGAAAAAAGGCGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTTGCAATTCAATATGA\n", | |
"AAATCACCTCGGAGCTGGTAAAAAGAGGCCTAACCCCTGTCTTTAGATTTACAGTCCAATGCTTCACTCA\n", | |
"GCCATTTTACCTCACCCCCACTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGG\n", | |
"AACACTATACCTATTATTCGGCGCATGAGCTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCC\n", | |
"GAGCTGGGCCAGCCAGGCAACCTTCTAGGTAACGACCACATCTACAACGTTATCGTCACAGCCCATGCAT\n", | |
"TTGTAATAATCTTCTTCATAGTAATACCCATCATAATCGGAGGCTTTGGCAACTGACTAGTTCCCCTAAT\n", | |
"AATCGGTGCCCCCGATATGGCGTTTCCCCGCATAAACAACATAAGCTTCTGACTCTTACCTCCCTCTCTC\n", | |
"CTACTCCTGCTCGCATCTGCTATAGTGGAGGCCGGAGCAGGAACAGGTTGAACAGTCTACCCTCCCTTAG\n", | |
"CAGGGAACTACTCCCACCCTGGAGCCTCCGTAGACCTAACCATCTTCTCCTTACACCTAGCAGGTGTCTC\n", | |
"CTCTATCTTAGGGGCCATCAATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATAC\n", | |
"CAAACGCCCCTCTTCGTCTGATCCGTCCTAATCACAGCAGTCCTACTTCTCCTATCTCTCCCAGTCCTAG\n", | |
"CTGCTGGCATCACTATACTACTAACAGACCGCAACCTCAACACCACCTTCTTCGACCCCGCCGGAGGAGG\n", | |
"AGACCCCATTCTATACCAACACCTATTCTGATTTTTCGGTCACCCTGAAGTTTATATTCTTATCCTACCA\n", | |
"GGCTTCGGAATAATCTCCCATATTGTAACTTACTACTCCGGAAAAAAAGAACCATTTGGATACATAGGTA\n", | |
"TGGTCTGAGCTATGATATCAATTGGCTTCCTAGGGTTTATCGTGTGAGCACACCATATATTTACAGTAGG\n", | |
"AATAGACGTAGACACACGAGCATATTTCACCTCCGCTACCATAATCATCGCTATCCCCACCGGCGTCAAA\n", | |
"GTATTTAGCTGACTCGCCACACTCCACGGAAGCAATATGAAATGATCTGCTGCAGTGCTCTGAGCCCTAG\n", | |
"GATTCATCTTTCTTTTCACCGTAGGTGGCCTGACTGGCATTGTATTAGCAAACTCATCACTAGACATCGT\n", | |
"ACTACACGACACGTACTACGTTGTAGCCCACTTCCACTATGTCCTATCAATAGGAGCTGTATTTGCCATC\n", | |
"ATAGGAGGCTTCATTCACTGATTTCCCCTATTCTCAGGCTACACCCTAGACCAAACCTACGCCAAAATCC\n", | |
"ATTTCACTATCATATTCATCGGCGTAAATCTAACTTTCTTCCCACAACACTTTCTCGGCCTATCCGGAAT\n", | |
"GCCCCGACGTTACTCGGACTACCCCGATGCATACACCACATGAAACATCCTATCATCTGTAGGCTCATTC\n", | |
"ATTTCTCTAACAGCAGTAATATTAATAATTTTCATGATTTGAGAAGCCTTCGCTTCGAAGCGAAAAGTCC\n", | |
"TAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCCCCACCCTACCACACATTCGA\n", | |
"AGAACCCGTATACATAAAATCTAGACAAAAAAGGAAGGAATCGAACCCCCCAAAGCTGGTTTCAAGCCAA\n", | |
"CCCCATGGCCTCCATGACTTTTTCAAAAAGGTATTAGAAAAACCATTTCATAACTTTGTCAAAGTTAAAT\n", | |
"TATAGGCTAAATCCTATATATCTTAATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCT\n", | |
"ATCATAGAAGAGCTTATCACCTTTCATGATCACGCCCTCATAATCATTTTCCTTATCTGCTTCCTAGTCC\n", | |
"TGTATGCCCTTTTCCTAACACTCACAACAAAACTAACTAATACTAACATCTCAGACGCTCAGGAAATAGA\n", | |
"AACCGTCTGAACTATCCTGCCCGCCATCATCCTAGTCCTCATCGCCCTCCCATCCCTACGCATCCTTTAC\n", | |
"ATAACAGACGAGGTCAACGATCCCTCCCTTACCATCAAATCAATTGGCCACCAATGGTACTGAACCTACG\n", | |
"AGTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCATTATTCCTAGAACCAGGCGA\n", | |
"CCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATAATAATTACA\n", | |
"TCACAAGACGTCTTGCACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTC\n", | |
"TAAACCAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTGAAATCTGTGGAGC\n", | |
"AAACCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCTAAAAATCTTTGAAATAGGGCCCGTATTT\n", | |
"ACCCTATAGCACCCCCTCTACCCCCTCTAGAGCCCACTGTAAAGCTAACTTAGCATTAACCTTTTAAGTT\n", | |
"AAAGATTAAGAGAACCAACACCTCTTTACAGTGAAATGCCCCAACTAAATACTACCGTATGGCCCACCAT\n", | |
"AATTACCCCCATACTCCTTACACTATTCCTCATCACCCAACTAAAAATATTAAACACAAACTACCACCTA\n", | |
"CCTCCCTCACCAAAGCCCATAAAAATAAAAAATTATAACAAACCCTGAGAACCAAAATGAACGAAAATCT\n", | |
"GTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCAGTACTGATCATTCTATTTCCCCCT\n", | |
"CTATTGATCCCCACCTCCAAATATCTCATCAACAACCGACTAATCACCACCCAACAATGACTAATCAAAC\n", | |
"TAACCTCAAAACAAATGATAACCATACACAACACTAAAGGACGAACCTGATCTCTTATACTAGTATCCTT\n", | |
"AATCATTTTTATTGCCACAACTAACCTCCTCGGACTCCTGCCTCACTCATTTACACCAACCACCCAACTA\n", | |
"TCTATAAACCTAGCCATGGCCATCCCCTTATGAGCGGGCACAGTGATTATAGGCTTTCGCTCTAAGATTA\n", | |
"AAAATGCCCTAGCCCACTTCTTACCACAAGGCACACCTACACCCCTTATCCCCATACTAGTTATTATCGA\n", | |
"AACCATCAGCCTACTCATTCAACCAATAGCCCTGGCCGTACGCCTAACCGCTAACATTACTGCAGGCCAC\n", | |
"CTACTCATGCACCTAATTGGAAGCGCCACCCTAGCAATATCAACCATTAACCTTCCCTCTACACTTATCA\n", | |
"TCTTCACAATTCTAATTCTACTGACTATCCTAGAAATCGCTGTCGCCTTAATCCAAGCCTACGTTTTCAC\n", | |
"ACTTCTAGTAAGCCTCTACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAA\n", | |
"AACCCAGCCCATGACCCCTAACAGGGGCCCTCTCAGCCCTCCTAATGACCTCCGGCCTAGCCATGTGATT\n", | |
"TCACTTCCACTCCATAACGCTCCTCATACTAGGCCTACTAACCAACACACTAACCATATACCAATGATGG\n", | |
"CGCGATGTAACACGAGAAAGCACATACCAAGGCCACCACACACCACCTGTCCAAAAAGGCCTTCGATACG\n", | |
"GGATAATCCTATTTATTACCTCAGAAGTTTTTTTCTTCGCAGGATTTTTCTGAGCCTTTTACCACTCCAG\n", | |
"CCTAGCCCCTACCCCCCAATTAGGAGGGCACTGGCCCCCAACAGGCATCACCCCGCTAAATCCCCTAGAA\n", | |
"GTCCCACTCCTAAACACATCCGTATTACTCGCATCAGGAGTATCAATCACCTGAGCTCACCATAGTCTAA\n", | |
"TAGAAAACAACCGAAACCAAATAATTCAAGCACTGCTTATTACAATTTTACTGGGTCTCTATTTTACCCT\n", | |
"CCTACAAGCCTCAGAGTACTTCGAGTCTCCCTTCACCATTTCCGACGGCATCTACGGCTCAACATTTTTT\n", | |
"GTAGCCACAGGCTTCCACGGACTTCACGTCATTATTGGCTCAACTTTCCTCACTATCTGCTTCATCCGCC\n", | |
"AACTAATATTTCACTTTACATCCAAACATCACTTTGGCTTCGAAGCCGCCGCCTGATACTGGCATTTTGT\n", | |
"AGATGTGGTTTGACTATTTCTGTATGTCTCCATCTATTGATGAGGGTCTTACTCTTTTAGTATAAATAGT\n", | |
"ACCGTTAACTTCCAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAA\n", | |
"TAATCAACACCCTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACAT\n", | |
"AGAAAAATCCACCCCTTACGAGTGCGGCTTCGACCCTATATCCCCCGCCCGCGTCCCTTTCTCCATAAAA\n", | |
"TTCTTCTTAGTAGCTATTACCTTCTTATTATTTGATCTAGAAATTGCCCTCCTTTTACCCCTACCATGAG\n", | |
"CCCTACAAACAACTAACCTGCCACTAATAGTTATGTCATCCCTCTTATTAATCATCATCCTAGCCCTAAG\n", | |
"TCTGGCCTATGAGTGACTACAAAAAGGATTAGACTGAACCGAATTGGTATATAGTTTAAACAAAACGAAT\n", | |
"GATTTCGACTCATTAAATTATGATAATCATATTTACCAAATGCCCCTCATTTACATAAATATTATACTAG\n", | |
"CATTTACCATCTCACTTCTAGGAATACTAGTATATCGCTCACACCTCATATCCTCCCTACTATGCCTAGA\n", | |
"AGGAATAATACTATCGCTGTTCATTATAGCTACTCTCATAACCCTCAACACCCACTCCCTCTTAGCCAAT\n", | |
"ATTGTGCCTATTGCCATACTAGTCTTTGCCGCCTGCGAAGCAGCGGTGGGCCTAGCCCTACTAGTCTCAA\n", | |
"TCTCCAACACATATGGCCTAGACTACGTACATAACCTAAACCTACTCCAATGCTAAAACTAATCGTCCCA\n", | |
"ACAATTATATTACTACCACTGACATGACTTTCCAAAAAACACATAATTTGAATCAACACAACCACCCACA\n", | |
"GCCTAATTATTAGCATCATCCCTCTACTATTTTTTAACCAAATCAACAACAACCTATTTAGCTGTTCCCC\n", | |
"AACCTTTTCCTCCGACCCCCTAACAACCCCCCTCCTAATACTAACTACCTGACTCCTACCCCTCACAATC\n", | |
"ATGGCAAGCCAACGCCACTTATCCAGTGAACCACTATCACGAAAAAAACTCTACCTCTCTATACTAATCT\n", | |
"CCCTACAAATCTCCTTAATTATAACATTCACAGCCACAGAACTAATCATATTTTATATCTTCTTCGAAAC\n", | |
"CACACTTATCCCCACCTTGGCTATCATCACCCGATGAGGCAACCAGCCAGAACGCCTGAACGCAGGCACA\n", | |
"TACTTCCTATTCTACACCCTAGTAGGCTCCCTTCCCCTACTCATCGCACTAATTTACACTCACAACACCC\n", | |
"TAGGCTCACTAAACATTCTACTACTCACTCTCACTGCCCAAGAACTATCAAACTCCTGAGCCAACAACTT\n", | |
"AATATGACTAGCTTACACAATAGCTTTTATAGTAAAGATACCTCTTTACGGACTCCACTTATGACTCCCT\n", | |
"AAAGCCCATGTCGAAGCCCCCATCGCTGGGTCAATAGTACTTGCCGCAGTACTCTTAAAACTAGGCGGCT\n", | |
"ATGGTATAATACGCCTCACACTCATTCTCAACCCCCTGACAAAACACATAGCCTACCCCTTCCTTGTACT\n", | |
"ATCCCTATGAGGCATAATTATAACAAGCTCCATCTGCCTACGACAAACAGACCTAAAATCGCTCATTGCA\n", | |
"TACTCTTCAATCAGCCACATAGCCCTCGTAGTAACAGCCATTCTCATCCAAACCCCCTGAAGCTTCACCG\n", | |
"GCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCTCATTACTATTCTGCCTAGCAAACTCAAACTA\n", | |
"CGAACGCACTCACAGTCGCATCATAATCCTCTCTCAAGGACTTCAAACTCTACTCCCACTAATAGCTTTT\n", | |
"TGATGACTTCTAGCAAGCCTCGCTAACCTCGCCTTACCCCCCACTATTAACCTACTGGGAGAACTCTCTG\n", | |
"TGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGACTCAACATACTAGTCACAGC\n", | |
"CCTATACTCCCTCTACATATTTACCACAACACAATGGGGCTCACTCACCCACCACATTAACAACATAAAA\n", | |
"CCCTCATTCACACGAGAAAACACCCTCATGTTCATACACCTATCCCCCATTCTCCTCCTATCCCTCAACC\n", | |
"CCGACATCATTACCGGGTTTTCCTCTTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAA\n", | |
"CAGAGGCTTACGACCCCTTATTTACCGAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAAC\n", | |
"AACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCA\n", | |
"ACTCCAAATAAAAGTAATAACCATGCACACTACTATAACCACCCTAACCCTGACTTCCCTAATTCCCCCC\n", | |
"ATCCTTACCACCCTCGTTAACCCTAACAAAAAAAACTCATACCCCCATTATGTAAAATCCATTGTCGCAT\n", | |
"CCACCTTTATTATCAGTCTCTTCCCCACAACAATATTCATGTGCCTAGACCAAGAAGTTATTATCTCGAA\n", | |
"CTGACACTGAGCCACAACCCAAACAACCCAGCTCTCCCTAAGCTTCAAACTAGACTACTTCTCCATAATA\n", | |
"TTCATCCCTGTAGCATTGTTCGTTACATGGTCCATCATAGAATTCTCACTGTGATATATAAACTCAGACC\n", | |
"CAAACATTAATCAGTTCTTCAAATATCTACTCATCTTCCTAATTACCATACTAATCTTAGTTACCGCTAA\n", | |
"CAACCTATTCCAACTGTTCATCGGCTGAGAGGGCGTAGGAATTATATCCTTCTTGCTCATCAGTTGATGA\n", | |
"TACGCCCGAGCAGATGCCAACACAGCAGCCATTCAAGCAATCCTATACAACCGTATCGGCGATATCGGTT\n", | |
"TCATCCTCGCCTTAGCATGATTTATCCTACACTCCAACTCATGAGACCCACAACAAATAGCCCTTCTAAA\n", | |
"CGCTAATCCAAGCCTCACCCCACTACTAGGCCTCCTCCTAGCAGCAGCAGGCAAATCAGCCCAATTAGGT\n", | |
"CTCCACCCCTGACTCCCCTCAGCCATAGAAGGCCCCACCCCAGTCTCAGCCCTACTCCACTCAAGCACTA\n", | |
"TAGTTGTAGCAGGAATCTTCTTACTCATCCGCTTCCACCCCCTAGCAGAAAATAGCCCACTAATCCAAAC\n", | |
"TCTAACACTATGCTTAGGCGCTATCACCACTCTGTTCGCAGCAGTCTGCGCCCTTACACAAAATGACATC\n", | |
"AAAAAAATCGTAGCCTTCTCCACTTCAAGTCAACTAGGACTCATAATAGTTACAATCGGCATCAACCAAC\n", | |
"CACACCTAGCATTCCTGCACATCTGTACCCACGCCTTCTTCAAAGCCATACTATTTATGTGCTCCGGGTC\n", | |
"CATCATCCACAACCTTAACAATGAACAAGATATTCGAAAAATAGGAGGACTACTCAAAACCATACCTCTC\n", | |
"ACTTCAACCTCCCTCACCATTGGCAGCCTAGCATTAGCAGGAATACCTTTCCTCACAGGTTTCTACTCCA\n", | |
"AAGACCACATCATCGAAACCGCAAACATATCATACACAAACGCCTGAGCCCTATCTATTACTCTCATCGC\n", | |
"TACCTCCCTGACAAGCGCCTATAGCACTCGAATAATTCTTCTCACCCTAACAGGTCAACCTCGCTTCCCC\n", | |
"ACCCTTACTAACATTAACGAAAATAACCCCACCCTACTAAACCCCATTAAACGCCTGGCAGCCGGAAGCC\n", | |
"TATTCGCAGGATTTCTCATTACTAACAACATTTCCCCCGCATCCCCCTTCCAAACAACAATCCCCCTCTA\n", | |
"CCTAAAACTCACAGCCCTCGCTGTCACTTTCCTAGGACTTCTAACAGCCCTAGACCTCAACTACCTAACC\n", | |
"AACAAACTTAAAATAAAATCCCCACTATGCACATTTTATTTCTCCAACATACTCGGATTCTACCCTAGCA\n", | |
"TCACACACCGCACAATCCCCTATCTAGGCCTTCTTACGAGCCAAAACCTGCCCCTACTCCTCCTAGACCT\n", | |
"AACCTGACTAGAAAAGCTATTACCTAAAACAATTTCACAGCACCAAATCTCCACCTCCATCATCACCTCA\n", | |
"ACCCAAAAAGGCATAATTAAACTTTACTTCCTCTCTTTCTTCTTCCCACTCATCCTAACCCTACTCCTAA\n", | |
"TCACATAACCTATTCCCCCGAGCAATCTCAATTACAATATATACACCAACAAACAATGTTCAACCAGTAA\n", | |
"CTACTACTAATCAACGCCCATAATCATACAAAGCCCCCGCACCAATAGGATCCTCCCGAATCAACCCTGA\n", | |
"CCCCTCTCCTTCATAAATTATTCAGCTTCCTACACTATTAAAGTTTACCACAACCACCACCCCATCATAC\n", | |
"TCTTTCACCCACAGCACCAATCCTACCTCCATCGCTAACCCCACTAAAACACTCACCAAGACCTCAACCC\n", | |
"CTGACCCCCATGCCTCAGGATACTCCTCAATAGCCATCGCTGTAGTATATCCAAAGACAACCATCATTCC\n", | |
"CCCTAAATAAATTAAAAAAACTATTAAACCCATATAACCTCCCCCAAAATTCAGAATAATAACACACCCG\n", | |
"ACCACACCGCTAACAATCAATACTAAACCCCCATAAATAGGAGAAGGCTTAGAAGAAAACCCCACAAACC\n", | |
"CCATTACTAAACCCACACTCAACAGAAACAAAGCATACATCATTATTCTCGCACGGACTACAACCACGAC\n", | |
"CAATGATATGAAAAACCATCGTTGTATTTCAACTACAAGAACACCAATGACCCCAATACGCAAAACTAAC\n", | |
"CCCCTAATAAAATTAATTAACCACTCATTCATCGACCTCCCCACCCCATCCAACATCTCCGCATGATGAA\n", | |
"ACTTCGGCTCACTCCTTGGCGCCTGCCTGATCCTCCAAATCACCACAGGACTATTCCTAGCCATGCACTA\n", | |
"CTCACCAGACGCCTCAACCGCCTTTTCATCAATCGCCCACATCACTCGAGACGTAAATTATGGCTGAATC\n", | |
"ATCCGCTACCTTCACGCCAATGGCGCCTCAATATTCTTTATCTGCCTCTTCCTACACATCGGGCGAGGCC\n", | |
"TATATTACGGATCATTTCTCTACTCAGAAACCTGAAACATCGGCATTATCCTCCTGCTTGCAACTATAGC\n", | |
"AACAGCCTTCATAGGCTATGTCCTCCCGTGAGGCCAAATATCATTCTGAGGGGCCACAGTAATTACAAAC\n", | |
"TTACTATCCGCCATCCCATACATTGGGACAGACCTAGTTCAATGAATCTGAGGAGGCTACTCAGTAGACA\n", | |
"GTCCCACCCTCACACGATTCTTTACCTTTCACTTCATCTTGCCCTTCATTATTGCAGCCCTAGCAACACT\n", | |
"CCACCTCCTATTCTTGCACGAAACGGGATCAAACAACCCCCTAGGAATCACCTCCCATTCCGATAAAATC\n", | |
"ACCTTCCACCCTTACTACACAATCAAAGACGCCCTCGGCTTACTTCTCTTCCTTCTCTCCTTAATGACAT\n", | |
"TAACACTATTCTCACCAGACCTCCTAGGCGACCCAGACAATTATACCCTAGCCAACCCCTTAAACACCCC\n", | |
"TCCCCACATCAAGCCCGAATGATATTTCCTATTCGCCTACACAATTCTCCGATCCGTCCCTAACAAACTA\n", | |
"GGAGGCGTCCTTGCCCTATTACTATCCATCCTCATCCTAGCAATAATCCCCATCCTCCATATATCCAAAC\n", | |
"AACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCT\n", | |
"AACCTGAATCGGAGGACAACCAGTAAGCTACCCTTTTACCATCATTGGACAAGTAGCATCCGTACTATAC\n", | |
"TTCACAACAATCCTAATCCTAATACCAACTATCTCCCTAATTGAAAACAAAATACTCAAATGGGCCTGTC\n", | |
"CTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACAAATCAGA\n", | |
"GAAAAAGTCTTTAACTCCACCATTAGCACCCAAAGCTAAGATTCTAATTTAAACTATTCTCTGTTCTTTC\n", | |
"ATGGGGAAGCAGATTTGGGTACCACCCAAGTATTGACTCACCCATCAACAACCGCTATGTATTTCGTACA\n", | |
"TTACTGCCAGCCACCATGAATATTGTACGGTACCATAAATACTTGACCACCTGTAGTACATAAAAACCCA\n", | |
"ATCCACATCAAAACCCCCTCCCCATGCTTACAAGCAAGTACAGCAATCAACCCTCAACTATCACACATCA\n", | |
"ACTGCAACTCCAAAGCCACCCCTCACCCACTAGGATACCAACAAACCTACCCACCCTTAACAGTACATAG\n", | |
"TACATAAAGCCATTTACCGTACATAGCACATTACAGTCAAATCCCTTCTCGTCCCCATGGATGACCCCCC\n", | |
"TCAGATAGGGGTCCCTTGACCACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCG\n", | |
"CTCCGGGCCCATAACACTTGGGGGTAGCTAAAGTGAACTGTATCCGACATCTGGTTCCTACTTCAGGGTC\n", | |
"ATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATG\n", | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This FASTA file shown above has just one sequence in it. As we saw in the first example above, it's also possible for one FASTA file to contain multiple sequences. These are sometimes called multi-FASTA files. When you write code to interpret FASTA files, it's a good idea to *always* allow for the possibility that the FASTA file might contain multiple sequences.\n", | |
"\n", | |
"FASTA files are often stored with the `.fa` file name extension, but this is not a rule. `.fasta` is another popular extenson. You may also see `.fas`, `.fna`, `.mfa` (for multi-FASTA), and others." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Parsing FASTA\n", | |
"\n", | |
"Here is a simple function for parsing a FASTA file into a Python dictionary. The dictionary maps short names to corresponding nucleotide strings (with whitespace removed)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"def parse_fasta(fh):\n", | |
" fa = {}\n", | |
" current_short_name = None\n", | |
" # Part 1: compile list of lines per sequence\n", | |
" for ln in fh:\n", | |
" if ln[0] == '>':\n", | |
" # new name line; remember current sequence's short name\n", | |
" long_name = ln[1:].rstrip()\n", | |
" current_short_name = long_name.split()[0]\n", | |
" fa[current_short_name] = []\n", | |
" else:\n", | |
" # append nucleotides to current sequence\n", | |
" fa[current_short_name].append(ln.rstrip())\n", | |
" # Part 2: join lists into strings\n", | |
" for short_name, nuc_list in fa.iteritems():\n", | |
" # join this sequence's lines into one long string\n", | |
" fa[short_name] = ''.join(nuc_list)\n", | |
" return fa" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The first part accumulates a list of strings (one per line) for each sequence. The second part joins those lines together so that we end up with one long string per sequence. Why divide it up this way? Mainly to avoid the [poor performance](http://www.skymind.com/~ocrow/python_string/) of repeatedly concatenating (immutable) Python strings.\n", | |
"\n", | |
"I'll test it by running it on the simple multi-FASTA file we saw before:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"fasta_example = StringIO(\n", | |
"'''>sequence1_short_name with optional additional info after whitespace\n", | |
"ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA\n", | |
"GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC\n", | |
"AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGAT\n", | |
">sequence2_short_name with optional additional info after whitespace\n", | |
"GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCG\n", | |
"ATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATA\n", | |
"ATATAG''')\n", | |
"parsed_fa = parse_fasta(fasta_example)\n", | |
"parsed_fa" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 3, | |
"text": [ | |
"{'sequence1_short_name': 'ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGAT',\n", | |
" 'sequence2_short_name': 'GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATAATATAG'}" | |
] | |
} | |
], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that only the short names survive. This is usually fine, but it's not hard to modify the function so that information relating short names to long names is also retained." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Indexed FASTA\n", | |
"\n", | |
"Say you have one or more *big* FASTA files (e.g. the entire human reference genome) and you'd like to access those files \"randomly,\" peeking at substrings here and there without any regular access pattern. Maybe you're mimicking a sequencing machine, reading snippets of DNA here and there.\n", | |
"\n", | |
"You could start by using the `parse_fasta` function defined above to parse the FASTA files. Then, to access a substring, do as follows:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"parsed_fa['sequence2_short_name'][100:130]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 4, | |
"text": [ | |
"'AGTACCGCAAGGGAAAGATGAAAAATTATA'" | |
] | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Accessing a substring in this way is very fast and simple. The downside is that you've stored all of the sequences in memory. If the FASTA files are really big, this takes lots of valuable memory. This may or may not be a good trade.\n", | |
"\n", | |
"An alternative is to load only the *portions* of the FASTA files that you need, when you need them. For this to be practical, we have to have a way of \"jumping\" to the specific part of the specific FASTA file that you're intersted in.\n", | |
"\n", | |
"Fortunately, there is a standard way of indexing a FASTA file, popularized by the `faidx` tool in [SAMtools]. When you have such an index, it's easy to calculate exactly where to jump to when you want to extract a specific substring. Here is some Python to create such an index:\n", | |
"\n", | |
"[SAMtools]: http://samtools.sourceforge.net" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"def index_fasta(fh):\n", | |
" index = []\n", | |
" current_short_name = None\n", | |
" current_byte_offset, running_seq_length, running_byte_offset = 0, 0, 0\n", | |
" line_length_including_ws, line_length_excluding_ws = 0, 0\n", | |
" for ln in fh:\n", | |
" ln_stripped = ln.rstrip()\n", | |
" running_byte_offset += len(ln)\n", | |
" if ln[0] == '>':\n", | |
" if current_short_name is not None:\n", | |
" index.append((current_short_name, running_seq_length,\n", | |
" current_byte_offset, line_length_excluding_ws,\n", | |
" line_length_including_ws))\n", | |
" long_name = ln_stripped[1:]\n", | |
" current_short_name = long_name.split()[0]\n", | |
" current_byte_offset = running_byte_offset\n", | |
" running_seq_length = 0\n", | |
" else:\n", | |
" line_length_including_ws = max(line_length_including_ws, len(ln))\n", | |
" line_length_excluding_ws = max(line_length_excluding_ws, len(ln_stripped))\n", | |
" running_seq_length += len(ln_stripped)\n", | |
" if current_short_name is not None:\n", | |
" index.append((current_short_name, running_seq_length,\n", | |
" current_byte_offset, line_length_excluding_ws,\n", | |
" line_length_including_ws))\n", | |
" return index" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here we use it to index a small multi-FASTA file. We print out the index at the end." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"fasta_example = StringIO(\n", | |
"'''>sequence1_short_name with optional additional info after whitespace\n", | |
"ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA\n", | |
"GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC\n", | |
"AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGAT\n", | |
">sequence2_short_name with optional additional info after whitespace\n", | |
"GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCG\n", | |
"ATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATA\n", | |
"ATATAG''')\n", | |
"idx = index_fasta(fasta_example)\n", | |
"idx" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 6, | |
"text": [ | |
"[('sequence1_short_name', 194, 69, 70, 71),\n", | |
" ('sequence2_short_name', 146, 335, 70, 71)]" | |
] | |
} | |
], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What do the fields in those two records mean? Take the first record: `('sequence1_short_name', 194, 69, 70, 71)`. The fields from left to right are (1) the short name, (2) the length (in nucleotides), (3) the byte offset in the FASTA file of the first nucleotide of the sequence, (4) the maximum number of nucleotides per line, and (5) the maximum number of *bytes* per line, including whitespace. It's not hard to convince yourself that, if you know all these things, it's not hard to figure out the byte offset of any position in any of the sequences. (This is what the `get` member of the `FastaIndexed` class defined below does.)\n", | |
"\n", | |
"A typical way to build a FASTA index like this is to use [SAMtools], specifically the `samtools faidx` command. This and all the other `samtools` commands are documented in [its manual](http://samtools.sourceforge.net/samtools.shtml).\n", | |
"\n", | |
"[SAMtools]: http://samtools.sourceforge.net\n", | |
"\n", | |
"When you use a tool like this to index a FASTA file, a new file containing the index is written with an additional `.fai` extension. E.g. if the FASTA file is named `hg19.fa`, then running `samtools faidx hg19.fa` will create a new file `hg19.fa.fai` containing the index.\n", | |
"\n", | |
"The following Python class shows how you might use the FASTA file together with its index to extract arbitrary substrings without loading all of the sequences into memory:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import re\n", | |
"\n", | |
"class FastaOOB(Exception):\n", | |
" \"\"\" Out-of-bounds exception for FASTA sequences \"\"\"\n", | |
" \n", | |
" def __init__(self, value):\n", | |
" self.value = value\n", | |
" \n", | |
" def __str__(self):\n", | |
" return repr(self.value)\n", | |
"\n", | |
"class FastaIndexed(object):\n", | |
" \"\"\" Encapsulates a set of indexed FASTA files. Does not load the FASTA\n", | |
" files into memory but still allows the user to extract arbitrary\n", | |
" substrings, with the help of the index. \"\"\"\n", | |
" \n", | |
" __removeWs = re.compile(r'\\s+')\n", | |
" \n", | |
" def __init__(self, fafns):\n", | |
" self.fafhs = {}\n", | |
" self.faidxs = {}\n", | |
" self.chr2fh = {}\n", | |
" self.offset = {}\n", | |
" self.lens = {}\n", | |
" self.charsPerLine = {}\n", | |
" self.bytesPerLine = {}\n", | |
" \n", | |
" for fafn in fafns:\n", | |
" # Open FASTA file\n", | |
" self.fafhs[fafn] = fh = open(fafn, 'r')\n", | |
" # Parse corresponding .fai file\n", | |
" with open(fafn + '.fai') as idxfh:\n", | |
" for ln in idxfh:\n", | |
" toks = ln.rstrip().split()\n", | |
" if len(toks) == 0:\n", | |
" continue\n", | |
" assert len(toks) == 5\n", | |
" # Parse and save the index line\n", | |
" chr, ln, offset, charsPerLine, bytesPerLine = toks\n", | |
" self.chr2fh[chr] = fh\n", | |
" self.offset[chr] = int(offset) # 0-based\n", | |
" self.lens[chr] = int(ln)\n", | |
" self.charsPerLine[chr] = int(charsPerLine)\n", | |
" self.bytesPerLine[chr] = int(bytesPerLine)\n", | |
" \n", | |
" def __enter__(self):\n", | |
" return self\n", | |
" \n", | |
" def __exit__(self, type, value, traceback):\n", | |
" # Close all the open FASTA files\n", | |
" for fafh in self.fafhs.itervalues():\n", | |
" fafh.close()\n", | |
" \n", | |
" def has_name(self, refid):\n", | |
" return refid in self.offset\n", | |
" \n", | |
" def name_iter(self):\n", | |
" return self.offset.iterkeys()\n", | |
" \n", | |
" def length_of_ref(self, refid):\n", | |
" return self.lens[refid]\n", | |
" \n", | |
" def get(self, refid, start, ln):\n", | |
" ''' Return the specified substring of the reference. '''\n", | |
" assert refid in self.offset\n", | |
" if start + ln > self.lens[refid]:\n", | |
" raise ReferenceOOB('\"%s\" has length %d; tried to get [%d, %d)' % (refid, self.lens[refid], start, start + ln))\n", | |
" fh, offset, charsPerLine, bytesPerLine = \\\n", | |
" self.chr2fh[refid], self.offset[refid], \\\n", | |
" self.charsPerLine[refid], self.bytesPerLine[refid]\n", | |
" byteOff = offset\n", | |
" byteOff += (start // charsPerLine) * bytesPerLine\n", | |
" into = start % charsPerLine\n", | |
" byteOff += into\n", | |
" fh.seek(byteOff)\n", | |
" left = charsPerLine - into\n", | |
" # Count the number of line breaks interrupting the rest of the\n", | |
" # string we're trying to read\n", | |
" if ln < left:\n", | |
" return fh.read(ln)\n", | |
" else:\n", | |
" nbreaks = 1 + (ln - left) // charsPerLine\n", | |
" res = fh.read(ln + nbreaks * (bytesPerLine - charsPerLine))\n", | |
" res = re.sub(self.__removeWs, '', res)\n", | |
" return res\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 7 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here's an example of how to use the class defined above." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# first we'll write a new FASTA file\n", | |
"with open('tmp.fa', 'w') as fh:\n", | |
" fh.write('''>sequence1_short_name with optional additional info after whitespace\n", | |
"ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA\n", | |
"GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC\n", | |
"AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGAT\n", | |
">sequence2_short_name with optional additional info after whitespace\n", | |
"GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCG\n", | |
"ATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATA\n", | |
"ATATAG''')\n", | |
"with open('tmp.fa') as fh:\n", | |
" idx = index_fasta(fh)\n", | |
"with open('tmp.fa.fai', 'w') as fh:\n", | |
" fh.write('\\n'.join(['\\t'.join(map(str, x)) for x in idx]))\n", | |
"with FastaIndexed(['tmp.fa']) as fa_idx:\n", | |
" print fa_idx.get('sequence2_short_name', 100, 30)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"AGTACCGCAAGGGAAAGATGAAAAATTATA\n" | |
] | |
} | |
], | |
"prompt_number": 8 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Other resources\n", | |
"\n", | |
"* [Wikipedia page for FASTA format](http://en.wikipedia.org/wiki/Fasta_format)\n", | |
"* The [original FASTA paper](http://www.sciencedirect.com/science/article/pii/007668799083007V) by Bill Pearson. This is the software tool that made the format popular.\n", | |
"* [BioPython], which has [its own ways of parsing FASTA](http://biopython.org/wiki/SeqIO)\n", | |
"* [Many](https://github.com/brentp/pyfasta) [other](https://github.com/lh3/seqtk) [libraries](https://github.com/mdshw5/pyfaidx) and [tools](http://hannonlab.cshl.edu/fastx_toolkit/)\n", | |
"\n", | |
"[BioPython]: http://biopython.org/wiki/Main_Page\n", | |
"[SeqIO]: http://biopython.org/wiki/SeqIO\n", | |
"[SAMtools]: http://samtools.sourceforge.net/\n", | |
"[FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/\n", | |
"\n", | |
"\u00a9 Copyright Ben Langmead 2014" | |
] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment