Last active
December 17, 2015 10:58
-
-
Save mgalardini/5598454 to your computer and use it in GitHub Desktop.
Sequence utils
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
''' | |
Reads a FASTA files and rewrites it keeping only the unique IDs | |
Important: we assume that if two proteins have the same ID, they have the same sequence | |
''' | |
import sys | |
from Bio import SeqIO | |
if len(sys.argv) < 3: | |
print 'Usage: removeCopies INPUT OUTPUT' | |
sys.exit(1) | |
fname, out = sys.argv[1:3] | |
def unique(records): | |
count = 0 | |
already = set() | |
for r in records: | |
count += 1 | |
if r.id not in already: | |
already.add(r.id) | |
yield r | |
print "%i saved sequences, %i inputs" % (len(already), count) | |
SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A print statement inside the function is just perfect, thanks!
I've noticed that pypy 1.9 takes twice the time to run as compared to python 2.7: I expected it to be faster...