Skip to content

Instantly share code, notes, and snippets.

@mgalardini
Last active December 17, 2015 10:58
Show Gist options
  • Save mgalardini/5598454 to your computer and use it in GitHub Desktop.
Save mgalardini/5598454 to your computer and use it in GitHub Desktop.
Sequence utils
#!/usr/bin/python
'''
Reads a FASTA files and rewrites it keeping only the unique IDs
Important: we assume that if two proteins have the same ID, they have the same sequence
'''
import sys
from Bio import SeqIO
if len(sys.argv) < 3:
print 'Usage: removeCopies INPUT OUTPUT'
sys.exit(1)
fname, out = sys.argv[1:3]
def unique(records):
count = 0
already = set()
for r in records:
count += 1
if r.id not in already:
already.add(r.id)
yield r
print "%i saved sequences, %i inputs" % (len(already), count)
SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta")
@mgalardini
Copy link
Author

A print statement inside the function is just perfect, thanks!

I've noticed that pypy 1.9 takes twice the time to run as compared to python 2.7: I expected it to be faster...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment