-
-
Save mgalardini/5598454 to your computer and use it in GitHub Desktop.
#!/usr/bin/python | |
''' | |
Reads a FASTA files and rewrites it keeping only the unique IDs | |
Important: we assume that if two proteins have the same ID, they have the same sequence | |
''' | |
import sys | |
from Bio import SeqIO | |
if len(sys.argv) < 3: | |
print 'Usage: removeCopies INPUT OUTPUT' | |
sys.exit(1) | |
fname, out = sys.argv[1:3] | |
def unique(records): | |
count = 0 | |
already = set() | |
for r in records: | |
count += 1 | |
if r.id not in already: | |
already.add(r.id) | |
yield r | |
print "%i saved sequences, %i inputs" % (len(already), count) | |
SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta") |
Yes, i was wondering if there was a more efficient way to do that, so thanks a lot for this snippet.
Since the file is opened just once it is roughly 2X faster on a 3.3Gb file (using python 2.7).
The only missing "feature" is the count of duplicated entries, but i guess that we can get that information by creating a class instead of a function.
Thanks a lot, I'll update the script when it will be done...
In keeping with a quick-n-dirty script, you could just track the read count in a global variable... crude but effective. Or, you could put the print statement inside the filtering function?
...
def unique(records):
count = 0
already = set()
for r in records:
count += 1
if r.id not in already:
already.add(r.id)
yield r
print "%i unique identifiers from %i input records" % (len(unique), count)
SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta")
A print statement inside the function is just perfect, thanks!
I've noticed that pypy 1.9 takes twice the time to run as compared to python 2.7: I expected it to be faster...
You tweeted this so I figured you'd like feedback...
Because you never close the handles explicitly you may have problems with the handles if you run this on PyPy or Jython - it would be cleaner to output the output handle once, write to it for each batch, and then close it.
I see you're batching the records for output to avoid having too many in memory at once, but refactoring this as a generator function you could change it to only ever have one record in memory at any one time. You'd still need to have the
already
seen identifiers in memory though.I'm thinking of something like this (untested):