Skip to content

Instantly share code, notes, and snippets.

@mgalardini
Last active December 17, 2015 10:58
Show Gist options
  • Save mgalardini/5598454 to your computer and use it in GitHub Desktop.
Save mgalardini/5598454 to your computer and use it in GitHub Desktop.
Sequence utils
#!/usr/bin/python
'''
Reads a FASTA files and rewrites it keeping only the unique IDs
Important: we assume that if two proteins have the same ID, they have the same sequence
'''
import sys
from Bio import SeqIO
if len(sys.argv) < 3:
print 'Usage: removeCopies INPUT OUTPUT'
sys.exit(1)
fname, out = sys.argv[1:3]
def unique(records):
count = 0
already = set()
for r in records:
count += 1
if r.id not in already:
already.add(r.id)
yield r
print "%i saved sequences, %i inputs" % (len(already), count)
SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta")
@peterjc
Copy link

peterjc commented May 17, 2013

You tweeted this so I figured you'd like feedback...

Because you never close the handles explicitly you may have problems with the handles if you run this on PyPy or Jython - it would be cleaner to output the output handle once, write to it for each batch, and then close it.

I see you're batching the records for output to avoid having too many in memory at once, but refactoring this as a generator function you could change it to only ever have one record in memory at any one time. You'd still need to have the already seen identifiers in memory though.

I'm thinking of something like this (untested):

...
def unique(records):
    already = set()
    for r in records:
        if r.id not in already:
            already.add(r.id)
            yield r
count = SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta")
print "Saved %i sequences" % count

@mgalardini
Copy link
Author

Yes, i was wondering if there was a more efficient way to do that, so thanks a lot for this snippet.
Since the file is opened just once it is roughly 2X faster on a 3.3Gb file (using python 2.7).
The only missing "feature" is the count of duplicated entries, but i guess that we can get that information by creating a class instead of a function.

Thanks a lot, I'll update the script when it will be done...

@peterjc
Copy link

peterjc commented May 20, 2013

In keeping with a quick-n-dirty script, you could just track the read count in a global variable... crude but effective. Or, you could put the print statement inside the filtering function?

...
def unique(records):
    count = 0
    already = set()
    for r in records:
        count += 1
        if r.id not in already:
            already.add(r.id)
            yield r
    print "%i unique identifiers from %i input records" % (len(unique), count)
SeqIO.write(unique(SeqIO.parse(fname, "fasta")), out, "fasta")

@mgalardini
Copy link
Author

A print statement inside the function is just perfect, thanks!

I've noticed that pypy 1.9 takes twice the time to run as compared to python 2.7: I expected it to be faster...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment