Skip to content

Instantly share code, notes, and snippets.

@dinovski
Last active June 21, 2022 00:25
Show Gist options
  • Save dinovski/2bcdcc770d5388c6fcc8a656e5dbe53c to your computer and use it in GitHub Desktop.
Save dinovski/2bcdcc770d5388c6fcc8a656e5dbe53c to your computer and use it in GitHub Desktop.
calculate N50 from fasta/contigs file
#!/usr/bin/env python
## calculate N50 from fasta file
## N50 = contig length such that half of the contigs are longer and 1/2 of contigs are shorter
import commands
import sys
import os
from itertools import groupby
import numpy
# from Bio import SeqIO
lengths = []
with open(sys.argv[1]) as fasta:
## parse each sequence by header: groupby(data, key)
faiter = (x[1] for x in groupby(fasta, lambda line: line[0] == ">"))
for record in faiter:
## join sequence lines
seq = "".join(s.strip() for s in faiter.next())
lengths.append(len(seq))
## sort contigs longest>shortest
all_len=sorted(lengths, reverse=True)
csum=numpy.cumsum(all_len)
print "N: %d" % int(sum(lengths))
n2=int(sum(lengths)/2)
# get index for cumsum >= N/2
csumn2=min(csum[csum >= n2])
ind=numpy.where(csum == csumn2)
n50 = all_len[ind[0]]
print "N50: %s" % n50
## N90
nx90=int(sum(lengths)*0.90)
## index for csumsum >= 0.9*N
csumn90=min(csum[csum >= nx90])
ind90=numpy.where(csum == csumn90)
n90 = all_len[ind90[0]]
print "N90: %s" % n90
## write lengths to file
with open('all_lengths.txt', 'w') as handle:
handle.write('\n'.join(str(i) for i in lengths))
@ViriatoII
Copy link

Thanks! I did get an error:
n50 = all_len[ind50[0]]
TypeError: only integer scalar arrays can be converted to a scalar index

That was easily solved by applying int() to ind50[0].

Have a great day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment