Skip to content

Instantly share code, notes, and snippets.

@samuell
Created September 6, 2014 01:01
Show Gist options
  • Save samuell/dbea9f759dced60a7d0b to your computer and use it in GitHub Desktop.
Save samuell/dbea9f759dced60a7d0b to your computer and use it in GitHub Desktop.
defmodule ATGCCount do
def count(sequence), do: cnt(String.to_char_list(sequence),0,0)
def cnt([65|t],at,gc), do: cnt(t,at+1,gc)
def cnt([84|t],at,gc), do: cnt(t,at+1,gc)
def cnt([71|t],at,gc), do: cnt(t,at,gc+1)
def cnt([67|t],at,gc), do: cnt(t,at,gc+1)
def cnt([62|_],at,gc), do: {at,gc}
def cnt([],at,gc), do: {at,gc}
def cnt(_,0,0), do: {0,0}
def cnt([_|t], at, gc), do: cnt(t,at,gc)
end
defmodule GCCount do
def process do
filename = "chry.fa"
if File.exists?(filename) do
stream = File.stream!(filename, [:read_ahead, :raw], :line)
Enum.reduce stream, {0, 0}, fn(line, {at_acc, gc_acc}) ->
{at, gc} = ATGCCount.count(line)
{at_acc + at, gc_acc + gc}
end
end
end
def gc_ratio do
{at, gc} = process
case {gc, at} do
{0, 0} -> 0
{at, gc} -> gc/(gc+at)
{_,_} -> 0
end
end
end
@behe
Copy link

behe commented Oct 3, 2014

I got this faster primarily by changing to reading bigger chunks of the file at a time, eg:

File.stream!(filename, [:read_ahead, :raw], @read_size)

The second slight increase that can only be observed when reading bigger chunks is by skipping the list conversion and doing a binary matching:

def count(sequence), do: cnt(sequence,0,0)
def cnt(<<?A, t :: binary>>,at,gc), do: cnt(t,at+1,gc)
def cnt(<<?T, t :: binary>>,at,gc), do: cnt(t,at+1,gc)
def cnt(<<?G, t :: binary>>,at,gc), do: cnt(t,at,gc+1)
def cnt(<<?C, t :: binary>>,at,gc), do: cnt(t,at,gc+1)
def cnt(<<?>, _ :: binary>>,at,gc), do: {at,gc}
def cnt("",at,gc), do: {at,gc}
def cnt(_,0,0), do: {0,0}
def cnt(<<_, t :: binary>>, at, gc), do: cnt(t,at,gc)

Thirdly I can cut down the total time by doing this asynchronously, see: https://gist.github.com/behe/b741c6ae7ff567b2d307

Lastly this seems wrong:

def cnt(_,0,0), do: {0,0}

I believe this is not needed and will incorrectly return a line starting with N as a {0,0} result even though it might include ATGC afterwards.

And then of course in the gc_ratio you have switched order of at & gc by mistake.

Thanks for a fun Elixir excercise!

@samuell
Copy link
Author

samuell commented Jun 25, 2015

@behe Cool! In my experiment on saml.rilspace.org/moar-languagez-gc-content-in-python-d-fpc-c-and-c I put on the restriction to only read one line at a time (except for the second test in the end). Would be very interesting to see what performance we get without that optimization!

@samuell
Copy link
Author

samuell commented Jun 25, 2015

All: For reference, the data file used here is available at: http://bit.ly/ychromo
Also see the discussion about these tests on the meetup page: http://www.meetup.com/stockholm-elixir/events/203279212/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment