Skip to content

Instantly share code, notes, and snippets.

@radaniba
Created November 29, 2012 16:56
Show Gist options
  • Save radaniba/4170368 to your computer and use it in GitHub Desktop.
Save radaniba/4170368 to your computer and use it in GitHub Desktop.
Parse Genbank with BioRuby
#You can parse Genbank bank files with BioRuby the standard way, but there's a hidden problem. If the file ends with blank lines, i.e. after the genbank terminator (two forwards slashes, //) there are empty lines, BioRuby reads these as additional, empty records. However, you can route around this by trimming the blank lines before handing it to the parser.
puts "Parsing seqs ..."
Bio::FlatFile.auto("foo.genbank").each_entry { |gb|
puts "Sequence '#{gb.to_biosequence.entry_id}'"
}
puts "Finished."
which will print the id of every sequence in the file. However, if the file ends with blank lines, i.e. after the genbank terminator (two forwards slashes, which the wiki markup doesn't like) there are empty lines, BioRuby reads these as additional, empty records:
Parsing seqs ...
Sequence 'CY011043'
Sequence ''
Finished.
Nice. You can route around this by trimming the blank lines before handing it to the parser:
puts "Parsing seqs ..."
data = File.open("foo.genbank", "rb") { |f| f.read() }
buffer = StringIO.new(data.rstrip!(), 'rb')
Bio::FlatFile.auto(buffer).each_entry { |gb|
puts "Sequence '#{gb.to_biosequence.entry_id}'"
}
to give:
Parsing seqs ...
Sequence 'CY011043'
Finished.
# the standard BioRuby parse for GenBank
puts "Parsing seqs ..."
Bio::FlatFile.auto("foo.genbank").each_entry { |gb|
puts "Sequence '#{gb.to_biosequence.entry_id}'"
}
puts "Finished."
# A "blank" record as above will produce this output
# Parsing seqs ...
# Sequence 'CY011043'
# Sequence ''
# Finished.
# Fix it this way
puts "Parsing seqs ..."
data = File.open("foo.genbank", "rb") { |f| f.read() }
buffer = StringIO.new(data.rstrip!(), 'rb')
Bio::FlatFile.auto(buffer).each_entry { |gb|
puts "Sequence '#{gb.to_biosequence.entry_id}'"
}
# This gives
# Parsing seqs ...
# Sequence 'CY011043'
# Finished.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment