Created
November 29, 2012 16:56
-
-
Save radaniba/4170368 to your computer and use it in GitHub Desktop.
Parse Genbank with BioRuby
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #You can parse Genbank bank files with BioRuby the standard way, but there's a hidden problem. If the file ends with blank lines, i.e. after the genbank terminator (two forwards slashes, //) there are empty lines, BioRuby reads these as additional, empty records. However, you can route around this by trimming the blank lines before handing it to the parser. | |
| puts "Parsing seqs ..." | |
| Bio::FlatFile.auto("foo.genbank").each_entry { |gb| | |
| puts "Sequence '#{gb.to_biosequence.entry_id}'" | |
| } | |
| puts "Finished." | |
| which will print the id of every sequence in the file. However, if the file ends with blank lines, i.e. after the genbank terminator (two forwards slashes, which the wiki markup doesn't like) there are empty lines, BioRuby reads these as additional, empty records: | |
| Parsing seqs ... | |
| Sequence 'CY011043' | |
| Sequence '' | |
| Finished. | |
| Nice. You can route around this by trimming the blank lines before handing it to the parser: | |
| puts "Parsing seqs ..." | |
| data = File.open("foo.genbank", "rb") { |f| f.read() } | |
| buffer = StringIO.new(data.rstrip!(), 'rb') | |
| Bio::FlatFile.auto(buffer).each_entry { |gb| | |
| puts "Sequence '#{gb.to_biosequence.entry_id}'" | |
| } | |
| to give: | |
| Parsing seqs ... | |
| Sequence 'CY011043' | |
| Finished. | |
| # the standard BioRuby parse for GenBank | |
| puts "Parsing seqs ..." | |
| Bio::FlatFile.auto("foo.genbank").each_entry { |gb| | |
| puts "Sequence '#{gb.to_biosequence.entry_id}'" | |
| } | |
| puts "Finished." | |
| # A "blank" record as above will produce this output | |
| # Parsing seqs ... | |
| # Sequence 'CY011043' | |
| # Sequence '' | |
| # Finished. | |
| # Fix it this way | |
| puts "Parsing seqs ..." | |
| data = File.open("foo.genbank", "rb") { |f| f.read() } | |
| buffer = StringIO.new(data.rstrip!(), 'rb') | |
| Bio::FlatFile.auto(buffer).each_entry { |gb| | |
| puts "Sequence '#{gb.to_biosequence.entry_id}'" | |
| } | |
| # This gives | |
| # Parsing seqs ... | |
| # Sequence 'CY011043' | |
| # Finished. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment