Skip to content

Instantly share code, notes, and snippets.

@brendano
Created June 22, 2012 03:23
Show Gist options
  • Select an option

  • Save brendano/2969996 to your computer and use it in GitHub Desktop.

Select an option

Save brendano/2969996 to your computer and use it in GitHub Desktop.
# handle the wikipedia dump format
module WikiDump
def self.yield_page_strings(stream)
buf = ""
stream.each do |line|
if line =~ /^\s* <page> \s*$/x
buf = ""
end
buf << line
if line =~ /^\s* <\/page> \s*$/x
yield buf
end
end
end
def self.yield_pages(stream)
yield_page_strings(stream) do |pagestr|
yield WikiPage.new(pagestr)
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment