Skip to content

Instantly share code, notes, and snippets.

@brenes
Created January 15, 2013 22:08
Show Gist options
  • Select an option

  • Save brenes/4542583 to your computer and use it in GitHub Desktop.

Select an option

Save brenes/4542583 to your computer and use it in GitHub Desktop.
Wukong XML Parser
# Based on http://thedatachef.blogspot.com/2011/01/processing-xml-records-with-hadoop-and.html although I only can find it through Google Cache:
# http://webcache.googleusercontent.com/search?q=cache:VuIRvlkYpjcJ:thedatachef.blogspot.com/2011/01/processing-xml-records-with-hadoop-and.html+&cd=1&hl=es&ct=clnk&gl=es
#!/usr/bin/env ruby
require 'rubygems'
require 'wukong'
require 'wukong/encoding'
require 'crack'
class HackernewsComment < Struct.new(:username, :url, :title, :text, :timestamp, :comment_id, :points, :comment_count, :type)
def self.parse raw
raw_hash = Crack::XML.parse(raw.strip)
return unless raw_hash
return unless raw_hash["row"]
raw_hash = raw_hash["row"]
raw_hash[:username] = raw_hash["Username"].wukong_encode if raw_hash["Username"]
raw_hash[:url] = raw_hash["Url"].wukong_encode if raw_hash["Url"]
raw_hash[:title] = raw_hash["Title"].wukong_encode if raw_hash["Title"]
raw_hash[:text] = raw_hash["Text"].wukong_encode if raw_hash["Text"]
raw_hash[:feed_id] = raw_hash["ID"].to_i if raw_hash["ID"]
raw_hash[:points] = raw_hash["Points"].to_i if raw_hash["Points"]
raw_hash[:comment_count] = raw_hash["CommentCount"].to_i if raw_hash["CommentCount"]
raw_hash[:type] = raw_hash["Type"].to_i if raw_hash["Type"]
# Eg. Map '2010-10-26T19:29:59.717' to easier to work with '20101027002959'
raw_hash[:timestamp] = Time.parse_and_flatten(raw_hash["Timestamp"]) if raw_hash["Timestamp"]
#
self.from_hash(raw_hash, true)
end
end
class XMLParser < Wukong::Streamer::LineStreamer
def process line
return unless line =~ /^\<row/
yield HackernewsComment.parse(line)
end
end
Wukong::Script.new(XMLParser, nil).run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment