Created
January 15, 2013 22:08
-
-
Save brenes/4542583 to your computer and use it in GitHub Desktop.
Wukong XML Parser
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Based on http://thedatachef.blogspot.com/2011/01/processing-xml-records-with-hadoop-and.html although I only can find it through Google Cache: | |
| # http://webcache.googleusercontent.com/search?q=cache:VuIRvlkYpjcJ:thedatachef.blogspot.com/2011/01/processing-xml-records-with-hadoop-and.html+&cd=1&hl=es&ct=clnk&gl=es | |
| #!/usr/bin/env ruby | |
| require 'rubygems' | |
| require 'wukong' | |
| require 'wukong/encoding' | |
| require 'crack' | |
| class HackernewsComment < Struct.new(:username, :url, :title, :text, :timestamp, :comment_id, :points, :comment_count, :type) | |
| def self.parse raw | |
| raw_hash = Crack::XML.parse(raw.strip) | |
| return unless raw_hash | |
| return unless raw_hash["row"] | |
| raw_hash = raw_hash["row"] | |
| raw_hash[:username] = raw_hash["Username"].wukong_encode if raw_hash["Username"] | |
| raw_hash[:url] = raw_hash["Url"].wukong_encode if raw_hash["Url"] | |
| raw_hash[:title] = raw_hash["Title"].wukong_encode if raw_hash["Title"] | |
| raw_hash[:text] = raw_hash["Text"].wukong_encode if raw_hash["Text"] | |
| raw_hash[:feed_id] = raw_hash["ID"].to_i if raw_hash["ID"] | |
| raw_hash[:points] = raw_hash["Points"].to_i if raw_hash["Points"] | |
| raw_hash[:comment_count] = raw_hash["CommentCount"].to_i if raw_hash["CommentCount"] | |
| raw_hash[:type] = raw_hash["Type"].to_i if raw_hash["Type"] | |
| # Eg. Map '2010-10-26T19:29:59.717' to easier to work with '20101027002959' | |
| raw_hash[:timestamp] = Time.parse_and_flatten(raw_hash["Timestamp"]) if raw_hash["Timestamp"] | |
| # | |
| self.from_hash(raw_hash, true) | |
| end | |
| end | |
| class XMLParser < Wukong::Streamer::LineStreamer | |
| def process line | |
| return unless line =~ /^\<row/ | |
| yield HackernewsComment.parse(line) | |
| end | |
| end | |
| Wukong::Script.new(XMLParser, nil).run |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment