Skip to content

Instantly share code, notes, and snippets.

@zudochkin
Created January 30, 2012 20:09
Show Gist options
  • Save zudochkin/1706379 to your computer and use it in GitHub Desktop.
Save zudochkin/1706379 to your computer and use it in GitHub Desktop.
books.ru xml parser
# coding: utf-8
require 'pp'
require 'rubygems'
require 'hpricot'
require 'iconv'
require 'net/http'
xml = File.read('9001274.xml')
ic = Iconv.new('UTF-8','WINDOWS-1251')
xml = ic.iconv(xml)
doc = Hpricot.XML(xml)
=begin
(doc/'//category').each do |item|
#title = (item/:id)
print item.inner_html
print "\n"
print "id = #{item[:id]}, parentId = #{item[:parentId]}"
print "\n"
end
=end
(doc/'//offer').each do |book|
resp = Net::HTTP.get(URI.parse(book.at('picture').to_plain_text))
open("./book-images/#{book[:id]}.jpg", "wb") { |file|
begin
file.write(resp)
rescue SocketError
end
}
['url', 'price', 'categoryId', 'picture', 'author', 'name', 'description', 'year', 'ISBN'].each do |el|
#puts "#{book[:id]}"
puts "#{book.at(el).to_plain_text}"
#puts "#{el}: #{book.at(el).to_plain_text}"# #{book.at(el).innerHTML}"
#puts book.find_element(el).to_s
#puts book.at(el).to_plain_text
puts "\n"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment