Skip to content

Instantly share code, notes, and snippets.

@mvcaaa
Last active December 16, 2015 07:39
Show Gist options
  • Save mvcaaa/5400276 to your computer and use it in GitHub Desktop.
Save mvcaaa/5400276 to your computer and use it in GitHub Desktop.
Simple tourdom parser
# Simple CLI tourdom.ru crawler/parser
# To exec: ruby index.rb
# (C) Astashov Andrey <[email protected]>
# require 'rubygems'
require 'restclient'
require 'nokogiri'
require 'digest/md5'
require 'time'
# Initial Setup
# URL to fetch from
REQUEST_URL = "http://www.tourdom.ru/birga/go2/index.php"
# Cache File TTL in seconds
CACHE_FILE_TTL = 1200
# Output filter string
OUTPUT_FILTER_STR = 'Бангкок - Москва'
force_write = nil
items = {}
if File.exists?("cache_file")
file = File.open("cache_file", "r+")
else
file = File.new('cache_file', 'w')
force_write = true
end
if Time.now.to_i-file.mtime.to_i > CACHE_FILE_TTL || force_write
if (page = RestClient::Request.execute(:method => "POST", :url => REQUEST_URL, :headers => {'Accept-Charset' => 'utf-8', 'User-Agent' => 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31'}, :payload => 'Checkbox1=on&Checkbox11=1&City=%C1%E0%ED%E3%EA%EE%EA'))
file.write page.body
file.close
end
doc = Nokogiri::HTML(page.body)
else
doc = Nokogiri::HTML(file)
end
foundcnt = doc.css("span.commenttxt strong").first.text.to_i
if foundcnt > 0
elements = doc.xpath("//table[@class='tblbirgaresult']/tr")
elements.each do |e|
# Clean up some garbage
subs = e.css("a.advwork")
subs.remove
subs = e.css("span.flgnew")
subs.remove
# Filling data variables
date1 = e.css('td.direction').text
direction1 = e.css('td.date').text.strip
text1 = e.css('td.adv').text.strip
text1hash = Digest::MD5.hexdigest(text1)
if !date1.empty? && direction1 == OUTPUT_FILTER_STR
# puts "Hash sum: #{ text1hash }\r\nDate: #{ date1 }\r\nRoute: #{ direction1 }\r\nDescription:#{ text1 }\r\n\r\n"
items[text1hash] = {:date => Time.parse(date1), :direction => direction1, :text => text1}
end
end
# Simple data output
puts "Found #{ items.count } tickets"
items.each do |k, item|
puts "Hash: #{ k }\r\nDate: #{ item[:date] }\r\nRoute: #{ item[:direction] }\r\nDescription:#{ item[:text] }\r\n\r\n"
end
else
puts 'Not found'
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment