Skip to content

Instantly share code, notes, and snippets.

@ybenjo
Created August 1, 2012 06:25
Show Gist options
  • Save ybenjo/3224235 to your computer and use it in GitHub Desktop.
Save ybenjo/3224235 to your computer and use it in GitHub Desktop.
社説比較くん4.0から社説をクロールするスクリプト
# -*- coding: utf-8 -*-
require 'nokogiri'
require 'open-uri'
require 'logger'
$log = Logger.new('./crawl.log')
def get(num)
ret = [ ]
url = "http://shasetsu.ps.land.to/index.cgi/event/#{num}/"
$log.info("get #{url}")
begin
doc = Nokogiri::HTML(open(url).read)
(doc/'table'/'tr'/'td').each do |elem|
each_doc = { }
next if elem['class'] == 'nil'
each_doc[:company] = elem['class']
# 日付
each_doc[:date] = (elem/'p.cite').inner_text.scan(/(\d{4})年(\d{2})月(\d{2})日/).first.map(&:to_i).join('/')
# 本文
each_doc[:body] = (elem/'blockquote').inner_text.gsub(/\n|\t/, '').tr('0-9', '0-9').tr('a-zA-Z', 'a-zA-Z')
ret.push each_doc
end
rescue => e
$log.error("Error in #{num}.")
$log.error(e.backtrace)
end
ret
end
def get_latest_num
top = Nokogiri::HTML(open('http://shasetsu.ps.land.to/').read)
(top/'dd'/'a').first['href'].scan(/\d{4}/).first.to_i
end
if __FILE__ == $0
puts get(1)
# p get_latest_num
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment