Last active
December 10, 2015 08:38
-
-
Save safarista/4408507 to your computer and use it in GitHub Desktop.
How to scrap Rightmove Agents. Ill probably need Fax numbers and website addresses. Planning B
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# encoding: UTF-8 | |
# MIT LICENSE (c)2012 Nelson Kelem [nelson at ilinkoln.org] | |
# Date: 30/12/2012 | |
# How to scrap Rightmove Agents | |
# Ill probably need Fax numbers and website addresses | |
# Planning B | |
# TODO: Write data to JSON, YAML or CSV | |
require 'nokogiri' | |
require 'open-uri' | |
require "yaml/store" | |
# require "capybara" | |
# Set the Yaml engine you like ['psych', 'syck'] | |
# YAML::ENGINE.yamler = 'psych' | |
# | |
# I dont think this is the efficient way to loop. But it works | |
# Show me some refactors and lets see how it turns out | |
# CITY = %w[ London Greater-London Manchester Greater-Manchester ] | |
# | |
@url = "http://www.rightmove.co.uk/estate-agents/London.html" # Maybe use CITY.each {} | |
@data = Nokogiri::HTML(open(@url)) | |
@last_page = @data.at_css('div#sliderBottom ul.items li:last-child a').text.to_i | |
File.open("RightMoveLondonAgents.yml", "w:UTF-8") do |io| | |
i = 0 | |
z = 0 | |
io.puts "---\nagencies:" | |
while i < @last_page | |
if i <= 1 | |
data = @data | |
else | |
data = Nokogiri::HTML(open(@url + '?index=' + z.to_s)) | |
end | |
@agents = data.css('ol#summaries li').each do |li| | |
li.css('div.photos').each do |agent| | |
io.puts "\tagent:" | |
# Logo image URL | |
io.puts "\t\tlogo_url: \"#{agent.at_css('a.photo img')['src']}\"" | |
# Agent details | |
@details = li.css('.details').each do |detail| | |
io.puts "\t\tbranch_name: \"#{detail.at_css('h2.branchname a').content}\"" | |
io.puts "\t\ttelephone: #{detail.at_css('p.telephone').text.to_s.gsub(/[^0-9]/, '')}" | |
io.puts "\t\tdescription: \"#{detail.at_css('p.description').text.to_s.strip.gsub(/\n+/, ' ').squeeze}\"" | |
end | |
# Type of business specialisation: Sales or Lettings | |
@specialty = agent.css('div.channels').each do |sp| | |
if sp.at_css('div.onechannel') | |
io.puts "\t\tspecialising_in: \"#{sp.text.strip} \"" | |
else | |
io.puts "\t\tspecialising_in: \"#{sp.at_css('div:first-child').text.strip} & #{sp.at_css('div:last-child').text.strip}\"" | |
end | |
end | |
io.puts "\n" | |
end | |
end | |
i += 1 | |
z += 20 | |
end | |
end #end io |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem with this approach is Rightmove will only let you run through the website 9.times at the speed your scripts query their servers. Times out and I think its trivial to make the script sleep for 2 minutes then continue if the pages are more than 9