Skip to content

Instantly share code, notes, and snippets.

@safarista
Last active December 10, 2015 08:38
Show Gist options
  • Save safarista/4408507 to your computer and use it in GitHub Desktop.
Save safarista/4408507 to your computer and use it in GitHub Desktop.
How to scrap Rightmove Agents. Ill probably need Fax numbers and website addresses. Planning B
# encoding: UTF-8
# MIT LICENSE (c)2012 Nelson Kelem [nelson at ilinkoln.org]
# Date: 30/12/2012
# How to scrap Rightmove Agents
# Ill probably need Fax numbers and website addresses
# Planning B
# TODO: Write data to JSON, YAML or CSV
require 'nokogiri'
require 'open-uri'
require "yaml/store"
# require "capybara"
# Set the Yaml engine you like ['psych', 'syck']
# YAML::ENGINE.yamler = 'psych'
#
# I dont think this is the efficient way to loop. But it works
# Show me some refactors and lets see how it turns out
# CITY = %w[ London Greater-London Manchester Greater-Manchester ]
#
@url = "http://www.rightmove.co.uk/estate-agents/London.html" # Maybe use CITY.each {}
@data = Nokogiri::HTML(open(@url))
@last_page = @data.at_css('div#sliderBottom ul.items li:last-child a').text.to_i
File.open("RightMoveLondonAgents.yml", "w:UTF-8") do |io|
i = 0
z = 0
io.puts "---\nagencies:"
while i < @last_page
if i <= 1
data = @data
else
data = Nokogiri::HTML(open(@url + '?index=' + z.to_s))
end
@agents = data.css('ol#summaries li').each do |li|
li.css('div.photos').each do |agent|
io.puts "\tagent:"
# Logo image URL
io.puts "\t\tlogo_url: \"#{agent.at_css('a.photo img')['src']}\""
# Agent details
@details = li.css('.details').each do |detail|
io.puts "\t\tbranch_name: \"#{detail.at_css('h2.branchname a').content}\""
io.puts "\t\ttelephone: #{detail.at_css('p.telephone').text.to_s.gsub(/[^0-9]/, '')}"
io.puts "\t\tdescription: \"#{detail.at_css('p.description').text.to_s.strip.gsub(/\n+/, ' ').squeeze}\""
end
# Type of business specialisation: Sales or Lettings
@specialty = agent.css('div.channels').each do |sp|
if sp.at_css('div.onechannel')
io.puts "\t\tspecialising_in: \"#{sp.text.strip} \""
else
io.puts "\t\tspecialising_in: \"#{sp.at_css('div:first-child').text.strip} & #{sp.at_css('div:last-child').text.strip}\""
end
end
io.puts "\n"
end
end
i += 1
z += 20
end
end #end io
@safarista
Copy link
Author

Problem with this approach is Rightmove will only let you run through the website 9.times at the speed your scripts query their servers. Times out and I think its trivial to make the script sleep for 2 minutes then continue if the pages are more than 9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment