Skip to content

Instantly share code, notes, and snippets.

@tmocellin
Created December 16, 2015 22:32
Show Gist options
  • Save tmocellin/89a1d2fcf91e83c1412b to your computer and use it in GitHub Desktop.
Save tmocellin/89a1d2fcf91e83c1412b to your computer and use it in GitHub Desktop.
Simple test scraping whith nokogiri
require 'open-uri'
require 'nokogiri'
def getPageResult (html_data)
$data_array = []
list_ads = html_data.css('.list-lbc a')
list_ads.each { |add|
url = add['href']
title = add.css('.detail .title').text.gsub(/\s+/,' ')
price = add.css('.detail .price').text.gsub(/\s+/,'').gsub(/\u00a0/,'').gsub(/\u20AC/," euros")
data = {price:price , title:title , url:url}
$data_array.push(data)
}
$data_array
end
$pageResults = []
$pages = []
searchUrl = "http://www.leboncoin.fr/annonces/offres/rhone_alpes/occasions/?o=1&q=guitar"
stream = open(searchUrl)
html = stream.read
html_data = Nokogiri::HTML(html)
$pageResults.concat(getPageResult(html_data))
pages_list = html_data.css('#paging a')
pages_list.each{ |page_link|
if page_link.text != "Page suivante" && page_link.text != ">>"
puts "page #{page_link.text}"
stream = open(page_link['href'])
html = stream.read
html_data = Nokogiri::HTML(html)
$pageResults.concat(getPageResult(html_data))
pages_list = html_data.css('#paging a')
sleep(2.5)
end
}
puts $pageResults
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment