Last active
December 28, 2015 22:58
-
-
Save yoshikischmitz/7575048 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#used ruby version 2.0.0, probably works on 1.9.x as well, may not work on 1.8.x | |
require 'mechanize' | |
mech = Mechanize.new | |
page = mech.get("http://www.europarl.europa.eu/meps/en/full-list.html?filter=all&leg=") | |
mep_links = [] | |
page.links.each do |x| #a Mechanize::Page object provides us with all the links on the downloaded page | |
link = x.href | |
unless mep_links.include? link #avoid duplicate as there is one link in the name and another in the photo | |
mep_links << link if link =~ /\/meps\/\w+\/\d+\// #format for a MEP page link is /meps/de/34234238/..etc.. | |
end | |
end | |
mep_links.each_with_index do |url,index| #visit every url in our list | |
page = mech.get(url) | |
#we'll used XPATHs to extract the exact text we need from the pages. Mechanize provides us access to XPATHs through the search() | |
#function, which is a wrapper function around Nokogiri's xpath function, which just means we don't have to instantiate a new | |
#nokogiri object on our own. | |
first_name = page.search("//li[@class='mep_name']/text()[1]").text #the name is held in the li tag with the class "mep_name" | |
last_name = page.search("//li[@class='mep_name']/text()[2]").text #by using text()[1] we get the text before the break | |
#and text()[2] gives us the name after the break. Note that | |
#technically mep_name should be an id instead of a class since | |
#it's unique and non-repeating | |
hometown = page.search("//*[@id='zone_before_content_global']/div/div[1]/ul/span[2]").text #I think this is the hometown? | |
hometown = hometown.match(/(?<=, ).+/) #everything after the comma is the home-town | |
puts "#{first_name}\t#{last_name}\t#{hometown}" #by separating with a tab we can be lazy and copy the results from the console to excel | |
break if index == 2 #this is to stop the script from running on all 766 results | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment