Created
January 22, 2014 15:31
-
-
Save gberger/8560768 to your computer and use it in GitHub Desktop.
Download all SISU selected candidates pages. Parse them.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Usage: | |
# ruby get.rb lower upper output-dir | |
# Example: | |
# ruby get.rb 65000 85000 out | |
require 'open-uri' | |
class String | |
def red; "\033[31m#{self}\033[0m" end | |
def green; "\033[32m#{self}\033[0m" end | |
end | |
(ARGV[0]..ARGV[1]).each do |n| | |
source = open("http://sisu.mec.gov.br/selecionados?co_oferta=#{n}").read | |
if source.include? 'A página que você tentou acessar está indisponível' | |
puts "Skipping #{n}".red | |
else | |
puts "Saving #{n}".green | |
file = File.new("#{ARGV[2]}/#{n}.html", "w") | |
file.puts source | |
file.close | |
end | |
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'nokogiri' | |
require 'titleize' | |
class String | |
def clean | |
self.strip.gsub(/\s{2,}/, ' ') | |
end | |
end | |
Dir.glob('out/*').each do |filename| | |
page = Nokogiri::HTML(open(filename)) | |
ies = page.css('.nome_ies_p').text.clean | |
campus = page.css('.nome_campus_p').text.clean | |
curso = page.css('.nome_curso_p').text.clean | |
turno = page.css('.grau_turno_p').text.clean | |
candidatos = page.css('.no_candidato').map { |cand| cand.text.clean }[1..-1] | |
# Now do something with this info! Save to DB, write to a file... | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment