Skip to content

Instantly share code, notes, and snippets.

@tuttinator
Created May 3, 2016 02:24
Show Gist options
  • Select an option

  • Save tuttinator/7cf96d2a1e3db0186d4ec5102c049caa to your computer and use it in GitHub Desktop.

Select an option

Save tuttinator/7cf96d2a1e3db0186d4ec5102c049caa to your computer and use it in GitHub Desktop.
Auckland Council committee minutes scraper

Council Attendance scraper

To use:

Clone the git repo

Create a DocumentCloud account on documentcloud.org and create a project.

Copy sample.env to .env and edit the configuration with your email address, password and DocumentCloud project id.

Run bundle install

To scrape the PDFs you are interested in:

Edit run.rb and edit the committees and the year you are interested in.

Run ruby run.rb

This will download the PDFs and also upload them to DocumentCloud.

source 'https://rubygems.org'
gem 'capybara'
gem 'capybara-webkit'
gem 'documentcloud', github: 'nzherald/documentcloud', branch: 'develop'
gem 'dotenv'
gem 'pry'
#! /usr/bin/env ruby
require 'scraper'
committees = [
"Governing Body",
"Auckland Development Committee",
"Finance and Performance Committee",
"Regional Strategy and Policy Committee",
"Arts, Culture and Events Committee",
"Community Development and Safety Committee",
"Economic Development Committee",
"Environment, Climate Change and Natural Heritage Committee",
"Infrastructure Committee",
"Parks, Recreation and Sport Committee",
"Tenders and Procurement Committee",
"Unitary Plan Committee",
"Audit and Risk Committee",
"Council Controlled Organisations Governance and Monitoring Committee",
"Chief Executive Officer Review Committee",
"Civil Defence and Emergency Management Group Committee",
"Hearings Committee",
"Regulatory and Bylaws Committee",
"Auckland Domain Committee",
"Albert-Eden Local Board",
"Devonport-Takapuna Local Board",
"Franklin Local Board",
"Great Barrier Local Board",
"Henderson-Massey Local Board",
"Hibiscus and Bays Local Board",
"Howick Local Board",
"Kaipātiki Local Board",
"Māngere-Ōtāhuhu Local Board",
"Ōrākei Local Board",
"Ōtara-Papatoetoe Local Board",
"Papakura Local Board",
"Puketāpapa Local Board",
"Rodney Local Board",
"Upper Harbour Local Board",
"Waiheke Local Board",
"Waitākere Ranges Local Board",
"Waitematā Local Board",
"Whau Local Board"
]
scraper = CouncilMinutesScraper.new({ committees: committees, year: 2015 })
scraper.scrape!
DOCUMENTCLOUD_EMAIL=YOUR_EMAIL
DOCUMENTCLOUD_PASSWORD=YOUR_PASSWORD
DOCUMENTCLOUD_PROJECT_ID=YOUR_PROJECT_ID_NUMBER
require 'bundler'
Bundler.require
require 'open-uri'
Dotenv.load
class CouncilMinutesScraper
include Capybara::DSL
attr_reader :options, :documentcloud_project_id
def initialize(options)
@options = options
DocumentCloud.configure do |config|
config.email = ENV['DOCUMENTCLOUD_EMAIL']
config.password = ENV['DOCUMENTCLOUD_PASSWORD']
end
@documentcloud_project_id = ENV['DOCUMENTCLOUD_PROJECT_ID']
FileUtils.mkdir_p 'pdfs'
Capybara.app = self
Capybara.current_driver = :webkit
Capybara.run_server = false
Capybara.app_host = 'http://infocouncil.aucklandcouncil.govt.nz'
end
# options: hash
#
# category: symbol
# amount: float
# original_date: string
# comparison_date: string
#
def scrape!
visit '/'
options.fetch(:committees).map do |committee|
puts "Fetching #{committee}"
select(committee, from: 'ddlCommittee')
select(options.fetch(:year), from: 'ddlYear')
click_on('btnView')
sleep 10
upload_to_document_cloud(get_result, committee)
end
end
def upload_to_document_cloud(docs, committee)
docs.each do |url|
title = committee.gsub(' ', '-') + "-" + url.split('/').last.split('_')[1]
puts "Downloading #{url} locally"
open("pdfs/#{title}.pdf", 'wb') do |file|
file << open(url).read
end
puts "Uploading #{title}.pdf to DocumentCloud"
DocumentCloud.upload(File.new("pdfs/#{title}.pdf", 'rb'), title, { project: documentcloud_project_id, source: committee })
end
end
def get_result
pdf_links = all('.bpsGridPDFLink').map { |a| a[:href] }.select {|href| href.match /_MIN_/ }.map {|href| href.sub(/RedirectToDoc.aspx\?URL\=/, 'http://infocouncil.aucklandcouncil.govt.nz/') }
puts pdf_links
pdf_links
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment