Last active
October 24, 2017 21:40
-
-
Save amichal/f40c7563d8d76b6ae95f9cb9ba5b2bdd to your computer and use it in GitHub Desktop.
for eg
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tmp/** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# frozen_string_literal: true | |
source "https://rubygems.org" | |
gem 'nokogiri' | |
gem 'open-uri-cached' | |
gem 'activesupport' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GEM | |
remote: https://rubygems.org/ | |
specs: | |
activesupport (5.1.4) | |
concurrent-ruby (~> 1.0, >= 1.0.2) | |
i18n (~> 0.7) | |
minitest (~> 5.1) | |
tzinfo (~> 1.1) | |
concurrent-ruby (1.0.5) | |
i18n (0.9.0) | |
concurrent-ruby (~> 1.0) | |
mini_portile2 (2.3.0) | |
minitest (5.10.3) | |
nokogiri (1.8.1) | |
mini_portile2 (~> 2.3.0) | |
open-uri-cached (0.0.5) | |
thread_safe (0.3.6) | |
tzinfo (1.2.3) | |
thread_safe (~> 0.1) | |
PLATFORMS | |
ruby | |
DEPENDENCIES | |
activesupport | |
nokogiri | |
open-uri-cached | |
BUNDLED WITH | |
1.14.3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'bundler/setup' | |
require 'nokogiri' | |
require 'open-uri/cached' | |
require 'fileutils' | |
OpenURI::Cache.cache_path = "#{File.dirname(__FILE__)}/tmp/cache" | |
FileUtils.mkdir_p OpenURI::Cache.cache_path | |
require 'active_support/core_ext/string' | |
require 'csv' | |
def fetch_doc(url) | |
if !OpenURI::Cache.get(url) | |
sleep(0.1) # dont hammer our server | |
end | |
Nokogiri::HTML(open(url)) | |
end | |
rest_period = 1 #second | |
index_url = 'https://www.globaldownsyndrome.org/research-medical-care/medical-care-providers/' | |
index_doc = fetch_doc(index_url) | |
records = [] | |
index_doc.css('a[href^="/research-medical-care/medical-care-providers/"]').each do |node| | |
state_url = URI.join(index_url, node['href']).to_s | |
state_doc = fetch_doc(state_url) | |
warn "Processing #{state_url}" | |
#FIXME: there isnt a great way to find this | |
state_doc.css('#content table h2').each do |title| | |
table = title.ancestors('table').first | |
id = table['id'] | |
data = { | |
source_url: state_url, | |
globaldownsyndrome_org_url: ("#{state_url}##{id}" if id), | |
title: title.text | |
} | |
table.css('tr').each do |tr| | |
label = tr.at_css('td').text.strip | |
# real labels look like foo: or foo? | |
if label =~ /[?:]\z/ | |
label = label.gsub /[?:]\z/, '' | |
# services provide works like a section header | |
value = if label == 'Services provided' | |
table.css('.services input[checked]').map{|i| i.next_sibling.text.strip}.join("\n") | |
else | |
tr.at_css('td + td')&.text | |
end.strip | |
if label.present? && value.present? | |
data[label] = value | |
end | |
end | |
end | |
records << data | |
end | |
end | |
headers = records.flat_map(&:keys).uniq | |
puts headers.to_csv | |
records.each do |rec| | |
puts headers.map{|h| rec[h]}.to_csv | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Run with
ruby globaldownsyndrome.org.rb > ~/Google\ Drive/gds.org.csv