Skip to content

Instantly share code, notes, and snippets.

@amichal
Last active October 24, 2017 21:40
Show Gist options
  • Save amichal/f40c7563d8d76b6ae95f9cb9ba5b2bdd to your computer and use it in GitHub Desktop.
Save amichal/f40c7563d8d76b6ae95f9cb9ba5b2bdd to your computer and use it in GitHub Desktop.
for eg
# frozen_string_literal: true
source "https://rubygems.org"
gem 'nokogiri'
gem 'open-uri-cached'
gem 'activesupport'
GEM
remote: https://rubygems.org/
specs:
activesupport (5.1.4)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (~> 0.7)
minitest (~> 5.1)
tzinfo (~> 1.1)
concurrent-ruby (1.0.5)
i18n (0.9.0)
concurrent-ruby (~> 1.0)
mini_portile2 (2.3.0)
minitest (5.10.3)
nokogiri (1.8.1)
mini_portile2 (~> 2.3.0)
open-uri-cached (0.0.5)
thread_safe (0.3.6)
tzinfo (1.2.3)
thread_safe (~> 0.1)
PLATFORMS
ruby
DEPENDENCIES
activesupport
nokogiri
open-uri-cached
BUNDLED WITH
1.14.3
require 'rubygems'
require 'bundler/setup'
require 'nokogiri'
require 'open-uri/cached'
require 'fileutils'
OpenURI::Cache.cache_path = "#{File.dirname(__FILE__)}/tmp/cache"
FileUtils.mkdir_p OpenURI::Cache.cache_path
require 'active_support/core_ext/string'
require 'csv'
def fetch_doc(url)
if !OpenURI::Cache.get(url)
sleep(0.1) # dont hammer our server
end
Nokogiri::HTML(open(url))
end
rest_period = 1 #second
index_url = 'https://www.globaldownsyndrome.org/research-medical-care/medical-care-providers/'
index_doc = fetch_doc(index_url)
records = []
index_doc.css('a[href^="/research-medical-care/medical-care-providers/"]').each do |node|
state_url = URI.join(index_url, node['href']).to_s
state_doc = fetch_doc(state_url)
warn "Processing #{state_url}"
#FIXME: there isnt a great way to find this
state_doc.css('#content table h2').each do |title|
table = title.ancestors('table').first
id = table['id']
data = {
source_url: state_url,
globaldownsyndrome_org_url: ("#{state_url}##{id}" if id),
title: title.text
}
table.css('tr').each do |tr|
label = tr.at_css('td').text.strip
# real labels look like foo: or foo?
if label =~ /[?:]\z/
label = label.gsub /[?:]\z/, ''
# services provide works like a section header
value = if label == 'Services provided'
table.css('.services input[checked]').map{|i| i.next_sibling.text.strip}.join("\n")
else
tr.at_css('td + td')&.text
end.strip
if label.present? && value.present?
data[label] = value
end
end
end
records << data
end
end
headers = records.flat_map(&:keys).uniq
puts headers.to_csv
records.each do |rec|
puts headers.map{|h| rec[h]}.to_csv
end
@amichal
Copy link
Author

amichal commented Oct 24, 2017

Run with ruby globaldownsyndrome.org.rb > ~/Google\ Drive/gds.org.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment