Skip to content

Instantly share code, notes, and snippets.

@46bit
Created April 15, 2013 11:36
Show Gist options
  • Save 46bit/5387480 to your computer and use it in GitHub Desktop.
Save 46bit/5387480 to your computer and use it in GitHub Desktop.
This is a mass downloader of all http://www.police.uk/data datasets. I'm tidying the data into JSON files, one per year, so that we can dive into the full data quickly.
require "httparty"
data_index_url = "http://www.police.uk/data"
data_index_response = HTTParty.get data_index_url
if data_index_response.code != 200
throw "Fetching URL '#{data_index_url}' failed: '#{data_index_response.code}', '#{data_index_response.message}'"
end
zip_match_regex = /\/\/policeuk.s3.amazonaws.com\/frontend\/crime-data\/([0-9]+)-([0-9]+)\/[0-9]+-[0-9]+-([a-z-]+)-([a-z]+).zip/
data_index = data_index_response.body
zip_matches = data_index.scan zip_match_regex
datasets = {}
zip_matches.each do |zip_match|
year, month, force, type = zip_match[0..4]
handle = "#{year}-#{month}-#{force}-#{type}"
url = "http://policeuk.s3.amazonaws.com/frontend/crime-data/#{year}-#{month}/#{handle}.zip"
datasets[handle] = {
:url => url,
:handle => handle,
:year => year,
:month => month,
:force => force,
:type => type
}
end
datasets.each_pair do |key, dataset|
print "DATASET: #{dataset[:url]} \n\n"
system "wget #{dataset[:url]}"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment