Skip to content

Instantly share code, notes, and snippets.

@pdbartsch
Last active December 20, 2015 00:09
Show Gist options
  • Save pdbartsch/6039457 to your computer and use it in GitHub Desktop.
Save pdbartsch/6039457 to your computer and use it in GitHub Desktop.
Using the Data Science Toolkit with OpenRefine to make your data more interesting. Geocode. Coordinates to statistics.
##Data Science Toolkit & OpenRefine Tutorial
1. Download, Setup, and start up [OpenRefine][OpenRefine]
2. Get some data. I recommend starting with a small dataset. I'll be using [this sample data][sampledata]. Extract the zip file. You now have 500.csv.
3. Load the data into OpenRefine by clicking 'Create Project' and then 'Choose Files', navigate to 500.csv. Then click 'Next', stay with the defaults and then 'Create Project'.
4. We'll first need to concatenate the address data down into a single field. Click the dropdown arrow at the top of the existing 'Address' field, then select 'Edit column' > 'Add column based on this column'.
* Enter a name for your new column. I'll go with 'AddressCombined' in the 'New column name' box
* In the expression box enter the following:
cells['Address'].value + ', ' + cells['City'].value + ', ' + cells['State'].value + ', ' + cells['ZIP'].value
* If the preview looks ok, press 'OK'
5. Now on your new 'AddressCombined' column do the following:
* Edit column > Add column by fetching URL's
* Name your new column 'DST_returned'
* Set the throttle delay to 0
* In the expression box enter the following (including the outside quotes):
'http://www.datasciencetoolkit.org/maps/api/geocode/json?sensor=false&address=' + escape(value, 'url')'
* If the preview looks ok, press 'OK' and wait for the process to complete
* You'll end up with a JSON blob in the new 'DST_returned' column
6. Now on your new 'DST_returned' column do the following:
* Edit column > Add column based on this column
* Name your new column 'LatLon'
* In the expression box enter the following:
with(value.parseJson().results[0].geometry.location, pair, pair.lat +"," + pair.lng)
* If the preview looks ok, press 'OK'
7. You can now optionally delete or collapse the 'DST_returned' field if it's in your way
8. Now we'll get to the fun part and grab some outside data for each Latitude,Longitude pair from the Data Science Toolkit's [coordinates2statistics tool][c2s]. Perform the following on the 'LatLon' field:
* Edit column > Add column by fetching URL's
* Name your new column 'DST_stats_return'
* Set the throttle delay to 0
* In the expression box enter the following:
'http://www.datasciencetoolkit.org/coordinates2statistics/'+ value
* If the preview looks ok, press 'OK' and wait for the process to complete
* Grab a beverage
* You'll end up with a JSON blob in the new 'DST_stats_return' column. Read through this info and take a look at all of the data that was just returned to you. There are currently 39 fields which are a part of the DST Statistics. You'll find each of these fields listed along with an explanation of their sources and units [here][c2s].
* We grabbed all of the fields at once which is easier on the DST servers than running the above data grab a bunch of times for many fields, we'll be parsing some of this data into fields in the following steps. If you only wanted a single statistic like Populations Density, you could have run something like this:
'http://www.datasciencetoolkit.org/coordinates2statistics/'+ value+'?statistics=population_density'
9. Now on your new 'DST_stats_return' column do the following:
* Edit column > Add column based on this column
* Name your new column 'PopulationDensity'
* In the expression box enter the following:
with(value.parseJson()[0], pair, pair.statistics.population_density.value)
* If the preview looks ok, press 'OK'
10. Here's another example almost identical to #9. You'll notice that I'm only changing the statistics field name. After this step you'll be able to parse any of the 39 fields listed [here][c2s]!
* Edit column > Add column based on this column
* Name your new column 'LandCover'
* In the expression box enter the following:
with(value.parseJson()[0], pair, pair.statistics.land_cover.value)
* Note that some data is unavailable in some areas so if you're preview seems to return no value, scroll down. If you see values returned for any of the records then the expression worked.
* Press 'OK'
11. Go hunt down some more open data api's to use your new OpenRefine skills on!
[sampledata]:http://www.briandunning.com/sample-data/500.zip
[OpenRefine]:http://openrefine.org/
[c2s]:http://www.datasciencetoolkit.org/developerdocs#coordinates2statistics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment