Skip to content

Instantly share code, notes, and snippets.

@hydrosquall
Last active April 8, 2017 18:27
Show Gist options
  • Save hydrosquall/f5bd44d0c47f9756135c72e800d70a4b to your computer and use it in GitHub Desktop.
Save hydrosquall/f5bd44d0c47f9756135c72e800d70a4b to your computer and use it in GitHub Desktop.
data rescue event notes @ yale on april 8th, 2017

Intro

Data "at risk" has been an issue ever since the internet was created, and also any time there has been a change in administration. This just happens to have been an especially alarming transition when it comes to environmental data, hence the birthing of this DataRescue project.

Today's workshop led by Joshua Dull and Kayleigh Bohemier

Not just saving it - saving it in a way that can be used!

Intro

Key Links

Roles

  • Seeder

    • Determine whether data is crawlable
    • Internet Archive crawls 3 layers deep.
    • Use Chrome Extension to tag some pages, and decide whether the page can be auto-harvested or if it would take human intervention
  • Harvesters / Researchers

    • Index of Pages to Process
      • Lets you review all the urls, and examine the metadata about each one of those pages
      • Some of it is categorical, some of it is hand-typed recommendations
  • Baggers / Describers

    • These roles are available but there is presently not as much work for them yet.

URLs analyzed for the research process

GSA Government Agency Directory

https://gsa.gov/portal/staffDirectory/searchStaffDirectory

Learned that number of zip codes changes every month

Check out https://github.com/hemanklamba/Grab_Crawler Still not completed, but hopefully will be!

App limits queries to 250 results at a time. Iterating through each of the states won't work because California has over 250 government employees. Perhaps using a list of all the zip codes in the USA, and querying all the results that come back for each zip code will work. I would recommend trying that, and then flagging a zip code as questionable if it has exactly 250 employees in it because that would mean the item has been overloaded.

Note that a full list of zip codes is updated every 1 month, and you have to buy a product through the USPS to get the most current codes. https://ribbs.usps.gov/index.cfm?page=address_info_systems . There are over 43,000 zip codes!

Another gotcha: when the records come back on the html page, not everyone has the same amount of metadata (e.g. Anderson, Thomas has a job title, whereas other people don't). So the content parser will have to be flexible.

Cornell EAB GDD Maps

http://www.nrcc.cornell.edu/industry/eab/

In your web browser, open the network inspector tab. All of the visualizations are powered by POST form requests to this http://data.rcc-acis.org/GridData endpoint.

Data on the map appears to be returned on a 24-day rolling window. The JSON fields are not named, they are just given integer keys.

It appears that if you change the rolling window inside params to a longer time frame, you can get more than 24 days of data at once. For example, use the body below to get over 1 year of data from the bath station. Beware that this takes quite a long time to return, so maybe asking for 1 month at a time would be better.

{"loc":"-77.31139, 42.36679","stn_name":"Bath","edate":"2017-4-08","sdate":"2016-03-15","grid":"3","elems":[{"name":"gdd","interval":[0,0,1],"duration":"std","season_start":"03-15","reduce":"sum","maxmissing":0}]}

I recommend using the Postman Chrome Interceptor to help with this process

Plants Map

https://plants.usda.gov/java/

The PLANTS database includes data that is directly downloadable by URL: Plant Species List (https://plants.usda.gov/java/downloadData?fileName=plantlst.txt&static=true) Plant Lists By State (https://plants.usda.gov/java/stateDownload?statefips=[ENTER STATE FIPS CODE HERE])

The PLANTS database also has extensive imagery that is important for plant identification. There are 50,000+ images that link to the Plant Profiles for each species. It seems as though the images and individual plant species profiles are only accessible by following the Search UI.

There is an interactive map on each plant profile page - this data is being accessed from an ArcGIS Server REST API. If the RESP API URL can be determined, it's possible that this can be accessed directly.

The interactive map is unfortunately just a zoomable picture- it's not a D3 visualization with any underlying javascript data structures where the values associated with each state can be extracted. The URLs for each plant follow this structure, just replace "ACARO2" with the value found when you search for an element with CSS selector '.input#vSymbol' on the page. Example provided below.

<input type="hidden" id="vSymbol" value="ACARO2" />

Sample URL:

https://plants.usda.gov/core/mapping?GisServer/export?dpi=96&transparent=true&format=png8&bbox=-17324530.59646223%2C760320.4983000085%2C-4605409.089812295%2C11326975.288439956&bboxSR=102100&imageSR=102100&size=650%2C540&layerDefs=2%3ASYMBOL%3D.ACARO2.%3B3%3ASYMBOL%3D.ACARO2.%3B4%3ASYMBOL%3D.ACARO2.%3B5%3ASYMBOL%3D.ACARO2.%3B6%3ASYMBOL%3D.ACARO2.&f=image

Surface Water Catalogue

Is this actually data?

https://swot.jpl.nasa.gov/gallery/ebrochure/

Found the iframe on the page, and nominated that for scraping instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment