National Data Catalog - Catalog Scraper Specification

The National Data Catalog stores metadata about data sets and APIs published by all levels of government. It helps developers, researchers, and journalists search for, identify, and work with data sources that would otherwise take significant effort to track down.

Here are some resources:

For more background, read our kickoff blog post about the project.

Project leads:

David James - djames at sunlightfoundation.com
Luigi Montanez - luigi at sunlightfoundation.com

Existing Catalogs

The biggest example of a current data catalog is Data.gov, which aims to cover the executive branch of the federal government. States and cities have followed its example by releasing data catalogs of their own.

The National Data Catalog should harness these catalogs to build an all-encompassing one. So, scrapers need to be built for the following:

Scraping Tools

It's highly likely that the catalog you choose to work on does not have a machine-friendy dataset of it's own catalog. Instead, you'll need to do some screen scraping. Some available libraries are:

Parsley - A command-line tool and C library. Also has Ruby and Python bindings.
BeautifulSoup for Python
Nokogiri and Mechanize for Ruby

Note: Your input needed for scraper libraries outside of Python and Ruby.

Scraping Technique

We ask that catalog scrapers divide the operation into two stages.

1. Scrape the site, write results to a set of static YAML or JSON files. Each file should represent one data source. The collection of files should be stored publicly in a distributed version control system like Git or Mercurial. This way, rollbacks and history will be easy. Also, we can run analytics on the history to see how a given catalog has changed over time.

2. Use the static files to load the data into the API. See the API documentation on GitHub. We have set up a Sandbox API at sandbox.nationaldatacatalog.com for your use and testing. Please contact David or Luigi for API keys and further instructions.

Review

When you think you've built a successful scraper, let us know. We'll review its functionality, make sure it conforms to guidelines above, and if all goes well we'll add it to our battery of existing scrapers.

luigi/Scraper Spec.md

National Data Catalog - Catalog Scraper Specification

Existing Catalogs

Scraping Tools

Scraping Technique

Review