CKAN Data Quality

Background

We are working on a set of "data quality" tools, to check the quality of open data publications.

These tools are currently very much WIP, here:

Dashboard: https://github.com/okfn/data-quality-dashboard/tree/feature/refactor
- Example: http://uk-25k.datadashboards.io
CLI: https://github.com/okfn/data-quality-cli/tree/feature/refactor
Dataset (UK 25k spend data): https://github.com/okfn/data-quality-uk-25k-spend/tree/feature/refactor

You do not need to learn all these tools. They just provide context for what we are doing, which is:

Check sets of data that are published, and make a list of the data sources
Assess the quality of the data published
Show the results of our quality assessment in a dashboard app

Task

The dataset repository above contains a script (id_data.py) to build out two files: publishers.csv and sources.csv. It does this in a very particular way due to the type of data we are collecting there.

We want to create a different, generic script that does similar: builds publishers.csv and sources.csv from any CKAN instance.

So, the key differences from the above script are:

Get all resources from a CKAN instance, not just a particular type (so, unlike the 25k spend data above)
Write a generic script/class/whatever that just takes the URL of the CKAN instance, and builds out the list of publishers.csv, from the instance's Organizations, and sources.csv from the instances resources.

Spec

Given a CKAN instance, use its API to...
- Build a list of publishers followng this schema ( https://github.com/okfn/data-quality-uk-25k-spend/blob/feature/refactor/data/publishers.csv )
- Build a list of sources following this schema ( https://github.com/okfn/data-quality-uk-25k-spend/blob/feature/refactor/data/sources.csv )

Fields from the schema above that do not make sense in this generic implementation - that is fine, just note them.

Expected output

The output should be:

A repository on GitHub with:
- The script/module that implements the "CSV database"
- An example database using the Queensland government CKAN instance: https://data.qld.gov.au/
- Using the standard Open Knowledge license for such apps ( https://github.com/okfn/data-quality-dashboard/blob/master/LICENSE )
- A README to describe the usage of the script/module

We should then be able to take this code, and build the publishers.csv and sources.csv tables for any other CKAN instance.

pwalsh/ckan_data_quality.md

CKAN Data Quality

Background

Task

Spec

Expected output

pwalsh commented Mar 27, 2016