We are working on a set of "data quality" tools, to check the quality of open data publications.
These tools are currently very much WIP, here:
- Dashboard: https://github.com/okfn/data-quality-dashboard/tree/feature/refactor
- Example: http://uk-25k.datadashboards.io
- CLI: https://github.com/okfn/data-quality-cli/tree/feature/refactor
- Dataset (UK 25k spend data): https://github.com/okfn/data-quality-uk-25k-spend/tree/feature/refactor
You do not need to learn all these tools. They just provide context for what we are doing, which is:
- Check sets of data that are published, and make a list of the data sources
- Assess the quality of the data published
- Show the results of our quality assessment in a dashboard app
The dataset repository above contains a script (id_data.py
) to build out two files: publishers.csv
and sources.csv
. It does this in a very particular way due to the type of data we are collecting there.
We want to create a different, generic script that does similar: builds publishers.csv
and sources.csv
from any CKAN instance.
So, the key differences from the above script are:
- Get all resources from a CKAN instance, not just a particular type (so, unlike the 25k spend data above)
- Write a generic script/class/whatever that just takes the URL of the CKAN instance, and builds out the list of
publishers.csv
, from the instance's Organizations, andsources.csv
from the instances resources.
- Given a CKAN instance, use its API to...
- Build a list of publishers followng this schema ( https://github.com/okfn/data-quality-uk-25k-spend/blob/feature/refactor/data/publishers.csv )
- Build a list of sources following this schema ( https://github.com/okfn/data-quality-uk-25k-spend/blob/feature/refactor/data/sources.csv )
Fields from the schema above that do not make sense in this generic implementation - that is fine, just note them.
The output should be:
- A repository on GitHub with:
- The script/module that implements the "CSV database"
- An example database using the Queensland government CKAN instance: https://data.qld.gov.au/
- Using the standard Open Knowledge license for such apps ( https://github.com/okfn/data-quality-dashboard/blob/master/LICENSE )
- A README to describe the usage of the script/module
We should then be able to take this code, and build the publishers.csv
and sources.csv
tables for any other CKAN instance.
Q: What if the test instance suggested does not have
extras
(or, other...) data which is required for the schema?A: You can't handle data that does not exist. However, seeing as the script needs to be generic, it might be a good idea to try it against 4-5 ckan instances. Some examples, in addition to the QLD instance, could be: http://data.gov.au ; http://data.gov.uk ; http://www.data.gov ; http://opendata.aragon.es ; http://daten.berlin.de