Skip to content

Instantly share code, notes, and snippets.

@lawlesst
Last active December 30, 2015 03:59
Show Gist options
  • Save lawlesst/7772873 to your computer and use it in GitHub Desktop.
Save lawlesst/7772873 to your computer and use it in GitHub Desktop.

Google / Open Refine Reconciliation

Latest release is Google Refine 2.5. OpenRefine beta candidate available and moving forward the project is independent of Google.

Documentation and downloads available from the Open Refine website.

A useful tool in the ongoing work to move from strings to things.

Overview

OpenRefine (ex-Google Refine) is a powerful tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Interactive Data Transformation Tool (IDT)

Like Silk Workbench or Karma it is an interactive tool for working with data. The Free Your Metdata group coined this term. See doi:10.3789/isqv24n2-3.2012.04, "Joining the Linked Data Cloud"

What works well - in our experiences at Brown - is that you can write a script to massage data and then load it into Refine and use its built in tools like faceting and text processing algorithms. Users also have a graphical interface to further process the data and verify matches.

Reconciliation

Reconciliation in Refine is the process of converting text names (strings) to database identifiers (things). This was originally created to assist with loading data into Freebase.

Refine establishes an API for creating Reconcillation services against datasets you are interested in.

Extensions

A longer list of extensions is available at: http://openrefine.org/download.html.

There seem to be many extensions related to RDF and/or named entity recognition which should be of interest to the VIVO community.

  • Allows for reconciling against SPARQL endpoints or RDF dumps.
  • Import data in any Refine supported format and export as RDF.
  • Good tutorials and documentation.

VIVO and Refine

Each VIVO has a built in reconciliation service available at /reconcile. This allows you to reconcile your data against any public VIVO. There is also a VIVO Refine extension that could offer more workflow improvements but I haven't experimented with it.

Demo

  • Database of publications for medical school faculty. Very few identifiers (less than 1%).
  • Use Refine to take names of academic journals and resolve to authoritative names and ISSN using the JournalTOCs service.
  • Created a basic web service that converts responses from the JournalTOCs API to the Refine reconciliation API format.
  • Users then can evaluate possible matches and establishes matches from journal names strings to ISSNs.
  • Data can then be exported as RDF and loaded into VIVO or used in ingest processes.
  • We will use this data with the CrossRef OpenURL web service to obtain DOIs for articles in this data set. We've found that querying with date of publication, starting page, and ISSN of publication venue that DOIs are returned in a large percentage of instances. After reconciling the journal names, we have this data in our local database for processing and loading into VIVO.

Code

Papers

Maali, F., & Cyganiak, R. (2011). Re-using Cool URIs : Entity Reconciliation Against LOD Hubs. link

Van Hooland, S., Verborgh, R., & Van de Walle, R. (2012). Joining the Linked Data Cloud in a Cost-Effective Manner. Information Standards Quarterly, 24(2/3), 24. doi:10.3789/isqv24n2-3.2012.04 link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment