Skip to content

Instantly share code, notes, and snippets.

@dr5hn
Forked from VladimirAlexiev/README.md
Created March 25, 2023 14:16
Show Gist options
  • Save dr5hn/a6a35a79530e2b23fb6eb9662ce34e6e to your computer and use it in GitHub Desktop.
Save dr5hn/a6a35a79530e2b23fb6eb9662ce34e6e to your computer and use it in GitHub Desktop.
How to use Google Sheets to Manage Wikidata Coreferencing

How to use Google Sheets to Manage Wikidata Coreferencing

A previous post How to Add Museum IDs to Wikidata explained how to use SPARQL to find missing data on Wikidata (Getty Museum IDs), how to create such values (from museum webpage URLs) and how to format them properly for QuickStatements.

Here I explain how to use Google sheets to manage a more advanced task. The sheet AAT-Wikidata matches about 3k AAT concepts to Wikipedia, WordNet30 and BabelNet (it restored an old mapping to Wordnet, retrieved it from BabelNet, mapped to Wikipedia).

  • For each row, it uses the following Google sheet formula (column C) to query the Wikipedia API and get the corresponding Wikidata ID (wikibase_item); split on two lines for readability:
=ImportXml(concat("https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&
ppprop=wikibase_item&redirects=1&format=xml&titles=",G1),"//@wikibase_item")
  • This formula asks for results in XML format, and the part //@wikibase_item is a XPath that fetches the WD ID from the resulting XML.
  • Making 3k API calls is slow, so Google sheet initially shows "Loading…" for all rows, and gradually "materializes" the WD IDs (Qnnnn) as they come in. I have periodically sorted the column and used "Edit> Paste special> Values only" for the "materialized" IDs in order to fix them and not cause re-fetching next time when I open the google sheet.
  • Columns C,D,E are specially formatted to produce the required QuickStatements tab-delimited format. E.g. for row 62 AAT "patrol_wagons" corresponding to WP "Police_van": Q1023646 P1014 "300212831"

The benefit of Google sheets is that they allow easy addition of columns and convenient facilities for manual tasks:

  • Collaborative editing by several people at once.
  • Column A for tracking which rows are checked, which are already inserted to WD, etc
  • Using filters to find rows of interest. E.g. check=1 means rows that are manually checked and ready for insertion to QuickStatements. After insertion, I change it to check=2 to mark it as already inserted.
  • Column B for tracking already existing WD IDs
  • Conditional formatting to colour existing WD IDs (column B) that differ from my idea what is the matching WD ID (column C) and therefore must be checked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment