Skip to content

Instantly share code, notes, and snippets.

@dbreunig
Created August 5, 2024 15:52
Show Gist options
  • Save dbreunig/49135ac9245b5dd819f0515f14f19050 to your computer and use it in GitHub Desktop.
Save dbreunig/49135ac9245b5dd819f0515f14f19050 to your computer and use it in GitHub Desktop.
Preparing Wikidata JSON extracts for processing. Wikidata ships its JSON extracts as a single file. They helpfully put each item on a new line, but unhelpfully wrap all items as a JSON array – with brackets and commas. This one-liner cleans and splits the file into manageable chunks, gzipped together, in a memory friendly manner.
zcat ../latest-all.json.gz | sed 's/,$//' | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment