Created
August 5, 2024 15:52
-
-
Save dbreunig/49135ac9245b5dd819f0515f14f19050 to your computer and use it in GitHub Desktop.
Preparing Wikidata JSON extracts for processing. Wikidata ships its JSON extracts as a single file. They helpfully put each item on a new line, but unhelpfully wrap all items as a JSON array – with brackets and commas. This one-liner cleans and splits the file into manageable chunks, gzipped together, in a memory friendly manner.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
zcat ../latest-all.json.gz | sed 's/,$//' | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment