Skip to content

Instantly share code, notes, and snippets.

@dkam
Created November 26, 2025 20:55
Show Gist options
  • Select an option

  • Save dkam/908c86a5ce5ce992a25f8da41d5ad12f to your computer and use it in GitHub Desktop.

Select an option

Save dkam/908c86a5ce5ce992a25f8da41d5ad12f to your computer and use it in GitHub Desktop.
Convert Wikidata to Parquet
bzcat latest-all.json.bz2 | sed '1d;$d;s/,$//' | split -l 100000 - --filter="
duckdb -c \"COPY (
SELECT * FROM read_json('/dev/stdin', union_by_name=true)
) TO 'chunky-parquet/\$FILE.parquet' (
FORMAT PARQUET,
COMPRESSION ZSTD
)\"
" chunk_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment