Last active
January 22, 2024 05:54
-
-
Save dkam/5725c01173a6fa71f7f80c0e08605f96 to your computer and use it in GitHub Desktop.
Convert ISNI_persons.jsonld.gz into a JSONL file using command line tools sed and jq.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# https://isni.org/page/linked-data/ | |
# https://isni.oclc.org:2443/isni/public_export/ISNI_persons.jsonld.gz | |
wget https://isni.oclc.org:2443/isni/public_export/ISNI_persons.jsonld.gz | |
# The file I downloaded was full of the 0x1E character, or ^^ in ASCII. This will strip that | |
sed 's/\x1E//g' ISNI_persons.jsonld > cleaned_ISNI_persons.jsonld | |
# Then use JQ to convert the file into the way more sane JSONL format. By default, it tries to read it all into | |
# memory - so you will need to use the streaming version I found from : | |
# https://stackoverflow.com/questions/49808581/using-jq-how-can-i-split-a-very-large-json-file-into-multiple-files-each-a-spec | |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' cleaned_ISNI_persons.jsonld > ISNI_persons.jsonl |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment