Skip to content

Instantly share code, notes, and snippets.

@sloanlance
Last active September 13, 2024 10:08
Show Gist options
  • Save sloanlance/6b648e51c3c2a69ae200c93c6a310cb6 to your computer and use it in GitHub Desktop.
Save sloanlance/6b648e51c3c2a69ae200c93c6a310cb6 to your computer and use it in GitHub Desktop.
jq: JSONL → TSV conversion

jq: JSONL → TSV conversion

What is TSV?

TSV means "tab-separated values". I prefer this format over CSV ("comma-separated values") because it doesn't require as much quoting. Many programs that can use CSV formatted data can also use TSV, although they may need to be explicitly told of the different format if it's not detected automatically.

However, in any of the jq scripts below, "@tsv" can usually be replaced with "@csv" to get CSV output instead.

  1. First Row Keys Only

    jq --slurp --raw-output '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ]|tostring])[] | @tsv' input.jsonl > output.tsv

    Problems with nested structures are avoided by turning them all to strings with "tostring".

  2. Keys From All Rows

    jq --slurp --raw-output '(map(keys) | add | unique) as $keys | $keys, map([.[ $keys[] ]|tostring])[] | @tsv'

    This version builds a list of all keys from every row, which allows for variations in keys from row to row.

  3. Unique Path Strings To Each Key In Input

    The problem with the previous two is that they are shallow. They only work with the immediate child keys of each row. If the row contains nested structures (arrays or objects), they are expressed as strings.

    To allow working with nested structures, this filter is an attempt to make an array of all unique paths to keys in the input:

    jq --slurp --raw-output '[paths(scalars)[1:] | map(if type=="number" then "[\(.)]" else ".\(.)" end) | join("")] | unique'

    Maybe combining this with one of the others will let us produce TSV with column headings for all keys, no matter how complex the JSON?

    1. Unique Paths To Each Key In Input

      Like the parent of this item, except the paths are kept in the internal array structure, not converted to a selector string:

      jq --slurp --compact-output --raw-output '[paths(scalars)[1:] | tojson] | unique[] | fromjson'
    2. Using Paths To Get Values From Input

      Using the path arrays from the previous item, get the value found at that path:

      jq --slurp --raw-output '[["tar", 1, "max" ], ["tar", 2]] as $p | [($p[]|map(if type=="number" then "[\(.)]" else ".\(.)" end) | join(""))],(.[] | [getpath($p[])]) | @tsv'

      https://jqplay.org/s/Qk1TTRjqJd

      The values in the "$p" array are examples from the output of the previous item.

Solution?

This is close:

jq --slurp --raw-output '[[paths(scalars)[1:] | tojson] | unique[] | fromjson] as $p | [($p[]|map(if type=="number" then "[\(.)]" else ".\(.)" end) | join(""))],(.[] | [getpath($p[])]) | @tsv'

https://jqplay.org/s/rxyRtfjwqA

In the example above, fu is a number in most rows, but fu.x exists in the last row. When the rows are processed, the try…catch block prevents errors about fu.x on rows where fu is a number, but no other output appears for the row after that column, either. Also, output for the row containing fu.x also contains the value {"x":42} for the column fu.

Maybe two things are needed:

  1. When a value is not a simple type (is that scalar?), it should not give any output.
  2. When a row doesn't contain a nested key, there needs to be some way to process the remaining columns. Perhaps that can be accomplished by calling getpath() multiple times per row, rather than one time with all path names.

Inspiration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment