-
-
Save pkoppstein/a5abb4ebef3b0f72a6ed to your computer and use it in GitHub Desktop.
module { | |
"name": "schema", | |
"description": "JSON Schema Inference", | |
"version": "0.0.3.1", | |
"homepage": "https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed", | |
"license": "MIT", | |
"author": "pkoppstein at gmail dot com" | |
}; | |
# NEWS: | |
# 2022.10.16 | |
# - unionType need not be top-level | |
# - documentation, notably mention gojq | |
# 2021.01.18 | |
# - the extended type of [] is [], and [] U [T] => [T] | |
# 2021.01.16 | |
# - bug fix @2021 | |
# 2019.07.26 | |
# - every type is regarded as including null if and only if $ARGS.named.nullable (default: false) | |
# Requires: jq 1.5 or higher, or gojq | |
# Usage examples: | |
# jq --argjson nullable true 'include "schema"; schema' input.json | |
# jq .[] huge.json | jq -n 'include "schema" {search: "."}; schema(inputs)' | |
# gojq .[] huge.json | gojq -L . -nc 'include "schema"; schema(inputs)' | |
# Example of use with jm (http://github.com/pkoppstein/jm) | |
# jm HUGE.json | jq -n 'include "schema" {search: "."}; schema(inputs)' | |
# To check conformance with the structural schemas produced by this module, | |
# see the JESS repository at https://github.com/pkoppstein/JESS | |
# This module defines four filters: | |
# schema/0 returns the typeUnion of the extended-type values of the entities | |
# in the input array, if the input is an array; | |
# otherwise it simply returns the "typeof" value of its input. | |
# schema(stream) returns the schema of the items in the stream; | |
# typeof/0 returns the extended-type of its input; | |
# typeUnion(a;b) returns b if a == null, otherwise it returns | |
# the union of the two specified extended-type values; | |
# All extended-type values are themselves JSON entities, e.g. | |
# the JSON value null has extended type `"null"`. | |
# Each extended type can be thought of as a set of JSON entities, | |
# e.g. "number" for the set consisting of null and the JSON numbers, | |
# and ["number"] for the set of arrays whose elements are in "number". | |
# Note that the JSON value [] can be thought of as belonging to all | |
# extended types of the form [T] as well as [] and "JSON". | |
# If $nullable, then `null` is regarded as an element of all types; | |
# otherwise, null is only regarded as being a member of "null", | |
# "scalar", "JSON", and any type of the form ["+", "null", X]; and all | |
# other types, T, are extended by ["+", "null", T]. | |
# The extended types are: | |
# "null", "boolean", "string", "number"; | |
# "scalar" for any combination of scalars (null, true, false, strings, or numbers); | |
# []; | |
# [T] where T is an extended type; | |
# an object all of whose values are extended types; | |
# a union type: ["+", "null", T] where T is an extended type not of the same form; | |
# "JSON" signifying that no other extended-type value is applicable. | |
# The extended-type of a JSON entity is defined recursively. | |
# The extended-type of a scalar value is its jq type ("null", "boolean", "string", "number"). | |
# The extended-type of [] is []. | |
# The extended-type of a non-empty array of values all of which have the | |
# same extended-type, T, is [T]. | |
# The extended-type of an object is an object with the same keys, but the | |
# values of which are the extended-types of the corresponding values. | |
# The extended-type of an array with values of different extended-types is defined | |
# using typeUnion(a;b) as defined below. | |
# typeUnion(a;b) returns the least extended-type value that subsumes both a and b. | |
# For example: | |
# typeUnion("number"; "string") yields "scalar"; | |
# typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"}; | |
# if $nullable, then typeUnion("null", t) yields t for any valid extended type, t; | |
# otherwise, if t is an extended type, then typeUnion("null"; t) yields t if t already includes | |
# null, or else ["+", "null", t]. | |
def typeUnion($a; $b): | |
def unionType: type == "array" and .[0] == "+" ; | |
def nullable: $ARGS.named.nullable; | |
def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar"; | |
def addNull: | |
if nullable then . | |
elif unionType or . == "null" or . == "scalar" or . == "JSON" then . | |
else ["+", "null", .] | |
end; | |
def addNull($x;$y): typeUnion($x;$y) | addNull; | |
if $a == null or $a == $b then $b | |
elif $b == null then $a # @2021 | |
elif $a == "null" then $b | addNull | |
elif $b == "null" then $a | addNull | |
elif ($a|unionType) and ($b|unionType) then addNull($a[2];$b[2]) | |
elif $a|unionType then addNull($a[2]; $b) | |
elif $b|unionType then addNull($a; $b[2]) | |
elif ($a | scalarp) and ($b | scalarp) then "scalar" | |
elif $a == "JSON" or $b == "JSON" then "JSON" | |
elif $a == [] then typeUnion($b; $a) # @2021 | |
elif $b == [] and ($a|type) == "array" then $a # @2021 | |
elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ] | |
elif ($a|type) == "object" and ($b|type) == "object" then | |
((($a|keys) + ($b|keys)) | unique) as $keys | |
| reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) ) | |
else "JSON" | |
end ; | |
def typeof: | |
def typeofArray: | |
if length == 0 then [] # @2021 | |
else [reduce .[] as $item (null; typeUnion(.; $item|typeof))] | |
end ; | |
def typeofObject: | |
reduce keys[] as $key ( . ; .[$key] |= typeof ) ; | |
. as $in | |
| type | |
| if . == "null" then "null" | |
elif . == "string" or . == "number" or . == "boolean" then . | |
elif . == "object" then $in | typeofObject | |
else $in | typeofArray | |
end; | |
def schema(stream): | |
reduce stream as $x (null; typeUnion(.; $x|typeof)); | |
# Omit the outermost [] for an array | |
def schema: | |
if type == "array" then schema(.[]) | |
else typeof | |
end ; | |
## Example programs: | |
# typeof | |
# .. | if type == "object" and has("country") then schema else empty end | |
# schema | |
# schema(inputs) |
@deric4 - Thanks for pointing out the mismatch. I've updated the Experimental Benchmarks page to reflect the updated version of schema.jq (https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed). In brief, schema.jq was updated to pay attention to the "nullable" flag, as documented in the gist. What was lost in the wash is that the default behavior changed. Omitting the nullable flag now produces slightly different results but takes considerably longer to execute.
Awesome thanks @pkoppstein !
@pkoppstein In trying to demonstrate some examples to show where I think the "JSON"
extended-type can be improved, I stumbled on this case:
jq -n "$(curl -s https://gist.githubusercontent.com/pkoppstein/a5abb4ebef3b0f72a6ed/raw/b1315e7fb23a1c90df94d6425cfdba0907a6a284/schema.jq)
schema(inputs)" <<EOM
{"a":1}
{"b":2}
EOM
Which produces:
{
"a": "JSON",
"b": "number"
}
when I would have expected:
{
"a": "number",
"b": "number"
}
Curious if you think a case might be missing on L84:
i.e.:
# [...]
if $a == null or $a == $b then $b
elif $b == null then $a # This?
# [...]
Thoughts?
@btiernay - You're absolutely correct. Fixed. Thanks.
Thanks for the updates!
Here are a couple of cases I'm wondering if we can improve upon:
Case 1
Input
{"a":[]}
{"a":[1]}
Actual Output
{
"a": "JSON"
}
Expected Output
{
"a": [
"number"
]
}
Case 2
Input
{"a":1}
{"a":{}}
Actual Output
{
"a": "JSON"
}
Expected Output
{
"a": [
"+",
"number",
"object"
]
}
The above introduces the notion of type unions which are the general mechanism for loss-less schema inferencing. This is especially insightful for more complex array and object types.
@btiernay - Thanks for the two excellent examples.
I am inclined to change schema.jq in accordance with your first example:
-- the new extended-type for [] will be []
-- [] U [ T ] => [ T ]
I will have to see what changes are needed to JESS (the schema-checker) as
we obviously do not want the schema {"a": []}
to imply that .a must be the empty array.
The second example, as you say, involves a significant extension of
schema.jq's extended-types. JESS should already handles these,
but I'm concerned about the performance implications for schema inference.
If these were supported as an option, what would you suggest as the option name?
UPDATE:
The update regarding [] is in V 0.0.3 at
https://gist.github.com/pkoppstein/70b7c053cf7757e1cf9665a0971d7383
Please let me know if it passes your tests; I'll then update the "gist" above.
The gist has been updated in accordance with @btiernay's "Case 1"
above, so that now (that is, with Version 0.0.3.1):
schema( [], [1] ) == ["number"] # formerly "JSON"
schema( {a:[]}, {a:[1]} ) == {a:["number"]} # formerly {"a":"JSON"}
For reference, the prior version of this page (with Version 0.0.2.3 of schema.jq) can be found at
https://web.archive.org/web/20221017022540/https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed
Hi @pkoppstein !
Following up on an observation I noticed in the significant time difference between the versions of
schema.jq
from 2017 and its current implementation ( jqlang/jq#2386 (comment))Interested in trying to figure out if the additional conditional checks that have been implemented in the latest version are what's responsible, or something that rosetta 2 emulation on M1 was at fault for and was wondering if you might have any suggestions on valid data sets to perform a few different benchmarks.
Thanks!