-
-
Save pkoppstein/a5abb4ebef3b0f72a6ed to your computer and use it in GitHub Desktop.
module { | |
"name": "schema", | |
"description": "JSON Schema Inference", | |
"version": "0.0.3.1", | |
"homepage": "https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed", | |
"license": "MIT", | |
"author": "pkoppstein at gmail dot com" | |
}; | |
# NEWS: | |
# 2022.10.16 | |
# - unionType need not be top-level | |
# - documentation, notably mention gojq | |
# 2021.01.18 | |
# - the extended type of [] is [], and [] U [T] => [T] | |
# 2021.01.16 | |
# - bug fix @2021 | |
# 2019.07.26 | |
# - every type is regarded as including null if and only if $ARGS.named.nullable (default: false) | |
# Requires: jq 1.5 or higher, or gojq | |
# Usage examples: | |
# jq --argjson nullable true 'include "schema"; schema' input.json | |
# jq .[] huge.json | jq -n 'include "schema" {search: "."}; schema(inputs)' | |
# gojq .[] huge.json | gojq -L . -nc 'include "schema"; schema(inputs)' | |
# Example of use with jm (http://github.com/pkoppstein/jm) | |
# jm HUGE.json | jq -n 'include "schema" {search: "."}; schema(inputs)' | |
# To check conformance with the structural schemas produced by this module, | |
# see the JESS repository at https://github.com/pkoppstein/JESS | |
# This module defines four filters: | |
# schema/0 returns the typeUnion of the extended-type values of the entities | |
# in the input array, if the input is an array; | |
# otherwise it simply returns the "typeof" value of its input. | |
# schema(stream) returns the schema of the items in the stream; | |
# typeof/0 returns the extended-type of its input; | |
# typeUnion(a;b) returns b if a == null, otherwise it returns | |
# the union of the two specified extended-type values; | |
# All extended-type values are themselves JSON entities, e.g. | |
# the JSON value null has extended type `"null"`. | |
# Each extended type can be thought of as a set of JSON entities, | |
# e.g. "number" for the set consisting of null and the JSON numbers, | |
# and ["number"] for the set of arrays whose elements are in "number". | |
# Note that the JSON value [] can be thought of as belonging to all | |
# extended types of the form [T] as well as [] and "JSON". | |
# If $nullable, then `null` is regarded as an element of all types; | |
# otherwise, null is only regarded as being a member of "null", | |
# "scalar", "JSON", and any type of the form ["+", "null", X]; and all | |
# other types, T, are extended by ["+", "null", T]. | |
# The extended types are: | |
# "null", "boolean", "string", "number"; | |
# "scalar" for any combination of scalars (null, true, false, strings, or numbers); | |
# []; | |
# [T] where T is an extended type; | |
# an object all of whose values are extended types; | |
# a union type: ["+", "null", T] where T is an extended type not of the same form; | |
# "JSON" signifying that no other extended-type value is applicable. | |
# The extended-type of a JSON entity is defined recursively. | |
# The extended-type of a scalar value is its jq type ("null", "boolean", "string", "number"). | |
# The extended-type of [] is []. | |
# The extended-type of a non-empty array of values all of which have the | |
# same extended-type, T, is [T]. | |
# The extended-type of an object is an object with the same keys, but the | |
# values of which are the extended-types of the corresponding values. | |
# The extended-type of an array with values of different extended-types is defined | |
# using typeUnion(a;b) as defined below. | |
# typeUnion(a;b) returns the least extended-type value that subsumes both a and b. | |
# For example: | |
# typeUnion("number"; "string") yields "scalar"; | |
# typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"}; | |
# if $nullable, then typeUnion("null", t) yields t for any valid extended type, t; | |
# otherwise, if t is an extended type, then typeUnion("null"; t) yields t if t already includes | |
# null, or else ["+", "null", t]. | |
def typeUnion($a; $b): | |
def unionType: type == "array" and .[0] == "+" ; | |
def nullable: $ARGS.named.nullable; | |
def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar"; | |
def addNull: | |
if nullable then . | |
elif unionType or . == "null" or . == "scalar" or . == "JSON" then . | |
else ["+", "null", .] | |
end; | |
def addNull($x;$y): typeUnion($x;$y) | addNull; | |
if $a == null or $a == $b then $b | |
elif $b == null then $a # @2021 | |
elif $a == "null" then $b | addNull | |
elif $b == "null" then $a | addNull | |
elif ($a|unionType) and ($b|unionType) then addNull($a[2];$b[2]) | |
elif $a|unionType then addNull($a[2]; $b) | |
elif $b|unionType then addNull($a; $b[2]) | |
elif ($a | scalarp) and ($b | scalarp) then "scalar" | |
elif $a == "JSON" or $b == "JSON" then "JSON" | |
elif $a == [] then typeUnion($b; $a) # @2021 | |
elif $b == [] and ($a|type) == "array" then $a # @2021 | |
elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ] | |
elif ($a|type) == "object" and ($b|type) == "object" then | |
((($a|keys) + ($b|keys)) | unique) as $keys | |
| reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) ) | |
else "JSON" | |
end ; | |
def typeof: | |
def typeofArray: | |
if length == 0 then [] # @2021 | |
else [reduce .[] as $item (null; typeUnion(.; $item|typeof))] | |
end ; | |
def typeofObject: | |
reduce keys[] as $key ( . ; .[$key] |= typeof ) ; | |
. as $in | |
| type | |
| if . == "null" then "null" | |
elif . == "string" or . == "number" or . == "boolean" then . | |
elif . == "object" then $in | typeofObject | |
else $in | typeofArray | |
end; | |
def schema(stream): | |
reduce stream as $x (null; typeUnion(.; $x|typeof)); | |
# Omit the outermost [] for an array | |
def schema: | |
if type == "array" then schema(.[]) | |
else typeof | |
end ; | |
## Example programs: | |
# typeof | |
# .. | if type == "object" and has("country") then schema else empty end | |
# schema | |
# schema(inputs) |
@btiernay - Thanks for your comments and the pointer to the article.
Would you be able to provide a small but concrete
example of a set of JSON texts and the inferred schema
you have in mind?
(My working hypothesis that it is not possible to add inference rules
to schema.jq without adding new types or breaking its commitment to
what I would call "universality". Of course specific applications can
be expected to require more constrained schemas, which is why I
created JESS (https://github.com/pkoppstein/JESS), which allows very
complex constraints to be specified. The main idea is that the schema
produced by schema.jq could be refined into a JESS schema.
Unfortunately, the JESS schema language is correspondingly complex as
well :-)
By the way, I was impressed by your "JSON like a boss"
presentation. The version that I saw is quite old but it's
still on the web, so if you're open to making revisions,
please let me know, perhaps by sending an email to
me as per the "author" tag above.
Thanks again!
Thanks for the feedback and complement! 🙇
Before giving some examples, can you please expand on “universality”? In the literature it is generally accepted that there is a trade off between conciseness and precision. It sounds like you are highly favouring conciseness in your approach?
@btiernay - Perhaps "universality" is subsumed here by the
constraint that there be no new types but
I used the term "univserality" to express
the idea that additional schema inference rules
would require some special assumptions.
For example, consider JSON objects of the form:
{"i": 1}
{"i": 2}
To infer the schema {"i": "integer"}
(meaning that .i must be an integer)
is to go well beyond the JSON type system,
and thus would best be handled by some
domain-specific or application-specific rule.
A similar case would be JSON arrays of the form:
["string", "number"]
i.e. where length ==2 and (.[0]|type=="string") and (.[1]|type=="number")
.
Hi @pkoppstein !
Following up on an observation I noticed in the significant time difference between the versions of schema.jq
from 2017 and its current implementation ( jqlang/jq#2386 (comment))
Interested in trying to figure out if the additional conditional checks that have been implemented in the latest version are what's responsible, or something that rosetta 2 emulation on M1 was at fault for and was wondering if you might have any suggestions on valid data sets to perform a few different benchmarks.
Thanks!
@deric4 - Thanks for pointing out the mismatch. I've updated the Experimental Benchmarks page to reflect the updated version of schema.jq (https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed). In brief, schema.jq was updated to pay attention to the "nullable" flag, as documented in the gist. What was lost in the wash is that the default behavior changed. Omitting the nullable flag now produces slightly different results but takes considerably longer to execute.
Awesome thanks @pkoppstein !
@pkoppstein In trying to demonstrate some examples to show where I think the "JSON"
extended-type can be improved, I stumbled on this case:
jq -n "$(curl -s https://gist.githubusercontent.com/pkoppstein/a5abb4ebef3b0f72a6ed/raw/b1315e7fb23a1c90df94d6425cfdba0907a6a284/schema.jq)
schema(inputs)" <<EOM
{"a":1}
{"b":2}
EOM
Which produces:
{
"a": "JSON",
"b": "number"
}
when I would have expected:
{
"a": "number",
"b": "number"
}
Curious if you think a case might be missing on L84:
i.e.:
# [...]
if $a == null or $a == $b then $b
elif $b == null then $a # This?
# [...]
Thoughts?
@btiernay - You're absolutely correct. Fixed. Thanks.
Thanks for the updates!
Here are a couple of cases I'm wondering if we can improve upon:
Case 1
Input
{"a":[]}
{"a":[1]}
Actual Output
{
"a": "JSON"
}
Expected Output
{
"a": [
"number"
]
}
Case 2
Input
{"a":1}
{"a":{}}
Actual Output
{
"a": "JSON"
}
Expected Output
{
"a": [
"+",
"number",
"object"
]
}
The above introduces the notion of type unions which are the general mechanism for loss-less schema inferencing. This is especially insightful for more complex array and object types.
@btiernay - Thanks for the two excellent examples.
I am inclined to change schema.jq in accordance with your first example:
-- the new extended-type for [] will be []
-- [] U [ T ] => [ T ]
I will have to see what changes are needed to JESS (the schema-checker) as
we obviously do not want the schema {"a": []}
to imply that .a must be the empty array.
The second example, as you say, involves a significant extension of
schema.jq's extended-types. JESS should already handles these,
but I'm concerned about the performance implications for schema inference.
If these were supported as an option, what would you suggest as the option name?
UPDATE:
The update regarding [] is in V 0.0.3 at
https://gist.github.com/pkoppstein/70b7c053cf7757e1cf9665a0971d7383
Please let me know if it passes your tests; I'll then update the "gist" above.
The gist has been updated in accordance with @btiernay's "Case 1"
above, so that now (that is, with Version 0.0.3.1):
schema( [], [1] ) == ["number"] # formerly "JSON"
schema( {a:[]}, {a:[1]} ) == {a:["number"]} # formerly {"a":"JSON"}
For reference, the prior version of this page (with Version 0.0.2.3 of schema.jq) can be found at
https://web.archive.org/web/20221017022540/https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed
The above is a great first-order approximation inferencing algorithm. Thanks for sharing!
I'm curious if explored improving the
"JSON"
handling by using the "fusion" technique described in Mohamed-Amine Baazizi el al's Schema Inference for Massive JSON Datasets. An example Scala implementation may be found here from Ivan Veinhardt Lattak's "Schema Inference for NoSQL Databases" master thesis which is a great summary of the various approaches.