Created
December 31, 2022 22:50
-
-
Save tomsing1/304762fce66353310b254e8b22094a6e to your computer and use it in GitHub Desktop.
Experimenting with JSON fields in a SQLite database
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Experimenting with SQLite" | |
editor_options: | |
chunk_output_type: inline | |
--- | |
```{r} | |
library(glue) | |
library(RSQLite) | |
library(jsonlite) | |
``` | |
The steps below are inspired by | |
[this tutorial](https://tirkarthi.github.io/programming/2022/02/26/sqlite-json-improvements.html) | |
```{r} | |
con <- dbConnect(RSQLite::SQLite(), ":memory:") | |
``` | |
```{r} | |
if (dbExistsTable(con, "user")) { | |
dbRemoveTable(con, "user") | |
} | |
sql <- glue_sql( | |
"CREATE TABLE IF NOT EXISTS user ( | |
id integer PRIMARY KEY, | |
name text, | |
interests json, | |
created timestamp default current_timestamp not null | |
);", .con = con) | |
DBI::dbExecute(con, sql) | |
``` | |
```{r} | |
interests <- list( | |
John = list( | |
likes = c("skating", "reading", "swimming"), | |
dislikes = c("cooking") | |
), | |
Kate = list( | |
likes = c("reading", "swimming"), | |
dislikes = c("skating") | |
), | |
Jim = list( | |
likes = c("reading", "swimming"), | |
dislikes = c("cooking") | |
) | |
) | |
for (id in names(interests)) { | |
dbWithTransaction(con, { | |
payload <- toJSON(interests[[id]], autounbox = TRUE) | |
sql <- glue_sql( | |
"INSERT INTO user (id, name, interests) VALUES(null, {id}, {payload});", | |
.con = con) | |
DBI::dbExecute(con, sql) | |
}) | |
} | |
``` | |
We can use SQLite's | |
[-> and ->> JSON operators](https://www.sqlite.org/json1.html#jptr) | |
for extracting subcomponents of a JSON column. | |
### Left operands | |
Both operators take a JSON string as their *left* operand. | |
### Right operands | |
On the right side, you can use ither a | |
[PATH expression](https://dev.mysql.com/worklog/task/?id=8607) | |
(similar to MySQL) or an object field label or array index (like Postgres). | |
### Return values | |
- `->`: returns a JSON representation of the selected subcomponent or `NULL` if | |
that subcomponent does not exist. | |
- `->>`: returns an SQL TEXT, INTEGER, REAL, or NULL value that represents the | |
selected subcomponent, or NULL if the subcomponent does not exist. | |
Because `->` always returns a JSON representation, with can use it to chain | |
operations. | |
In a first example using `PATH expression` syntax, we specify the `likes` | |
attribute in the `interests` JSON column with the `$.likes` path expression and | |
chain it to `->>` to return the **first** element of the array. (We could also | |
use negative indexing, e.g. `$[#-1]` to access the last element of the array.) | |
```{sql connection=con} | |
select id, name from user | |
where interests->'$.likes'->>'$[0]' = 'skating'; | |
``` | |
Alternatively, we can execute the same query using a Postgres-like syntax [^1]. | |
```{sql connection=con} | |
select id, name from user | |
where interests->'likes'->> 0 = 'skating'; | |
``` | |
## Filtering JSON columns | |
::: {.callout-note} | |
All of the methods shown below (e.g. using `json_each` and `json_tree`) scan | |
through the full table, e.g. they don't usee an index to make the search more efficient. | |
::: | |
If we have information about the structure of the JSON object, then we can | |
unpack its elements and to focus on specific fields. (See | |
[this tutorial](https://database.guide/sqlite-json_tree/) | |
for some nice examples.) | |
For example, we might want to return all users like `skating`, e.g. whose | |
`likes` JSON field *contains* this value, e.g. regardless of the position of the value within the array. | |
### json_each | |
The `json_each` *table-valued* function walks along the specified JSON field and returns a new table with row for each value. The returned table includes e.g. | |
`key` and `value` columns we can use to filter. | |
```{sql connection=con} | |
SELECT DISTINCT user.id, name, interests->'likes' as likes | |
FROM user, json_each(interests, '$.likes') | |
WHERE json_each.value LIKE '%swimming%' | |
``` | |
### json_tree | |
The `json_tree` function steps through the full JSON object, e.g. it also | |
descends into the two sub-arrays `likes` and `dislikes`. | |
```{sql connection=con} | |
SELECT name, key, value | |
FROM user, json_tree(interests) | |
``` | |
We can define the root for the tree by providing a second argument, e.g. to | |
focus only on the `likes` sub-array: | |
```{sql connection=con} | |
SELECT name, key, value | |
FROM user, json_tree(interests, '$.likes') | |
``` | |
To focus only on the node leafs of the tree, we can retain only `atomic` fields: | |
```{sql connection=con} | |
SELECT name, key, value | |
FROM user, json_tree(interests, '$.likes') | |
WHERE json_tree.atom IS NOT NULL | |
``` | |
Now that we have row for each value of the `likes` sub-array, we can filter the | |
table: | |
```{sql connection=con} | |
SELECT name, key, value | |
FROM user, json_tree(interests, '$.likes') | |
WHERE json_tree.atom IS NOT NULL AND json_tree.value LIKE '%skating%' | |
``` | |
::: {.callout-warning} | |
Both the `json_each` and `json_tree` function scan the entire `interests` | |
column of the table, e.g. they don't use an index. I have not been able to | |
find documentation on how to add an (expression) index to speed up the search, | |
so for large datasets the queries shown above might be slow. | |
::: | |
## Casting JSON to text | |
Another way to filter JSON columns is to simply cast them into text, e.g. a JSON | |
string, and perform a text comparison. (Note that we are including the double | |
quotes in the search term to avoid unexpected partial matches.) | |
```{sql connection=con} | |
SELECT id, name, CAST(interests->'likes' AS TEXT) | |
FROM user | |
WHERE CAST(interests->'likes' AS TEXT) LIKE '%"skating"%' | |
``` | |
Alternatively, we can also extract the TEXT representation of the `likes` | |
sub-array with the `json_extract` function: | |
```{sql connection=con} | |
SELECT id, name, json_extract(interests, '$.likes') as likes | |
FROM user | |
WHERE json_extract(interests, '$.likes') LIKE '%"skating"%' | |
``` | |
### Indexing | |
::: {.callout-warning} | |
Using an index for fuzzy matching with the `LIKE` operator is only supported | |
for patterns of the format `XXX%`, e.g. a leading constant value before any | |
wildcards (as outlined | |
[here](https://stackoverflow.com/questions/57329285/how-to-properly-implement-indexing-for-use-in-variable-like-statement-with-sqlit)). | |
It is also necessary to make sure the search is case insensitive by defining | |
the casted TEXT explicitely as such with the `COLLATE NOCASE` clause. | |
::: | |
```{sql connection=con, include = FALSE} | |
DROP INDEX IF EXISTS dislikes_idx; | |
``` | |
Without an index, the table is scanned entirely: | |
```{sql connection=con} | |
EXPLAIN QUERY PLAN | |
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes | |
FROM user | |
WHERE dislikes = '["cooking"]' | |
``` | |
To speed up future queries, we can add an index based on the same | |
*expression* we are planning to use in the query: | |
```{sql connection=con} | |
CREATE INDEX dislikes_idx | |
ON user(CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE); | |
``` | |
When the equality (`=`) filter is returned using the `likes_idx` index, speeding | |
up the search. | |
```{sql connection=con} | |
EXPLAIN QUERY PLAN | |
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes | |
FROM user | |
WHERE dislikes = '["cooking"]' | |
``` | |
Similarly, when we use a pattern with a wildcard at the end, the index is used | |
as well: | |
```{sql connection=con} | |
EXPLAIN QUERY PLAN | |
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes | |
FROM user | |
WHERE dislikes LIKE 'cooking"%' | |
``` | |
But a wildcard at the start of the term triggers a full scan of the table. | |
```{sql connection=con} | |
EXPLAIN QUERY PLAN | |
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes | |
FROM user | |
WHERE dislikes LIKE '%cooking' | |
``` | |
That's a severe limitation because it precludes matching patterns that are _not_ | |
at the start of the TEXT string. | |
There might be other ways to implement a | |
[full text search](https://www.sqlite.org/fts5.html) | |
in SQLite, but I haven't discovered these features, yet. | |
## Cleanup | |
As always, we disconnect from the database when we are done. | |
```{r} | |
dbDisconnect(con) | |
``` | |
[^1]: If the right operand is a text label `X`, then it is interpreted as the | |
JSON path `$.X`. If the right operand is an integer value `N`, then it is | |
interpreted as the JSON path `$[N]`. | |
<details> | |
<summary> | |
SessionInfo | |
</summary> | |
```{r} | |
#| echo: false | |
sessioninfo::session_info() | |
``` | |
</details> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment