informal spec for shitty datalog db implemented on a filesystem:
- a directory represents a table
- a file in the directory represents a row
- the filename is a universal hash of the content
- the file/directory extension is the format of the content (e.g.
.ssii
: two strings, two integers) - row format could be binary or single-line csv
some nice properties:
- needs no container format, leverages existing technology for hi-throughput caching and buffering as well as shared network access
- separates unindexed data from index
- create/drop tables by just adding/removing directories
- index by table just by searching directory
- add/remove rows by just writing/removing files
- rows are content-addressed, therefore:
- rows are immutable
- deduplicates, since duplicate content -> duplicate filename
- corruption can be detected by checking if content matches filename
- each row has a creation date and time, as well as last access (permitting LRU drops)
- alter columns by transitioning file extensions; can be safely resumed after interruption
- online indices can update after file system notifications
- bonus: permission flags could do something interesting
drawbacks:
- max 64K rows per table, beyond that a HAMT-like structure is needed, i.e. sort into subdirectories by first few digits of hash
- possibly too taxing for SSDs
- incomplete: applications still need to build indices over the data