Skip to content

Instantly share code, notes, and snippets.

@paniq
Last active November 5, 2024 09:41
Show Gist options
  • Save paniq/dce063282205b515cec039391d0b7280 to your computer and use it in GitHub Desktop.
Save paniq/dce063282205b515cec039391d0b7280 to your computer and use it in GitHub Desktop.

informal spec for shitty datalog db implemented on a filesystem:

  • a directory represents a table
  • a file in the directory represents a row
  • the filename is a universal hash of the content
  • the file/directory extension is the format of the content (e.g. .ssii: two strings, two integers)
  • row format could be binary or single-line csv

some nice properties:

  • needs no container format, leverages existing technology for hi-throughput caching and buffering as well as shared network access
  • separates unindexed data from index
  • create/drop tables by just adding/removing directories
  • index by table just by searching directory
  • add/remove rows by just writing/removing files
  • rows are content-addressed, therefore:
    • rows are immutable
    • deduplicates, since duplicate content -> duplicate filename
    • corruption can be detected by checking if content matches filename
  • each row has a creation date and time, as well as last access (permitting LRU drops)
  • alter columns by transitioning file extensions; can be safely resumed after interruption
  • online indices can update after file system notifications
  • bonus: permission flags could do something interesting

drawbacks:

  • max 64K rows per table, beyond that a HAMT-like structure is needed, i.e. sort into subdirectories by first few digits of hash
  • possibly too taxing for SSDs
  • incomplete: applications still need to build indices over the data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment