Dataframe Oriented Programming: https://csvbase.com/blog/1
Accessing up to date data quickly and easily (even on phone) and pulling decision making information out of it constantly should be effortless.
Something like the stories of the old APL mainframe environment.
- Named tables that are always up to date.
- Not needing to worry about ram or compute location relative to data etc.
- Hierarchical but also tagged organisation of datasets.
- Easy job scheduling (like systemd-timers)
- market data
- commodities
- currencies
- commodity shipments
- country productions
- country investments (all the SAI global database stuff)
- energy data: opennem.org.au, global
- tables from Wikipedia
- US bureau of stats, every country
It should be almost effortless to add new dataset scrapers, a pandas pd.read_html(...) and a simple scraper schedule. Datasets should keep version history so monitoring change over time is easy and useful.
- Ripgrep/fzf full text searching of everything.
- Splunk style web interface for event streaming.
- Csvbase style interface for tables.
- Ag-grid dataframe viewing at a button click.
- Web based code editing and repl, jupyter is terrible but hints at the possibility.
- I think kdb+ has a "something studio" app that might have some useful ideas.
- Github.dev with the vscode web version is not it.
- Docs should also be effortless and live with the code, jupyter notebooks aren't quite it but hint at it.
What functionality is required for the above?
- Long lived state (independant from compute nodes): Filesystem or S3-compatible object storage?
- Mostly thinking about mostly-read parquet files here.
- authn/authz to protect the long lived state.
- Job scheduling. Cron/systemd/etc but needs to work with transient nodes.
- DAG job dependencies (Bank Python's Dagger)
- Webserver nodes. Or can we get away with static bucket hosting?
- Transient compute nodes. Mix of home/office desktops and cloud VMs. Lots (most?) datasets are small enough that a home pc could run scheduled tasks like: scrape-table-from-site.py and push to S3 or similar.
K.I.S.S.
- S3 (or R2) full of parquet files
- guix for language tooling (or nix, but guix is nicer ime)
- tailscale (or headscale)
It should be extremely cost effective.
Home PCs should participate in scheduled jobs transparently.
Use spot nodes temporarily if needed, transparently.
It should also be multi-user capable, at least for a single org.
Publishing "notebooks" should be as easy as gist.github.com.
- kubernetes - i'm convinced this is way overcomplicated for the vast majority of organisations
- docker - just zip up binaries into a blob, what could go wrong?
- serverless - ridiculous vendor lockin doesn't seem like the sort of thing we'll build the Starship Enterprise with. Maybe fly.io is worth digging into though to test out the ideas?
Muy Interesante: https://calpaterson.com/bank-python.html