Build a single-page “Moltbook Analysis Console” that lets an operator:
- Understand what’s happening (global trends, bursts, dominant spaces/topics).
- Understand where it’s happening (which submolts, which clusters).
- Understand who drives it (authors, communities, cross-posting patterns).
- Understand how it’s received (upvotes/downvotes/comments; controversy).
- Drill down from any aggregate visualization into the exact posts/submolt descriptions that explain the pattern.
The core UX principle: coordinated multiple views (filters in one view instantly update all others) + a persistent Details/Reader panel.
all_posts.merged.json→posts→posts_filtered(list of post dicts)all_submolts.filtered.json→submolts→submolts_filtered(list of submolt dicts)
Even if you keep the raw JSON, the console should treat the data as tables:
-
Posts table
post_id,created_at,title,content,urlauthor_id,author_namesubmolt_id,submolt_name,submolt_display_nameupvotes,downvotes,comment_count- Derived:
score = upvotes - downvotes,total_votes = upvotes + downvotes - Derived:
controversy = f(upvotes, downvotes, comment_count)(define later) - Derived:
content_len,has_url,is_empty_title,lang(optional)
-
Submolts table
submolt_id,submolt_name,display_name,descriptionsubscriber_count,created_at,last_activity_at,created_by_id,created_by_name
-
Tags table (from
predicted_tags)- One row per
(post_id, tag) - Derived:
tag_namespace(e.g.emotion,style,intent)
- One row per
-
Class notes table (from
class_notes)- One row per
(post_id, class_note)
- One row per
-
Optional computed tables
- Submolt × tag frequency
- Author × submolt frequency
- Tag co-occurrence edges
- Submolt similarity edges (cosine over tag distributions)
- Preprocess once into columnar files:
- JSON → Parquet/Arrow (posts, submolts, tags, class_notes)
- Load into browser using:
- DuckDB-WASM for SQL queries + aggregations
- Web Workers for non-blocking queries
- Benefits:
- Fast slice/dice, no backend required
- Easy to implement drill-down with SQL: “show me the posts behind this bar”
- Precompute all aggregates offline (Python notebook) → ship as JSON bundles
- Browser does only filtering on precomputed structures
- Benefits:
- Simpler stack
- Risks:
- Harder to add ad-hoc operator questions (less flexible)
If the goal is “operator exploration” (unknown questions), Option A is usually worth it.
Think of the console as a set of models/views that are interchangeable but synchronized by a common filter state.
The operator can always filter by:
- Time range (brush on timeline)
- Submolt(s)
- Author(s)
- Tag(s) / class_note(s)
- Engagement ranges (upvotes/downvotes/comment_count)
- Text search (title/content; optional regex)
- Language (if detected)
All charts and tables update to reflect the current filter state.
- Search box (posts, authors, submolts, tags)
- Active filter chips (click to remove)
- Saved views/bookmarks (“Spike on Jan-31”, “Crypto shilling cluster”, etc.)
- Tabs: Overview, Spaces, Networks, Embeddings, Integrity
- Shows selected post/submolt/author
- Renders markdown-like content (safe sanitizer)
- Context section:
- “More posts by this author”
- “More posts in this submolt”
- “Similar posts” (embedding or tag similarity)
This right panel is the drill-down anchor: every visualization must be able to populate it.
Purpose: quickly answer “what is going on overall?”
Views:
- Timeline:
- Posts/day (line or area)
- Unique authors/day (overlay)
- Optional: stacked area by
class_note - Interaction: brush selects time range; click spike → auto drill-down list
- Engagement distributions:
- Histograms (log-scale) for upvotes/downvotes/comment_count
- Interaction: drag range filter
- Top lists:
- Top submolts by posts, top authors by posts
- Top tags / class_notes
- Interaction: click item → adds filter; shift-click to compare multiple
Insights enabled:
- bursts/events; inequality; dominant categories; baseline health.
Purpose: understand submolts as communities with different content/ecosystems.
Views:
- Submolt map (2D projection):
- Each submolt is a point.
- Similarity computed from tag distributions / class_note distributions.
- Use UMAP on the submolt vectors offline; ship coordinates.
- Interaction: lasso selection → filters; click point → show submolt description + example posts.
- Submolt profile panel (within Details panel when submolt selected):
- Description, subscriber_count, last_activity
- “What’s distinctive here?”:
- tags overrepresented vs global baseline (lift)
- top authors within submolt
Insights enabled:
- niche clusters (technical vs philosophical vs crypto); “nearby” submolts; genre neighborhoods.
Purpose: see emergent structure not visible in rankings.
Provide 2–3 switchable network types:
-
Tag co-occurrence graph
- Nodes: tags (colored by namespace)
- Edges: co-occurrence weight
- Interaction: select community → filter posts; click tag → open “tag page” + example posts
-
Submolt similarity graph
- Nodes: submolts
- Edges: similarity above threshold
- Community detection (Louvain/Leiden offline) → color communities
-
Author–submolt bipartite / projected graph
- Identify “bridges” (authors connecting disparate submolt communities)
- Interaction: click author → show their submolt portfolio and posts over time
Implementation notes:
- Use a WebGL-based renderer if needed (e.g., Sigma.js / Graphology) for smooth interaction.
Insights enabled:
- bridging authors; polarization clusters; tag communities; “topic neighborhoods”.
Purpose: operator wants to browse content beyond predefined tags.
Pipeline:
- Offline compute embeddings for
title + content(or content only). - Reduce to 2D (UMAP) and ship:
(post_id, x, y)plus minimal metadata for tooltip.
UI:
- Scatterplot (canvas/WebGL) of posts in embedding space.
- Color by class_note or dominant tag namespace.
- Interaction:
- lasso region → filters posts
- click point → open post in Reader
- “Find similar” button (kNN by embedding) → list of neighbors
Insights enabled:
- emergent genres; mislabeled clusters; novel subtopics; weird pockets (spam, memes, manifestos).
Purpose: prevent bad conclusions; spot artifacts/spam.
Views:
- Missingness dashboard (which fields absent, by submolt/time)
- Duplicate detection (same title/content repeated)
- Outlier dashboard (extreme votes / extreme length)
- Spam heuristics:
- high URL density, repetitive phrases, crypto tickers
- show flagged clusters in embedding space or as a table
Insights enabled:
- trust calibration; artifact discovery; cleaning priorities.
Every chart selection should be able to produce:
- A query definition (SQL or filter predicate)
- A result list (posts/submolts/authors)
- A Reader view for the selected item
Examples:
- Click a stacked area segment (class_note “manifestos_political” on Jan-31) →
- Filter: time range + class_note
- Show top posts in that slice
- Click a node in tag network (“intent:debate”) →
- Filter: posts with that tag
- Show example posts + co-occurring tags
- Lasso a cluster in embedding map →
- Filter: selected post_ids
- Show summary terms + top authors + representative posts
Provide toggles for “rank posts by”:
- Most discussed:
comment_count(with time filter) - Most upvoted:
upvotes - Most downvoted:
downvotes - Most controversial: candidate formulas to experiment with:
- (2 * \min(upvotes, downvotes)) (classic “both sides” signal)
- (total_votes * (1 - |upvotes - downvotes| / total_votes))
- incorporate comments:
controversy * log(1 + comment_count)
Important: surface the formula in UI so the operator knows what “controversy” means.
Create entity pages:
- Submolt page: description, activity timeline, top tags, representative posts
- Author page: posting timeline, submolt portfolio, top tags, representative posts
- Tag page: frequency over time, which submolts use it, representative posts
Add full-text search:
- Simple in-browser index (FlexSearch / Lunr) over
title+contentfor fast retrieval - Search results always drill to Reader and can be “converted to filters”.
Target: interactions feel instant.
Practical constraints/approach:
- Precompute heavy things offline (embeddings, UMAP, network communities).
- Use DuckDB-WASM for aggregations; cache query results keyed by filter signature.
- Render big scatterplots via canvas/WebGL.
- Use virtualized tables for drill-down lists (10k+ rows).
Ship a folder like:
viz/(static web app)viz/data/(Parquet/Arrow/JSON assets produced by a build step)viz/scripts/(offline preprocessing pipeline)
The project should be runnable as a static site (GitHub Pages / local file server).
- Data engine: DuckDB-WASM (SQL) vs “precomputed-only”.
- If the operator needs to ask new questions, choose DuckDB-WASM.
- Data format: prefer Parquet + Arrow for load speed and memory.
- Where heavy computation lives:
- Offline: embeddings, UMAP coords, graph communities, similarity edges.
- Online: filtering, aggregations, drill-down lists.
- Filter state: single canonical filter object + URL sync.
- Drill-down invariant: every view must be able to produce a query + a list + a selected item for Reader.
Create a reproducible build that takes raw JSON and emits viz/data/*:
-
Normalize and validate
- Read
all_posts.merged.json,all_submolts.filtered.json. - Build
posts_submolts,submolts_filtered,posts_filtered. - Normalize into flat tables:
posts.parquetsubmolts.parquetpost_tags.parquetpost_class_notes.parquet
- Add derived fields:
created_at_day(date)content_lentotal_votes,score- optional
lang
- Read
-
Precompute “model assets”
- Submolt vectors (tag distribution, class distribution)
- Submolt map coords: UMAP →
submolt_umap.parquet(submolt_name,x,y) - Tag co-occurrence edges:
tag_edges.parquet(tag_a,tag_b,w) - Graph communities (offline):
tag_communities.parquet,submolt_communities.parquet - Post embeddings + 2D projection:
post_embedding_meta.parquet(at minimumpost_id,x,y)- optional
post_knn.parquet(top-k neighbors per post_id) for fast “similar posts”
-
Publish a tiny data dictionary
viz/data/schema.jsonwith field names/types + build timestamp + row counts.
This build step should run from the repo root and be deterministic.
-
Data layer
DataCatalog: knows which assets exist and their versionsDuckDBService(if using DuckDB-WASM): loads Parquet, exposes:query(sql, params)(async)- cached results by
(sql + params)signature
SearchIndexService: builds/loads full-text index (FlexSearch) for title/content
-
State layer
FilterState: canonical filter object (time range, selected submolts, tags, etc.)SelectionState: what is currently highlighted (chart hover/selection)ReaderState: currently opened entity (post/submolt/author/tag)- URL serialization:
encodeStateToUrl,decodeStateFromUrl
-
Query layer
- “Query builders” that translate filter state into SQL fragments:
whereClauseFromFilters(filters)topAuthorsQuery(filters, limit)timelineQuery(filters, grain)
- This is where you enforce consistency so every module interprets filters identically.
- “Query builders” that translate filter state into SQL fragments:
-
UI components
Navigator(search + filters + saved views)CanvasTabs(Overview/Spaces/Networks/Embeddings/Integrity)ReaderPanel(post/submolt/author/tag “pages”)ResultsTable(virtualized drill-down list)
- User interacts with a chart (brush, click, lasso).
- That interaction emits:
filtersDelta(add/remove/change)selection(optional, for hover highlights)drilldownQuery(optional shortcut for “show underlying posts”)
- Global state updates → query layer recomputes → all views re-render.
- When a row/point is selected, Reader loads the full text and context.
-
Reader panel
- Markdown-ish rendering with strict sanitization (no raw HTML execution).
- Highlight search terms; show extracted URLs; show “copy id/link” buttons.
-
Virtualized lists
- All drill-down lists must be virtualized (10k+ rows possible).
- Provide “load next 200” style pagination for heavy queries.
-
Large scatterplots
- Use canvas/WebGL; avoid SVG for 25k points with lasso.
- Decouple rendering from React reconciliation (draw directly to canvas).
-
Network views
- Start with filtered subgraphs only (top N nodes/edges) to keep it navigable.
- Provide thresholds / sliders for edge weight.
- Freeze schema for posts/submolts/tags/class_notes.
- Canonicalize
predicted_tagskey (and handle typos gracefully in build step). - Produce
viz/data/schema.jsonand simple row-count report.
- Vite SPA + layout: Navigator / Tabs / Reader.
- Load a tiny sample (e.g., 500 posts) to validate UX and drill-down flow.
- URL state + saved views.
- DuckDB-WASM wiring (or precomputed aggregates).
- Overview tab:
- timeline brush
- top submolts/authors
- engagement histograms
- Reader “post page” + “submolt page” with representative posts.
- Submolt UMAP map (lasso, profile, “distinctive tags” lift).
- Tag co-occurrence network (threshold slider + drill-down).
- Author↔submolt bipartite summary (even if not full network yet).
- Post embedding scatter (lasso + “similar posts”).
- Integrity dashboard:
- missingness, duplicates, vote outliers
- spam heuristics and review workflow
- Export slice as CSV/JSON.
- IndexedDB annotations (flag, note, label).
- Keyboard shortcuts (next/prev in list, open reader, pin filters).
- Shareable “view links” (URL encodes state).
Goal: ensure the web app is never fed inconsistent/broken data.
- Schema tests
- Assert required fields exist in outputs (posts/submolts/tags/class_notes).
- Assert types (timestamps parse, ints non-negative where expected).
- Row count invariants
len(posts_filtered)equals posts table row count.distinct(submolt_name in posts_filtered)equalslen(submolts_filtered)(or explain exceptions).
- Key canonicalization
- If raw data contains
preducted_tags, it must be mapped topredicted_tags(and logged).
- If raw data contains
- Dedup checks
post_iduniqueness;submolt_iduniqueness.
- Build reproducibility
- Same inputs → same output checksums (or same row counts + stable ordering).
These can run as a Python test suite (pytest) or as a node script with assertions; either is fine as long as it runs in CI.
Goal: correctness of state and query logic without running a browser.
- Filter serialization
decode(encode(filters))round-trips.- Backward compatibility when adding new filter fields.
- Query builders
- Given a filter state, generated SQL contains the right WHERE predicates.
- “No filter” state returns unfiltered query.
- Scoring functions
controversyformulas behave sensibly on edge cases:- zero votes, only upvotes, only downvotes, huge numbers.
Use a JS test runner (e.g., Vitest) and test pure functions.
Goal: coordinated views stay consistent.
With a small fixed dataset fixture:
- Brushing the timeline reduces counts in:
- top submolts list
- histograms
- drill-down table row count
- Clicking a tag in the tag view updates:
- filter chips
- Reader “tag page” content
- the underlying posts list
These can be written as component tests (Playwright component tests or Testing Library + jsdom), depending on your preferred stack.
Goal: the operator-critical flows never break.
E2E scenarios:
- Load + baseline render
- App loads, shows Overview with non-empty charts.
- Drill-down loop
- Apply a filter (time brush) → open drill-down list → click a post → Reader renders full content.
- Cross-view coordination
- Select a submolt on the map → Overview updates → open submolt page shows correct description and representative posts.
- Search
- Search for a keyword → results list → open post → keyword highlighted.
- Embeddings
- Lasso selection → results count changes → “similar posts” returns non-empty list.
Goal: keep interactions “instant” as data grows.
- Performance budgets
- Cold load to interactive (TTI) under a target on a typical machine.
- Timeline brush updates within a target (e.g., <200ms perceived).
- Query time logging
- Instrument DuckDB query durations; flag slow queries.
- Visual regression (optional but valuable)
- Screenshot key states (Overview, Map, Network, Embeddings) on a fixed dataset fixture.
- Diff screenshots in CI to catch accidental UI breakage.
Goal: ensure charts match known expected numbers.
For a fixed small dataset fixture, store golden values:
- total posts, total unique authors, total unique submolts
- top-5 submolts by count
- tag frequency counts for a known slice
Run these as part of CI so refactors don’t silently change semantics.
- Build/tooling: Vite + TypeScript
- Data:
- DuckDB-WASM + Parquet (Option A), or JSON + precomputed aggregates (Option B)
- Arrow for efficient transfer
- Charts:
- Observable Plot for quick “analysis-grade” charts
- D3 where custom interaction is needed
- Networks: Sigma.js (Graphology) or similar WebGL renderer
- Embedding scatter: deck.gl ScatterplotLayer or custom canvas plot
- Search: FlexSearch
- State: URL-synced state (filters encoded into query params) + local saved views
- No comment graph: can’t do thread analysis → emphasize content + counts + co-occurrence.
- Vote counts may be noisy: treat engagement as “signals” not ground truth; add Integrity module and confidence notes.
- Label quality (
predicted_tags,class_notes): add quick qualitative sampling workflows and “mismatch cluster” detection via embeddings.