Interactive visualization plan (HTML + JS) for `posts_filtered`

Goal (operator workflow)

Build a single-page “Moltbook Analysis Console” that lets an operator:

Understand what’s happening (global trends, bursts, dominant spaces/topics).
Understand where it’s happening (which submolts, which clusters).
Understand who drives it (authors, communities, cross-posting patterns).
Understand how it’s received (upvotes/downvotes/comments; controversy).
Drill down from any aggregate visualization into the exact posts/submolt descriptions that explain the pattern.

The core UX principle: coordinated multiple views (filters in one view instantly update all others) + a persistent Details/Reader panel.

Data inputs & core entities

Inputs

all_posts.merged.json → posts → posts_filtered (list of post dicts)
all_submolts.filtered.json → submolts → submolts_filtered (list of submolt dicts)

Entity tables (normalized, for fast querying)

Even if you keep the raw JSON, the console should treat the data as tables:

Posts table
- post_id, created_at, title, content, url
- author_id, author_name
- submolt_id, submolt_name, submolt_display_name
- upvotes, downvotes, comment_count
- Derived: score = upvotes - downvotes, total_votes = upvotes + downvotes
- Derived: controversy = f(upvotes, downvotes, comment_count) (define later)
- Derived: content_len, has_url, is_empty_title, lang (optional)
Submolts table
- submolt_id, submolt_name, display_name, description
- subscriber_count, created_at, last_activity_at, created_by_id, created_by_name
Tags table (from predicted_tags)
- One row per (post_id, tag)
- Derived: tag_namespace (e.g. emotion, style, intent)
Class notes table (from class_notes)
- One row per (post_id, class_note)
Optional computed tables
- Submolt × tag frequency
- Author × submolt frequency
- Tag co-occurrence edges
- Submolt similarity edges (cosine over tag distributions)

Architecture options (two viable paths)

Option A (recommended): “Static site + in-browser analytics engine”

Preprocess once into columnar files:
- JSON → Parquet/Arrow (posts, submolts, tags, class_notes)
Load into browser using:
- DuckDB-WASM for SQL queries + aggregations
- Web Workers for non-blocking queries
Benefits:
- Fast slice/dice, no backend required
- Easy to implement drill-down with SQL: “show me the posts behind this bar”

Option B: “Static site + precomputed aggregates”

Precompute all aggregates offline (Python notebook) → ship as JSON bundles
Browser does only filtering on precomputed structures
Benefits:
- Simpler stack
Risks:
- Harder to add ad-hoc operator questions (less flexible)

If the goal is “operator exploration” (unknown questions), Option A is usually worth it.

“Several different models” = multiple coordinated exploration modules

Think of the console as a set of models/views that are interchangeable but synchronized by a common filter state.

Global filter state (shared across all modules)

The operator can always filter by:

Time range (brush on timeline)
Submolt(s)
Author(s)
Tag(s) / class_note(s)
Engagement ranges (upvotes/downvotes/comment_count)
Text search (title/content; optional regex)
Language (if detected)

All charts and tables update to reflect the current filter state.

Core UI layout (high-level)

Left: “Navigator”

Search box (posts, authors, submolts, tags)
Active filter chips (click to remove)
Saved views/bookmarks (“Spike on Jan-31”, “Crypto shilling cluster”, etc.)

Center: “Canvas” (module area)

Tabs: Overview, Spaces, Networks, Embeddings, Integrity

Right: “Details / Reader”

Shows selected post/submolt/author
Renders markdown-like content (safe sanitizer)
Context section:
- “More posts by this author”
- “More posts in this submolt”
- “Similar posts” (embedding or tag similarity)

This right panel is the drill-down anchor: every visualization must be able to populate it.

Modules (what each “model” provides)

1) Overview model (time + engagement + composition)

Purpose: quickly answer “what is going on overall?”

Views:

Timeline:
- Posts/day (line or area)
- Unique authors/day (overlay)
- Optional: stacked area by class_note
- Interaction: brush selects time range; click spike → auto drill-down list
Engagement distributions:
- Histograms (log-scale) for upvotes/downvotes/comment_count
- Interaction: drag range filter
Top lists:
- Top submolts by posts, top authors by posts
- Top tags / class_notes
- Interaction: click item → adds filter; shift-click to compare multiple

Insights enabled:

bursts/events; inequality; dominant categories; baseline health.

2) Spaces model (submolts as “places”)

Purpose: understand submolts as communities with different content/ecosystems.

Views:

Submolt map (2D projection):
- Each submolt is a point.
- Similarity computed from tag distributions / class_note distributions.
- Use UMAP on the submolt vectors offline; ship coordinates.
- Interaction: lasso selection → filters; click point → show submolt description + example posts.
Submolt profile panel (within Details panel when submolt selected):
- Description, subscriber_count, last_activity
- “What’s distinctive here?”:
  - tags overrepresented vs global baseline (lift)
  - top authors within submolt

Insights enabled:

niche clusters (technical vs philosophical vs crypto); “nearby” submolts; genre neighborhoods.

3) Networks model (relationships)

Purpose: see emergent structure not visible in rankings.

Provide 2–3 switchable network types:

Tag co-occurrence graph
- Nodes: tags (colored by namespace)
- Edges: co-occurrence weight
- Interaction: select community → filter posts; click tag → open “tag page” + example posts
Submolt similarity graph
- Nodes: submolts
- Edges: similarity above threshold
- Community detection (Louvain/Leiden offline) → color communities
Author–submolt bipartite / projected graph
- Identify “bridges” (authors connecting disparate submolt communities)
- Interaction: click author → show their submolt portfolio and posts over time

Implementation notes:

Use a WebGL-based renderer if needed (e.g., Sigma.js / Graphology) for smooth interaction.

Insights enabled:

bridging authors; polarization clusters; tag communities; “topic neighborhoods”.

4) Embeddings model (semantic exploration + “similar posts”)

Purpose: operator wants to browse content beyond predefined tags.

Pipeline:

Offline compute embeddings for title + content (or content only).
Reduce to 2D (UMAP) and ship:
- (post_id, x, y) plus minimal metadata for tooltip.

UI:

Scatterplot (canvas/WebGL) of posts in embedding space.
Color by class_note or dominant tag namespace.
Interaction:
- lasso region → filters posts
- click point → open post in Reader
- “Find similar” button (kNN by embedding) → list of neighbors

Insights enabled:

emergent genres; mislabeled clusters; novel subtopics; weird pockets (spam, memes, manifestos).

5) Integrity model (data quality & suspicious patterns)

Purpose: prevent bad conclusions; spot artifacts/spam.

Views:

Missingness dashboard (which fields absent, by submolt/time)
Duplicate detection (same title/content repeated)
Outlier dashboard (extreme votes / extreme length)
Spam heuristics:
- high URL density, repetitive phrases, crypto tickers
- show flagged clusters in embedding space or as a table

Insights enabled:

trust calibration; artifact discovery; cleaning priorities.

Drill-down mechanics (critical)

Every chart selection should be able to produce:

A query definition (SQL or filter predicate)
A result list (posts/submolts/authors)
A Reader view for the selected item

Examples:

Click a stacked area segment (class_note “manifestos_political” on Jan-31) →
- Filter: time range + class_note
- Show top posts in that slice
Click a node in tag network (“intent:debate”) →
- Filter: posts with that tag
- Show example posts + co-occurring tags
Lasso a cluster in embedding map →
- Filter: selected post_ids
- Show summary terms + top authors + representative posts

Controversy / “interestingness” scoring (operator-facing)

Provide toggles for “rank posts by”:

Most discussed: comment_count (with time filter)
Most upvoted: upvotes
Most downvoted: downvotes
Most controversial: candidate formulas to experiment with:
- (2 * \min(upvotes, downvotes)) (classic “both sides” signal)
- (total_votes * (1 - |upvotes - downvotes| / total_votes))
- incorporate comments: controversy * log(1 + comment_count)

Important: surface the formula in UI so the operator knows what “controversy” means.

Text search and “pages” (Wikipedia-style exploration)

Create entity pages:

Submolt page: description, activity timeline, top tags, representative posts
Author page: posting timeline, submolt portfolio, top tags, representative posts
Tag page: frequency over time, which submolts use it, representative posts

Add full-text search:

Simple in-browser index (FlexSearch / Lunr) over title+content for fast retrieval
Search results always drill to Reader and can be “converted to filters”.

Performance targets (given ~25k posts)

Target: interactions feel instant.

Practical constraints/approach:

Precompute heavy things offline (embeddings, UMAP, network communities).
Use DuckDB-WASM for aggregations; cache query results keyed by filter signature.
Render big scatterplots via canvas/WebGL.
Use virtualized tables for drill-down lists (10k+ rows).

Implementation plan (expanded)

Deliverable shape

Ship a folder like:

viz/ (static web app)
viz/data/ (Parquet/Arrow/JSON assets produced by a build step)
viz/scripts/ (offline preprocessing pipeline)

The project should be runnable as a static site (GitHub Pages / local file server).

Key design choices to lock early

Data engine: DuckDB-WASM (SQL) vs “precomputed-only”.
- If the operator needs to ask new questions, choose DuckDB-WASM.
Data format: prefer Parquet + Arrow for load speed and memory.
Where heavy computation lives:
- Offline: embeddings, UMAP coords, graph communities, similarity edges.
- Online: filtering, aggregations, drill-down lists.
Filter state: single canonical filter object + URL sync.
Drill-down invariant: every view must be able to produce a query + a list + a selected item for Reader.

Data pipeline (offline build step)

Create a reproducible build that takes raw JSON and emits viz/data/*:

Normalize and validate
- Read all_posts.merged.json, all_submolts.filtered.json.
- Build posts_submolts, submolts_filtered, posts_filtered.
- Normalize into flat tables:
  - posts.parquet
  - submolts.parquet
  - post_tags.parquet
  - post_class_notes.parquet
- Add derived fields:
  - created_at_day (date)
  - content_len
  - total_votes, score
  - optional lang
Precompute “model assets”
- Submolt vectors (tag distribution, class distribution)
- Submolt map coords: UMAP → submolt_umap.parquet (submolt_name, x, y)
- Tag co-occurrence edges: tag_edges.parquet (tag_a, tag_b, w)
- Graph communities (offline): tag_communities.parquet, submolt_communities.parquet
- Post embeddings + 2D projection:
  - post_embedding_meta.parquet (at minimum post_id, x, y)
  - optional post_knn.parquet (top-k neighbors per post_id) for fast “similar posts”
Publish a tiny data dictionary
- viz/data/schema.json with field names/types + build timestamp + row counts.

This build step should run from the repo root and be deterministic.

App architecture (web)

Core modules (by responsibility)

Data layer
- DataCatalog: knows which assets exist and their versions
- DuckDBService (if using DuckDB-WASM): loads Parquet, exposes:
  - query(sql, params) (async)
  - cached results by (sql + params) signature
- SearchIndexService: builds/loads full-text index (FlexSearch) for title/content
State layer
- FilterState: canonical filter object (time range, selected submolts, tags, etc.)
- SelectionState: what is currently highlighted (chart hover/selection)
- ReaderState: currently opened entity (post/submolt/author/tag)
- URL serialization: encodeStateToUrl, decodeStateFromUrl
Query layer
- “Query builders” that translate filter state into SQL fragments:
  - whereClauseFromFilters(filters)
  - topAuthorsQuery(filters, limit)
  - timelineQuery(filters, grain)
- This is where you enforce consistency so every module interprets filters identically.
UI components
- Navigator (search + filters + saved views)
- CanvasTabs (Overview/Spaces/Networks/Embeddings/Integrity)
- ReaderPanel (post/submolt/author/tag “pages”)
- ResultsTable (virtualized drill-down list)

Crossfilter mechanics (how coordination happens)

User interacts with a chart (brush, click, lasso).
That interaction emits:
- filtersDelta (add/remove/change)
- selection (optional, for hover highlights)
- drilldownQuery (optional shortcut for “show underlying posts”)
Global state updates → query layer recomputes → all views re-render.
When a row/point is selected, Reader loads the full text and context.

UI implementation details that matter

Reader panel
- Markdown-ish rendering with strict sanitization (no raw HTML execution).
- Highlight search terms; show extracted URLs; show “copy id/link” buttons.
Virtualized lists
- All drill-down lists must be virtualized (10k+ rows possible).
- Provide “load next 200” style pagination for heavy queries.
Large scatterplots
- Use canvas/WebGL; avoid SVG for 25k points with lasso.
- Decouple rendering from React reconciliation (draw directly to canvas).
Network views
- Start with filtered subgraphs only (top N nodes/edges) to keep it navigable.
- Provide thresholds / sliders for edge weight.

Phased roadmap (milestones)

Phase 0 — data contract (1–2 days)

Freeze schema for posts/submolts/tags/class_notes.
Canonicalize predicted_tags key (and handle typos gracefully in build step).
Produce viz/data/schema.json and simple row-count report.

Phase 1 — skeleton console (2–4 days)

Vite SPA + layout: Navigator / Tabs / Reader.
Load a tiny sample (e.g., 500 posts) to validate UX and drill-down flow.
URL state + saved views.

Phase 2 — core analysis loop (3–7 days)

DuckDB-WASM wiring (or precomputed aggregates).
Overview tab:
- timeline brush
- top submolts/authors
- engagement histograms
Reader “post page” + “submolt page” with representative posts.

Phase 3 — submolt map + networks (4–10 days)

Submolt UMAP map (lasso, profile, “distinctive tags” lift).
Tag co-occurrence network (threshold slider + drill-down).
Author↔submolt bipartite summary (even if not full network yet).

Phase 4 — embeddings + integrity (4–12 days)

Post embedding scatter (lasso + “similar posts”).
Integrity dashboard:
- missingness, duplicates, vote outliers
- spam heuristics and review workflow

Phase 5 — polish & operator ergonomics (ongoing)

Export slice as CSV/JSON.
IndexedDB annotations (flag, note, label).
Keyboard shortcuts (next/prev in list, open reader, pin filters).
Shareable “view links” (URL encodes state).

Test plan (what to test, and how)

1) Data pipeline tests (offline)

Goal: ensure the web app is never fed inconsistent/broken data.

Schema tests
- Assert required fields exist in outputs (posts/submolts/tags/class_notes).
- Assert types (timestamps parse, ints non-negative where expected).
Row count invariants
- len(posts_filtered) equals posts table row count.
- distinct(submolt_name in posts_filtered) equals len(submolts_filtered) (or explain exceptions).
Key canonicalization
- If raw data contains preducted_tags, it must be mapped to predicted_tags (and logged).
Dedup checks
- post_id uniqueness; submolt_id uniqueness.
Build reproducibility
- Same inputs → same output checksums (or same row counts + stable ordering).

These can run as a Python test suite (pytest) or as a node script with assertions; either is fine as long as it runs in CI.

2) Unit tests (web app)

Goal: correctness of state and query logic without running a browser.

Filter serialization
- decode(encode(filters)) round-trips.
- Backward compatibility when adding new filter fields.
Query builders
- Given a filter state, generated SQL contains the right WHERE predicates.
- “No filter” state returns unfiltered query.
Scoring functions
- controversy formulas behave sensibly on edge cases:
  - zero votes, only upvotes, only downvotes, huge numbers.

Use a JS test runner (e.g., Vitest) and test pure functions.

3) Integration tests (web app, headless but app-level)

Goal: coordinated views stay consistent.

With a small fixed dataset fixture:

Brushing the timeline reduces counts in:
- top submolts list
- histograms
- drill-down table row count
Clicking a tag in the tag view updates:
- filter chips
- Reader “tag page” content
- the underlying posts list

These can be written as component tests (Playwright component tests or Testing Library + jsdom), depending on your preferred stack.

4) End-to-end tests (Playwright)

Goal: the operator-critical flows never break.

E2E scenarios:

Load + baseline render
- App loads, shows Overview with non-empty charts.
Drill-down loop
- Apply a filter (time brush) → open drill-down list → click a post → Reader renders full content.
Cross-view coordination
- Select a submolt on the map → Overview updates → open submolt page shows correct description and representative posts.
Search
- Search for a keyword → results list → open post → keyword highlighted.
Embeddings
- Lasso selection → results count changes → “similar posts” returns non-empty list.

5) Performance and regression tests

Goal: keep interactions “instant” as data grows.

Performance budgets
- Cold load to interactive (TTI) under a target on a typical machine.
- Timeline brush updates within a target (e.g., <200ms perceived).
Query time logging
- Instrument DuckDB query durations; flag slow queries.
Visual regression (optional but valuable)
- Screenshot key states (Overview, Map, Network, Embeddings) on a fixed dataset fixture.
- Diff screenshots in CI to catch accidental UI breakage.

6) “Analyst sanity” tests (golden aggregates)

Goal: ensure charts match known expected numbers.

For a fixed small dataset fixture, store golden values:

total posts, total unique authors, total unique submolts
top-5 submolts by count
tag frequency counts for a known slice

Run these as part of CI so refactors don’t silently change semantics.

Tech stack suggestion (concrete)

Build/tooling: Vite + TypeScript
Data:
- DuckDB-WASM + Parquet (Option A), or JSON + precomputed aggregates (Option B)
- Arrow for efficient transfer
Charts:
- Observable Plot for quick “analysis-grade” charts
- D3 where custom interaction is needed
Networks: Sigma.js (Graphology) or similar WebGL renderer
Embedding scatter: deck.gl ScatterplotLayer or custom canvas plot
Search: FlexSearch
State: URL-synced state (filters encoded into query params) + local saved views

Risks & mitigations

No comment graph: can’t do thread analysis → emphasize content + counts + co-occurrence.
Vote counts may be noisy: treat engagement as “signals” not ground truth; add Integrity module and confidence notes.
Label quality (predicted_tags, class_notes): add quick qualitative sampling workflows and “mismatch cluster” detection via embeddings.

altsoph/VIZ_PLAN.md

Interactive visualization plan (HTML + JS) for posts_filtered

Goal (operator workflow)

Data inputs & core entities

Inputs

Entity tables (normalized, for fast querying)

Architecture options (two viable paths)

Option A (recommended): “Static site + in-browser analytics engine”

Option B: “Static site + precomputed aggregates”

“Several different models” = multiple coordinated exploration modules

Global filter state (shared across all modules)

Core UI layout (high-level)

Left: “Navigator”

Center: “Canvas” (module area)

Right: “Details / Reader”

Modules (what each “model” provides)

1) Overview model (time + engagement + composition)

2) Spaces model (submolts as “places”)

3) Networks model (relationships)

4) Embeddings model (semantic exploration + “similar posts”)

5) Integrity model (data quality & suspicious patterns)

Drill-down mechanics (critical)

Controversy / “interestingness” scoring (operator-facing)

Text search and “pages” (Wikipedia-style exploration)

Performance targets (given ~25k posts)

Implementation plan (expanded)

Deliverable shape

Key design choices to lock early

Data pipeline (offline build step)

App architecture (web)

Core modules (by responsibility)

Crossfilter mechanics (how coordination happens)

UI implementation details that matter

Phased roadmap (milestones)

Phase 0 — data contract (1–2 days)

Phase 1 — skeleton console (2–4 days)

Phase 2 — core analysis loop (3–7 days)

Phase 3 — submolt map + networks (4–10 days)

Phase 4 — embeddings + integrity (4–12 days)

Phase 5 — polish & operator ergonomics (ongoing)

Test plan (what to test, and how)

1) Data pipeline tests (offline)

2) Unit tests (web app)

3) Integration tests (web app, headless but app-level)

4) End-to-end tests (Playwright)

5) Performance and regression tests

6) “Analyst sanity” tests (golden aggregates)

Tech stack suggestion (concrete)

Risks & mitigations

Interactive visualization plan (HTML + JS) for `posts_filtered`