Skip to content

Instantly share code, notes, and snippets.

@altsoph
Created February 2, 2026 21:44
Show Gist options
  • Select an option

  • Save altsoph/21a037cd482837adce19777c18e36e92 to your computer and use it in GitHub Desktop.

Select an option

Save altsoph/21a037cd482837adce19777c18e36e92 to your computer and use it in GitHub Desktop.

Interactive visualization plan (HTML + JS) for posts_filtered

Goal (operator workflow)

Build a single-page “Moltbook Analysis Console” that lets an operator:

  • Understand what’s happening (global trends, bursts, dominant spaces/topics).
  • Understand where it’s happening (which submolts, which clusters).
  • Understand who drives it (authors, communities, cross-posting patterns).
  • Understand how it’s received (upvotes/downvotes/comments; controversy).
  • Drill down from any aggregate visualization into the exact posts/submolt descriptions that explain the pattern.

The core UX principle: coordinated multiple views (filters in one view instantly update all others) + a persistent Details/Reader panel.

Data inputs & core entities

Inputs

  • all_posts.merged.jsonpostsposts_filtered (list of post dicts)
  • all_submolts.filtered.jsonsubmoltssubmolts_filtered (list of submolt dicts)

Entity tables (normalized, for fast querying)

Even if you keep the raw JSON, the console should treat the data as tables:

  • Posts table

    • post_id, created_at, title, content, url
    • author_id, author_name
    • submolt_id, submolt_name, submolt_display_name
    • upvotes, downvotes, comment_count
    • Derived: score = upvotes - downvotes, total_votes = upvotes + downvotes
    • Derived: controversy = f(upvotes, downvotes, comment_count) (define later)
    • Derived: content_len, has_url, is_empty_title, lang (optional)
  • Submolts table

    • submolt_id, submolt_name, display_name, description
    • subscriber_count, created_at, last_activity_at, created_by_id, created_by_name
  • Tags table (from predicted_tags)

    • One row per (post_id, tag)
    • Derived: tag_namespace (e.g. emotion, style, intent)
  • Class notes table (from class_notes)

    • One row per (post_id, class_note)
  • Optional computed tables

    • Submolt × tag frequency
    • Author × submolt frequency
    • Tag co-occurrence edges
    • Submolt similarity edges (cosine over tag distributions)

Architecture options (two viable paths)

Option A (recommended): “Static site + in-browser analytics engine”

  • Preprocess once into columnar files:
    • JSON → Parquet/Arrow (posts, submolts, tags, class_notes)
  • Load into browser using:
    • DuckDB-WASM for SQL queries + aggregations
    • Web Workers for non-blocking queries
  • Benefits:
    • Fast slice/dice, no backend required
    • Easy to implement drill-down with SQL: “show me the posts behind this bar”

Option B: “Static site + precomputed aggregates”

  • Precompute all aggregates offline (Python notebook) → ship as JSON bundles
  • Browser does only filtering on precomputed structures
  • Benefits:
    • Simpler stack
  • Risks:
    • Harder to add ad-hoc operator questions (less flexible)

If the goal is “operator exploration” (unknown questions), Option A is usually worth it.

“Several different models” = multiple coordinated exploration modules

Think of the console as a set of models/views that are interchangeable but synchronized by a common filter state.

Global filter state (shared across all modules)

The operator can always filter by:

  • Time range (brush on timeline)
  • Submolt(s)
  • Author(s)
  • Tag(s) / class_note(s)
  • Engagement ranges (upvotes/downvotes/comment_count)
  • Text search (title/content; optional regex)
  • Language (if detected)

All charts and tables update to reflect the current filter state.

Core UI layout (high-level)

Left: “Navigator”

  • Search box (posts, authors, submolts, tags)
  • Active filter chips (click to remove)
  • Saved views/bookmarks (“Spike on Jan-31”, “Crypto shilling cluster”, etc.)

Center: “Canvas” (module area)

  • Tabs: Overview, Spaces, Networks, Embeddings, Integrity

Right: “Details / Reader”

  • Shows selected post/submolt/author
  • Renders markdown-like content (safe sanitizer)
  • Context section:
    • “More posts by this author”
    • “More posts in this submolt”
    • “Similar posts” (embedding or tag similarity)

This right panel is the drill-down anchor: every visualization must be able to populate it.

Modules (what each “model” provides)

1) Overview model (time + engagement + composition)

Purpose: quickly answer “what is going on overall?”

Views:

  • Timeline:
    • Posts/day (line or area)
    • Unique authors/day (overlay)
    • Optional: stacked area by class_note
    • Interaction: brush selects time range; click spike → auto drill-down list
  • Engagement distributions:
    • Histograms (log-scale) for upvotes/downvotes/comment_count
    • Interaction: drag range filter
  • Top lists:
    • Top submolts by posts, top authors by posts
    • Top tags / class_notes
    • Interaction: click item → adds filter; shift-click to compare multiple

Insights enabled:

  • bursts/events; inequality; dominant categories; baseline health.

2) Spaces model (submolts as “places”)

Purpose: understand submolts as communities with different content/ecosystems.

Views:

  • Submolt map (2D projection):
    • Each submolt is a point.
    • Similarity computed from tag distributions / class_note distributions.
    • Use UMAP on the submolt vectors offline; ship coordinates.
    • Interaction: lasso selection → filters; click point → show submolt description + example posts.
  • Submolt profile panel (within Details panel when submolt selected):
    • Description, subscriber_count, last_activity
    • “What’s distinctive here?”:
      • tags overrepresented vs global baseline (lift)
      • top authors within submolt

Insights enabled:

  • niche clusters (technical vs philosophical vs crypto); “nearby” submolts; genre neighborhoods.

3) Networks model (relationships)

Purpose: see emergent structure not visible in rankings.

Provide 2–3 switchable network types:

  1. Tag co-occurrence graph

    • Nodes: tags (colored by namespace)
    • Edges: co-occurrence weight
    • Interaction: select community → filter posts; click tag → open “tag page” + example posts
  2. Submolt similarity graph

    • Nodes: submolts
    • Edges: similarity above threshold
    • Community detection (Louvain/Leiden offline) → color communities
  3. Author–submolt bipartite / projected graph

    • Identify “bridges” (authors connecting disparate submolt communities)
    • Interaction: click author → show their submolt portfolio and posts over time

Implementation notes:

  • Use a WebGL-based renderer if needed (e.g., Sigma.js / Graphology) for smooth interaction.

Insights enabled:

  • bridging authors; polarization clusters; tag communities; “topic neighborhoods”.

4) Embeddings model (semantic exploration + “similar posts”)

Purpose: operator wants to browse content beyond predefined tags.

Pipeline:

  • Offline compute embeddings for title + content (or content only).
  • Reduce to 2D (UMAP) and ship:
    • (post_id, x, y) plus minimal metadata for tooltip.

UI:

  • Scatterplot (canvas/WebGL) of posts in embedding space.
  • Color by class_note or dominant tag namespace.
  • Interaction:
    • lasso region → filters posts
    • click point → open post in Reader
    • “Find similar” button (kNN by embedding) → list of neighbors

Insights enabled:

  • emergent genres; mislabeled clusters; novel subtopics; weird pockets (spam, memes, manifestos).

5) Integrity model (data quality & suspicious patterns)

Purpose: prevent bad conclusions; spot artifacts/spam.

Views:

  • Missingness dashboard (which fields absent, by submolt/time)
  • Duplicate detection (same title/content repeated)
  • Outlier dashboard (extreme votes / extreme length)
  • Spam heuristics:
    • high URL density, repetitive phrases, crypto tickers
    • show flagged clusters in embedding space or as a table

Insights enabled:

  • trust calibration; artifact discovery; cleaning priorities.

Drill-down mechanics (critical)

Every chart selection should be able to produce:

  • A query definition (SQL or filter predicate)
  • A result list (posts/submolts/authors)
  • A Reader view for the selected item

Examples:

  • Click a stacked area segment (class_note “manifestos_political” on Jan-31) →
    • Filter: time range + class_note
    • Show top posts in that slice
  • Click a node in tag network (“intent:debate”) →
    • Filter: posts with that tag
    • Show example posts + co-occurring tags
  • Lasso a cluster in embedding map →
    • Filter: selected post_ids
    • Show summary terms + top authors + representative posts

Controversy / “interestingness” scoring (operator-facing)

Provide toggles for “rank posts by”:

  • Most discussed: comment_count (with time filter)
  • Most upvoted: upvotes
  • Most downvoted: downvotes
  • Most controversial: candidate formulas to experiment with:
    • (2 * \min(upvotes, downvotes)) (classic “both sides” signal)
    • (total_votes * (1 - |upvotes - downvotes| / total_votes))
    • incorporate comments: controversy * log(1 + comment_count)

Important: surface the formula in UI so the operator knows what “controversy” means.

Text search and “pages” (Wikipedia-style exploration)

Create entity pages:

  • Submolt page: description, activity timeline, top tags, representative posts
  • Author page: posting timeline, submolt portfolio, top tags, representative posts
  • Tag page: frequency over time, which submolts use it, representative posts

Add full-text search:

  • Simple in-browser index (FlexSearch / Lunr) over title+content for fast retrieval
  • Search results always drill to Reader and can be “converted to filters”.

Performance targets (given ~25k posts)

Target: interactions feel instant.

Practical constraints/approach:

  • Precompute heavy things offline (embeddings, UMAP, network communities).
  • Use DuckDB-WASM for aggregations; cache query results keyed by filter signature.
  • Render big scatterplots via canvas/WebGL.
  • Use virtualized tables for drill-down lists (10k+ rows).

Implementation plan (expanded)

Deliverable shape

Ship a folder like:

  • viz/ (static web app)
  • viz/data/ (Parquet/Arrow/JSON assets produced by a build step)
  • viz/scripts/ (offline preprocessing pipeline)

The project should be runnable as a static site (GitHub Pages / local file server).

Key design choices to lock early

  • Data engine: DuckDB-WASM (SQL) vs “precomputed-only”.
    • If the operator needs to ask new questions, choose DuckDB-WASM.
  • Data format: prefer Parquet + Arrow for load speed and memory.
  • Where heavy computation lives:
    • Offline: embeddings, UMAP coords, graph communities, similarity edges.
    • Online: filtering, aggregations, drill-down lists.
  • Filter state: single canonical filter object + URL sync.
  • Drill-down invariant: every view must be able to produce a query + a list + a selected item for Reader.

Data pipeline (offline build step)

Create a reproducible build that takes raw JSON and emits viz/data/*:

  1. Normalize and validate

    • Read all_posts.merged.json, all_submolts.filtered.json.
    • Build posts_submolts, submolts_filtered, posts_filtered.
    • Normalize into flat tables:
      • posts.parquet
      • submolts.parquet
      • post_tags.parquet
      • post_class_notes.parquet
    • Add derived fields:
      • created_at_day (date)
      • content_len
      • total_votes, score
      • optional lang
  2. Precompute “model assets”

    • Submolt vectors (tag distribution, class distribution)
    • Submolt map coords: UMAP → submolt_umap.parquet (submolt_name, x, y)
    • Tag co-occurrence edges: tag_edges.parquet (tag_a, tag_b, w)
    • Graph communities (offline): tag_communities.parquet, submolt_communities.parquet
    • Post embeddings + 2D projection:
      • post_embedding_meta.parquet (at minimum post_id, x, y)
      • optional post_knn.parquet (top-k neighbors per post_id) for fast “similar posts”
  3. Publish a tiny data dictionary

    • viz/data/schema.json with field names/types + build timestamp + row counts.

This build step should run from the repo root and be deterministic.

App architecture (web)

Core modules (by responsibility)

  • Data layer

    • DataCatalog: knows which assets exist and their versions
    • DuckDBService (if using DuckDB-WASM): loads Parquet, exposes:
      • query(sql, params) (async)
      • cached results by (sql + params) signature
    • SearchIndexService: builds/loads full-text index (FlexSearch) for title/content
  • State layer

    • FilterState: canonical filter object (time range, selected submolts, tags, etc.)
    • SelectionState: what is currently highlighted (chart hover/selection)
    • ReaderState: currently opened entity (post/submolt/author/tag)
    • URL serialization: encodeStateToUrl, decodeStateFromUrl
  • Query layer

    • “Query builders” that translate filter state into SQL fragments:
      • whereClauseFromFilters(filters)
      • topAuthorsQuery(filters, limit)
      • timelineQuery(filters, grain)
    • This is where you enforce consistency so every module interprets filters identically.
  • UI components

    • Navigator (search + filters + saved views)
    • CanvasTabs (Overview/Spaces/Networks/Embeddings/Integrity)
    • ReaderPanel (post/submolt/author/tag “pages”)
    • ResultsTable (virtualized drill-down list)

Crossfilter mechanics (how coordination happens)

  1. User interacts with a chart (brush, click, lasso).
  2. That interaction emits:
    • filtersDelta (add/remove/change)
    • selection (optional, for hover highlights)
    • drilldownQuery (optional shortcut for “show underlying posts”)
  3. Global state updates → query layer recomputes → all views re-render.
  4. When a row/point is selected, Reader loads the full text and context.

UI implementation details that matter

  • Reader panel

    • Markdown-ish rendering with strict sanitization (no raw HTML execution).
    • Highlight search terms; show extracted URLs; show “copy id/link” buttons.
  • Virtualized lists

    • All drill-down lists must be virtualized (10k+ rows possible).
    • Provide “load next 200” style pagination for heavy queries.
  • Large scatterplots

    • Use canvas/WebGL; avoid SVG for 25k points with lasso.
    • Decouple rendering from React reconciliation (draw directly to canvas).
  • Network views

    • Start with filtered subgraphs only (top N nodes/edges) to keep it navigable.
    • Provide thresholds / sliders for edge weight.

Phased roadmap (milestones)

Phase 0 — data contract (1–2 days)

  • Freeze schema for posts/submolts/tags/class_notes.
  • Canonicalize predicted_tags key (and handle typos gracefully in build step).
  • Produce viz/data/schema.json and simple row-count report.

Phase 1 — skeleton console (2–4 days)

  • Vite SPA + layout: Navigator / Tabs / Reader.
  • Load a tiny sample (e.g., 500 posts) to validate UX and drill-down flow.
  • URL state + saved views.

Phase 2 — core analysis loop (3–7 days)

  • DuckDB-WASM wiring (or precomputed aggregates).
  • Overview tab:
    • timeline brush
    • top submolts/authors
    • engagement histograms
  • Reader “post page” + “submolt page” with representative posts.

Phase 3 — submolt map + networks (4–10 days)

  • Submolt UMAP map (lasso, profile, “distinctive tags” lift).
  • Tag co-occurrence network (threshold slider + drill-down).
  • Author↔submolt bipartite summary (even if not full network yet).

Phase 4 — embeddings + integrity (4–12 days)

  • Post embedding scatter (lasso + “similar posts”).
  • Integrity dashboard:
    • missingness, duplicates, vote outliers
    • spam heuristics and review workflow

Phase 5 — polish & operator ergonomics (ongoing)

  • Export slice as CSV/JSON.
  • IndexedDB annotations (flag, note, label).
  • Keyboard shortcuts (next/prev in list, open reader, pin filters).
  • Shareable “view links” (URL encodes state).

Test plan (what to test, and how)

1) Data pipeline tests (offline)

Goal: ensure the web app is never fed inconsistent/broken data.

  • Schema tests
    • Assert required fields exist in outputs (posts/submolts/tags/class_notes).
    • Assert types (timestamps parse, ints non-negative where expected).
  • Row count invariants
    • len(posts_filtered) equals posts table row count.
    • distinct(submolt_name in posts_filtered) equals len(submolts_filtered) (or explain exceptions).
  • Key canonicalization
    • If raw data contains preducted_tags, it must be mapped to predicted_tags (and logged).
  • Dedup checks
    • post_id uniqueness; submolt_id uniqueness.
  • Build reproducibility
    • Same inputs → same output checksums (or same row counts + stable ordering).

These can run as a Python test suite (pytest) or as a node script with assertions; either is fine as long as it runs in CI.

2) Unit tests (web app)

Goal: correctness of state and query logic without running a browser.

  • Filter serialization
    • decode(encode(filters)) round-trips.
    • Backward compatibility when adding new filter fields.
  • Query builders
    • Given a filter state, generated SQL contains the right WHERE predicates.
    • “No filter” state returns unfiltered query.
  • Scoring functions
    • controversy formulas behave sensibly on edge cases:
      • zero votes, only upvotes, only downvotes, huge numbers.

Use a JS test runner (e.g., Vitest) and test pure functions.

3) Integration tests (web app, headless but app-level)

Goal: coordinated views stay consistent.

With a small fixed dataset fixture:

  • Brushing the timeline reduces counts in:
    • top submolts list
    • histograms
    • drill-down table row count
  • Clicking a tag in the tag view updates:
    • filter chips
    • Reader “tag page” content
    • the underlying posts list

These can be written as component tests (Playwright component tests or Testing Library + jsdom), depending on your preferred stack.

4) End-to-end tests (Playwright)

Goal: the operator-critical flows never break.

E2E scenarios:

  • Load + baseline render
    • App loads, shows Overview with non-empty charts.
  • Drill-down loop
    • Apply a filter (time brush) → open drill-down list → click a post → Reader renders full content.
  • Cross-view coordination
    • Select a submolt on the map → Overview updates → open submolt page shows correct description and representative posts.
  • Search
    • Search for a keyword → results list → open post → keyword highlighted.
  • Embeddings
    • Lasso selection → results count changes → “similar posts” returns non-empty list.

5) Performance and regression tests

Goal: keep interactions “instant” as data grows.

  • Performance budgets
    • Cold load to interactive (TTI) under a target on a typical machine.
    • Timeline brush updates within a target (e.g., <200ms perceived).
  • Query time logging
    • Instrument DuckDB query durations; flag slow queries.
  • Visual regression (optional but valuable)
    • Screenshot key states (Overview, Map, Network, Embeddings) on a fixed dataset fixture.
    • Diff screenshots in CI to catch accidental UI breakage.

6) “Analyst sanity” tests (golden aggregates)

Goal: ensure charts match known expected numbers.

For a fixed small dataset fixture, store golden values:

  • total posts, total unique authors, total unique submolts
  • top-5 submolts by count
  • tag frequency counts for a known slice

Run these as part of CI so refactors don’t silently change semantics.

Tech stack suggestion (concrete)

  • Build/tooling: Vite + TypeScript
  • Data:
    • DuckDB-WASM + Parquet (Option A), or JSON + precomputed aggregates (Option B)
    • Arrow for efficient transfer
  • Charts:
    • Observable Plot for quick “analysis-grade” charts
    • D3 where custom interaction is needed
  • Networks: Sigma.js (Graphology) or similar WebGL renderer
  • Embedding scatter: deck.gl ScatterplotLayer or custom canvas plot
  • Search: FlexSearch
  • State: URL-synced state (filters encoded into query params) + local saved views

Risks & mitigations

  • No comment graph: can’t do thread analysis → emphasize content + counts + co-occurrence.
  • Vote counts may be noisy: treat engagement as “signals” not ground truth; add Integrity module and confidence notes.
  • Label quality (predicted_tags, class_notes): add quick qualitative sampling workflows and “mismatch cluster” detection via embeddings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment