Iteration 1 spec: https://gist.github.com/aaronschachter/709c3d9a112a55efc96adc1a1d841137
Iteration 1 used a fan-out approach: query each collection separately via getCMSItems, merge server-side, return paginated results. This works for small sites but doesn't scale, and it gives the AI assistant nothing persistent to reason over.
Iteration 2 replaces fan-out with a PG-native content_index table living in the same CMS schema as the item data. A background job populates it. The table becomes both the data source for the Content Index UI and the knowledge base for a Content Agent.
CMS items already live in Postgres (Aurora) via the single-tenant PG storage layer. Each collection has:
s_{collectionId}(staged/draft items)p_{collectionId}(published/live items)
The content_index table can be built with SQL queries across these existing tables, no need to go through getCMSItems pagination (which caps at 100/request). Diffing staged vs live is a join in the same schema.
cms_{storageLayerConfigName}_{databaseId}/
s_{collectionId_1} -- staged items (existing)
p_{collectionId_1} -- published items (existing)
s_{collectionId_2} -- staged items (existing)
p_{collectionId_2} -- published items (existing)
...
content_index -- NEW: denormalized status index
CREATE TABLE content_index (
item_id VARCHAR(24) NOT NULL,
collection_id VARCHAR(24) NOT NULL,
collection_name TEXT NOT NULL,
locale_id VARCHAR(24) NOT NULL DEFAULT '',
name TEXT NOT NULL DEFAULT '',
slug TEXT NOT NULL DEFAULT '',
status TEXT NOT NULL, -- 'published' | 'draft' | 'changes-in-draft' | 'queued-to-publish' | 'scheduled' | 'archived'
is_draft BOOLEAN NOT NULL DEFAULT FALSE,
is_archived BOOLEAN NOT NULL DEFAULT FALSE,
updated_on TIMESTAMPTZ,
published_on TIMESTAMPTZ,
scheduled_time TIMESTAMPTZ,
has_live_changes BOOLEAN NOT NULL DEFAULT FALSE,
changed_fields TEXT[] DEFAULT '{}', -- slugs of fields that differ between staged and live
built_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (item_id, collection_id, locale_id)
);
-- Primary query paths
CREATE INDEX idx_content_index_status ON content_index (status);
CREATE INDEX idx_content_index_collection ON content_index (collection_id, status);
CREATE INDEX idx_content_index_updated ON content_index (status, updated_on DESC);Status is derived from staged item fields and cross-referenced with scheduled items and published tables:
| Status | Condition |
|---|---|
archived |
_archived = true |
scheduled |
Item ID in scheduled publish table, no error |
draft |
_draft = true |
changes-in-draft |
published_on exists AND updated_on > published_on |
published |
published_on exists AND updated_on <= published_on |
queued-to-publish |
Not draft, not published (or published with changes and _draft = false) |
Priority order matches ItemUtils.getStatus() from packages/shared/utils/ItemUtils.ts.
For each staged item that has a corresponding row in p_{collectionId}:
- Join
s_{collectionId}top_{collectionId}on item ID - Compare all user-defined columns (exclude internal columns
_id,id,l,p,_fileRefs,_mtok) - Collect slugs of columns where values differ into
changed_fields - Set
has_live_changes = trueif any differ
This powers:
- Content Index table: a badge or icon indicating "3 fields changed" on the row
- Item detail view: full field-by-field diff (queried on demand from
s_andp_tables, not stored incontent_index)
The job runs when:
- User lands on Content Index panel and no
content_indextable exists (orbuilt_atis stale) - User explicitly clicks "Rebuild Index" in the Content Index UI
- (Future) Automatically after CMS mutations via write-through hooks
1. Get CMSSchemaConnection for the database
2. List all s_* tables in the schema (these are staged collection tables)
3. For each collection:
a. Query all rows from s_{collectionId} (name, slug, _draft, _archived, updated-on, published-on)
b. Left join against p_{collectionId} to detect field-level diffs
c. Derive status per item
4. Fetch scheduled items via getScheduledSIPByDatabase (external call, not in PG yet)
5. Cross-reference: mark scheduled items with 'scheduled' status
6. TRUNCATE + INSERT into content_index (atomic via transaction)
7. Update built_at timestamp
- Each
s_{collectionId}query is a sequential scan (no filter, we want all items). Fine for CMS scale (most collections are <10K items). - The join for diff detection reads both tables but only compares column values, no complex transforms.
- Total job time for a site with 20 collections and 10K items each: estimated <10 seconds.
- The TRUNCATE + INSERT pattern avoids complex upsert logic. The table is fully rebuilt each time.
The content index is kept in sync via application-level hooks in the PG storage layer operations. Each CMS mutation path updates the content_index table after the primary operation succeeds. Updates are non-blocking (fire-and-forget with error logging) so they don't add latency to the primary write path.
Hook point: CMSInsertOperations in the PG storage layer, after successful INSERT into s_{collectionId}.
Behavior: INSERT a new row into content_index. New items typically start as _draft: true, so status is draft. No live version exists yet, so has_live_changes = false and changed_fields = {}.
Hook point: CMSUpdateOperations in the PG storage layer, after successful UPDATE on s_{collectionId}.
Behavior: UPDATE the corresponding content_index row. Re-derive status (e.g., updated_on changing can flip published to changes-in-draft). Re-diff against p_{collectionId} to recompute changed_fields and has_live_changes. This is a single-row join, cheap.
Hook point: CMSRemoveOperations in the PG storage layer, after successful DELETE from s_{collectionId}.
Behavior: DELETE the corresponding row from content_index.
These are item updates that flip _draft or _archived. Same UPDATE path as item update, but status derivation changes accordingly.
Hook point: sitePublish.ts batch job, after publishSite() resolves and PUBLISH_FINISHED status is set (around line 592), before notifyMPSOfTaskStatusUpdate.
Behavior: Full rebuild of the content index. Site publish writes staged items to all p_* tables, potentially affecting every item's status and diff state. A full rebuild is the safest approach here, and publish already takes time so an extra few seconds is negligible. A targeted bulk UPDATE (SET status = 'published', has_live_changes = false, changed_fields = '{}') is faster but risks drift if status derivation logic changes.
The rebuild runs in the same job context (not forked), so it completes before the publish task status is broadcast to the designer.
Hook point: After publishSingleItemByScheduledId succeeds.
Behavior: Update the single row in content_index (set status to published, clear changed_fields, set has_live_changes = false, clear scheduled_time). Could also trigger a full rebuild since scheduled publishes are infrequent and a few seconds of extra work is acceptable.
All content index updates are non-critical. If an update fails:
- Log the error with context (item ID, collection ID, operation)
- Do NOT fail the primary CMS operation
- The index may become stale, but the next manual "Rebuild Index" or site publish will correct it
This follows the existing pattern in the codebase where non-critical side effects (e.g., component usage tracking) log errors but allow the primary operation to succeed.
- After individual mutations: Index is updated synchronously in the same request. If the update fails, the index is stale by one operation. Users can click "Rebuild Index" to correct.
- After site publish: Full rebuild guarantees consistency. The rebuild completes before publish status is broadcast.
- Staleness detection: The
built_atcolumn on each row tracks when it was last computed. The UI can show how fresh the index is and offer a rebuild option.
The content index is opt-in per database. It is only created when a user first visits the Content Index panel and clicks "Build Content Index" (or when the Content Agent is first used). There is no background cost for sites that don't use it.
If a user has no need for the Content Index views or Content Agent:
- The
content_indextable is never created in their CMS schema - No write-through hooks fire (hooks check for table existence before attempting updates)
- No performance impact on CMS mutations
- The CMS panel behaves exactly as it does today (collection-scoped views)
The feature is gated behind isCmsContentIndexEnabled() (to be replaced with a Statsig flag ff-cms-content-index). When disabled:
- Content Index panel is not shown in the CMS panel hierarchy
- Write-through hooks are no-ops
- No table is created
Users who previously built the index and later opt out (or if the flag is disabled) retain the table but it goes stale. The table can be dropped as part of a cleanup pass, or simply ignored.
POST /api/v1/databases/:database_slug/content-index/build
Triggers the build job. Returns { status: 'building' }.
GET /api/v1/databases/:database_slug/content-index/status
Returns { status: 'ready' | 'building' | 'none', builtAt, itemCount }.
GET /api/v1/databases/:database_slug/content-index
Query params: status, collectionIds, limit, offset, sort
Reads from content_index table. Returns CMSContentIndexResult shape.
Fields: item_id, name, slug, collection_id, collection_name, status,
updated_on, published_on, scheduled_time, has_live_changes, changed_fields.
GET /api/v1/databases/:database_slug/content-index/items/:itemId/diff
Returns staged vs live field-by-field diff for a single item.
Queries s_{collectionId} and p_{collectionId} directly.
The existing GET /overview?status=... endpoint is preserved during transition but deprecated in favor of the content-index routes.
When the Content Index panel loads:
- Call
GET .../content-index/status - If
status === 'none': show empty state with "Build Content Index" button - If
status === 'building': show progress indicator, poll every 2s - If
status === 'ready': render the table, showbuiltAttimestamp and item count
- New column or badge: "Changes" showing count from
changed_fields.length - Row click still navigates to item edit panel
- (Future) Row click could open an item detail sheet showing the diff
The content_index table is registered as two tools for the AI assistant:
Tool: get-content-index
Queries the content_index table with filters (status, collection, search by name)
Supports mode: 'items' (return rows) or 'summary' (return counts by status/collection)
Returns structured data the agent can reason over
Tool: get-item-diff
Given an item ID and collection ID, queries s_{collectionId} and p_{collectionId}
Returns field-by-field diff between staged and live versions
Example agent interactions:
- "What will change when I publish?" -> get-content-index with status='queued-to-publish' + has_live_changes, summarizes changed fields
- "Which collections have scheduled items?" -> get-content-index in summary mode, filtered to status='scheduled'
- "Show me stale content" -> get-content-index where updated_on is older than N months
- "What's different in the Blog Posts collection vs what's live?" -> get-content-index with has_live_changes=true for that collection, then get-item-diff for each
| Thing | Name | Why |
|---|---|---|
| PG table | content_index |
Describes what it is: an index of all content. Natural language for the build action ("Build Content Index", "Rebuild Index"). Works in the agent context ("query the content index"). |
| Status value | changes-in-draft |
Matches what the designer UI shows. Internal code calls it stagedChanges but user-facing consistency wins. |
| UI panel | "Content Index" | Descriptive, implies completeness. Not "Overview" (too vague). |
| Job action | "Build Content Index" / "Rebuild Index" | Clear, intuitive. Users understand what "building an index" means. |
| Agent feature | "Content Agent" | The agent powered by the content index. |
| Iteration 1 | Iteration 2 |
|---|---|
item_overview table |
content_index table |
| CMS Overview panel | Content Index panel |
getCMSOverview |
getContentIndex |
buildItemOverview job |
buildContentIndex job |
get-cms-overview tool |
get-content-index tool |
CMSOverview.tsx |
CMSContentIndex.tsx |
useCMSOverview.ts |
useContentIndex.ts |
get_cms_summary tool |
Merged into get-content-index with mode: 'summary' |
- Add
content_indextable creation to the PG storage layer (alongside existingcreateTableAndIndexes) - Add
buildContentIndexjob function - Add new API routes (build, status, query, diff)
- Refactor existing overview endpoint to read from PG table instead of fan-out
- Rename frontend components (CMSOverview -> CMSContentIndex, etc.)
- Update frontend to handle job lifecycle (build/poll/ready)
- Register agent tools (get-content-index, get-item-diff)
- Remove fan-out query code
- Site Publishes: Expose publications to users. Each domain has an S3
manifest.jsonwith a publication ID. Show which version of an item lives on which domain. Enable per-domain single-item publish. Call these "Site Publishes" with dates from the published database. - Live version diffing in UI: Full JSON diff viewer in the item detail panel, showing exactly what fields changed between staged and each live domain.
- Discard draft changes: Revert staged item to match live version. Batch support.
- Real-time updates: WebSocket push when items change status.
- Collection health metrics: Broken references, empty required fields, SEO gaps, staleness.
- Per-item analytics: If Optimize or Google Analytics data becomes trackable per-item,
content_indexcan accommodate additional columns.