CMS Content Index - Iteration 2: PG-native Content Index + Content Agent

Iteration 1 spec: https://gist.github.com/aaronschachter/709c3d9a112a55efc96adc1a1d841137

What changed since iteration 1

Iteration 1 used a fan-out approach: query each collection separately via getCMSItems, merge server-side, return paginated results. This works for small sites but doesn't scale, and it gives the AI assistant nothing persistent to reason over.

Iteration 2 replaces fan-out with a PG-native content_index table living in the same CMS schema as the item data. A background job populates it. The table becomes both the data source for the Content Index UI and the knowledge base for a Content Agent.

Key insight

CMS items already live in Postgres (Aurora) via the single-tenant PG storage layer. Each collection has:

s_{collectionId} (staged/draft items)
p_{collectionId} (published/live items)

The content_index table can be built with SQL queries across these existing tables, no need to go through getCMSItems pagination (which caps at 100/request). Diffing staged vs live is a join in the same schema.

Architecture

Schema location

cms_{storageLayerConfigName}_{databaseId}/
  s_{collectionId_1}          -- staged items (existing)
  p_{collectionId_1}          -- published items (existing)
  s_{collectionId_2}          -- staged items (existing)
  p_{collectionId_2}          -- published items (existing)
  ...
  content_index               -- NEW: denormalized status index

Table: `content_index`

CREATE TABLE content_index (
  item_id         VARCHAR(24) NOT NULL,
  collection_id   VARCHAR(24) NOT NULL,
  collection_name TEXT NOT NULL,
  locale_id       VARCHAR(24) NOT NULL DEFAULT '',
  name            TEXT NOT NULL DEFAULT '',
  slug            TEXT NOT NULL DEFAULT '',
  status          TEXT NOT NULL,  -- 'published' | 'draft' | 'changes-in-draft' | 'queued-to-publish' | 'scheduled' | 'archived'
  is_draft        BOOLEAN NOT NULL DEFAULT FALSE,
  is_archived     BOOLEAN NOT NULL DEFAULT FALSE,
  updated_on      TIMESTAMPTZ,
  published_on    TIMESTAMPTZ,
  scheduled_time  TIMESTAMPTZ,
  has_live_changes BOOLEAN NOT NULL DEFAULT FALSE,
  changed_fields  TEXT[] DEFAULT '{}',  -- slugs of fields that differ between staged and live
  built_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),

  PRIMARY KEY (item_id, collection_id, locale_id)
);

-- Primary query paths
CREATE INDEX idx_content_index_status ON content_index (status);
CREATE INDEX idx_content_index_collection ON content_index (collection_id, status);
CREATE INDEX idx_content_index_updated ON content_index (status, updated_on DESC);

Status derivation

Status is derived from staged item fields and cross-referenced with scheduled items and published tables:

Status	Condition
`archived`	`_archived = true`
`scheduled`	Item ID in scheduled publish table, no error
`draft`	`_draft = true`
`changes-in-draft`	`published_on` exists AND `updated_on > published_on`
`published`	`published_on` exists AND `updated_on <= published_on`
`queued-to-publish`	Not draft, not published (or published with changes and `_draft = false`)

Priority order matches ItemUtils.getStatus() from packages/shared/utils/ItemUtils.ts.

Live diff detection

For each staged item that has a corresponding row in p_{collectionId}:

Join s_{collectionId} to p_{collectionId} on item ID
Compare all user-defined columns (exclude internal columns _id, id, l, p, _fileRefs, _mtok)
Collect slugs of columns where values differ into changed_fields
Set has_live_changes = true if any differ

This powers:

Content Index table: a badge or icon indicating "3 fields changed" on the row
Item detail view: full field-by-field diff (queried on demand from s_ and p_ tables, not stored in content_index)

Build job

Trigger

The job runs when:

User lands on Content Index panel and no content_index table exists (or built_at is stale)
User explicitly clicks "Rebuild Index" in the Content Index UI
(Future) Automatically after CMS mutations via write-through hooks

Job flow

1. Get CMSSchemaConnection for the database
2. List all s_* tables in the schema (these are staged collection tables)
3. For each collection:
   a. Query all rows from s_{collectionId} (name, slug, _draft, _archived, updated-on, published-on)
   b. Left join against p_{collectionId} to detect field-level diffs
   c. Derive status per item
4. Fetch scheduled items via getScheduledSIPByDatabase (external call, not in PG yet)
5. Cross-reference: mark scheduled items with 'scheduled' status
6. TRUNCATE + INSERT into content_index (atomic via transaction)
7. Update built_at timestamp

Performance considerations

Each s_{collectionId} query is a sequential scan (no filter, we want all items). Fine for CMS scale (most collections are <10K items).
The join for diff detection reads both tables but only compares column values, no complex transforms.
Total job time for a site with 20 collections and 10K items each: estimated <10 seconds.
The TRUNCATE + INSERT pattern avoids complex upsert logic. The table is fully rebuilt each time.

Keeping the index in sync

The content index is kept in sync via application-level hooks in the PG storage layer operations. Each CMS mutation path updates the content_index table after the primary operation succeeds. Updates are non-blocking (fire-and-forget with error logging) so they don't add latency to the primary write path.

Mutation paths

Item create

Hook point: CMSInsertOperations in the PG storage layer, after successful INSERT into s_{collectionId}.

Behavior: INSERT a new row into content_index. New items typically start as _draft: true, so status is draft. No live version exists yet, so has_live_changes = false and changed_fields = {}.

Item update

Hook point: CMSUpdateOperations in the PG storage layer, after successful UPDATE on s_{collectionId}.

Behavior: UPDATE the corresponding content_index row. Re-derive status (e.g., updated_on changing can flip published to changes-in-draft). Re-diff against p_{collectionId} to recompute changed_fields and has_live_changes. This is a single-row join, cheap.

Item delete

Hook point: CMSRemoveOperations in the PG storage layer, after successful DELETE from s_{collectionId}.

Behavior: DELETE the corresponding row from content_index.

Status change (draft/archive/unarchive)

These are item updates that flip _draft or _archived. Same UPDATE path as item update, but status derivation changes accordingly.

Site publish

Hook point: sitePublish.ts batch job, after publishSite() resolves and PUBLISH_FINISHED status is set (around line 592), before notifyMPSOfTaskStatusUpdate.

Behavior: Full rebuild of the content index. Site publish writes staged items to all p_* tables, potentially affecting every item's status and diff state. A full rebuild is the safest approach here, and publish already takes time so an extra few seconds is negligible. A targeted bulk UPDATE (SET status = 'published', has_live_changes = false, changed_fields = '{}') is faster but risks drift if status derivation logic changes.

The rebuild runs in the same job context (not forked), so it completes before the publish task status is broadcast to the designer.

Scheduled publish (single item)

Hook point: After publishSingleItemByScheduledId succeeds.

Behavior: Update the single row in content_index (set status to published, clear changed_fields, set has_live_changes = false, clear scheduled_time). Could also trigger a full rebuild since scheduled publishes are infrequent and a few seconds of extra work is acceptable.

Error handling

All content index updates are non-critical. If an update fails:

Log the error with context (item ID, collection ID, operation)
Do NOT fail the primary CMS operation
The index may become stale, but the next manual "Rebuild Index" or site publish will correct it

This follows the existing pattern in the codebase where non-critical side effects (e.g., component usage tracking) log errors but allow the primary operation to succeed.

Consistency guarantees

After individual mutations: Index is updated synchronously in the same request. If the update fails, the index is stale by one operation. Users can click "Rebuild Index" to correct.
After site publish: Full rebuild guarantees consistency. The rebuild completes before publish status is broadcast.
Staleness detection: The built_at column on each row tracks when it was last computed. The UI can show how fresh the index is and offer a rebuild option.

Opt-out / opt-in

The content index is opt-in per database. It is only created when a user first visits the Content Index panel and clicks "Build Content Index" (or when the Content Agent is first used). There is no background cost for sites that don't use it.

If a user has no need for the Content Index views or Content Agent:

The content_index table is never created in their CMS schema
No write-through hooks fire (hooks check for table existence before attempting updates)
No performance impact on CMS mutations
The CMS panel behaves exactly as it does today (collection-scoped views)

The feature is gated behind isCmsContentIndexEnabled() (to be replaced with a Statsig flag ff-cms-content-index). When disabled:

Content Index panel is not shown in the CMS panel hierarchy
Write-through hooks are no-ops
No table is created

Users who previously built the index and later opt out (or if the flag is disabled) retain the table but it goes stale. The table can be dropped as part of a cleanup pass, or simply ignored.

API changes

New routes

POST /api/v1/databases/:database_slug/content-index/build
  Triggers the build job. Returns { status: 'building' }.

GET /api/v1/databases/:database_slug/content-index/status
  Returns { status: 'ready' | 'building' | 'none', builtAt, itemCount }.

GET /api/v1/databases/:database_slug/content-index
  Query params: status, collectionIds, limit, offset, sort
  Reads from content_index table. Returns CMSContentIndexResult shape.
  Fields: item_id, name, slug, collection_id, collection_name, status,
          updated_on, published_on, scheduled_time, has_live_changes, changed_fields.

GET /api/v1/databases/:database_slug/content-index/items/:itemId/diff
  Returns staged vs live field-by-field diff for a single item.
  Queries s_{collectionId} and p_{collectionId} directly.

Backward compatibility

The existing GET /overview?status=... endpoint is preserved during transition but deprecated in favor of the content-index routes.

Frontend changes

Content Index landing state

When the Content Index panel loads:

Call GET .../content-index/status
If status === 'none': show empty state with "Build Content Index" button
If status === 'building': show progress indicator, poll every 2s
If status === 'ready': render the table, show builtAt timestamp and item count

Table enhancements

New column or badge: "Changes" showing count from changed_fields.length
Row click still navigates to item edit panel
(Future) Row click could open an item detail sheet showing the diff

Content Agent integration

The content_index table is registered as two tools for the AI assistant:

Tool: get-content-index
  Queries the content_index table with filters (status, collection, search by name)
  Supports mode: 'items' (return rows) or 'summary' (return counts by status/collection)
  Returns structured data the agent can reason over

Tool: get-item-diff
  Given an item ID and collection ID, queries s_{collectionId} and p_{collectionId}
  Returns field-by-field diff between staged and live versions

Example agent interactions:

"What will change when I publish?" -> get-content-index with status='queued-to-publish' + has_live_changes, summarizes changed fields
"Which collections have scheduled items?" -> get-content-index in summary mode, filtered to status='scheduled'
"Show me stale content" -> get-content-index where updated_on is older than N months
"What's different in the Blog Posts collection vs what's live?" -> get-content-index with has_live_changes=true for that collection, then get-item-diff for each

Naming decisions

Thing	Name	Why
PG table	`content_index`	Describes what it is: an index of all content. Natural language for the build action ("Build Content Index", "Rebuild Index"). Works in the agent context ("query the content index").
Status value	`changes-in-draft`	Matches what the designer UI shows. Internal code calls it `stagedChanges` but user-facing consistency wins.
UI panel	"Content Index"	Descriptive, implies completeness. Not "Overview" (too vague).
Job action	"Build Content Index" / "Rebuild Index"	Clear, intuitive. Users understand what "building an index" means.
Agent feature	"Content Agent"	The agent powered by the content index.

Rename map (iteration 1 -> iteration 2)

Iteration 1	Iteration 2
`item_overview` table	`content_index` table
CMS Overview panel	Content Index panel
`getCMSOverview`	`getContentIndex`
`buildItemOverview` job	`buildContentIndex` job
`get-cms-overview` tool	`get-content-index` tool
`CMSOverview.tsx`	`CMSContentIndex.tsx`
`useCMSOverview.ts`	`useContentIndex.ts`
`get_cms_summary` tool	Merged into `get-content-index` with `mode: 'summary'`

Migration path from iteration 1

Add content_index table creation to the PG storage layer (alongside existing createTableAndIndexes)
Add buildContentIndex job function
Add new API routes (build, status, query, diff)
Refactor existing overview endpoint to read from PG table instead of fan-out
Rename frontend components (CMSOverview -> CMSContentIndex, etc.)
Update frontend to handle job lifecycle (build/poll/ready)
Register agent tools (get-content-index, get-item-diff)
Remove fan-out query code

Future considerations (not in scope)

Site Publishes: Expose publications to users. Each domain has an S3 manifest.json with a publication ID. Show which version of an item lives on which domain. Enable per-domain single-item publish. Call these "Site Publishes" with dates from the published database.
Live version diffing in UI: Full JSON diff viewer in the item detail panel, showing exactly what fields changed between staged and each live domain.
Discard draft changes: Revert staged item to match live version. Batch support.
Real-time updates: WebSocket push when items change status.
Collection health metrics: Broken references, empty required fields, SEO gaps, staleness.
Per-item analytics: If Optimize or Google Analytics data becomes trackable per-item, content_index can accommodate additional columns.

aaronschachter/cms-overview-iteration-2.md

Select an option

No results found

Select an option

No results found

CMS Content Index - Iteration 2: PG-native Content Index + Content Agent

What changed since iteration 1

Key insight

Architecture

Schema location

Table: `content_index`

Status derivation

Live diff detection

Build job

Trigger

Job flow

Performance considerations

Keeping the index in sync

Mutation paths

Item create

Item update

Item delete

Status change (draft/archive/unarchive)

Site publish

Scheduled publish (single item)

Error handling

Consistency guarantees

Opt-out / opt-in

API changes

New routes

Backward compatibility

Frontend changes

Content Index landing state

Table enhancements

Content Agent integration

Naming decisions

Rename map (iteration 1 -> iteration 2)

Migration path from iteration 1

Future considerations (not in scope)

aaronschachter/cms-overview-iteration-2.md

CMS Content Index - Iteration 2: PG-native Content Index + Content Agent

What changed since iteration 1

Key insight

Architecture

Schema location

Table: content_index

Status derivation

Live diff detection

Build job

Trigger

Job flow

Performance considerations

Keeping the index in sync

Mutation paths

Item create

Item update

Item delete

Status change (draft/archive/unarchive)

Site publish

Scheduled publish (single item)

Error handling

Consistency guarantees

Opt-out / opt-in

API changes

New routes

Backward compatibility

Frontend changes

Content Index landing state

Table enhancements

Content Agent integration

Naming decisions

Rename map (iteration 1 -> iteration 2)

Migration path from iteration 1

Future considerations (not in scope)

Table: `content_index`