E2E Scenario: Missing Index Detection → Approval → Auto-Remediation

This is the parent issue. Sub-issues are the component-level work items.

Objective

Build the first real end-to-end DBAgent scenario: detect a missing index causing high CPU, present the finding to the user, get approval, create the index, and then demonstrate autonomy progression by auto-resolving the same class of problem when the user sets an "always allow" instruction.

This scenario is the vehicle to build the first floor of every DBAgent component — real code, real data flow, real infrastructure. Each component is scoped to exactly what this scenario needs, with clear boundaries on what's deferred.

The Scenario

A PostgreSQL Flexible Server is running a workload that causes high CPU due to sequential scans on an unindexed column. An alert fires (via Azure Monitor, an internal metrics service, or a custom scheduling system — the alert source is a design decision TBD). DBAgent investigates, identifies the missing index, proposes CREATE INDEX CONCURRENTLY, gets human approval, and executes the fix. Then we repeat the scenario with an "always allow" instruction — DBAgent resolves it autonomously.

Demo Script

Beat 1 — Baseline: Dashboard shows a Flex Server with DBAgent enabled. No active issues.

Beat 2 — Problem: A workload script starts hammering queries against an unindexed column. CPU spikes above the alert threshold.

Beat 3 — Detection: Alert fires on high CPU → Platform Services receives the alert → dispatches a Run via Service Bus → Runner starts in AKS.

Beat 4 — Investigation: Runner connects to the Flex Server, queries pg_stat_statements and pg_stat_user_tables, identifies the top query is a sequential scan on orders.customer_id with no supporting index. Creates an Issue with severity, evidence, and a ProposedAction: CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id).

Beat 5 — Approval + Always Allow: Dashboard shows the Issue with the proposed fix. Three options: Approve (one-time), Skip, or Always Allow. User clicks Always Allow — this launches a Copilot chat session that assesses the risk level of always-allowing index creation, walks the user through any configuration (e.g., table size limits, time-of-day constraints), and once confirmed, sets the instruction AND approves the current action in one flow. Runner resumes, creates the index. Issue transitions to Resolved. CPU drops.

Beat 6 — Autonomous Resolution: A new workload hits a different unindexed column. DBAgent detects, investigates, finds the missing index — and this time auto-creates it because of the always-allow instruction set in Beat 5. No human in the loop. Dashboard shows: detected → auto-resolved.

Beat 7 — Timeline: Dashboard timeline shows both incidents. First: human-approved. Second: autonomous. "This is autonomy progression in action."

System Flow

═══════════════════════════════════════════════════════
 FIRST INCIDENT — Human Approval + Always Allow Setup
═══════════════════════════════════════════════════════

Alert Service (CPU > 80%)
    │
    ▼
Platform Services
    │ receives alert, validates server, acquires resource lock
    ▼
Service Bus (trigger message)
    │
    ▼
Runner (AKS job)
    │ Copilot SDK session: investigate → find missing index → propose CREATE INDEX
    │ Write Issue + ProposedAction to Cosmos DB
    │ Check autonomy: no always-allow instruction → suspend for approval
    ▼
Service Bus (completion/suspended)
    │
    ▼
Platform Services (updates Run state, holds lock)

    ... meanwhile ...

API ← reads from Cosmos → Dashboard shows Issue + Approval Request
    │
    User clicks "Always Allow"
    │
    ▼
Copilot chat session (local, in VS Code)
    │ Assesses risk of always-allowing index creation
    │ Walks user through configuration (scope, constraints)
    │ User confirms
    │
    ▼
API → writes Instruction (always-allow: create-index) to Cosmos
    → writes ApprovalDecision (approved) to Cosmos
    → sends Approval message to Service Bus
    │
    ▼
Platform Services
    │ validates lock, dispatches resume trigger
    ▼
Service Bus (resume trigger)
    │
    ▼
Runner resumes
    │ executes CREATE INDEX CONCURRENTLY
    │ Issue → Resolved
    ▼
Service Bus (completion)
    │
    ▼
Platform Services (releases lock)

═══════════════════════════════════════════════════════
 SECOND INCIDENT — Autonomous Resolution
═══════════════════════════════════════════════════════

Alert Service (CPU > 80%, different workload)
    │
    ▼
Platform Services
    │ receives alert, validates server, acquires resource lock
    ▼
Service Bus (trigger message)
    │
    ▼
Runner (AKS job)
    │ Copilot SDK session: investigate → find missing index on different table
    │ Write Issue + ProposedAction to Cosmos DB
    │ Check autonomy: always-allow instruction matches → auto-approve
    │ Execute CREATE INDEX CONCURRENTLY immediately
    │ Issue → Resolved
    ▼
Service Bus (completion)
    │
    ▼
Platform Services (releases lock)

Dashboard: shows both incidents — first human-approved, second autonomous

Success Criteria

The demo is complete when we can do the following live, with no mocks:

An alert on high CPU triggers a real Run through Platform Services → Service Bus → Runner
Runner investigates a real Flex Server and correctly identifies the missing index
Issue and ProposedAction appear in the Dashboard via the API (real data from Cosmos)
Approving in the Dashboard resumes the Runner, which creates the index
Setting "always allow index creation" causes the next occurrence to auto-resolve
The Dashboard timeline shows both incidents with their resolution paths

What's In Scope

One scenario: missing index causing high CPU under load
One trigger type: CPU alert (source is a Platform Services design decision — Azure Monitor, internal metrics service, or custom polling)
One action type: CREATE INDEX CONCURRENTLY
One autonomy instruction: "always allow index creation"
Real Service Bus messaging between all components
Real Cosmos DB persistence for domain objects
Real AKS deployment for Runner
Real Flex Server (or dev-equivalent) as the target

What's Deferred

Multiple scenarios / playbook library
Additional alert sources / detection mechanisms beyond what Platform Services implements for this scenario
Multi-model review of proposed actions
Follow-up Runs and scheduled monitoring
Full autonomy framework (per-type, per-action granularity)
Duroxide dehydrate/rehydrate across process restarts (can keep the process alive for the demo)
KEDA auto-scaling (single Runner pod is fine)
EV2 rollout (manual deployment for now)
Context Services (Signals, Structure, Knowledge as separate services — Runner queries PG directly for the demo)
Memory Service abstraction (instructions stored directly in Cosmos)

Sub-Issues

Contracts: Communication Plane message schemas (trigger, completion, approval)
Platform Services: Alert-triggered dispatch + approval resume flow
Runner: Investigate CPU spike, find missing index, propose + execute fix
API: Serve issues, handle approvals, manage instructions
Dashboard: Wire real API to issue/approval/instruction views
Scenario Infrastructure: Flex Server + workload + alert configuration

lossyrob/00-parent-e2e-scenario.md

Select an option

No results found