This is the parent issue. Sub-issues are the component-level work items.
Build the first real end-to-end DBAgent scenario: detect a missing index causing high CPU, present the finding to the user, get approval, create the index, and then demonstrate autonomy progression by auto-resolving the same class of problem when the user sets an "always allow" instruction.
This scenario is the vehicle to build the first floor of every DBAgent component — real code, real data flow, real infrastructure. Each component is scoped to exactly what this scenario needs, with clear boundaries on what's deferred.
A PostgreSQL Flexible Server is running a workload that causes high CPU due to sequential scans on an unindexed column. An alert fires (via Azure Monitor, an internal metrics service, or a custom scheduling system — the alert source is a design decision TBD). DBAgent investigates, identifies the missing index, proposes CREATE INDEX CONCURRENTLY, gets human approval, and executes the fix. Then we repeat the scenario with an "always allow" instruction — DBAgent resolves it autonomously.
Beat 1 — Baseline: Dashboard shows a Flex Server with DBAgent enabled. No active issues.
Beat 2 — Problem: A workload script starts hammering queries against an unindexed column. CPU spikes above the alert threshold.
Beat 3 — Detection: Alert fires on high CPU → Platform Services receives the alert → dispatches a Run via Service Bus → Runner starts in AKS.
Beat 4 — Investigation: Runner connects to the Flex Server, queries pg_stat_statements and pg_stat_user_tables, identifies the top query is a sequential scan on orders.customer_id with no supporting index. Creates an Issue with severity, evidence, and a ProposedAction: CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id).
Beat 5 — Approval + Always Allow: Dashboard shows the Issue with the proposed fix. Three options: Approve (one-time), Skip, or Always Allow. User clicks Always Allow — this launches a Copilot chat session that assesses the risk level of always-allowing index creation, walks the user through any configuration (e.g., table size limits, time-of-day constraints), and once confirmed, sets the instruction AND approves the current action in one flow. Runner resumes, creates the index. Issue transitions to Resolved. CPU drops.
Beat 6 — Autonomous Resolution: A new workload hits a different unindexed column. DBAgent detects, investigates, finds the missing index — and this time auto-creates it because of the always-allow instruction set in Beat 5. No human in the loop. Dashboard shows: detected → auto-resolved.
Beat 7 — Timeline: Dashboard timeline shows both incidents. First: human-approved. Second: autonomous. "This is autonomy progression in action."
═══════════════════════════════════════════════════════
FIRST INCIDENT — Human Approval + Always Allow Setup
═══════════════════════════════════════════════════════
Alert Service (CPU > 80%)
│
▼
Platform Services
│ receives alert, validates server, acquires resource lock
▼
Service Bus (trigger message)
│
▼
Runner (AKS job)
│ Copilot SDK session: investigate → find missing index → propose CREATE INDEX
│ Write Issue + ProposedAction to Cosmos DB
│ Check autonomy: no always-allow instruction → suspend for approval
▼
Service Bus (completion/suspended)
│
▼
Platform Services (updates Run state, holds lock)
... meanwhile ...
API ← reads from Cosmos → Dashboard shows Issue + Approval Request
│
User clicks "Always Allow"
│
▼
Copilot chat session (local, in VS Code)
│ Assesses risk of always-allowing index creation
│ Walks user through configuration (scope, constraints)
│ User confirms
│
▼
API → writes Instruction (always-allow: create-index) to Cosmos
→ writes ApprovalDecision (approved) to Cosmos
→ sends Approval message to Service Bus
│
▼
Platform Services
│ validates lock, dispatches resume trigger
▼
Service Bus (resume trigger)
│
▼
Runner resumes
│ executes CREATE INDEX CONCURRENTLY
│ Issue → Resolved
▼
Service Bus (completion)
│
▼
Platform Services (releases lock)
═══════════════════════════════════════════════════════
SECOND INCIDENT — Autonomous Resolution
═══════════════════════════════════════════════════════
Alert Service (CPU > 80%, different workload)
│
▼
Platform Services
│ receives alert, validates server, acquires resource lock
▼
Service Bus (trigger message)
│
▼
Runner (AKS job)
│ Copilot SDK session: investigate → find missing index on different table
│ Write Issue + ProposedAction to Cosmos DB
│ Check autonomy: always-allow instruction matches → auto-approve
│ Execute CREATE INDEX CONCURRENTLY immediately
│ Issue → Resolved
▼
Service Bus (completion)
│
▼
Platform Services (releases lock)
Dashboard: shows both incidents — first human-approved, second autonomous
The demo is complete when we can do the following live, with no mocks:
- An alert on high CPU triggers a real Run through Platform Services → Service Bus → Runner
- Runner investigates a real Flex Server and correctly identifies the missing index
- Issue and ProposedAction appear in the Dashboard via the API (real data from Cosmos)
- Approving in the Dashboard resumes the Runner, which creates the index
- Setting "always allow index creation" causes the next occurrence to auto-resolve
- The Dashboard timeline shows both incidents with their resolution paths
- One scenario: missing index causing high CPU under load
- One trigger type: CPU alert (source is a Platform Services design decision — Azure Monitor, internal metrics service, or custom polling)
- One action type: CREATE INDEX CONCURRENTLY
- One autonomy instruction: "always allow index creation"
- Real Service Bus messaging between all components
- Real Cosmos DB persistence for domain objects
- Real AKS deployment for Runner
- Real Flex Server (or dev-equivalent) as the target
- Multiple scenarios / playbook library
- Additional alert sources / detection mechanisms beyond what Platform Services implements for this scenario
- Multi-model review of proposed actions
- Follow-up Runs and scheduled monitoring
- Full autonomy framework (per-type, per-action granularity)
- Duroxide dehydrate/rehydrate across process restarts (can keep the process alive for the demo)
- KEDA auto-scaling (single Runner pod is fine)
- EV2 rollout (manual deployment for now)
- Context Services (Signals, Structure, Knowledge as separate services — Runner queries PG directly for the demo)
- Memory Service abstraction (instructions stored directly in Cosmos)
- Contracts: Communication Plane message schemas (trigger, completion, approval)
- Platform Services: Alert-triggered dispatch + approval resume flow
- Runner: Investigate CPU spike, find missing index, propose + execute fix
- API: Serve issues, handle approvals, manage instructions
- Dashboard: Wire real API to issue/approval/instruction views
- Scenario Infrastructure: Flex Server + workload + alert configuration