Skip to content

Instantly share code, notes, and snippets.

@lossyrob
Created February 27, 2026 16:57
Show Gist options
  • Select an option

  • Save lossyrob/c8b7b8fda23e4dc8caed92e6ed31505f to your computer and use it in GitHub Desktop.

Select an option

Save lossyrob/c8b7b8fda23e4dc8caed92e6ed31505f to your computer and use it in GitHub Desktop.
DBAgent E2E Scenario: Missing Index Detection → Approval → Auto-Remediation

E2E Scenario: Missing Index Detection → Approval → Auto-Remediation

This is the parent issue. Sub-issues are the component-level work items.

Objective

Build the first real end-to-end DBAgent scenario: detect a missing index causing high CPU, present the finding to the user, get approval, create the index, and then demonstrate autonomy progression by auto-resolving the same class of problem when the user sets an "always allow" instruction.

This scenario is the vehicle to build the first floor of every DBAgent component — real code, real data flow, real infrastructure. Each component is scoped to exactly what this scenario needs, with clear boundaries on what's deferred.

The Scenario

A PostgreSQL Flexible Server is running a workload that causes high CPU due to sequential scans on an unindexed column. An alert fires (via Azure Monitor, an internal metrics service, or a custom scheduling system — the alert source is a design decision TBD). DBAgent investigates, identifies the missing index, proposes CREATE INDEX CONCURRENTLY, gets human approval, and executes the fix. Then we repeat the scenario with an "always allow" instruction — DBAgent resolves it autonomously.

Demo Script

Beat 1 — Baseline: Dashboard shows a Flex Server with DBAgent enabled. No active issues.

Beat 2 — Problem: A workload script starts hammering queries against an unindexed column. CPU spikes above the alert threshold.

Beat 3 — Detection: Alert fires on high CPU → Platform Services receives the alert → dispatches a Run via Service Bus → Runner starts in AKS.

Beat 4 — Investigation: Runner connects to the Flex Server, queries pg_stat_statements and pg_stat_user_tables, identifies the top query is a sequential scan on orders.customer_id with no supporting index. Creates an Issue with severity, evidence, and a ProposedAction: CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id).

Beat 5 — Approval + Always Allow: Dashboard shows the Issue with the proposed fix. Three options: Approve (one-time), Skip, or Always Allow. User clicks Always Allow — this launches a Copilot chat session that assesses the risk level of always-allowing index creation, walks the user through any configuration (e.g., table size limits, time-of-day constraints), and once confirmed, sets the instruction AND approves the current action in one flow. Runner resumes, creates the index. Issue transitions to Resolved. CPU drops.

Beat 6 — Autonomous Resolution: A new workload hits a different unindexed column. DBAgent detects, investigates, finds the missing index — and this time auto-creates it because of the always-allow instruction set in Beat 5. No human in the loop. Dashboard shows: detected → auto-resolved.

Beat 7 — Timeline: Dashboard timeline shows both incidents. First: human-approved. Second: autonomous. "This is autonomy progression in action."

System Flow

═══════════════════════════════════════════════════════
 FIRST INCIDENT — Human Approval + Always Allow Setup
═══════════════════════════════════════════════════════

Alert Service (CPU > 80%)
    │
    ▼
Platform Services
    │ receives alert, validates server, acquires resource lock
    ▼
Service Bus (trigger message)
    │
    ▼
Runner (AKS job)
    │ Copilot SDK session: investigate → find missing index → propose CREATE INDEX
    │ Write Issue + ProposedAction to Cosmos DB
    │ Check autonomy: no always-allow instruction → suspend for approval
    ▼
Service Bus (completion/suspended)
    │
    ▼
Platform Services (updates Run state, holds lock)

    ... meanwhile ...

API ← reads from Cosmos → Dashboard shows Issue + Approval Request
    │
    User clicks "Always Allow"
    │
    ▼
Copilot chat session (local, in VS Code)
    │ Assesses risk of always-allowing index creation
    │ Walks user through configuration (scope, constraints)
    │ User confirms
    │
    ▼
API → writes Instruction (always-allow: create-index) to Cosmos
    → writes ApprovalDecision (approved) to Cosmos
    → sends Approval message to Service Bus
    │
    ▼
Platform Services
    │ validates lock, dispatches resume trigger
    ▼
Service Bus (resume trigger)
    │
    ▼
Runner resumes
    │ executes CREATE INDEX CONCURRENTLY
    │ Issue → Resolved
    ▼
Service Bus (completion)
    │
    ▼
Platform Services (releases lock)

═══════════════════════════════════════════════════════
 SECOND INCIDENT — Autonomous Resolution
═══════════════════════════════════════════════════════

Alert Service (CPU > 80%, different workload)
    │
    ▼
Platform Services
    │ receives alert, validates server, acquires resource lock
    ▼
Service Bus (trigger message)
    │
    ▼
Runner (AKS job)
    │ Copilot SDK session: investigate → find missing index on different table
    │ Write Issue + ProposedAction to Cosmos DB
    │ Check autonomy: always-allow instruction matches → auto-approve
    │ Execute CREATE INDEX CONCURRENTLY immediately
    │ Issue → Resolved
    ▼
Service Bus (completion)
    │
    ▼
Platform Services (releases lock)

Dashboard: shows both incidents — first human-approved, second autonomous

Success Criteria

The demo is complete when we can do the following live, with no mocks:

  • An alert on high CPU triggers a real Run through Platform Services → Service Bus → Runner
  • Runner investigates a real Flex Server and correctly identifies the missing index
  • Issue and ProposedAction appear in the Dashboard via the API (real data from Cosmos)
  • Approving in the Dashboard resumes the Runner, which creates the index
  • Setting "always allow index creation" causes the next occurrence to auto-resolve
  • The Dashboard timeline shows both incidents with their resolution paths

What's In Scope

  • One scenario: missing index causing high CPU under load
  • One trigger type: CPU alert (source is a Platform Services design decision — Azure Monitor, internal metrics service, or custom polling)
  • One action type: CREATE INDEX CONCURRENTLY
  • One autonomy instruction: "always allow index creation"
  • Real Service Bus messaging between all components
  • Real Cosmos DB persistence for domain objects
  • Real AKS deployment for Runner
  • Real Flex Server (or dev-equivalent) as the target

What's Deferred

  • Multiple scenarios / playbook library
  • Additional alert sources / detection mechanisms beyond what Platform Services implements for this scenario
  • Multi-model review of proposed actions
  • Follow-up Runs and scheduled monitoring
  • Full autonomy framework (per-type, per-action granularity)
  • Duroxide dehydrate/rehydrate across process restarts (can keep the process alive for the demo)
  • KEDA auto-scaling (single Runner pod is fine)
  • EV2 rollout (manual deployment for now)
  • Context Services (Signals, Structure, Knowledge as separate services — Runner queries PG directly for the demo)
  • Memory Service abstraction (instructions stored directly in Cosmos)

Sub-Issues

  • Contracts: Communication Plane message schemas (trigger, completion, approval)
  • Platform Services: Alert-triggered dispatch + approval resume flow
  • Runner: Investigate CPU spike, find missing index, propose + execute fix
  • API: Serve issues, handle approvals, manage instructions
  • Dashboard: Wire real API to issue/approval/instruction views
  • Scenario Infrastructure: Flex Server + workload + alert configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment