Skip to content

Instantly share code, notes, and snippets.

@dpark2025
Last active December 20, 2025 02:56
Show Gist options
  • Select an option

  • Save dpark2025/4870813bd71288cb080408c78c6b84fc to your computer and use it in GitHub Desktop.

Select an option

Save dpark2025/4870813bd71288cb080408c78c6b84fc to your computer and use it in GitHub Desktop.

Predictive Service Availability and Capacity Planning

Project Proposal for Executive Review

Version 1.0
Author David Park
Date December 2024

Executive Summary

Our global service infrastructure operates across multiple work units (clusters) worldwide, each capable of generating millions of metrics and log lines per second. Today, critical decisions around capacity planning, traffic failover, and incident response rely heavily on human judgment and tribal knowledge. Operators manually interpret dashboards, estimate work unit capacity, and determine optimal traffic routing during incidents—processes that are time-sensitive, error-prone, and do not scale.

This project proposes building an intelligent system that leverages our existing observability infrastructure—time-series metrics, centralized logging, distributed tracing, our anomaly detection system, and our data warehouse—to provide predictive capacity planning, automated failover recommendations, and post-incident analysis. The system will use Claude (via Claude Code and the Claude API) as the reasoning layer to interpret signals, correlate data across sources, and generate actionable recommendations.

The approach is deliberately phased: we begin in shadow mode, where the system observes and recommends alongside human operators without taking action. Only after building confidence through validated recommendations do we consider progressive automation. The guiding principle throughout is "do no harm"—the system's failure mode is always to alert humans rather than take potentially harmful autonomous action.


Current State

Infrastructure Overview

Our service runs globally across multiple work units, each representing a complete deployment of our service stack in a specific region. Each work unit generates telemetry at scale:

  • Metrics: Collected via a metrics aggregation system that reduces cardinality by aggregating across hundreds of identical service instances, plus raw time-series metrics for select high-value signals. Metrics are shipped to a centralized metrics bucket work unit and eventually parsed into the Data Warehouse.

  • Logs: Ingested by local log indexers within each work unit, then shipped to centralized indexers for retention (90-day window).

  • Traces: Distributed tracing is available, though current coverage and sampling strategy require clarification.

  • Anomaly Detection: A bespoke internal system provides current-state and historical anomaly data via API.

  • Data Warehouse: A columnar analytics database that ETLs all telemetry into a common schema, providing up to 2 years of historical data for batch and ad-hoc analysis.

Current Operational Challenges

Capacity Planning: Work unit capacity limits are based on historical load tests and empirical observation from traffic migrations. There is no systematic, data-driven model of true capacity thresholds or the factors that determine them.

Failover Decisions: When a work unit shows signs of distress (typically observed as drops in network connection metrics), a human operator must:

  1. Recognize the problem from dashboard observations
  2. Decide to initiate failover
  3. Determine which work unit(s) should receive the traffic based on mental models of regional proximity and capacity
  4. Execute the failover manually

This process depends on operator experience and availability. The decision criteria are not codified, and there is no systematic way to validate that the destination work unit can absorb the additional load.

Traffic Patterns: Traffic follows predictable sinusoidal patterns tied to local time zones (peak usage around 9-10 PM local time). Work units experience "thundering herd" spikes when first receiving traffic, followed by steady-state operation once devices are tethered. These patterns are well-understood qualitatively but not modeled quantitatively.

Post-Incident Analysis: After a failover event, understanding root cause requires manually correlating metrics, logs, and traces across systems—a time-consuming process that often yields incomplete answers.


Project Goals

Primary Goals

  1. Capacity Planning and Traffic Forecasting: Build predictive models for traffic volume by region and time, enabling proactive capacity decisions. Understand and quantify the true capacity limits of each work unit based on historical data.

  2. Cost Modeling for Traffic Migration: Provide visibility into cloud provider cost implications when migrating traffic between regions, enabling cost-aware routing decisions.

  3. Load Shedding Analysis: Answer the question "if work unit X sheds load, can work units Y and Z absorb it?" with data-driven confidence, accounting for current utilization and capacity headroom.

  4. Near Real-Time Failover Recommendations: Monitor work unit health at 5-minute granularity (or finer) and generate failover recommendations that account for:

    • Current health signals and anomaly state
    • Regional proximity (minimize TTFB impact)
    • Destination capacity and current load
    • Thundering herd considerations for cold work units
  5. Post-Incident Analysis and Remediation: After a failover event, automatically correlate relevant metrics, logs, and traces to identify root cause and generate remediation recommendations.

  6. Claude-Powered Reasoning: Use Claude (via Claude Code for early adopters, Claude API for production agents) as the intelligent layer that interprets signals, answers ad-hoc questions, and generates human-readable recommendations.

Success Criteria

  • Shadow mode recommendations match human operator decisions ≥90% of the time
  • Time-to-decision for failover reduced from minutes to seconds
  • Capacity forecasts accurate within ±15% over 30-day horizons
  • Post-incident root cause identification completed within 15 minutes of incident close
  • Zero automated actions taken without explicit human approval until shadow mode validation complete

Team Requirements and Investment

Team Structure

Role Count Responsibility
Tech Lead / Architect 1 System design, cross-team coordination, technical decisions
Backend Engineers 2-3 Core platform, decision engine, agent framework
Data/ML Engineer 1 Forecasting models, capacity baselines, analytics pipelines
SRE (embedded) 1 Production readiness, observability, runbooks

Total: 5-6 engineers for build phase

Ongoing Operations: Existing SRE team absorbs operational responsibility post-launch. System designed for minimal operational overhead with standard Kubernetes patterns, automated health checks, and self-healing agents. See Operations Handoff section below for details.

AI-Assisted Development Adoption

The AI-assisted estimates assume the team is proficient with Claude Code. We recommend a one-week bootcamp before project work begins, followed by continued learning while delivering:

Week 0: AI Development Bootcamp (dedicated time, before Phase 0)

Day Focus Activities
Day 1-2 Setup & Fundamentals Dev environment configuration, Claude Code installation, API access, basic prompting techniques, MCP server concepts
Day 3-4 Capstone: MCP Server Prototype Group builds a working MCP server that queries one data source (e.g., time-series metrics). Real Phase 0 deliverable: team practices AI-assisted development while producing shippable code, rotating through driver/navigator roles, reviewing each other's AI-generated PRs
Day 5 Team Workflows Refine PR review norms based on capstone experience, document shared prompt patterns that worked, establish team conventions, retrospective

Weeks 1-4: Learning While Delivering (parallel with Phase 0)

Milestone What It Looks Like
Week 1-2 Applying bootcamp skills to real tasks; occasional check-ins with AI tooling champion
Week 3-4 Developing personal patterns; full proficiency with autonomous sessions

Key success factors:

  • Designate an AI tooling champion to establish team conventions and share learnings
  • Start with well-scoped, low-risk tasks during ramp-up (e.g., boilerplate, tests, documentation)
  • Establish code review norms for AI-generated code (same standards as human-written code)
  • Create shared prompt libraries and MCP configurations for common project tasks

Recommendation: Run Phase 0 with 1-2 engineers who have prior Claude Code experience. Use this phase to develop team-specific patterns before scaling to the full team in Phase 1.

Engineering Investment Estimate

The following estimates assume leveraging existing observability infrastructure and focus purely on engineering effort.

Phase Traditional Approach AI-Assisted Approach Notes
Phase 0: Foundation 6-8 engineer-weeks 3-4 engineer-weeks Data access validation, MCP server scaffolding
Phase 1: Shadow Mode 16-20 engineer-weeks 8-10 engineer-weeks Core decision engine, capacity models, health monitoring
Phase 2: Assisted 12-16 engineer-weeks 6-8 engineer-weeks Operator tooling, post-incident automation, work unit agents
Phase 3: Progressive Automation 8-12 engineer-weeks 4-6 engineer-weeks Automation runbooks, staged rollout
Total 42-56 engineer-weeks 21-28 engineer-weeks

AI-Assisted Approach: Assumes use of Claude Code for rapid prototyping, code generation, and iterative development. Estimates based on internal pilot projects using AI-assisted development; actual results will vary based on task complexity and team familiarity with AI tooling.

Return on Investment

Investment pays back through:

  • Reduced incident duration: Faster failover decisions prevent cascading failures and customer impact
  • Reduced operator toil: Automated analysis replaces hours of manual dashboard correlation
  • Prevented outages: Proactive capacity alerts catch issues before they become incidents
  • Operator scaling: System handles routine decisions, freeing SREs for higher-value work

Quantified ROI depends on current incident frequency and cost-per-incident metrics, which should be gathered during Phase 0.


Operational Concerns

Operations Handoff

Once the system reaches production, the following responsibilities transfer to the SRE team. Most routine operations are automated; human intervention is primarily required for approval gates and exception handling.

Category Automated Human Intervention (Gating)
System Health Self-healing pods; automated alerting; health dashboards Escalation for alerts that don't self-resolve
Analytics Model Maintenance Scheduled retraining pipelines; automated validation Approve model updates before prod deployment
LLM Operations API key rotation; usage/cost dashboards with alerts Approve version upgrades; investigate anomalies
Access Control SSO integration; automated provisioning Approve elevated access; periodic reviews
Audit & Compliance Automated report generation; log retention Monthly audit review; compliance inquiries

Estimated human effort: 4-8 hours/week, trending lower once stable.

Handoff deliverables:

  • Operational runbooks and architecture docs
  • Monitoring dashboards and alert definitions
  • On-call escalation procedures
  • Knowledge transfer sessions (2-3 sessions, 2 hours each)

Infrastructure Cost Estimate (AWS)

All components are Kubernetes-native, deployed via Helm. Estimates use on-demand pricing; Reserved Instances reduce costs 30-50%.

POC / Phase 0-1:

Component Spec Monthly Cost
Orchestrator + MCP 2x m5.large (sidecar pattern) $140
Analytics (DuckDB) 1x r6i.large + 100GB gp3 $100
Work Unit Pod (1 unit) 1x g4dn.xlarge (agent + Llama 3 8B) $380
Networking ALB + data transfer $50
Claude API Low volume $500
POC Total ~$1,200/month

Production (multi-AZ, Kubernetes-native HA):

Control Plane (3 replicas across AZs, K8s handles failover):

Component Spec Monthly Cost
Orchestrator + MCP 3x m5.large across 3 AZs $210
Analytics 1x r6i.large + 200GB gp3 (with snapshots) $120
Networking ALB (multi-AZ) + cross-region transfer $150
Claude API Production volume with caching $1,500-2,500
Control Plane Total ~$2,000-3,000/month

Per-Work-Unit (1 active pod, 1 standby in different AZ):

Component Spec Monthly Cost
Work Unit Pod (active) 1x g4dn.xlarge (agent + LLM) $380
Work Unit Pod (standby) 1x g4dn.xlarge (stopped until failover) $0*
Storage 50GB gp3 (model + state) $4
Cross-region transfer To central orchestrator $20
Per-Unit Total ~$400/month

*Standby instances stopped by default; only pay for EBS. Start on failover via K8s scaling.

Scaling:

Work Units Work Unit Cost Control Plane Total Monthly
5 $2,000 $2,500 ~$4,500
10 $4,000 $2,800 ~$6,800
15 $6,000 $3,100 ~$9,100
20 $8,000 $3,400 ~$11,400

Non-Prod Environments (single AZ, no HA):

Environment Monthly Cost
Staging $600
Acceptance $600
Non-Prod Total ~$1,200/month

Deployment:

  • Helm charts with environment-specific values (values-prod.yaml, values-staging.yaml)
  • CI/CD: merge to main → staging, tagged release → prod

Cost optimization: Reserved Instances (30-50% off), right-size after POC, Claude API caching.


Proposed Architecture

Conceptual Layers

Layer 1: Data Sources

All existing telemetry systems feed into the architecture:

  • Time-series metrics system (per-work-unit, ~2-week retention, real-time)
  • Centralized logging system (logs, 90-day retention)
  • Distributed tracing system
  • Anomaly Detection System (current state + historical via API)
  • Data Warehouse (2-year historical, batch queries)

Layer 2: Pre-Computed Analytics

Rather than querying raw data at decision time, we build derived datasets optimized for each use case:

  • Capacity Baselines: Per-work-unit capacity models derived from historical load data. Implementation: Analyze Data Warehouse metrics from past peak events; identify which resource (CPU, memory, connections, network) saturates first; set threshold at 80% of observed max. Requires: Access to historical peak load data; validation that past peaks represent true limits.
  • Traffic Affinity Matrix: Region-to-work-unit mapping based on empirical TTFB data. Implementation: Daily batch job queries P50/P95 TTFB by source region and destination work unit; ranks work units per region. Requires: TTFB data accessible in Data Warehouse with region tagging.
  • Headroom Calculations: Current utilization vs. baseline capacity. Implementation: Scheduled query (every 5 min) computes (baseline - current) / baseline per work unit. Requires: Real-time metrics accessible from central location.
  • Cost Models: Cloud provider pricing by traffic path. Implementation: Static lookup table of region-to-region transfer costs; updated manually when pricing changes. Requires: Documented cloud pricing for relevant regions.
  • Traffic Forecasts: Predict load by work unit. Implementation: Prophet model trained on 90-day historical load data; captures daily/weekly seasonality. Requires investigation: Validate Prophet handles our traffic patterns; may need custom seasonality for regional holidays.

Layer 3: Decision Engine

Two operational modes:

Near Real-Time (Goals #4, #5):

  • Polls anomaly detection system API and time-series metrics every 5 minutes (interval configurable; faster polling requires validating API rate limits)
  • Compares current utilization against pre-computed capacity baselines; flags when headroom drops below threshold (e.g., <20%)
  • When anomaly detected OR headroom low: queries traffic affinity matrix for ranked failover targets; filters to targets with sufficient headroom; generates recommendation with reasoning
  • Outputs recommendation to human operators via alerting integration (shadow mode) or triggers automated workflow via traffic control API (future state, requires investigation: confirm traffic control API exists and supports programmatic failover)

Batch/Interactive (Goals #1, #2, #3):

  • Runs Prophet forecasting models against Data Warehouse historical data; outputs 30/60/90-day load predictions per work unit
  • Answers ad-hoc "what-if" queries via Claude: user asks natural language question → Claude translates to Data Warehouse query + capacity model lookup → returns computed answer with assumptions stated
  • Generates periodic capacity planning reports via scheduled jobs that query forecasts and current headroom, output to dashboards or email

Layer 4: Agent Architecture

A multi-tier agent topology provides resilience and appropriate separation of concerns:

Central Orchestrator:

  • Runs as a Kubernetes pod in an admin cluster
  • Runs a custom agent built on Claude SDK or LangChain with Opus 4.5 for high-stakes reasoning and cross-work-unit coordination
  • Has visibility into all work units via aggregated data
  • Makes or recommends traffic-affecting decisions
  • Maintains audit log of all observations, recommendations, and actions

Work Unit Agents:

  • Deployed within each work unit
  • Run a lightweight local LLM for fast, local queries (target: sub-100ms for routine health checks; requires validation during POC with actual hardware and model size)
  • Can operate independently if central orchestrator is unreachable
  • Provide local context to central orchestrator on request
  • Limited to "safe" autonomous actions (e.g., local alerting) without central coordination

Communication:

  • Agents communicate via authenticated, attested channels (see Security section)
  • Central orchestrator queries work unit agents; work unit agents do not initiate traffic-affecting actions independently
  • Graceful degradation: if central cannot reach a work unit agent, it flags the gap rather than assuming state

Data Flow Summary

flowchart TB
    subgraph DataSources["Data Sources (Layer 1)"]
        METRICS[Time-Series Metrics]
        LOGS[Centralized Logging]
        TRACES[Distributed Tracing]
        ANOMALY[Anomaly Detection]
        DW[Data Warehouse]
    end

    subgraph Analytics["Pre-Computed Analytics (Layer 2)"]
        REALTIME[Real-Time Analytics<br/>Headroom, Health Signals]
        BATCH[Batch Analytics<br/>Capacity Baselines, Affinity Matrix,<br/>Cost Models, Forecasts]
    end

    subgraph Decision["Decision Engine (Layer 3)"]
        ENGINE[Decision Engine<br/>Near Real-Time + Interactive]
    end

    subgraph Agents["Agent Architecture (Layer 4)"]
        CENTRAL[Central Orchestrator]
        WU_AGENT1[Work Unit Agent]
        WU_AGENT2[Work Unit Agent]
        WU_AGENT3[Work Unit Agent]
    end

    subgraph Outputs["Outputs"]
        HUMAN[Human Operators]
        AUTO[Automated Workflow]
        VIZ[Dashboards]
    end

    METRICS --> REALTIME
    LOGS --> REALTIME
    TRACES --> REALTIME
    ANOMALY --> REALTIME
    DW --> BATCH

    REALTIME --> ENGINE
    BATCH --> ENGINE

    ENGINE --> CENTRAL

    CENTRAL <--> WU_AGENT1
    CENTRAL <--> WU_AGENT2
    CENTRAL <--> WU_AGENT3

    CENTRAL --> HUMAN
    CENTRAL --> AUTO
    CENTRAL --> VIZ
Loading

Human Interaction Model

The system is designed to augment human decision-making, not replace it. This section describes how different roles interact with the system and the value each interaction provides.

Operator Experience During Incidents

When a work unit shows signs of distress, on-call operators currently must interpret multiple dashboards, recall tribal knowledge about capacity, and make high-stakes decisions under pressure.

With this system:

  • Operators receive a clear, prioritized recommendation via existing alerting channels (Slack, PagerDuty, etc.): "Recommend failover of Region-A traffic to Region-B. Region-B has 34% headroom and lowest TTFB for affected customers. Region-C is not recommended due to current 78% utilization."
  • The recommendation includes confidence level (high/medium/low based on signal agreement) and the signals that triggered it
  • Operators can ask follow-up questions in natural language via Chat/CLI interface: "What happens if we split traffic between B and C instead?" (requires: MCP server with traffic model access; Claude interprets question, queries model, returns computed answer)
  • One-click acknowledgment to log the decision, whether they follow the recommendation or override it (requires: simple UI or Slack workflow integration)

Value: Faster decisions, reduced cognitive load, consistent decision-making regardless of operator experience level.

Capacity Planning Workflows

Capacity planners currently rely on manually visualizing dashboards, interpreting historical trends, and periodic ad-hoc analysis to forecast needs.

With this system:

  • Interactive dashboards show 30/60/90-day traffic forecasts by region
  • "What-if" scenarios answer questions like: "If we decommission Work Unit X, can the remaining units handle the load during peak hours?"
  • Cost projections for traffic migration between regions
  • Proactive alerts when forecasted demand approaches capacity thresholds

Value: Data-driven capacity decisions, reduced over-provisioning costs, earlier identification of scaling needs.

Post-Incident Analysis

After an incident, understanding root cause requires hours of manual correlation across metrics, logs, and traces.

With this system:

  • Automated timeline generation: query audit logs + anomaly detection history + operator actions for the incident window; display as chronological event stream. Implementation: Straightforward log aggregation and sorting.
  • Correlated signals: for each event in timeline, fetch related metrics/logs/traces within ±5 minute window; display in unified view. Implementation: Requires trace IDs or timestamps to correlate across systems; may require investigation if correlation keys aren't available.
  • AI-generated root cause summary: Claude analyzes the timeline and correlated signals, generates hypothesis with cited evidence. Limitation: Quality depends on signal availability; novel failure modes may produce speculative answers that require human validation.
  • Suggested remediation: Claude compares against previous incidents stored in knowledge base. Requires: Building a structured incident history database during Phase 1-2; initially this feature will have limited data to draw from.

Value: Faster post-incident reviews, more complete root cause analysis, institutional memory that doesn't depend on individual engineers.

Leadership and Planning Reviews

Executives and engineering leadership need visibility into infrastructure health and capacity trends without diving into operational details.

With this system:

  • Weekly/monthly capacity reports generated automatically
  • Trend analysis showing utilization patterns and headroom across all work units
  • Cost attribution for traffic routing decisions
  • Risk assessment highlighting work units approaching capacity limits

Value: Clear visibility into infrastructure investment needs, data-backed budget discussions, reduced surprise escalations.

System Administration

With AI-assisted operations, most system administration is fully automated. Humans intervene only when automation fails or decisions require approval.

Rolling Updates and Upgrades

Automated Human Intervention (Exception Only)
Dependabot/Renovate opens PRs for dependency updates (requires initial setup) Approve PR if breaking changes detected
CI/CD runs tests, builds images, deploys to staging on PR merge Review failed test results
Staging validation (health checks + smoke tests) gates prod promotion Investigate if staging validation fails
K8s performs rolling restart with readiness probes; auto-rollback on failure Post-mortem if rollback triggered

Human touch: Only when tests fail, health checks fail, or breaking changes need review. Requires: CI/CD pipeline setup, container registry, Helm chart structure.

Debugging System Issues

Automated Human Intervention (Exception Only)
Standard K8s/Prometheus alerting flags error rate or latency anomalies Investigate alerts that don't auto-resolve
Diagnostic dashboard aggregates recent logs, traces, API responses Deep-dive when root cause unclear
Self-healing: K8s restarts failed pods; circuit breakers recover automatically Exec into pod only for novel failure modes
Claude analyzes logs on-demand when operator requests assistance Validate AI hypothesis before applying fix

Human touch: Only for novel failures that self-healing can't resolve. Note: "AI-assisted log analysis" requires operator to invoke Claude; not fully autonomous.

Scaling and Resource Tuning

Automated Human Intervention (Exception Only)
HPA scales replicas based on CPU/memory thresholds (requires HPA configuration during build) Review if scaling events are excessive
Cluster Autoscaler provisions nodes as needed (requires autoscaler setup) Investigate if node provisioning fails
Cost monitoring dashboards with threshold alerts Approve infrastructure cost increases
Quarterly review: analyze usage data, propose right-sizing (manual analysis, not AI-driven initially) Approve major instance type changes

Human touch: Only for cost approvals or when auto-scaling behaves unexpectedly. Note: "AI recommends instance types" is a future enhancement, not Phase 1.

Configuration Changes

Automated Human Intervention (Exception Only)
Capacity baselines recalculated weekly via scheduled job Review if baseline changes significantly (>10% drift)
Traffic affinity matrix refreshes daily via batch job Investigate if any region loses viable failover targets
Threshold change proposals generated as draft PRs (requires building this workflow) Approve PR (required gate)
CI/CD validates in staging, auto-promotes if tests pass Review if staging validation fails

Human touch: PR approval (intentional gate) and exception handling. Note: "Auto-generated PRs" requires building integration between analytics and Git; this is Phase 2 work.

Disaster Recovery

Automated Human Intervention (Exception Only)
Health checks detect pod/node failure; K8s reschedules automatically Investigate if rescheduling repeatedly fails
Standby pods start via K8s scaling (requires pre-configured scaling policies) Investigate if standby fails to start
Analytics data restored from scheduled EBS snapshots (requires snapshot configuration) Validate data integrity post-restore
Work unit agents reconnect to central via standard K8s service discovery Manual re-registration if agent state corrupted

Human touch: Only when automated recovery fails. Note: Full regional DR (control plane failover to different region) is not in initial scope; would require additional infrastructure.

Bottom line: The system runs itself. Humans approve gates, investigate exceptions, and handle novel failures—not routine operations.

Interaction Channels

Role Primary Interface Interaction Type
On-Call Operator Chat/CLI + existing alerting tools Real-time recommendations, natural language queries
Capacity Planner Dashboards + interactive analysis UI Forecasts, what-if scenarios, reports
Incident Commander Chat/CLI during incident Situational awareness, decision support
Post-Incident Reviewer Investigation UI Timeline, correlation, root cause summary
Engineering Leadership Scheduled reports + dashboards Trends, forecasts, risk assessment

Phased Implementation Approach

Phase 0: Foundation

Establish data access and validate assumptions.

  1. Confirm Data Warehouse interface requirements with DW team; define query contract for historical data access
  2. Validate time-series metrics accessibility from central location; confirm metrics availability and retention
  3. Document anomaly detection system API and available signals
  4. Clarify distributed tracing coverage and sampling strategy
  5. Set up Claude Code environment for early adopter experimentation
  6. Define initial MCP server interface for Claude Code integration

Exit Criteria: Documented data access patterns for all source systems; working Claude Code prototype that can query at least one data source.

Phase 1: Shadow Mode - Observation and Recommendation

Build the core decision engine in observation-only mode.

  1. Implement capacity baseline models using historical Data Warehouse data
  2. Build traffic affinity matrix from empirical TTFB data
  3. Create near real-time health monitoring pipeline (time-series metrics + anomaly detection)
  4. Develop failover recommendation logic
  5. Deploy central orchestrator in shadow mode—generates recommendations, logs decisions, but takes no action
  6. Compare system recommendations against actual human operator decisions
  7. Iterate on models based on discrepancies

Exit Criteria: System generates failover recommendations that match human decisions ≥90% of the time over a 4-week observation period.

Phase 2: Assisted Decision-Making

Increase system involvement while maintaining human control.

  1. Surface recommendations to operators in real-time during incidents
  2. Provide "what-if" analysis tools for capacity planning via visualization platform integration or custom UI
  3. Implement post-incident analysis automation
  4. Deploy work unit agents for local context gathering
  5. Build cost modeling for traffic migration scenarios
  6. Extend audit logging to capture full decision context

Exit Criteria: Operators actively use system recommendations during incidents; post-incident analysis time reduced by 50%.

Phase 3: Progressive Automation

Gradually transfer decision authority to the system.

  1. Define automation runbook with explicit human approval gates
  2. Implement staged failover capability (if traffic control supports it)
  3. Enable automated alerting based on system recommendations
  4. Pilot automated failover for low-risk scenarios with immediate human notification
  5. Expand automation scope based on demonstrated reliability

Exit Criteria: Defined per-scenario based on risk tolerance and demonstrated system accuracy.


Security and Compliance

Security is a first-class concern, not an afterthought. The system will be designed with the following principles:

Authentication and Authorization

Service-to-Service (Agents):

  • All agent communication uses mutual TLS with certificate-based identity
  • Workload identity attestation ensures agents only accept requests from verified peers

Human Access:

  • SSO integration with existing identity provider (SAML/OIDC) for all operator-facing interfaces
  • No local accounts; all access tied to corporate identity

Role-Based Access Control:

Role Permissions Typical Users
Viewer View dashboards, recommendations, and audit logs All engineers, leadership
Operator Acknowledge/override recommendations, trigger manual analysis, ask ad-hoc questions On-call SREs, incident commanders
Admin Modify capacity thresholds, update traffic affinity rules, manage access Platform team leads
System Config Deploy model updates, modify agent configuration, update MCP schemas Build team, designated SREs

Configuration Management:

  • System configuration (thresholds, affinity rules, alert definitions) managed via GitOps—changes require PR review and merge to deploy
  • Operational actions (acknowledge recommendation, trigger failover) performed directly in operator UI with audit logging
  • No direct production access; all changes flow through version-controlled pipelines or audited UI actions

Network Security

  • Central orchestrator runs in isolated admin cluster with restricted network policies
  • Work unit agents communicate only with central orchestrator via authenticated channels
  • No direct work-unit-to-work-unit agent communication without central coordination
  • All cross-network traffic encrypted in transit

Audit and Accountability

  • Every observation, recommendation, and action is logged with full context:
    • What signals were observed
    • What recommendation was generated
    • Who or what approved the action (human operator ID or automated policy reference)
    • What action was taken
    • Outcome of the action
  • Audit logs shipped to centralized logging system with append-only retention policies (no deletion or modification within retention window)
  • Regular audit review as part of operational process

Failure Modes and Safety

  • First directive: Do no harm. When uncertain, the system alerts humans rather than taking action.
  • Network partition handling: If central orchestrator loses connectivity to a work unit agent, it does not assume the work unit is healthy or unhealthy—it flags the uncertainty.
  • Agent crashes/restarts: Agents resume in safe state; no action taken based on stale data
  • Consensus requirements: For high-impact actions (full work unit failover), require corroborating signals from at least two independent sources (e.g., anomaly detection + metrics threshold breach, or metrics + operator report) before recommending

Data Residency and Privacy

  • Customer data is never processed by the AI layer; only aggregated metrics and operational telemetry
  • Claude API calls do not include PII or customer-identifying information
  • All data processing respects existing data governance policies

Technology Approach

The system leverages existing infrastructure where possible and introduces new components only where necessary.

AI Layer:

  • Central orchestrator runs a custom agent built on Claude SDK or LangChain with Opus 4.5 for high-stakes reasoning
  • Work unit agents run local LLMs (Llama 3) for network partition resilience and API cost elimination
  • Claude Code with custom MCP servers for development and early adopter experimentation

Data Infrastructure: Builds on existing time-series metrics, logging, and data warehouse investments. New components limited to pre-computed analytics stores for derived datasets.

Build vs. Buy: Custom development required for the decision engine, capacity models, and MCP integrations. Standard tooling for everything else.

See Appendix C: Detailed Technology Stack for component-level recommendations.


Open Questions for Stakeholders

The following questions require input from relevant teams before implementation can proceed:

Data Access (Data Warehouse, Observability)

  1. What is the interface for querying the Data Warehouse (REST API, SQL, other), and what are the latency characteristics for interactive vs. batch queries?
  2. What is the current distributed tracing sampling strategy, and can we access trace data programmatically for correlation?

Traffic Control (Traffic Engineering)

  1. What is the current mechanism for traffic failover (DNS, load balancer, application-level), and is there an API for automated control?
  2. Can traffic be shifted gradually (e.g., 10% increments), or is it all-or-nothing?
  3. What are the current documented runbooks for failover decisions?

Security and Compliance

  1. What is the standard for service-to-service authentication, and are there specific compliance requirements (SOC2, etc.) affecting audit logging?

Rollout

  1. Which work units or regions should be prioritized for initial rollout?

Appendix A: Glossary

Term Definition
Work Unit A complete deployment of the service stack in a specific region; also referred to as a cluster
TTFB Time to First Byte; a measure of latency from user request to initial response
Thundering Herd A spike in resource utilization when a large number of clients simultaneously connect to a newly-activated work unit
Failover The process of shifting traffic away from a degraded work unit to healthy alternatives
Shadow Mode Operating mode where the system generates recommendations but takes no automated action
Traffic Affinity The mapping of customer populations to optimal work units based on latency and capacity
Headroom Available capacity on a work unit; the difference between current utilization and maximum safe capacity

Appendix B: Document History

Version Date Author Changes
0.1 Dec 2024 D. Park Initial draft for executive review
1.0 Dec 2024 D. Park Added team requirements, human interaction model, expanded tech details

Appendix C: Detailed Technology Stack

Core Components

Component Recommendation Rationale
AI Reasoning (Production) Custom agent built on Claude SDK or LangChain with Opus 4.5 Highest reasoning capability; custom agent enables fine-grained control over context management, token usage, and task execution; allows tailored tool definitions and guardrails specific to our operational domain
AI Reasoning (Development) Claude Code with custom MCP server Rapid iteration; direct tool access for early adopters
Work Unit Local Inference Llama 3 (8B or 70B) via vLLM or Ollama Required for network partition resilience; eliminates API costs at scale; sub-second local response times
Time-Series Metrics Prometheus Industry standard; powerful PromQL; native alerting
Time-Series Forecasting Prophet or NeuralProphet Handles sinusoidal patterns and missing data
Pre-Computed Analytics Store DuckDB or Apache Iceberg on Parquet Fast analytical queries on derived datasets
Centralized Logging Splunk or OpenSearch Enterprise features and compliance
Distributed Tracing Jaeger or Zipkin Kubernetes-native; OpenTelemetry support
Visualization Platform Tableau or Grafana See integration notes below
Container Orchestration Kubernetes (EKS, GKE, or AKS) Industry standard; strong ecosystem
Inter-Agent Communication gRPC with mTLS Strong typing, efficient serialization
Audit Logging Splunk or ELK stack Searchable, compliance-ready

Visualization Platform Integration

Use Case Data Source Refresh Rate Dashboard Type
Real-time work unit health Time-series metrics 1-5 min Operational
Current headroom by region Pre-computed analytics 5-15 min Operational
30/60/90 day capacity forecast Data Warehouse Daily Planning
"What-if" traffic migration Data Warehouse + Cost models On-demand Interactive
Post-incident timeline Logs + Traces + Metrics On-demand Investigation

Tableau Integration:

Connection Type Use Case How It Works
Live Connection Real-time operational dashboards Tableau queries the data source directly on each dashboard load or refresh. For Prometheus, this requires an intermediary (e.g., Prometheus → PostgreSQL via query exporter, or a REST API data source). Queries execute in real-time; no data cached in Tableau.
Extract (Hyper) Historical analysis, capacity planning Tableau pulls data on a schedule (hourly/daily), stores it in a compressed .hyper file. Faster queries for large datasets; enables offline analysis. Required for Data Warehouse queries that are too slow for interactive use.
Embedded Analytics Operator tooling integration Tableau views embedded in internal tools via Tableau Embedding API. SSO passthrough via SAML/JWT. Allows dashboards within existing operator workflows.

Tableau Server or Tableau Cloud is required for scheduled extract refreshes, shared dashboards, and embedding. Desktop-only deployments cannot support team-wide operational use.

Grafana Integration:

Data Source Connection Method Capabilities
Prometheus Native plugin (built-in) PromQL queries, alerting rules, annotation support. Sub-second refresh for real-time dashboards.
OpenSearch/Elasticsearch Native plugin Log exploration, full-text search, aggregations for log-based dashboards.
PostgreSQL/DuckDB SQL plugin Query pre-computed analytics tables. Supports variables and templating for interactive filtering.
Data Warehouse JDBC/ODBC or custom plugin Historical queries via SQL interface. May require caching layer for acceptable latency.

Grafana excels at operational dashboards with real-time metrics. For complex business analytics (cohort analysis, executive summaries with calculated fields), Tableau offers more flexibility.

Build vs. Buy

Build (Custom Development Required):

Component What It Does Key Technical Work
MCP Server for Claude Code Exposes internal systems (metrics, logs, anomaly detection) as tools that Claude can invoke via the Model Context Protocol Define tool schemas for each data source; implement authentication/authorization; handle rate limiting; translate natural language queries to PromQL/SQL
Traffic Affinity Matrix Pipeline Generates and maintains the region-to-work-unit routing preferences based on latency data Scheduled job that queries TTFB percentiles from Data Warehouse; computes ranked failover targets per region; outputs to analytics store; handles edge cases (new regions, decommissioned work units)
Capacity Baseline Modeling Determines the maximum safe load for each work unit based on historical performance Analyze historical metrics during peak load; identify limiting resources (CPU, memory, connections, network); compute per-work-unit capacity thresholds with safety margins; update baselines after infrastructure changes
Anomaly Detection Integration Connects the decision engine to the existing anomaly detection system API client for anomaly detection service; polling or webhook-based signal ingestion; normalization of anomaly scores to common format; historical anomaly retrieval for post-incident analysis
Failover Recommendation Engine Core decision logic that combines signals into actionable recommendations Rule engine or ML model that weighs health signals, capacity headroom, traffic affinity, and thundering herd risk; generates ranked recommendations with confidence scores; supports override rules for known scenarios

Leverage Existing:

System What We Use Integration Approach
Time-series metrics Real-time health signals, utilization data Query via existing API; no changes to collection pipeline
Centralized logging Log correlation for post-incident analysis Query via existing search API; may need new saved searches/alerts
Distributed tracing Request path analysis during incidents Query via existing API; requires adequate sampling for correlation
Data Warehouse Historical analysis, forecasting input New query patterns; may need additional ETL for derived tables
Authentication SSO for human users, service identity for agents SAML/OIDC integration for UIs; workload identity for service-to-service

Evaluate Commercial Options:

Option Potential Benefit Consideration
Unified observability platform (Datadog, New Relic, Dynatrace) Single pane of glass; built-in anomaly detection; reduced integration work May not integrate well with existing bespoke systems; licensing costs at scale; vendor lock-in
Commercial AIOps tools Pre-built correlation and recommendation engines Often require significant customization; may not understand domain-specific capacity constraints
Managed LLM inference Reduce operational burden of running local LLMs Adds network dependency; per-query costs; may not meet latency requirements for real-time decisions

This document is intended for internal planning purposes. Distribution should be limited to stakeholders involved in project scoping and approval.

Powered by Claude.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment