Project Proposal for Executive Review
| Version | 1.0 |
| Author | David Park |
| Date | December 2024 |
Our global service infrastructure operates across multiple work units (clusters) worldwide, each capable of generating millions of metrics and log lines per second. Today, critical decisions around capacity planning, traffic failover, and incident response rely heavily on human judgment and tribal knowledge. Operators manually interpret dashboards, estimate work unit capacity, and determine optimal traffic routing during incidents—processes that are time-sensitive, error-prone, and do not scale.
This project proposes building an intelligent system that leverages our existing observability infrastructure—time-series metrics, centralized logging, distributed tracing, our anomaly detection system, and our data warehouse—to provide predictive capacity planning, automated failover recommendations, and post-incident analysis. The system will use Claude (via Claude Code and the Claude API) as the reasoning layer to interpret signals, correlate data across sources, and generate actionable recommendations.
The approach is deliberately phased: we begin in shadow mode, where the system observes and recommends alongside human operators without taking action. Only after building confidence through validated recommendations do we consider progressive automation. The guiding principle throughout is "do no harm"—the system's failure mode is always to alert humans rather than take potentially harmful autonomous action.
Our service runs globally across multiple work units, each representing a complete deployment of our service stack in a specific region. Each work unit generates telemetry at scale:
-
Metrics: Collected via a metrics aggregation system that reduces cardinality by aggregating across hundreds of identical service instances, plus raw time-series metrics for select high-value signals. Metrics are shipped to a centralized metrics bucket work unit and eventually parsed into the Data Warehouse.
-
Logs: Ingested by local log indexers within each work unit, then shipped to centralized indexers for retention (90-day window).
-
Traces: Distributed tracing is available, though current coverage and sampling strategy require clarification.
-
Anomaly Detection: A bespoke internal system provides current-state and historical anomaly data via API.
-
Data Warehouse: A columnar analytics database that ETLs all telemetry into a common schema, providing up to 2 years of historical data for batch and ad-hoc analysis.
Capacity Planning: Work unit capacity limits are based on historical load tests and empirical observation from traffic migrations. There is no systematic, data-driven model of true capacity thresholds or the factors that determine them.
Failover Decisions: When a work unit shows signs of distress (typically observed as drops in network connection metrics), a human operator must:
- Recognize the problem from dashboard observations
- Decide to initiate failover
- Determine which work unit(s) should receive the traffic based on mental models of regional proximity and capacity
- Execute the failover manually
This process depends on operator experience and availability. The decision criteria are not codified, and there is no systematic way to validate that the destination work unit can absorb the additional load.
Traffic Patterns: Traffic follows predictable sinusoidal patterns tied to local time zones (peak usage around 9-10 PM local time). Work units experience "thundering herd" spikes when first receiving traffic, followed by steady-state operation once devices are tethered. These patterns are well-understood qualitatively but not modeled quantitatively.
Post-Incident Analysis: After a failover event, understanding root cause requires manually correlating metrics, logs, and traces across systems—a time-consuming process that often yields incomplete answers.
-
Capacity Planning and Traffic Forecasting: Build predictive models for traffic volume by region and time, enabling proactive capacity decisions. Understand and quantify the true capacity limits of each work unit based on historical data.
-
Cost Modeling for Traffic Migration: Provide visibility into cloud provider cost implications when migrating traffic between regions, enabling cost-aware routing decisions.
-
Load Shedding Analysis: Answer the question "if work unit X sheds load, can work units Y and Z absorb it?" with data-driven confidence, accounting for current utilization and capacity headroom.
-
Near Real-Time Failover Recommendations: Monitor work unit health at 5-minute granularity (or finer) and generate failover recommendations that account for:
- Current health signals and anomaly state
- Regional proximity (minimize TTFB impact)
- Destination capacity and current load
- Thundering herd considerations for cold work units
-
Post-Incident Analysis and Remediation: After a failover event, automatically correlate relevant metrics, logs, and traces to identify root cause and generate remediation recommendations.
-
Claude-Powered Reasoning: Use Claude (via Claude Code for early adopters, Claude API for production agents) as the intelligent layer that interprets signals, answers ad-hoc questions, and generates human-readable recommendations.
- Shadow mode recommendations match human operator decisions ≥90% of the time
- Time-to-decision for failover reduced from minutes to seconds
- Capacity forecasts accurate within ±15% over 30-day horizons
- Post-incident root cause identification completed within 15 minutes of incident close
- Zero automated actions taken without explicit human approval until shadow mode validation complete
| Role | Count | Responsibility |
|---|---|---|
| Tech Lead / Architect | 1 | System design, cross-team coordination, technical decisions |
| Backend Engineers | 2-3 | Core platform, decision engine, agent framework |
| Data/ML Engineer | 1 | Forecasting models, capacity baselines, analytics pipelines |
| SRE (embedded) | 1 | Production readiness, observability, runbooks |
Total: 5-6 engineers for build phase
Ongoing Operations: Existing SRE team absorbs operational responsibility post-launch. System designed for minimal operational overhead with standard Kubernetes patterns, automated health checks, and self-healing agents. See Operations Handoff section below for details.
The AI-assisted estimates assume the team is proficient with Claude Code. We recommend a one-week bootcamp before project work begins, followed by continued learning while delivering:
Week 0: AI Development Bootcamp (dedicated time, before Phase 0)
| Day | Focus | Activities |
|---|---|---|
| Day 1-2 | Setup & Fundamentals | Dev environment configuration, Claude Code installation, API access, basic prompting techniques, MCP server concepts |
| Day 3-4 | Capstone: MCP Server Prototype | Group builds a working MCP server that queries one data source (e.g., time-series metrics). Real Phase 0 deliverable: team practices AI-assisted development while producing shippable code, rotating through driver/navigator roles, reviewing each other's AI-generated PRs |
| Day 5 | Team Workflows | Refine PR review norms based on capstone experience, document shared prompt patterns that worked, establish team conventions, retrospective |
Weeks 1-4: Learning While Delivering (parallel with Phase 0)
| Milestone | What It Looks Like |
|---|---|
| Week 1-2 | Applying bootcamp skills to real tasks; occasional check-ins with AI tooling champion |
| Week 3-4 | Developing personal patterns; full proficiency with autonomous sessions |
Key success factors:
- Designate an AI tooling champion to establish team conventions and share learnings
- Start with well-scoped, low-risk tasks during ramp-up (e.g., boilerplate, tests, documentation)
- Establish code review norms for AI-generated code (same standards as human-written code)
- Create shared prompt libraries and MCP configurations for common project tasks
Recommendation: Run Phase 0 with 1-2 engineers who have prior Claude Code experience. Use this phase to develop team-specific patterns before scaling to the full team in Phase 1.
The following estimates assume leveraging existing observability infrastructure and focus purely on engineering effort.
| Phase | Traditional Approach | AI-Assisted Approach | Notes |
|---|---|---|---|
| Phase 0: Foundation | 6-8 engineer-weeks | 3-4 engineer-weeks | Data access validation, MCP server scaffolding |
| Phase 1: Shadow Mode | 16-20 engineer-weeks | 8-10 engineer-weeks | Core decision engine, capacity models, health monitoring |
| Phase 2: Assisted | 12-16 engineer-weeks | 6-8 engineer-weeks | Operator tooling, post-incident automation, work unit agents |
| Phase 3: Progressive Automation | 8-12 engineer-weeks | 4-6 engineer-weeks | Automation runbooks, staged rollout |
| Total | 42-56 engineer-weeks | 21-28 engineer-weeks |
AI-Assisted Approach: Assumes use of Claude Code for rapid prototyping, code generation, and iterative development. Estimates based on internal pilot projects using AI-assisted development; actual results will vary based on task complexity and team familiarity with AI tooling.
Investment pays back through:
- Reduced incident duration: Faster failover decisions prevent cascading failures and customer impact
- Reduced operator toil: Automated analysis replaces hours of manual dashboard correlation
- Prevented outages: Proactive capacity alerts catch issues before they become incidents
- Operator scaling: System handles routine decisions, freeing SREs for higher-value work
Quantified ROI depends on current incident frequency and cost-per-incident metrics, which should be gathered during Phase 0.
Once the system reaches production, the following responsibilities transfer to the SRE team. Most routine operations are automated; human intervention is primarily required for approval gates and exception handling.
| Category | Automated | Human Intervention (Gating) |
|---|---|---|
| System Health | Self-healing pods; automated alerting; health dashboards | Escalation for alerts that don't self-resolve |
| Analytics Model Maintenance | Scheduled retraining pipelines; automated validation | Approve model updates before prod deployment |
| LLM Operations | API key rotation; usage/cost dashboards with alerts | Approve version upgrades; investigate anomalies |
| Access Control | SSO integration; automated provisioning | Approve elevated access; periodic reviews |
| Audit & Compliance | Automated report generation; log retention | Monthly audit review; compliance inquiries |
Estimated human effort: 4-8 hours/week, trending lower once stable.
Handoff deliverables:
- Operational runbooks and architecture docs
- Monitoring dashboards and alert definitions
- On-call escalation procedures
- Knowledge transfer sessions (2-3 sessions, 2 hours each)
All components are Kubernetes-native, deployed via Helm. Estimates use on-demand pricing; Reserved Instances reduce costs 30-50%.
POC / Phase 0-1:
| Component | Spec | Monthly Cost |
|---|---|---|
| Orchestrator + MCP | 2x m5.large (sidecar pattern) | $140 |
| Analytics (DuckDB) | 1x r6i.large + 100GB gp3 | $100 |
| Work Unit Pod (1 unit) | 1x g4dn.xlarge (agent + Llama 3 8B) | $380 |
| Networking | ALB + data transfer | $50 |
| Claude API | Low volume | $500 |
| POC Total | ~$1,200/month |
Production (multi-AZ, Kubernetes-native HA):
Control Plane (3 replicas across AZs, K8s handles failover):
| Component | Spec | Monthly Cost |
|---|---|---|
| Orchestrator + MCP | 3x m5.large across 3 AZs | $210 |
| Analytics | 1x r6i.large + 200GB gp3 (with snapshots) | $120 |
| Networking | ALB (multi-AZ) + cross-region transfer | $150 |
| Claude API | Production volume with caching | $1,500-2,500 |
| Control Plane Total | ~$2,000-3,000/month |
Per-Work-Unit (1 active pod, 1 standby in different AZ):
| Component | Spec | Monthly Cost |
|---|---|---|
| Work Unit Pod (active) | 1x g4dn.xlarge (agent + LLM) | $380 |
| Work Unit Pod (standby) | 1x g4dn.xlarge (stopped until failover) | $0* |
| Storage | 50GB gp3 (model + state) | $4 |
| Cross-region transfer | To central orchestrator | $20 |
| Per-Unit Total | ~$400/month |
*Standby instances stopped by default; only pay for EBS. Start on failover via K8s scaling.
Scaling:
| Work Units | Work Unit Cost | Control Plane | Total Monthly |
|---|---|---|---|
| 5 | $2,000 | $2,500 | ~$4,500 |
| 10 | $4,000 | $2,800 | ~$6,800 |
| 15 | $6,000 | $3,100 | ~$9,100 |
| 20 | $8,000 | $3,400 | ~$11,400 |
Non-Prod Environments (single AZ, no HA):
| Environment | Monthly Cost |
|---|---|
| Staging | $600 |
| Acceptance | $600 |
| Non-Prod Total | ~$1,200/month |
Deployment:
- Helm charts with environment-specific values (
values-prod.yaml,values-staging.yaml) - CI/CD: merge to main → staging, tagged release → prod
Cost optimization: Reserved Instances (30-50% off), right-size after POC, Claude API caching.
Layer 1: Data Sources
All existing telemetry systems feed into the architecture:
- Time-series metrics system (per-work-unit, ~2-week retention, real-time)
- Centralized logging system (logs, 90-day retention)
- Distributed tracing system
- Anomaly Detection System (current state + historical via API)
- Data Warehouse (2-year historical, batch queries)
Layer 2: Pre-Computed Analytics
Rather than querying raw data at decision time, we build derived datasets optimized for each use case:
- Capacity Baselines: Per-work-unit capacity models derived from historical load data. Implementation: Analyze Data Warehouse metrics from past peak events; identify which resource (CPU, memory, connections, network) saturates first; set threshold at 80% of observed max. Requires: Access to historical peak load data; validation that past peaks represent true limits.
- Traffic Affinity Matrix: Region-to-work-unit mapping based on empirical TTFB data. Implementation: Daily batch job queries P50/P95 TTFB by source region and destination work unit; ranks work units per region. Requires: TTFB data accessible in Data Warehouse with region tagging.
- Headroom Calculations: Current utilization vs. baseline capacity. Implementation: Scheduled query (every 5 min) computes
(baseline - current) / baselineper work unit. Requires: Real-time metrics accessible from central location. - Cost Models: Cloud provider pricing by traffic path. Implementation: Static lookup table of region-to-region transfer costs; updated manually when pricing changes. Requires: Documented cloud pricing for relevant regions.
- Traffic Forecasts: Predict load by work unit. Implementation: Prophet model trained on 90-day historical load data; captures daily/weekly seasonality. Requires investigation: Validate Prophet handles our traffic patterns; may need custom seasonality for regional holidays.
Layer 3: Decision Engine
Two operational modes:
Near Real-Time (Goals #4, #5):
- Polls anomaly detection system API and time-series metrics every 5 minutes (interval configurable; faster polling requires validating API rate limits)
- Compares current utilization against pre-computed capacity baselines; flags when headroom drops below threshold (e.g., <20%)
- When anomaly detected OR headroom low: queries traffic affinity matrix for ranked failover targets; filters to targets with sufficient headroom; generates recommendation with reasoning
- Outputs recommendation to human operators via alerting integration (shadow mode) or triggers automated workflow via traffic control API (future state, requires investigation: confirm traffic control API exists and supports programmatic failover)
Batch/Interactive (Goals #1, #2, #3):
- Runs Prophet forecasting models against Data Warehouse historical data; outputs 30/60/90-day load predictions per work unit
- Answers ad-hoc "what-if" queries via Claude: user asks natural language question → Claude translates to Data Warehouse query + capacity model lookup → returns computed answer with assumptions stated
- Generates periodic capacity planning reports via scheduled jobs that query forecasts and current headroom, output to dashboards or email
Layer 4: Agent Architecture
A multi-tier agent topology provides resilience and appropriate separation of concerns:
Central Orchestrator:
- Runs as a Kubernetes pod in an admin cluster
- Runs a custom agent built on Claude SDK or LangChain with Opus 4.5 for high-stakes reasoning and cross-work-unit coordination
- Has visibility into all work units via aggregated data
- Makes or recommends traffic-affecting decisions
- Maintains audit log of all observations, recommendations, and actions
Work Unit Agents:
- Deployed within each work unit
- Run a lightweight local LLM for fast, local queries (target: sub-100ms for routine health checks; requires validation during POC with actual hardware and model size)
- Can operate independently if central orchestrator is unreachable
- Provide local context to central orchestrator on request
- Limited to "safe" autonomous actions (e.g., local alerting) without central coordination
Communication:
- Agents communicate via authenticated, attested channels (see Security section)
- Central orchestrator queries work unit agents; work unit agents do not initiate traffic-affecting actions independently
- Graceful degradation: if central cannot reach a work unit agent, it flags the gap rather than assuming state
flowchart TB
subgraph DataSources["Data Sources (Layer 1)"]
METRICS[Time-Series Metrics]
LOGS[Centralized Logging]
TRACES[Distributed Tracing]
ANOMALY[Anomaly Detection]
DW[Data Warehouse]
end
subgraph Analytics["Pre-Computed Analytics (Layer 2)"]
REALTIME[Real-Time Analytics<br/>Headroom, Health Signals]
BATCH[Batch Analytics<br/>Capacity Baselines, Affinity Matrix,<br/>Cost Models, Forecasts]
end
subgraph Decision["Decision Engine (Layer 3)"]
ENGINE[Decision Engine<br/>Near Real-Time + Interactive]
end
subgraph Agents["Agent Architecture (Layer 4)"]
CENTRAL[Central Orchestrator]
WU_AGENT1[Work Unit Agent]
WU_AGENT2[Work Unit Agent]
WU_AGENT3[Work Unit Agent]
end
subgraph Outputs["Outputs"]
HUMAN[Human Operators]
AUTO[Automated Workflow]
VIZ[Dashboards]
end
METRICS --> REALTIME
LOGS --> REALTIME
TRACES --> REALTIME
ANOMALY --> REALTIME
DW --> BATCH
REALTIME --> ENGINE
BATCH --> ENGINE
ENGINE --> CENTRAL
CENTRAL <--> WU_AGENT1
CENTRAL <--> WU_AGENT2
CENTRAL <--> WU_AGENT3
CENTRAL --> HUMAN
CENTRAL --> AUTO
CENTRAL --> VIZ
The system is designed to augment human decision-making, not replace it. This section describes how different roles interact with the system and the value each interaction provides.
When a work unit shows signs of distress, on-call operators currently must interpret multiple dashboards, recall tribal knowledge about capacity, and make high-stakes decisions under pressure.
With this system:
- Operators receive a clear, prioritized recommendation via existing alerting channels (Slack, PagerDuty, etc.): "Recommend failover of Region-A traffic to Region-B. Region-B has 34% headroom and lowest TTFB for affected customers. Region-C is not recommended due to current 78% utilization."
- The recommendation includes confidence level (high/medium/low based on signal agreement) and the signals that triggered it
- Operators can ask follow-up questions in natural language via Chat/CLI interface: "What happens if we split traffic between B and C instead?" (requires: MCP server with traffic model access; Claude interprets question, queries model, returns computed answer)
- One-click acknowledgment to log the decision, whether they follow the recommendation or override it (requires: simple UI or Slack workflow integration)
Value: Faster decisions, reduced cognitive load, consistent decision-making regardless of operator experience level.
Capacity planners currently rely on manually visualizing dashboards, interpreting historical trends, and periodic ad-hoc analysis to forecast needs.
With this system:
- Interactive dashboards show 30/60/90-day traffic forecasts by region
- "What-if" scenarios answer questions like: "If we decommission Work Unit X, can the remaining units handle the load during peak hours?"
- Cost projections for traffic migration between regions
- Proactive alerts when forecasted demand approaches capacity thresholds
Value: Data-driven capacity decisions, reduced over-provisioning costs, earlier identification of scaling needs.
After an incident, understanding root cause requires hours of manual correlation across metrics, logs, and traces.
With this system:
- Automated timeline generation: query audit logs + anomaly detection history + operator actions for the incident window; display as chronological event stream. Implementation: Straightforward log aggregation and sorting.
- Correlated signals: for each event in timeline, fetch related metrics/logs/traces within ±5 minute window; display in unified view. Implementation: Requires trace IDs or timestamps to correlate across systems; may require investigation if correlation keys aren't available.
- AI-generated root cause summary: Claude analyzes the timeline and correlated signals, generates hypothesis with cited evidence. Limitation: Quality depends on signal availability; novel failure modes may produce speculative answers that require human validation.
- Suggested remediation: Claude compares against previous incidents stored in knowledge base. Requires: Building a structured incident history database during Phase 1-2; initially this feature will have limited data to draw from.
Value: Faster post-incident reviews, more complete root cause analysis, institutional memory that doesn't depend on individual engineers.
Executives and engineering leadership need visibility into infrastructure health and capacity trends without diving into operational details.
With this system:
- Weekly/monthly capacity reports generated automatically
- Trend analysis showing utilization patterns and headroom across all work units
- Cost attribution for traffic routing decisions
- Risk assessment highlighting work units approaching capacity limits
Value: Clear visibility into infrastructure investment needs, data-backed budget discussions, reduced surprise escalations.
With AI-assisted operations, most system administration is fully automated. Humans intervene only when automation fails or decisions require approval.
Rolling Updates and Upgrades
| Automated | Human Intervention (Exception Only) |
|---|---|
| Dependabot/Renovate opens PRs for dependency updates (requires initial setup) | Approve PR if breaking changes detected |
| CI/CD runs tests, builds images, deploys to staging on PR merge | Review failed test results |
| Staging validation (health checks + smoke tests) gates prod promotion | Investigate if staging validation fails |
| K8s performs rolling restart with readiness probes; auto-rollback on failure | Post-mortem if rollback triggered |
Human touch: Only when tests fail, health checks fail, or breaking changes need review. Requires: CI/CD pipeline setup, container registry, Helm chart structure.
Debugging System Issues
| Automated | Human Intervention (Exception Only) |
|---|---|
| Standard K8s/Prometheus alerting flags error rate or latency anomalies | Investigate alerts that don't auto-resolve |
| Diagnostic dashboard aggregates recent logs, traces, API responses | Deep-dive when root cause unclear |
| Self-healing: K8s restarts failed pods; circuit breakers recover automatically | Exec into pod only for novel failure modes |
| Claude analyzes logs on-demand when operator requests assistance | Validate AI hypothesis before applying fix |
Human touch: Only for novel failures that self-healing can't resolve. Note: "AI-assisted log analysis" requires operator to invoke Claude; not fully autonomous.
Scaling and Resource Tuning
| Automated | Human Intervention (Exception Only) |
|---|---|
| HPA scales replicas based on CPU/memory thresholds (requires HPA configuration during build) | Review if scaling events are excessive |
| Cluster Autoscaler provisions nodes as needed (requires autoscaler setup) | Investigate if node provisioning fails |
| Cost monitoring dashboards with threshold alerts | Approve infrastructure cost increases |
| Quarterly review: analyze usage data, propose right-sizing (manual analysis, not AI-driven initially) | Approve major instance type changes |
Human touch: Only for cost approvals or when auto-scaling behaves unexpectedly. Note: "AI recommends instance types" is a future enhancement, not Phase 1.
Configuration Changes
| Automated | Human Intervention (Exception Only) |
|---|---|
| Capacity baselines recalculated weekly via scheduled job | Review if baseline changes significantly (>10% drift) |
| Traffic affinity matrix refreshes daily via batch job | Investigate if any region loses viable failover targets |
| Threshold change proposals generated as draft PRs (requires building this workflow) | Approve PR (required gate) |
| CI/CD validates in staging, auto-promotes if tests pass | Review if staging validation fails |
Human touch: PR approval (intentional gate) and exception handling. Note: "Auto-generated PRs" requires building integration between analytics and Git; this is Phase 2 work.
Disaster Recovery
| Automated | Human Intervention (Exception Only) |
|---|---|
| Health checks detect pod/node failure; K8s reschedules automatically | Investigate if rescheduling repeatedly fails |
| Standby pods start via K8s scaling (requires pre-configured scaling policies) | Investigate if standby fails to start |
| Analytics data restored from scheduled EBS snapshots (requires snapshot configuration) | Validate data integrity post-restore |
| Work unit agents reconnect to central via standard K8s service discovery | Manual re-registration if agent state corrupted |
Human touch: Only when automated recovery fails. Note: Full regional DR (control plane failover to different region) is not in initial scope; would require additional infrastructure.
Bottom line: The system runs itself. Humans approve gates, investigate exceptions, and handle novel failures—not routine operations.
| Role | Primary Interface | Interaction Type |
|---|---|---|
| On-Call Operator | Chat/CLI + existing alerting tools | Real-time recommendations, natural language queries |
| Capacity Planner | Dashboards + interactive analysis UI | Forecasts, what-if scenarios, reports |
| Incident Commander | Chat/CLI during incident | Situational awareness, decision support |
| Post-Incident Reviewer | Investigation UI | Timeline, correlation, root cause summary |
| Engineering Leadership | Scheduled reports + dashboards | Trends, forecasts, risk assessment |
Establish data access and validate assumptions.
- Confirm Data Warehouse interface requirements with DW team; define query contract for historical data access
- Validate time-series metrics accessibility from central location; confirm metrics availability and retention
- Document anomaly detection system API and available signals
- Clarify distributed tracing coverage and sampling strategy
- Set up Claude Code environment for early adopter experimentation
- Define initial MCP server interface for Claude Code integration
Exit Criteria: Documented data access patterns for all source systems; working Claude Code prototype that can query at least one data source.
Build the core decision engine in observation-only mode.
- Implement capacity baseline models using historical Data Warehouse data
- Build traffic affinity matrix from empirical TTFB data
- Create near real-time health monitoring pipeline (time-series metrics + anomaly detection)
- Develop failover recommendation logic
- Deploy central orchestrator in shadow mode—generates recommendations, logs decisions, but takes no action
- Compare system recommendations against actual human operator decisions
- Iterate on models based on discrepancies
Exit Criteria: System generates failover recommendations that match human decisions ≥90% of the time over a 4-week observation period.
Increase system involvement while maintaining human control.
- Surface recommendations to operators in real-time during incidents
- Provide "what-if" analysis tools for capacity planning via visualization platform integration or custom UI
- Implement post-incident analysis automation
- Deploy work unit agents for local context gathering
- Build cost modeling for traffic migration scenarios
- Extend audit logging to capture full decision context
Exit Criteria: Operators actively use system recommendations during incidents; post-incident analysis time reduced by 50%.
Gradually transfer decision authority to the system.
- Define automation runbook with explicit human approval gates
- Implement staged failover capability (if traffic control supports it)
- Enable automated alerting based on system recommendations
- Pilot automated failover for low-risk scenarios with immediate human notification
- Expand automation scope based on demonstrated reliability
Exit Criteria: Defined per-scenario based on risk tolerance and demonstrated system accuracy.
Security is a first-class concern, not an afterthought. The system will be designed with the following principles:
Service-to-Service (Agents):
- All agent communication uses mutual TLS with certificate-based identity
- Workload identity attestation ensures agents only accept requests from verified peers
Human Access:
- SSO integration with existing identity provider (SAML/OIDC) for all operator-facing interfaces
- No local accounts; all access tied to corporate identity
Role-Based Access Control:
| Role | Permissions | Typical Users |
|---|---|---|
| Viewer | View dashboards, recommendations, and audit logs | All engineers, leadership |
| Operator | Acknowledge/override recommendations, trigger manual analysis, ask ad-hoc questions | On-call SREs, incident commanders |
| Admin | Modify capacity thresholds, update traffic affinity rules, manage access | Platform team leads |
| System Config | Deploy model updates, modify agent configuration, update MCP schemas | Build team, designated SREs |
Configuration Management:
- System configuration (thresholds, affinity rules, alert definitions) managed via GitOps—changes require PR review and merge to deploy
- Operational actions (acknowledge recommendation, trigger failover) performed directly in operator UI with audit logging
- No direct production access; all changes flow through version-controlled pipelines or audited UI actions
- Central orchestrator runs in isolated admin cluster with restricted network policies
- Work unit agents communicate only with central orchestrator via authenticated channels
- No direct work-unit-to-work-unit agent communication without central coordination
- All cross-network traffic encrypted in transit
- Every observation, recommendation, and action is logged with full context:
- What signals were observed
- What recommendation was generated
- Who or what approved the action (human operator ID or automated policy reference)
- What action was taken
- Outcome of the action
- Audit logs shipped to centralized logging system with append-only retention policies (no deletion or modification within retention window)
- Regular audit review as part of operational process
- First directive: Do no harm. When uncertain, the system alerts humans rather than taking action.
- Network partition handling: If central orchestrator loses connectivity to a work unit agent, it does not assume the work unit is healthy or unhealthy—it flags the uncertainty.
- Agent crashes/restarts: Agents resume in safe state; no action taken based on stale data
- Consensus requirements: For high-impact actions (full work unit failover), require corroborating signals from at least two independent sources (e.g., anomaly detection + metrics threshold breach, or metrics + operator report) before recommending
- Customer data is never processed by the AI layer; only aggregated metrics and operational telemetry
- Claude API calls do not include PII or customer-identifying information
- All data processing respects existing data governance policies
The system leverages existing infrastructure where possible and introduces new components only where necessary.
AI Layer:
- Central orchestrator runs a custom agent built on Claude SDK or LangChain with Opus 4.5 for high-stakes reasoning
- Work unit agents run local LLMs (Llama 3) for network partition resilience and API cost elimination
- Claude Code with custom MCP servers for development and early adopter experimentation
Data Infrastructure: Builds on existing time-series metrics, logging, and data warehouse investments. New components limited to pre-computed analytics stores for derived datasets.
Build vs. Buy: Custom development required for the decision engine, capacity models, and MCP integrations. Standard tooling for everything else.
See Appendix C: Detailed Technology Stack for component-level recommendations.
The following questions require input from relevant teams before implementation can proceed:
- What is the interface for querying the Data Warehouse (REST API, SQL, other), and what are the latency characteristics for interactive vs. batch queries?
- What is the current distributed tracing sampling strategy, and can we access trace data programmatically for correlation?
- What is the current mechanism for traffic failover (DNS, load balancer, application-level), and is there an API for automated control?
- Can traffic be shifted gradually (e.g., 10% increments), or is it all-or-nothing?
- What are the current documented runbooks for failover decisions?
- What is the standard for service-to-service authentication, and are there specific compliance requirements (SOC2, etc.) affecting audit logging?
- Which work units or regions should be prioritized for initial rollout?
| Term | Definition |
|---|---|
| Work Unit | A complete deployment of the service stack in a specific region; also referred to as a cluster |
| TTFB | Time to First Byte; a measure of latency from user request to initial response |
| Thundering Herd | A spike in resource utilization when a large number of clients simultaneously connect to a newly-activated work unit |
| Failover | The process of shifting traffic away from a degraded work unit to healthy alternatives |
| Shadow Mode | Operating mode where the system generates recommendations but takes no automated action |
| Traffic Affinity | The mapping of customer populations to optimal work units based on latency and capacity |
| Headroom | Available capacity on a work unit; the difference between current utilization and maximum safe capacity |
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | Dec 2024 | D. Park | Initial draft for executive review |
| 1.0 | Dec 2024 | D. Park | Added team requirements, human interaction model, expanded tech details |
| Component | Recommendation | Rationale |
|---|---|---|
| AI Reasoning (Production) | Custom agent built on Claude SDK or LangChain with Opus 4.5 | Highest reasoning capability; custom agent enables fine-grained control over context management, token usage, and task execution; allows tailored tool definitions and guardrails specific to our operational domain |
| AI Reasoning (Development) | Claude Code with custom MCP server | Rapid iteration; direct tool access for early adopters |
| Work Unit Local Inference | Llama 3 (8B or 70B) via vLLM or Ollama | Required for network partition resilience; eliminates API costs at scale; sub-second local response times |
| Time-Series Metrics | Prometheus | Industry standard; powerful PromQL; native alerting |
| Time-Series Forecasting | Prophet or NeuralProphet | Handles sinusoidal patterns and missing data |
| Pre-Computed Analytics Store | DuckDB or Apache Iceberg on Parquet | Fast analytical queries on derived datasets |
| Centralized Logging | Splunk or OpenSearch | Enterprise features and compliance |
| Distributed Tracing | Jaeger or Zipkin | Kubernetes-native; OpenTelemetry support |
| Visualization Platform | Tableau or Grafana | See integration notes below |
| Container Orchestration | Kubernetes (EKS, GKE, or AKS) | Industry standard; strong ecosystem |
| Inter-Agent Communication | gRPC with mTLS | Strong typing, efficient serialization |
| Audit Logging | Splunk or ELK stack | Searchable, compliance-ready |
| Use Case | Data Source | Refresh Rate | Dashboard Type |
|---|---|---|---|
| Real-time work unit health | Time-series metrics | 1-5 min | Operational |
| Current headroom by region | Pre-computed analytics | 5-15 min | Operational |
| 30/60/90 day capacity forecast | Data Warehouse | Daily | Planning |
| "What-if" traffic migration | Data Warehouse + Cost models | On-demand | Interactive |
| Post-incident timeline | Logs + Traces + Metrics | On-demand | Investigation |
Tableau Integration:
| Connection Type | Use Case | How It Works |
|---|---|---|
| Live Connection | Real-time operational dashboards | Tableau queries the data source directly on each dashboard load or refresh. For Prometheus, this requires an intermediary (e.g., Prometheus → PostgreSQL via query exporter, or a REST API data source). Queries execute in real-time; no data cached in Tableau. |
| Extract (Hyper) | Historical analysis, capacity planning | Tableau pulls data on a schedule (hourly/daily), stores it in a compressed .hyper file. Faster queries for large datasets; enables offline analysis. Required for Data Warehouse queries that are too slow for interactive use. |
| Embedded Analytics | Operator tooling integration | Tableau views embedded in internal tools via Tableau Embedding API. SSO passthrough via SAML/JWT. Allows dashboards within existing operator workflows. |
Tableau Server or Tableau Cloud is required for scheduled extract refreshes, shared dashboards, and embedding. Desktop-only deployments cannot support team-wide operational use.
Grafana Integration:
| Data Source | Connection Method | Capabilities |
|---|---|---|
| Prometheus | Native plugin (built-in) | PromQL queries, alerting rules, annotation support. Sub-second refresh for real-time dashboards. |
| OpenSearch/Elasticsearch | Native plugin | Log exploration, full-text search, aggregations for log-based dashboards. |
| PostgreSQL/DuckDB | SQL plugin | Query pre-computed analytics tables. Supports variables and templating for interactive filtering. |
| Data Warehouse | JDBC/ODBC or custom plugin | Historical queries via SQL interface. May require caching layer for acceptable latency. |
Grafana excels at operational dashboards with real-time metrics. For complex business analytics (cohort analysis, executive summaries with calculated fields), Tableau offers more flexibility.
Build (Custom Development Required):
| Component | What It Does | Key Technical Work |
|---|---|---|
| MCP Server for Claude Code | Exposes internal systems (metrics, logs, anomaly detection) as tools that Claude can invoke via the Model Context Protocol | Define tool schemas for each data source; implement authentication/authorization; handle rate limiting; translate natural language queries to PromQL/SQL |
| Traffic Affinity Matrix Pipeline | Generates and maintains the region-to-work-unit routing preferences based on latency data | Scheduled job that queries TTFB percentiles from Data Warehouse; computes ranked failover targets per region; outputs to analytics store; handles edge cases (new regions, decommissioned work units) |
| Capacity Baseline Modeling | Determines the maximum safe load for each work unit based on historical performance | Analyze historical metrics during peak load; identify limiting resources (CPU, memory, connections, network); compute per-work-unit capacity thresholds with safety margins; update baselines after infrastructure changes |
| Anomaly Detection Integration | Connects the decision engine to the existing anomaly detection system | API client for anomaly detection service; polling or webhook-based signal ingestion; normalization of anomaly scores to common format; historical anomaly retrieval for post-incident analysis |
| Failover Recommendation Engine | Core decision logic that combines signals into actionable recommendations | Rule engine or ML model that weighs health signals, capacity headroom, traffic affinity, and thundering herd risk; generates ranked recommendations with confidence scores; supports override rules for known scenarios |
Leverage Existing:
| System | What We Use | Integration Approach |
|---|---|---|
| Time-series metrics | Real-time health signals, utilization data | Query via existing API; no changes to collection pipeline |
| Centralized logging | Log correlation for post-incident analysis | Query via existing search API; may need new saved searches/alerts |
| Distributed tracing | Request path analysis during incidents | Query via existing API; requires adequate sampling for correlation |
| Data Warehouse | Historical analysis, forecasting input | New query patterns; may need additional ETL for derived tables |
| Authentication | SSO for human users, service identity for agents | SAML/OIDC integration for UIs; workload identity for service-to-service |
Evaluate Commercial Options:
| Option | Potential Benefit | Consideration |
|---|---|---|
| Unified observability platform (Datadog, New Relic, Dynatrace) | Single pane of glass; built-in anomaly detection; reduced integration work | May not integrate well with existing bespoke systems; licensing costs at scale; vendor lock-in |
| Commercial AIOps tools | Pre-built correlation and recommendation engines | Often require significant customization; may not understand domain-specific capacity constraints |
| Managed LLM inference | Reduce operational burden of running local LLMs | Adds network dependency; per-query costs; may not meet latency requirements for real-time decisions |
This document is intended for internal planning purposes. Distribution should be limited to stakeholders involved in project scoping and approval.
Powered by Claude.