Predictive Service Availability and Capacity Planning

Project Proposal for Executive Review


Version	1.0
Author	David Park
Date	December 2024

Executive Summary

Our global service infrastructure operates across multiple work units (clusters) worldwide, each capable of generating millions of metrics and log lines per second. Today, critical decisions around capacity planning, traffic failover, and incident response rely heavily on human judgment and tribal knowledge. Operators manually interpret dashboards, estimate work unit capacity, and determine optimal traffic routing during incidents—processes that are time-sensitive, error-prone, and do not scale.

This project proposes building an intelligent system that leverages our existing observability infrastructure—time-series metrics, centralized logging, distributed tracing, our anomaly detection system, and our data warehouse—to provide predictive capacity planning, automated failover recommendations, and post-incident analysis. The system will use Claude (via Claude Code and the Claude API) as the reasoning layer to interpret signals, correlate data across sources, and generate actionable recommendations.

The approach is deliberately phased: we begin in shadow mode, where the system observes and recommends alongside human operators without taking action. Only after building confidence through validated recommendations do we consider progressive automation. The guiding principle throughout is "do no harm"—the system's failure mode is always to alert humans rather than take potentially harmful autonomous action.

Current State

Infrastructure Overview

Our service runs globally across multiple work units, each representing a complete deployment of our service stack in a specific region. Each work unit generates telemetry at scale:

Metrics: Collected via a metrics aggregation system that reduces cardinality by aggregating across hundreds of identical service instances, plus raw time-series metrics for select high-value signals. Metrics are shipped to a centralized metrics bucket work unit and eventually parsed into the Data Warehouse.
Logs: Ingested by local log indexers within each work unit, then shipped to centralized indexers for retention (90-day window).
Traces: Distributed tracing is available, though current coverage and sampling strategy require clarification.
Anomaly Detection: A bespoke internal system provides current-state and historical anomaly data via API.
Data Warehouse: A columnar analytics database that ETLs all telemetry into a common schema, providing up to 2 years of historical data for batch and ad-hoc analysis.

Current Operational Challenges

Capacity Planning: Work unit capacity limits are based on historical load tests and empirical observation from traffic migrations. There is no systematic, data-driven model of true capacity thresholds or the factors that determine them.

Failover Decisions: When a work unit shows signs of distress (typically observed as drops in network connection metrics), a human operator must:

Recognize the problem from dashboard observations
Decide to initiate failover
Determine which work unit(s) should receive the traffic based on mental models of regional proximity and capacity
Execute the failover manually

This process depends on operator experience and availability. The decision criteria are not codified, and there is no systematic way to validate that the destination work unit can absorb the additional load.

Traffic Patterns: Traffic follows predictable sinusoidal patterns tied to local time zones (peak usage around 9-10 PM local time). Work units experience "thundering herd" spikes when first receiving traffic, followed by steady-state operation once devices are tethered. These patterns are well-understood qualitatively but not modeled quantitatively.

Post-Incident Analysis: After a failover event, understanding root cause requires manually correlating metrics, logs, and traces across systems—a time-consuming process that often yields incomplete answers.

Project Goals

Primary Goals

Capacity Planning and Traffic Forecasting: Build predictive models for traffic volume by region and time, enabling proactive capacity decisions. Understand and quantify the true capacity limits of each work unit based on historical data.
Cost Modeling for Traffic Migration: Provide visibility into cloud provider cost implications when migrating traffic between regions, enabling cost-aware routing decisions.
Load Shedding Analysis: Answer the question "if work unit X sheds load, can work units Y and Z absorb it?" with data-driven confidence, accounting for current utilization and capacity headroom.
Near Real-Time Failover Recommendations: Monitor work unit health at 5-minute granularity (or finer) and generate failover recommendations that account for:
- Current health signals and anomaly state
- Regional proximity (minimize TTFB impact)
- Destination capacity and current load
- Thundering herd considerations for cold work units
Post-Incident Analysis and Remediation: After a failover event, automatically correlate relevant metrics, logs, and traces to identify root cause and generate remediation recommendations.
Claude-Powered Reasoning: Use Claude (via Claude Code for early adopters, Claude API for production agents) as the intelligent layer that interprets signals, answers ad-hoc questions, and generates human-readable recommendations.

Success Criteria

Shadow mode recommendations match human operator decisions ≥90% of the time
Time-to-decision for failover reduced from minutes to seconds
Capacity forecasts accurate within ±15% over 30-day horizons
Post-incident root cause identification completed within 15 minutes of incident close
Zero automated actions taken without explicit human approval until shadow mode validation complete

Team Requirements and Investment

Team Structure

Role	Count	Responsibility
Tech Lead / Architect	1	System design, cross-team coordination, technical decisions
Backend Engineers	2-3	Core platform, decision engine, agent framework
Data/ML Engineer	1	Forecasting models, capacity baselines, analytics pipelines
SRE (embedded)	1	Production readiness, observability, runbooks

Total: 5-6 engineers for build phase

Ongoing Operations: Existing SRE team absorbs operational responsibility post-launch. System designed for minimal operational overhead with standard Kubernetes patterns, automated health checks, and self-healing agents. See Operations Handoff section below for details.

AI-Assisted Development Adoption

The AI-assisted estimates assume the team is proficient with Claude Code. We recommend a one-week bootcamp before project work begins, followed by continued learning while delivering:

Week 0: AI Development Bootcamp (dedicated time, before Phase 0)

Day	Focus	Activities
Day 1-2	Setup & Fundamentals	Dev environment configuration, Claude Code installation, API access, basic prompting techniques, MCP server concepts
Day 3-4	Capstone: MCP Server Prototype	Group builds a working MCP server that queries one data source (e.g., time-series metrics). Real Phase 0 deliverable: team practices AI-assisted development while producing shippable code, rotating through driver/navigator roles, reviewing each other's AI-generated PRs
Day 5	Team Workflows	Refine PR review norms based on capstone experience, document shared prompt patterns that worked, establish team conventions, retrospective

Weeks 1-4: Learning While Delivering (parallel with Phase 0)

Milestone	What It Looks Like
Week 1-2	Applying bootcamp skills to real tasks; occasional check-ins with AI tooling champion
Week 3-4	Developing personal patterns; full proficiency with autonomous sessions

Key success factors:

Designate an AI tooling champion to establish team conventions and share learnings
Start with well-scoped, low-risk tasks during ramp-up (e.g., boilerplate, tests, documentation)
Establish code review norms for AI-generated code (same standards as human-written code)
Create shared prompt libraries and MCP configurations for common project tasks

Recommendation: Run Phase 0 with 1-2 engineers who have prior Claude Code experience. Use this phase to develop team-specific patterns before scaling to the full team in Phase 1.

Engineering Investment Estimate

The following estimates assume leveraging existing observability infrastructure and focus purely on engineering effort.

Phase	Traditional Approach	AI-Assisted Approach	Notes
Phase 0: Foundation	6-8 engineer-weeks	3-4 engineer-weeks	Data access validation, MCP server scaffolding
Phase 1: Shadow Mode	16-20 engineer-weeks	8-10 engineer-weeks	Core decision engine, capacity models, health monitoring
Phase 2: Assisted	12-16 engineer-weeks	6-8 engineer-weeks	Operator tooling, post-incident automation, work unit agents
Phase 3: Progressive Automation	8-12 engineer-weeks	4-6 engineer-weeks	Automation runbooks, staged rollout
Total	42-56 engineer-weeks	21-28 engineer-weeks

AI-Assisted Approach: Assumes use of Claude Code for rapid prototyping, code generation, and iterative development. Estimates based on internal pilot projects using AI-assisted development; actual results will vary based on task complexity and team familiarity with AI tooling.

Return on Investment

Investment pays back through:

Reduced incident duration: Faster failover decisions prevent cascading failures and customer impact
Reduced operator toil: Automated analysis replaces hours of manual dashboard correlation
Prevented outages: Proactive capacity alerts catch issues before they become incidents
Operator scaling: System handles routine decisions, freeing SREs for higher-value work

Quantified ROI depends on current incident frequency and cost-per-incident metrics, which should be gathered during Phase 0.

Operational Concerns

Operations Handoff

Once the system reaches production, the following responsibilities transfer to the SRE team. Most routine operations are automated; human intervention is primarily required for approval gates and exception handling.

Category	Automated	Human Intervention (Gating)
System Health	Self-healing pods; automated alerting; health dashboards	Escalation for alerts that don't self-resolve
Analytics Model Maintenance	Scheduled retraining pipelines; automated validation	Approve model updates before prod deployment
LLM Operations	API key rotation; usage/cost dashboards with alerts	Approve version upgrades; investigate anomalies
Access Control	SSO integration; automated provisioning	Approve elevated access; periodic reviews
Audit & Compliance	Automated report generation; log retention	Monthly audit review; compliance inquiries

Estimated human effort: 4-8 hours/week, trending lower once stable.

Handoff deliverables:

Operational runbooks and architecture docs
Monitoring dashboards and alert definitions
On-call escalation procedures
Knowledge transfer sessions (2-3 sessions, 2 hours each)

Infrastructure Cost Estimate (AWS)

All components are Kubernetes-native, deployed via Helm. Estimates use on-demand pricing; Reserved Instances reduce costs 30-50%.

POC / Phase 0-1:

Component	Spec	Monthly Cost
Orchestrator + MCP	2x m5.large (sidecar pattern)	$140
Analytics (DuckDB)	1x r6i.large + 100GB gp3	$100
Work Unit Pod (1 unit)	1x g4dn.xlarge (agent + Llama 3 8B)	$380
Networking	ALB + data transfer	$50
Claude API	Low volume	$500
POC Total		~$1,200/month

Production (multi-AZ, Kubernetes-native HA):

Control Plane (3 replicas across AZs, K8s handles failover):

Component	Spec	Monthly Cost
Orchestrator + MCP	3x m5.large across 3 AZs	$210
Analytics	1x r6i.large + 200GB gp3 (with snapshots)	$120
Networking	ALB (multi-AZ) + cross-region transfer	$150
Claude API	Production volume with caching	$1,500-2,500
Control Plane Total		~$2,000-3,000/month

Per-Work-Unit (1 active pod, 1 standby in different AZ):

Component	Spec	Monthly Cost
Work Unit Pod (active)	1x g4dn.xlarge (agent + LLM)	$380
Work Unit Pod (standby)	1x g4dn.xlarge (stopped until failover)	$0*
Storage	50GB gp3 (model + state)	$4
Cross-region transfer	To central orchestrator	$20
Per-Unit Total		~$400/month

*Standby instances stopped by default; only pay for EBS. Start on failover via K8s scaling.

Scaling:

Work Units	Work Unit Cost	Control Plane	Total Monthly
5	$2,000	$2,500	~$4,500
10	$4,000	$2,800	~$6,800
15	$6,000	$3,100	~$9,100
20	$8,000	$3,400	~$11,400

Non-Prod Environments (single AZ, no HA):

Environment	Monthly Cost
Staging	$600
Acceptance	$600
Non-Prod Total	~$1,200/month

Deployment:

Helm charts with environment-specific values (values-prod.yaml, values-staging.yaml)
CI/CD: merge to main → staging, tagged release → prod

Cost optimization: Reserved Instances (30-50% off), right-size after POC, Claude API caching.

Proposed Architecture

Conceptual Layers

Layer 1: Data Sources

All existing telemetry systems feed into the architecture:

Time-series metrics system (per-work-unit, ~2-week retention, real-time)
Centralized logging system (logs, 90-day retention)
Distributed tracing system
Anomaly Detection System (current state + historical via API)
Data Warehouse (2-year historical, batch queries)

Layer 2: Pre-Computed Analytics

Rather than querying raw data at decision time, we build derived datasets optimized for each use case:

Capacity Baselines: Per-work-unit capacity models derived from historical load data. Implementation: Analyze Data Warehouse metrics from past peak events; identify which resource (CPU, memory, connections, network) saturates first; set threshold at 80% of observed max. Requires: Access to historical peak load data; validation that past peaks represent true limits.
Traffic Affinity Matrix: Region-to-work-unit mapping based on empirical TTFB data. Implementation: Daily batch job queries P50/P95 TTFB by source region and destination work unit; ranks work units per region. Requires: TTFB data accessible in Data Warehouse with region tagging.
Headroom Calculations: Current utilization vs. baseline capacity. Implementation: Scheduled query (every 5 min) computes (baseline - current) / baseline per work unit. Requires: Real-time metrics accessible from central location.
Cost Models: Cloud provider pricing by traffic path. Implementation: Static lookup table of region-to-region transfer costs; updated manually when pricing changes. Requires: Documented cloud pricing for relevant regions.
Traffic Forecasts: Predict load by work unit. Implementation: Prophet model trained on 90-day historical load data; captures daily/weekly seasonality. Requires investigation: Validate Prophet handles our traffic patterns; may need custom seasonality for regional holidays.

Layer 3: Decision Engine

Two operational modes:

Near Real-Time (Goals #4, #5):

Polls anomaly detection system API and time-series metrics every 5 minutes (interval configurable; faster polling requires validating API rate limits)
Compares current utilization against pre-computed capacity baselines; flags when headroom drops below threshold (e.g., <20%)
When anomaly detected OR headroom low: queries traffic affinity matrix for ranked failover targets; filters to targets with sufficient headroom; generates recommendation with reasoning
Outputs recommendation to human operators via alerting integration (shadow mode) or triggers automated workflow via traffic control API (future state, requires investigation: confirm traffic control API exists and supports programmatic failover)

Batch/Interactive (Goals #1, #2, #3):

Runs Prophet forecasting models against Data Warehouse historical data; outputs 30/60/90-day load predictions per work unit
Answers ad-hoc "what-if" queries via Claude: user asks natural language question → Claude translates to Data Warehouse query + capacity model lookup → returns computed answer with assumptions stated
Generates periodic capacity planning reports via scheduled jobs that query forecasts and current headroom, output to dashboards or email

Layer 4: Agent Architecture

A multi-tier agent topology provides resilience and appropriate separation of concerns:

Central Orchestrator:

Runs as a Kubernetes pod in an admin cluster
Runs a custom agent built on Claude SDK or LangChain with Opus 4.5 for high-stakes reasoning and cross-work-unit coordination
Has visibility into all work units via aggregated data
Makes or recommends traffic-affecting decisions
Maintains audit log of all observations, recommendations, and actions

Work Unit Agents:

Deployed within each work unit
Run a lightweight local LLM for fast, local queries (target: sub-100ms for routine health checks; requires validation during POC with actual hardware and model size)
Can operate independently if central orchestrator is unreachable
Provide local context to central orchestrator on request
Limited to "safe" autonomous actions (e.g., local alerting) without central coordination

Communication:

Agents communicate via authenticated, attested channels (see Security section)
Central orchestrator queries work unit agents; work unit agents do not initiate traffic-affecting actions independently
Graceful degradation: if central cannot reach a work unit agent, it flags the gap rather than assuming state

Data Flow Summary

flowchart TB
    subgraph DataSources["Data Sources (Layer 1)"]
        METRICS[Time-Series Metrics]
        LOGS[Centralized Logging]
        TRACES[Distributed Tracing]
        ANOMALY[Anomaly Detection]
        DW[Data Warehouse]
    end

    subgraph Analytics["Pre-Computed Analytics (Layer 2)"]
        REALTIME[Real-Time Analytics<br/>Headroom, Health Signals]
        BATCH[Batch Analytics<br/>Capacity Baselines, Affinity Matrix,<br/>Cost Models, Forecasts]
    end

    subgraph Decision["Decision Engine (Layer 3)"]
        ENGINE[Decision Engine<br/>Near Real-Time + Interactive]
    end

    subgraph Agents["Agent Architecture (Layer 4)"]
        CENTRAL[Central Orchestrator]
        WU_AGENT1[Work Unit Agent]
        WU_AGENT2[Work Unit Agent]
        WU_AGENT3[Work Unit Agent]
    end

    subgraph Outputs["Outputs"]
        HUMAN[Human Operators]
        AUTO[Automated Workflow]
        VIZ[Dashboards]
    end

    METRICS --> REALTIME
    LOGS --> REALTIME
    TRACES --> REALTIME
    ANOMALY --> REALTIME
    DW --> BATCH

    REALTIME --> ENGINE
    BATCH --> ENGINE

    ENGINE --> CENTRAL

    CENTRAL <--> WU_AGENT1
    CENTRAL <--> WU_AGENT2
    CENTRAL <--> WU_AGENT3

    CENTRAL --> HUMAN
    CENTRAL --> AUTO
    CENTRAL --> VIZ

Human Interaction Model

The system is designed to augment human decision-making, not replace it. This section describes how different roles interact with the system and the value each interaction provides.

Operator Experience During Incidents

When a work unit shows signs of distress, on-call operators currently must interpret multiple dashboards, recall tribal knowledge about capacity, and make high-stakes decisions under pressure.

With this system:

Operators receive a clear, prioritized recommendation via existing alerting channels (Slack, PagerDuty, etc.): "Recommend failover of Region-A traffic to Region-B. Region-B has 34% headroom and lowest TTFB for affected customers. Region-C is not recommended due to current 78% utilization."
The recommendation includes confidence level (high/medium/low based on signal agreement) and the signals that triggered it
Operators can ask follow-up questions in natural language via Chat/CLI interface: "What happens if we split traffic between B and C instead?" (requires: MCP server with traffic model access; Claude interprets question, queries model, returns computed answer)
One-click acknowledgment to log the decision, whether they follow the recommendation or override it (requires: simple UI or Slack workflow integration)

Value: Faster decisions, reduced cognitive load, consistent decision-making regardless of operator experience level.

Capacity Planning Workflows

Capacity planners currently rely on manually visualizing dashboards, interpreting historical trends, and periodic ad-hoc analysis to forecast needs.

With this system:

Interactive dashboards show 30/60/90-day traffic forecasts by region
"What-if" scenarios answer questions like: "If we decommission Work Unit X, can the remaining units handle the load during peak hours?"
Cost projections for traffic migration between regions
Proactive alerts when forecasted demand approaches capacity thresholds

Value: Data-driven capacity decisions, reduced over-provisioning costs, earlier identification of scaling needs.

Post-Incident Analysis

After an incident, understanding root cause requires hours of manual correlation across metrics, logs, and traces.

With this system:

Automated timeline generation: query audit logs + anomaly detection history + operator actions for the incident window; display as chronological event stream. Implementation: Straightforward log aggregation and sorting.
Correlated signals: for each event in timeline, fetch related metrics/logs/traces within ±5 minute window; display in unified view. Implementation: Requires trace IDs or timestamps to correlate across systems; may require investigation if correlation keys aren't available.
AI-generated root cause summary: Claude analyzes the timeline and correlated signals, generates hypothesis with cited evidence. Limitation: Quality depends on signal availability; novel failure modes may produce speculative answers that require human validation.
Suggested remediation: Claude compares against previous incidents stored in knowledge base. Requires: Building a structured incident history database during Phase 1-2; initially this feature will have limited data to draw from.

Value: Faster post-incident reviews, more complete root cause analysis, institutional memory that doesn't depend on individual engineers.

Leadership and Planning Reviews

Executives and engineering leadership need visibility into infrastructure health and capacity trends without diving into operational details.

With this system:

Weekly/monthly capacity reports generated automatically
Trend analysis showing utilization patterns and headroom across all work units
Cost attribution for traffic routing decisions
Risk assessment highlighting work units approaching capacity limits

Value: Clear visibility into infrastructure investment needs, data-backed budget discussions, reduced surprise escalations.

System Administration

With AI-assisted operations, most system administration is fully automated. Humans intervene only when automation fails or decisions require approval.

Rolling Updates and Upgrades

Automated	Human Intervention (Exception Only)
Dependabot/Renovate opens PRs for dependency updates (requires initial setup)	Approve PR if breaking changes detected
CI/CD runs tests, builds images, deploys to staging on PR merge	Review failed test results
Staging validation (health checks + smoke tests) gates prod promotion	Investigate if staging validation fails
K8s performs rolling restart with readiness probes; auto-rollback on failure	Post-mortem if rollback triggered

Human touch: Only when tests fail, health checks fail, or breaking changes need review. Requires: CI/CD pipeline setup, container registry, Helm chart structure.

Debugging System Issues

Automated	Human Intervention (Exception Only)
Standard K8s/Prometheus alerting flags error rate or latency anomalies	Investigate alerts that don't auto-resolve
Diagnostic dashboard aggregates recent logs, traces, API responses	Deep-dive when root cause unclear
Self-healing: K8s restarts failed pods; circuit breakers recover automatically	Exec into pod only for novel failure modes
Claude analyzes logs on-demand when operator requests assistance	Validate AI hypothesis before applying fix

Human touch: Only for novel failures that self-healing can't resolve. Note: "AI-assisted log analysis" requires operator to invoke Claude; not fully autonomous.

Scaling and Resource Tuning

Automated	Human Intervention (Exception Only)
HPA scales replicas based on CPU/memory thresholds (requires HPA configuration during build)	Review if scaling events are excessive
Cluster Autoscaler provisions nodes as needed (requires autoscaler setup)	Investigate if node provisioning fails
Cost monitoring dashboards with threshold alerts	Approve infrastructure cost increases
Quarterly review: analyze usage data, propose right-sizing (manual analysis, not AI-driven initially)	Approve major instance type changes

Human touch: Only for cost approvals or when auto-scaling behaves unexpectedly. Note: "AI recommends instance types" is a future enhancement, not Phase 1.

Configuration Changes

Automated	Human Intervention (Exception Only)
Capacity baselines recalculated weekly via scheduled job	Review if baseline changes significantly (>10% drift)
Traffic affinity matrix refreshes daily via batch job	Investigate if any region loses viable failover targets
Threshold change proposals generated as draft PRs (requires building this workflow)	Approve PR (required gate)
CI/CD validates in staging, auto-promotes if tests pass	Review if staging validation fails

Human touch: PR approval (intentional gate) and exception handling. Note: "Auto-generated PRs" requires building integration between analytics and Git; this is Phase 2 work.

Disaster Recovery

Automated	Human Intervention (Exception Only)
Health checks detect pod/node failure; K8s reschedules automatically	Investigate if rescheduling repeatedly fails
Standby pods start via K8s scaling (requires pre-configured scaling policies)	Investigate if standby fails to start
Analytics data restored from scheduled EBS snapshots (requires snapshot configuration)	Validate data integrity post-restore
Work unit agents reconnect to central via standard K8s service discovery	Manual re-registration if agent state corrupted

Human touch: Only when automated recovery fails. Note: Full regional DR (control plane failover to different region) is not in initial scope; would require additional infrastructure.

Bottom line: The system runs itself. Humans approve gates, investigate exceptions, and handle novel failures—not routine operations.

Interaction Channels

Role	Primary Interface	Interaction Type
On-Call Operator	Chat/CLI + existing alerting tools	Real-time recommendations, natural language queries
Capacity Planner	Dashboards + interactive analysis UI	Forecasts, what-if scenarios, reports
Incident Commander	Chat/CLI during incident	Situational awareness, decision support
Post-Incident Reviewer	Investigation UI	Timeline, correlation, root cause summary
Engineering Leadership	Scheduled reports + dashboards	Trends, forecasts, risk assessment

Phased Implementation Approach

Phase 0: Foundation

Establish data access and validate assumptions.

Confirm Data Warehouse interface requirements with DW team; define query contract for historical data access
Validate time-series metrics accessibility from central location; confirm metrics availability and retention
Document anomaly detection system API and available signals
Clarify distributed tracing coverage and sampling strategy
Set up Claude Code environment for early adopter experimentation
Define initial MCP server interface for Claude Code integration

Exit Criteria: Documented data access patterns for all source systems; working Claude Code prototype that can query at least one data source.

Phase 1: Shadow Mode - Observation and Recommendation

Build the core decision engine in observation-only mode.

Implement capacity baseline models using historical Data Warehouse data
Build traffic affinity matrix from empirical TTFB data
Create near real-time health monitoring pipeline (time-series metrics + anomaly detection)
Develop failover recommendation logic
Deploy central orchestrator in shadow mode—generates recommendations, logs decisions, but takes no action
Compare system recommendations against actual human operator decisions
Iterate on models based on discrepancies

Exit Criteria: System generates failover recommendations that match human decisions ≥90% of the time over a 4-week observation period.

Phase 2: Assisted Decision-Making

Increase system involvement while maintaining human control.

Surface recommendations to operators in real-time during incidents
Provide "what-if" analysis tools for capacity planning via visualization platform integration or custom UI
Implement post-incident analysis automation
Deploy work unit agents for local context gathering
Build cost modeling for traffic migration scenarios
Extend audit logging to capture full decision context

Exit Criteria: Operators actively use system recommendations during incidents; post-incident analysis time reduced by 50%.

Phase 3: Progressive Automation

Gradually transfer decision authority to the system.

Define automation runbook with explicit human approval gates
Implement staged failover capability (if traffic control supports it)
Enable automated alerting based on system recommendations
Pilot automated failover for low-risk scenarios with immediate human notification
Expand automation scope based on demonstrated reliability

Exit Criteria: Defined per-scenario based on risk tolerance and demonstrated system accuracy.

Security and Compliance

Security is a first-class concern, not an afterthought. The system will be designed with the following principles:

Authentication and Authorization

Service-to-Service (Agents):

All agent communication uses mutual TLS with certificate-based identity
Workload identity attestation ensures agents only accept requests from verified peers

Human Access:

SSO integration with existing identity provider (SAML/OIDC) for all operator-facing interfaces
No local accounts; all access tied to corporate identity

Role-Based Access Control:

Role	Permissions	Typical Users
Viewer	View dashboards, recommendations, and audit logs	All engineers, leadership
Operator	Acknowledge/override recommendations, trigger manual analysis, ask ad-hoc questions	On-call SREs, incident commanders
Admin	Modify capacity thresholds, update traffic affinity rules, manage access	Platform team leads
System Config	Deploy model updates, modify agent configuration, update MCP schemas	Build team, designated SREs

Configuration Management:

System configuration (thresholds, affinity rules, alert definitions) managed via GitOps—changes require PR review and merge to deploy
Operational actions (acknowledge recommendation, trigger failover) performed directly in operator UI with audit logging
No direct production access; all changes flow through version-controlled pipelines or audited UI actions

Network Security

Central orchestrator runs in isolated admin cluster with restricted network policies
Work unit agents communicate only with central orchestrator via authenticated channels
No direct work-unit-to-work-unit agent communication without central coordination
All cross-network traffic encrypted in transit

Audit and Accountability

Every observation, recommendation, and action is logged with full context:
- What signals were observed
- What recommendation was generated
- Who or what approved the action (human operator ID or automated policy reference)
- What action was taken
- Outcome of the action
Audit logs shipped to centralized logging system with append-only retention policies (no deletion or modification within retention window)
Regular audit review as part of operational process

Failure Modes and Safety

First directive: Do no harm. When uncertain, the system alerts humans rather than taking action.
Network partition handling: If central orchestrator loses connectivity to a work unit agent, it does not assume the work unit is healthy or unhealthy—it flags the uncertainty.
Agent crashes/restarts: Agents resume in safe state; no action taken based on stale data
Consensus requirements: For high-impact actions (full work unit failover), require corroborating signals from at least two independent sources (e.g., anomaly detection + metrics threshold breach, or metrics + operator report) before recommending

Data Residency and Privacy

Customer data is never processed by the AI layer; only aggregated metrics and operational telemetry
Claude API calls do not include PII or customer-identifying information
All data processing respects existing data governance policies

Technology Approach

The system leverages existing infrastructure where possible and introduces new components only where necessary.

AI Layer:

Central orchestrator runs a custom agent built on Claude SDK or LangChain with Opus 4.5 for high-stakes reasoning
Work unit agents run local LLMs (Llama 3) for network partition resilience and API cost elimination
Claude Code with custom MCP servers for development and early adopter experimentation

Data Infrastructure: Builds on existing time-series metrics, logging, and data warehouse investments. New components limited to pre-computed analytics stores for derived datasets.

Build vs. Buy: Custom development required for the decision engine, capacity models, and MCP integrations. Standard tooling for everything else.

See Appendix C: Detailed Technology Stack for component-level recommendations.

Open Questions for Stakeholders

The following questions require input from relevant teams before implementation can proceed:

Data Access (Data Warehouse, Observability)

What is the interface for querying the Data Warehouse (REST API, SQL, other), and what are the latency characteristics for interactive vs. batch queries?
What is the current distributed tracing sampling strategy, and can we access trace data programmatically for correlation?

Traffic Control (Traffic Engineering)

What is the current mechanism for traffic failover (DNS, load balancer, application-level), and is there an API for automated control?
Can traffic be shifted gradually (e.g., 10% increments), or is it all-or-nothing?
What are the current documented runbooks for failover decisions?

Security and Compliance

What is the standard for service-to-service authentication, and are there specific compliance requirements (SOC2, etc.) affecting audit logging?

Rollout

Which work units or regions should be prioritized for initial rollout?

Appendix A: Glossary

Term	Definition
Work Unit	A complete deployment of the service stack in a specific region; also referred to as a cluster
TTFB	Time to First Byte; a measure of latency from user request to initial response
Thundering Herd	A spike in resource utilization when a large number of clients simultaneously connect to a newly-activated work unit
Failover	The process of shifting traffic away from a degraded work unit to healthy alternatives
Shadow Mode	Operating mode where the system generates recommendations but takes no automated action
Traffic Affinity	The mapping of customer populations to optimal work units based on latency and capacity
Headroom	Available capacity on a work unit; the difference between current utilization and maximum safe capacity

Appendix B: Document History

Version	Date	Author	Changes
0.1	Dec 2024	D. Park	Initial draft for executive review
1.0	Dec 2024	D. Park	Added team requirements, human interaction model, expanded tech details

Appendix C: Detailed Technology Stack

Core Components

Component	Recommendation	Rationale
AI Reasoning (Production)	Custom agent built on Claude SDK or LangChain with Opus 4.5	Highest reasoning capability; custom agent enables fine-grained control over context management, token usage, and task execution; allows tailored tool definitions and guardrails specific to our operational domain
AI Reasoning (Development)	Claude Code with custom MCP server	Rapid iteration; direct tool access for early adopters
Work Unit Local Inference	Llama 3 (8B or 70B) via vLLM or Ollama	Required for network partition resilience; eliminates API costs at scale; sub-second local response times
Time-Series Metrics	Prometheus	Industry standard; powerful PromQL; native alerting
Time-Series Forecasting	Prophet or NeuralProphet	Handles sinusoidal patterns and missing data
Pre-Computed Analytics Store	DuckDB or Apache Iceberg on Parquet	Fast analytical queries on derived datasets
Centralized Logging	Splunk or OpenSearch	Enterprise features and compliance
Distributed Tracing	Jaeger or Zipkin	Kubernetes-native; OpenTelemetry support
Visualization Platform	Tableau or Grafana	See integration notes below
Container Orchestration	Kubernetes (EKS, GKE, or AKS)	Industry standard; strong ecosystem
Inter-Agent Communication	gRPC with mTLS	Strong typing, efficient serialization
Audit Logging	Splunk or ELK stack	Searchable, compliance-ready

Visualization Platform Integration

Use Case	Data Source	Refresh Rate	Dashboard Type
Real-time work unit health	Time-series metrics	1-5 min	Operational
Current headroom by region	Pre-computed analytics	5-15 min	Operational
30/60/90 day capacity forecast	Data Warehouse	Daily	Planning
"What-if" traffic migration	Data Warehouse + Cost models	On-demand	Interactive
Post-incident timeline	Logs + Traces + Metrics	On-demand	Investigation

Tableau Integration:

Connection Type	Use Case	How It Works
Live Connection	Real-time operational dashboards	Tableau queries the data source directly on each dashboard load or refresh. For Prometheus, this requires an intermediary (e.g., Prometheus → PostgreSQL via query exporter, or a REST API data source). Queries execute in real-time; no data cached in Tableau.
Extract (Hyper)	Historical analysis, capacity planning	Tableau pulls data on a schedule (hourly/daily), stores it in a compressed .hyper file. Faster queries for large datasets; enables offline analysis. Required for Data Warehouse queries that are too slow for interactive use.
Embedded Analytics	Operator tooling integration	Tableau views embedded in internal tools via Tableau Embedding API. SSO passthrough via SAML/JWT. Allows dashboards within existing operator workflows.

Tableau Server or Tableau Cloud is required for scheduled extract refreshes, shared dashboards, and embedding. Desktop-only deployments cannot support team-wide operational use.

Grafana Integration:

Data Source	Connection Method	Capabilities
Prometheus	Native plugin (built-in)	PromQL queries, alerting rules, annotation support. Sub-second refresh for real-time dashboards.
OpenSearch/Elasticsearch	Native plugin	Log exploration, full-text search, aggregations for log-based dashboards.
PostgreSQL/DuckDB	SQL plugin	Query pre-computed analytics tables. Supports variables and templating for interactive filtering.
Data Warehouse	JDBC/ODBC or custom plugin	Historical queries via SQL interface. May require caching layer for acceptable latency.

Grafana excels at operational dashboards with real-time metrics. For complex business analytics (cohort analysis, executive summaries with calculated fields), Tableau offers more flexibility.

Build vs. Buy

Build (Custom Development Required):

Component	What It Does	Key Technical Work
MCP Server for Claude Code	Exposes internal systems (metrics, logs, anomaly detection) as tools that Claude can invoke via the Model Context Protocol	Define tool schemas for each data source; implement authentication/authorization; handle rate limiting; translate natural language queries to PromQL/SQL
Traffic Affinity Matrix Pipeline	Generates and maintains the region-to-work-unit routing preferences based on latency data	Scheduled job that queries TTFB percentiles from Data Warehouse; computes ranked failover targets per region; outputs to analytics store; handles edge cases (new regions, decommissioned work units)
Capacity Baseline Modeling	Determines the maximum safe load for each work unit based on historical performance	Analyze historical metrics during peak load; identify limiting resources (CPU, memory, connections, network); compute per-work-unit capacity thresholds with safety margins; update baselines after infrastructure changes
Anomaly Detection Integration	Connects the decision engine to the existing anomaly detection system	API client for anomaly detection service; polling or webhook-based signal ingestion; normalization of anomaly scores to common format; historical anomaly retrieval for post-incident analysis
Failover Recommendation Engine	Core decision logic that combines signals into actionable recommendations	Rule engine or ML model that weighs health signals, capacity headroom, traffic affinity, and thundering herd risk; generates ranked recommendations with confidence scores; supports override rules for known scenarios

Leverage Existing:

System	What We Use	Integration Approach
Time-series metrics	Real-time health signals, utilization data	Query via existing API; no changes to collection pipeline
Centralized logging	Log correlation for post-incident analysis	Query via existing search API; may need new saved searches/alerts
Distributed tracing	Request path analysis during incidents	Query via existing API; requires adequate sampling for correlation
Data Warehouse	Historical analysis, forecasting input	New query patterns; may need additional ETL for derived tables
Authentication	SSO for human users, service identity for agents	SAML/OIDC integration for UIs; workload identity for service-to-service

Evaluate Commercial Options:

Option	Potential Benefit	Consideration
Unified observability platform (Datadog, New Relic, Dynatrace)	Single pane of glass; built-in anomaly detection; reduced integration work	May not integrate well with existing bespoke systems; licensing costs at scale; vendor lock-in
Commercial AIOps tools	Pre-built correlation and recommendation engines	Often require significant customization; may not understand domain-specific capacity constraints
Managed LLM inference	Reduce operational burden of running local LLMs	Adds network dependency; per-query costs; may not meet latency requirements for real-time decisions

This document is intended for internal planning purposes. Distribution should be limited to stakeholders involved in project scoping and approval.

dpark2025/predictive-capacity-planning.md

Predictive Service Availability and Capacity Planning

Executive Summary

Current State

Infrastructure Overview

Current Operational Challenges

Project Goals

Primary Goals

Success Criteria

Team Requirements and Investment

Team Structure

AI-Assisted Development Adoption

Engineering Investment Estimate

Return on Investment

Operational Concerns

Operations Handoff

Infrastructure Cost Estimate (AWS)

Proposed Architecture

Conceptual Layers

Data Flow Summary

Human Interaction Model

Operator Experience During Incidents

Capacity Planning Workflows

Post-Incident Analysis

Leadership and Planning Reviews

System Administration

Interaction Channels

Phased Implementation Approach

Phase 0: Foundation

Phase 1: Shadow Mode - Observation and Recommendation

Phase 2: Assisted Decision-Making

Phase 3: Progressive Automation

Security and Compliance

Authentication and Authorization

Network Security

Audit and Accountability

Failure Modes and Safety

Data Residency and Privacy

Technology Approach

Open Questions for Stakeholders

Data Access (Data Warehouse, Observability)

Traffic Control (Traffic Engineering)

Security and Compliance

Rollout

Appendix A: Glossary

Appendix B: Document History

Appendix C: Detailed Technology Stack

Core Components

Visualization Platform Integration

Build vs. Buy