lab2_outputs.md

Step 1.2

This alarm indicates acute resource pressure on your catalog-api instance. Here's the immediate
  triage:

  Critical Status

  - Instance: i-0abc123def456001 in us-east-1
  - Metric: 92.3% CPU (crossed 90% threshold at 03:47 UTC on 2026-04-05)
  - Severity: CRITICAL (routed to ops-critical SNS topic)

  Immediate Actions (Priority Order)

  1. Verify the instance is still responsive (next 2 minutes)
  # Check if instance is reachable
  aws ec2 describe-instances --instance-ids i-0abc123def456001 --region us-east-1
  # Check recent CloudWatch metrics for context
  aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0abc123def456001 --start-time 2026-04-05T03:17:00Z \
    --end-time 2026-04-05T04:00:00Z --period 60 --statistics Average,Maximum
  2. Identify what's consuming CPU (next 5 minutes)
    - SSH into the instance and run: top / htop / ps aux --sort=-%cpu to find the culprit process
    - Check application logs for errors or runaway loops
  3. Scale horizontally if catalog-api is load-balanceable
    - If behind an Auto Scaling Group: verify ASG scaling policies are active
    - If not: launch a second instance and update load balancer target group
    - Don't restart yet — you need to know why CPU spiked first
  4. Suppress cascading alerts (if appropriate)
    - If you're investigating, consider temporarily disabling the alarm to avoid alert fatigue
    - Re-enable once root cause is identified

  Questions to Answer Now

  ┌──────────────────────────────────┬────────────────────┬────────────────────────────────────┐
  │             Question             │        Why         │             What To Do             │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Is this a one-time spike or      │ Transient vs.      │ Check CloudWatch graph for 1-hour  │
  │ sustained?                       │ structural problem │ history. Is it trending down?      │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Is the instance oversized for    │ May need permanent │ Check memory & disk. Is CPU the    │
  │ the workload?                    │  resize            │ only bottleneck?                   │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Are there deployment/config      │ Common cause of    │ Check git log in your infra repo,  │
  │ changes in the last 1-2 hours?   │ sudden load        │ or deployment logs in your CI/CD   │
  │                                  │                    │ system                             │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Is there a known data ingestion  │ Scheduled work may │ Check cron jobs: crontab -l        │
  │ job or batch process?            │  explain spike     │                                    │
  └──────────────────────────────────┴────────────────────┴────────────────────────────────────┘

  Next Steps (Based on Root Cause)

  - If load increased legitimately → Scale up (more instances or larger instance type)
  - If a process is runaway → Kill it, check application code/config
  - If misconfigured alarm → Adjust threshold (92% may be normal for this workload)
  - If instance is undersized → Migrate to larger instance type + adjust ASG

  ---
  Action needed from ops team: What's the current instance type (t3.medium, m5.large, etc.) and is
  catalog-api behind an ASG?

Step 1.3

⏺ SRE Analysis: HighCPUUtilization-catalog-api

  Severity Assessment: MEDIUM (Requires immediate investigation, not emergency)

  Customer Impact Potential: Moderate to High (if sustained, causes request queueing/timeouts)
  MTTR Estimate: 5–15 min (if traffic-driven); 30+ min (if application bug)
  Confidence Level: LOW — single metric, missing context

  ---
  What This Alarm Tells You

  ✅ Red flag:
  - catalog-api hit 92% CPU in a 5-minute window
  - Routed to ops-critical (your alerting is configured correctly)

  ⚠️ Significant gaps:
  - One datapoint — is this a sustained issue or a 30-second spike within a 5-min period?
  - No request volume — is traffic actually up, or is a single query inefficient?
  - No error rates — are customers experiencing 5xx errors, or just slower responses?
  - No latency data — can't tell if catalog-api is timing out downstream calls
  - No memory usage — is this a memory leak manifesting as CPU (GC pressure)?

  ---
  Decision Tree: What's Happening?

  CPU @ 92% → Is error rate also elevated?
             ├─ YES → Application bug (infinite loop, inefficient query, memory leak)
             │        Action: Check recent deployments, trace slowest queries
             │        MTTR: 30 min (requires debugging + rollback or fix-forward)
             │
             └─ NO  → Check request volume vs. baseline
                      ├─ Traffic UP 2-3x → Traffic spike, normal scaling response
                      │  Action: Autoscaling should kick in; check ASG state
                      │  MTTR: 3-5 min (scale out)
                      │
                      └─ Traffic NORMAL → Undersized instance or unknown workload
                         Action: Check running processes; possible resource leak
                         MTTR: 5-10 min (restart) or 30+ min (investigation)

  ---
  Immediate Actions (Next 5 Minutes)

  1. Check real-time metrics (CloudWatch dashboard or CLI):
  # Pull last 10 minutes of data
  aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0abc123def456001 \
    --statistics Average \
    --start-time 2026-04-05T03:37:00Z \
    --end-time 2026-04-05T03:47:00Z \
    --period 60

  # Check application metrics
  aws cloudwatch get-metric-statistics \
    --namespace CatalogAPI \
    --metric-name RequestCount,ErrorRate,Latency \
    --statistics Average,Sum \
    --start-time 2026-04-05T03:37:00Z \
    --end-time 2026-04-05T03:47:00Z \
    --period 60
  2. Assess customer impact:
    - Is catalog-api responding? curl -I https://api.example.com/catalog/health
    - Check error logs: grep ERROR /var/log/catalog-api/*.log | tail -20
    - Are downstream services timing out? (inventory, recommendations, cart)
  3. Rule out transient spike:
    - If it's the only high-CPU datapoint, and the service is responding normally → No immediate
  action. Monitor closely.
    - If CPU stayed high for 3+ consecutive periods → Escalate to triage phase (below).

  ---
  Triage Phase (If CPU Remains >85% or Error Rate Rises)

  Test 1: Is it a traffic surge?

  # Compare current request rate to baseline
  # If RequestRate > baseline by 2x, autoscaling group should handle it
  # If ASG isn't scaling, check:
  aws autoscaling describe-auto-scaling-instances \
    --instance-ids i-0abc123def456001
  # Are desired capacity and running instances in sync?

  If yes: Expected behavior. Autoscaling in progress. No action, monitor scale-out completion.

  Test 2: Is it an application issue?

  # SSH to instance and check running processes
  ps aux --sort=-%cpu | head -10
  # Look for: Java process (GC), Python runaway, database client

  # If recent deployment: check app logs for exceptions
  tail -f /var/log/catalog-api/application.log | grep -i "exception\|error\|timeout"

  # Check database queries if catalog-api is query-heavy
  # (you'll need slow query logs or APM tracing)

  If yes: Likely a bug. Rollback last deployment (if deployed in last 30 min), or restart the
  service (if crash/memory leak suspected). Investigate root cause post-incident.

  Test 3: Is the instance undersized?

  # Check memory + disk I/O, not just CPU
  aws ec2 describe-instances --instance-ids i-0abc123def456001 \
    --query 'Reservations[0].Instances[0].[InstanceType,State.Name]'
  # If t3.small or t3.micro and traffic is normal, consider upsizing

  ---
  Recommended Next Steps

  ┌───────────────────────┬────────────────────────────────────────────────┬───────┬────────────┐
  │       Scenario        │                     Action                     │ MTTR  │ Confidence │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ Single spike, service │ Wait & monitor for next 10 min. Set 5-min SNS  │ —     │ High       │
  │  healthy, no errors   │ silence if you need focus.                     │       │            │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ Sustained CPU >85% +  │ Trigger autoscaling scale-out or restart the   │ 3–5   │ High       │
  │ error rate rising     │ instance. Parallel: check recent deployments.  │ min   │            │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ Sustained CPU + no    │ Investigate running processes. Likely a        │ 30    │ Medium     │
  │ error rate change     │ resource leak or bad query. Possible rollback. │ min   │            │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ CPU spiking every 5   │ Check scheduled tasks (cron jobs, batch jobs,  │ 10    │            │
  │ min (pattern)         │ cache refresh). Tune or reschedule if not      │ min   │ Medium     │
  │                       │ critical.                                      │       │            │
  └───────────────────────┴────────────────────────────────────────────────┴───────┴────────────┘

  ---
  Questions to Clarify ASAP

  1. Is the service responding to requests? (Is this customer-facing yet?)
  2. Did we deploy in the last hour? (rollback candidate?)
  3. Is CPU still elevated now (2026-04-05T03:50+)? Or did it drop back?
  4. What's the baseline CPU for this instance? (Is 92% abnormal or cyclical?)
  5. Is request volume up? (Check request rate metric; if traffic is normal, this is suspicious.)

  ---
  Escalation Criteria

  Escalate to incident commander if:
  - CPU remains >90% for >5 consecutive periods (25 minutes sustained)
  - Error rate rises above 1% (catalog-api 5xx or timeout errors)
  - Customer complaints in support tickets reference slow catalog or 500 errors
  - Unable to determine root cause within 15 minutes

  This does NOT yet require a SEV-1 page. It's a SEV-3 (investigate) trending toward SEV-2 (degrade
  service) if unresolved in 10 minutes.

  ---
  Long-Term: Fix the Alert Fatigue

  Once resolved, tune this alarm:
  - Increase evaluation periods to 2 (requires 2 consecutive 5-min periods above 90%, filters
  spikes)
  - **Add missing metrics (error rate, latency) as composite alarm conditions
  - Set up dashboard with CPU + RequestCount + Errors + Latency to spot the pattern next time

Step 1.4

⏺ SRE Incident Analysis: HighCPUUtilization-catalog-api

  Severity Assessment

  ┌────────────────┬────────────────────────────────────────────────────────────────────────────┐
  │   Dimension    │                                 Assessment                                 │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Incident Level │ 3 (Elevated) — ops-critical SNS topic triggered, but off-peak timing       │
  │                │ limits customer impact                                                     │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Customer       │ Low (current) — 03:47 UTC is 6+ hours before peak traffic window (09:00    │
  │ Impact         │ UTC)                                                                       │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Blast Radius   │ Single instance — catalog-api only; ALB has other backends; cache layer in │
  │                │  place                                                                     │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ MTTR Estimate  │ 15–45 min (investigation + fix) vs. 5 min (instance restart/scale)         │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Confidence     │ Medium — single datapoint spike; need current state verification           │
  │ Level          │                                                                            │
  └────────────────┴────────────────────────────────────────────────────────────────────────────┘

  ---
  Immediate Triage (Next 2 Minutes)

  Check these in parallel while on call:

  1. Is CPU sustained or transient?
    - Query CloudWatch for last 10 minutes of CPU (1-min granularity)
    - If spike already dropped → transient, monitor closely
    - If still >85% → investigate root cause immediately
  2. Is traffic elevated?
    - Check RequestCount metric on ALB target group
    - Compare to baseline for 03:00–04:00 UTC window
    - If requests are normal → it's a process issue, not traffic
  3. Are database connections healthy?
    - Query DatabaseConnections on db-catalog RDS instance
    - If at/near 100 limit → app is likely in connection retry loop
    - If normal (40–60) → not the issue
  4. Check ElastiCache cache hit rate:
    - Cache hit rate <85% → probable cache node failure or eviction storm
    - This would explain elevated backend load

  ---
  Root Cause Decision Tree

  ├─ CPU still >85%?
  │  ├─ YES, requests normal (healthy baseline)
  │  │  └─ Process issue: memory leak, zombie threads, batch job
  │  │     → SSH, run `top`, check running processes
  │  │     → Check cron jobs, background tasks, recent deploys
  │  │
  │  ├─ YES, requests >3x baseline
  │  │  └─ Traffic spike (unusual for 03:47 UTC)
  │  │     → Check CDN cache invalidation events
  │  │     → Check for bot traffic / scraping
  │  │     → Consider immediate scale-up if sustained
  │  │
  │  ├─ YES, DB connections maxed
  │  │  └─ Database bottleneck (connection pool exhaustion)
  │  │     → Check RDS slow query logs
  │  │     → Rolling restart of app instances (careful — one at a time)
  │  │
  │  └─ YES, cache hit rate <85%
  │     └─ Cache failure (node down, eviction storm)
  │        → Failover to Redis standby if available
  │        → Monitor backend load as cache recovers
  │
  └─ CPU dropped below 80%?
     └─ Transient spike → monitor for patterns
        → If spikes recur every hour → likely scheduled job
        → If random → thermal blip or GC pause, continue monitoring

  ---
  Recommended Actions (In Order)

  Phase 1: Verify Scope (0–2 min)

  - Pull fresh CloudWatch metrics for CPU, RequestCount, Database Connections (last 15 min)
  - Check X-Ray traces for slow transactions or errors
  - Review CloudWatch Logs for errors at 03:40–03:50 UTC window
  - Check EC2 system logs for OS-level issues (OOM, kernel panic markers)

  Phase 2: Investigate Without Interrupting Service (2–10 min)

  - If CPU still >85% & requests normal: SSH to i-0abc123def456001, run:
  top -b -n 2 | head -20        # Top CPU-consuming processes
  ps aux | grep java             # If app is JVM-based
  docker stats                   # If containerized
  free -h                        # Memory and swap
  - If DB connections maxed: Check RDS slow query log for queries taking >5s
  - If cache hit rate <85%: Restart Redis cluster nodes in maintenance window

  Phase 3: Mitigation (Choose Based on Investigation)

  Option A: Process Investigation Confirms Runaway Process
  - Kill the offending process
  - Monitor CPU for 5 min to confirm recovery
  - Post-incident: Review logs, check for memory leaks or infinite loops

  Option B: No Obvious Process Issue
  - Safe option: Auto-scale up to t3.xlarge (4 vCPU) immediately
    - Reduces CPU to ~46% (92.3% on 2 vCPU → ~46% on 4 vCPU)
    - Buys time for investigation
    - Cost impact: ~$0.10/hour temporary
  - Continue investigation while scaled up
  - Scale back down once root cause is fixed

  Option C: Database Connection Exhaustion Confirmed
  - Implement connection pool draining: stop accepting new requests on one instance (ALB removes
  from target group)
  - Restart the instance after connections drain (5–10 min)
  - Repeat on other instances if needed

  ---
  Alert Tuning (Post-Incident)

  Current threshold (90%) is aggressive for this workload:
  - Baseline peak: 60–65%, so 90% threshold has ~25% headroom
  - Single datapoint (EvaluationPeriods=1) triggers on transient spikes
  - Recommendation after root cause is known:
    - Adjust to 85% threshold with 2 evaluation periods (10 min sustained)
    - This filters transient spikes, catches real issues earlier

  ---
  Handoff Checklist for On-Call

  - Alarm fired: 2026-04-05 03:47 UTC (off-peak)
  - Confidence: Medium (single spike, need current state)
  - Time to action: 1 min (ops-critical SLA)
  - Safe to restart? Likely yes, but investigate first (5 min cost vs. risk)
  - Escalation trigger: If CPU remains >85% after 10 min + you find no root cause → page database
  team (may indicate RDS issue)

  What does your current CloudWatch dashboard show? Is CPU still elevated?

Step 1.5

 ALARM ANALYSIS: HighCPUUtilization-catalog-api

  Initial Assessment

  Severity: CRITICAL (ops-critical SNS firing)
  Confidence Level: MEDIUM (single 5-min datapoint — need more data)
  Estimated Customer Impact: POTENTIAL (depends on error rate — check NOW)
  Time-Critical? YES (off-peak CPU spike is abnormal — usually indicates runaway process)

  ---
  Immediate Context Red Flag

  Time of alarm: 2026-04-05 03:47 UTC

  This is OFF-PEAK (baseline should be 15-25% CPU). Peak hours are 09:00-21:00 UTC.

  → This is NOT a traffic surge. Something else is consuming cycles at an odd hour. This raises the
  probability of:
  1. Runaway process (memory leak, infinite loop, or runaway query) — HIGHEST PROBABILITY
  2. Recent deployment gone wrong
  3. Cache failure (every query cascades to database)

  ---
  FIRST DIAGNOSTIC COMMAND (run immediately)

  # Get process snapshot on the instance
  aws ssm send-command \
    --instance-ids i-0abc123def456001 \
    --document-name "AWS-RunShellScript" \
    --region us-east-1 \
    --parameters 'commands=["echo \n=== TOP (processes) ===\n && top -bn1 -o -%CPU | head -15 &&
  echo \n=== MEMORY ===\n && free -h && echo \n=== DISK I/O ===\n && iostat -x 1 2 | tail -10 &&
  echo \n=== SWAP ===\n && grep -i swap /proc/meminfo"]'

  # Check deployment history
  aws deploy describe-deployments \
    --region us-east-1 \
    --query 'deployments[0:3].[deploymentId,status,creator,createTime]' \
    --output table

  Why this first?
  - top identifies if a single process is runaway (look for >50% CPU on one PID)
  - free / iostat / swap show if instance is in distress (OOM swapping kills performance)
  - Deployment check catches if we shipped broken code

  ---
  DECISION TREE (while waiting for diagnostics)

  If top shows a single process > 50% CPU:

  Process Name = java / python / node?
  ├─ YES (app process):
  │  ├─ Check logs for ERROR/EXCEPTION spike
  │  ├─ Run: kill -TERM <PID> (graceful) → monitor for 2 min
  │  └─ If CPU normalizes: investigate code leak (recent PR?)
  │  └─ If CPU stays high: kill -9 <PID> → alert dev team
  │
  └─ NO (system process: postgres, mysql, nginx):
     └─ Database or reverse proxy is slammed
     └─ Check RDS connections + ElastiCache health next

  If top shows distributed CPU (many processes at 5-10% each):

  → Likely cache failure or database cascade
  → Run ElastiCache health check immediately

  If no obvious runaway AND deployment < 30 min old:

  → Suspect code regression from recent deploy
  → Decision: Rollback vs. investigate
     └─ If error rate is spiking: ROLLBACK NOW
     └─ If error rate is normal: investigate further

  ---
  PARALLEL DIAGNOSTICS (run within 2 minutes)

  # Cache health
  aws elasticache describe-cache-clusters \
    --cache-cluster-id catalog-cache-1 \
    --show-cache-node-info \
    --region us-east-1 \
    --query 'CacheClusters[0].[CacheNodeType,CacheClusterStatus,CacheNodes[0].CacheNodeStatus]'

  # RDS connection count
  aws rds describe-db-instances \
    --db-instance-identifier db-catalog \
    --region us-east-1 \
    --query 'DBInstances[0].DBParameterGroups[0]'

  # Check X-Ray for errors
  aws xray get-trace-summaries \
    --start-time $(date -u -d '5 minutes ago' +%s) \
    --end-time $(date -u +%s) \
    --region us-east-1 \
    --query 'TraceSummaries[0:5].[HttpStatus,ErrorFlag,FaultFlag]'

  # ALB request count (to rule out traffic surge)
  aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name RequestCount \
    --start-time $(date -u -d '10 minutes ago' +%s) \
    --end-time $(date -u +%s) \
    --period 300 \
    --statistics Sum \
    --region us-east-1

  ---
  RESPONSE ACTIONS (priority order)

  ┌───────────────────────────────┬───────────────────────────────────────────────────┬─────────┐
  │            Finding            │                      Action                       │  MTTR   │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Runaway process identified    │ kill -TERM <PID> → wait 2 min → kill -9 if needed │ 3-5 min │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Recent deploy is culprit      │ Rollback via CodeDeploy                           │ 5-10    │
  │                               │                                                   │ min     │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Cache degraded (status ≠      │ Restart affected node → verify hit rate recovery  │ 2-3 min │
  │ available)                    │                                                   │         │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ RDS connections > 90 (of 100  │ Check for connection leaks in app; restart app    │ 5 min   │
  │ max)                          │ gracefully                                        │         │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Error rate spiking in X-Ray   │ Correlation test: is error rate tied to CPU spike │ N/A     │
  │                               │  start time?                                      │         │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ No clear cause after 5 min    │ Page secondary on-call → prepare for instance     │ —       │
  │                               │ replacement                                       │         │
  └───────────────────────────────┴───────────────────────────────────────────────────┴─────────┘

  ---
  ESCALATION CRITERIA

  Page dev team lead if:
  - CPU remains > 90% for > 10 minutes AND root cause is unclear
  - Multiple alarms firing simultaneously (CPU + latency + DB connections)
  - X-Ray shows error rate spiking in correlation with CPU

  Declare minor incident if:
  - Customer error rate confirmed elevated
  - ETA for fix > 15 minutes

  ---
  YOUR CALL RIGHT NOW

  Recommended immediate action:

  1. Run the first diagnostic command NOW (SSM send-command for top/free/iostat)
  2. Queue the parallel diagnostics while you wait for top output (2 min SLA)
  3. Set a 5-minute timer: If CPU doesn't drop and no root cause found, page secondary on-call

  Why this approach?
  - Off-peak + 92% CPU = probable runaway process, not a traffic issue
  - t3.large with 2 vCPU can't sustain 92% indefinitely — something is actively broken
  - You have ~5 minutes to identify the cause before customer impact cascades (if error rate is
  rising)

  What does top show when you run that first command?

initcron/lab2_outputs.md

Select an option

No results found

Select an option

No results found

Step 1.2

Step 1.3

Step 1.4

Step 1.5