Skip to content

Instantly share code, notes, and snippets.

@initcron
Last active April 6, 2026 07:20
Show Gist options
  • Select an option

  • Save initcron/c529a1a5b7ebce9b96fb58806b2254eb to your computer and use it in GitHub Desktop.

Select an option

Save initcron/c529a1a5b7ebce9b96fb58806b2254eb to your computer and use it in GitHub Desktop.

Step 1.2

This alarm indicates acute resource pressure on your catalog-api instance. Here's the immediate
  triage:

  Critical Status

  - Instance: i-0abc123def456001 in us-east-1
  - Metric: 92.3% CPU (crossed 90% threshold at 03:47 UTC on 2026-04-05)
  - Severity: CRITICAL (routed to ops-critical SNS topic)

  Immediate Actions (Priority Order)

  1. Verify the instance is still responsive (next 2 minutes)
  # Check if instance is reachable
  aws ec2 describe-instances --instance-ids i-0abc123def456001 --region us-east-1
  # Check recent CloudWatch metrics for context
  aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0abc123def456001 --start-time 2026-04-05T03:17:00Z \
    --end-time 2026-04-05T04:00:00Z --period 60 --statistics Average,Maximum
  2. Identify what's consuming CPU (next 5 minutes)
    - SSH into the instance and run: top / htop / ps aux --sort=-%cpu to find the culprit process
    - Check application logs for errors or runaway loops
  3. Scale horizontally if catalog-api is load-balanceable
    - If behind an Auto Scaling Group: verify ASG scaling policies are active
    - If not: launch a second instance and update load balancer target group
    - Don't restart yet — you need to know why CPU spiked first
  4. Suppress cascading alerts (if appropriate)
    - If you're investigating, consider temporarily disabling the alarm to avoid alert fatigue
    - Re-enable once root cause is identified

  Questions to Answer Now

  ┌──────────────────────────────────┬────────────────────┬────────────────────────────────────┐
  │             Question             │        Why         │             What To Do             │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Is this a one-time spike or      │ Transient vs.      │ Check CloudWatch graph for 1-hour  │
  │ sustained?                       │ structural problem │ history. Is it trending down?      │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Is the instance oversized for    │ May need permanent │ Check memory & disk. Is CPU the    │
  │ the workload?                    │  resize            │ only bottleneck?                   │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Are there deployment/config      │ Common cause of    │ Check git log in your infra repo,  │
  │ changes in the last 1-2 hours?   │ sudden load        │ or deployment logs in your CI/CD   │
  │                                  │                    │ system                             │
  ├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
  │ Is there a known data ingestion  │ Scheduled work may │ Check cron jobs: crontab -l        │
  │ job or batch process?            │  explain spike     │                                    │
  └──────────────────────────────────┴────────────────────┴────────────────────────────────────┘

  Next Steps (Based on Root Cause)

  - If load increased legitimately → Scale up (more instances or larger instance type)
  - If a process is runaway → Kill it, check application code/config
  - If misconfigured alarm → Adjust threshold (92% may be normal for this workload)
  - If instance is undersized → Migrate to larger instance type + adjust ASG

  ---
  Action needed from ops team: What's the current instance type (t3.medium, m5.large, etc.) and is
  catalog-api behind an ASG?

Step 1.3

⏺ SRE Analysis: HighCPUUtilization-catalog-api

  Severity Assessment: MEDIUM (Requires immediate investigation, not emergency)

  Customer Impact Potential: Moderate to High (if sustained, causes request queueing/timeouts)
  MTTR Estimate: 5–15 min (if traffic-driven); 30+ min (if application bug)
  Confidence Level: LOW — single metric, missing context

  ---
  What This Alarm Tells You

  ✅ Red flag:
  - catalog-api hit 92% CPU in a 5-minute window
  - Routed to ops-critical (your alerting is configured correctly)

  ⚠️ Significant gaps:
  - One datapoint — is this a sustained issue or a 30-second spike within a 5-min period?
  - No request volume — is traffic actually up, or is a single query inefficient?
  - No error rates — are customers experiencing 5xx errors, or just slower responses?
  - No latency data — can't tell if catalog-api is timing out downstream calls
  - No memory usage — is this a memory leak manifesting as CPU (GC pressure)?

  ---
  Decision Tree: What's Happening?

  CPU @ 92% → Is error rate also elevated?
             ├─ YES → Application bug (infinite loop, inefficient query, memory leak)
             │        Action: Check recent deployments, trace slowest queries
             │        MTTR: 30 min (requires debugging + rollback or fix-forward)
             │
             └─ NO  → Check request volume vs. baseline
                      ├─ Traffic UP 2-3x → Traffic spike, normal scaling response
                      │  Action: Autoscaling should kick in; check ASG state
                      │  MTTR: 3-5 min (scale out)
                      │
                      └─ Traffic NORMAL → Undersized instance or unknown workload
                         Action: Check running processes; possible resource leak
                         MTTR: 5-10 min (restart) or 30+ min (investigation)

  ---
  Immediate Actions (Next 5 Minutes)

  1. Check real-time metrics (CloudWatch dashboard or CLI):
  # Pull last 10 minutes of data
  aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0abc123def456001 \
    --statistics Average \
    --start-time 2026-04-05T03:37:00Z \
    --end-time 2026-04-05T03:47:00Z \
    --period 60

  # Check application metrics
  aws cloudwatch get-metric-statistics \
    --namespace CatalogAPI \
    --metric-name RequestCount,ErrorRate,Latency \
    --statistics Average,Sum \
    --start-time 2026-04-05T03:37:00Z \
    --end-time 2026-04-05T03:47:00Z \
    --period 60
  2. Assess customer impact:
    - Is catalog-api responding? curl -I https://api.example.com/catalog/health
    - Check error logs: grep ERROR /var/log/catalog-api/*.log | tail -20
    - Are downstream services timing out? (inventory, recommendations, cart)
  3. Rule out transient spike:
    - If it's the only high-CPU datapoint, and the service is responding normally → No immediate
  action. Monitor closely.
    - If CPU stayed high for 3+ consecutive periods → Escalate to triage phase (below).

  ---
  Triage Phase (If CPU Remains >85% or Error Rate Rises)

  Test 1: Is it a traffic surge?

  # Compare current request rate to baseline
  # If RequestRate > baseline by 2x, autoscaling group should handle it
  # If ASG isn't scaling, check:
  aws autoscaling describe-auto-scaling-instances \
    --instance-ids i-0abc123def456001
  # Are desired capacity and running instances in sync?

  If yes: Expected behavior. Autoscaling in progress. No action, monitor scale-out completion.

  Test 2: Is it an application issue?

  # SSH to instance and check running processes
  ps aux --sort=-%cpu | head -10
  # Look for: Java process (GC), Python runaway, database client

  # If recent deployment: check app logs for exceptions
  tail -f /var/log/catalog-api/application.log | grep -i "exception\|error\|timeout"

  # Check database queries if catalog-api is query-heavy
  # (you'll need slow query logs or APM tracing)

  If yes: Likely a bug. Rollback last deployment (if deployed in last 30 min), or restart the
  service (if crash/memory leak suspected). Investigate root cause post-incident.

  Test 3: Is the instance undersized?

  # Check memory + disk I/O, not just CPU
  aws ec2 describe-instances --instance-ids i-0abc123def456001 \
    --query 'Reservations[0].Instances[0].[InstanceType,State.Name]'
  # If t3.small or t3.micro and traffic is normal, consider upsizing

  ---
  Recommended Next Steps

  ┌───────────────────────┬────────────────────────────────────────────────┬───────┬────────────┐
  │       Scenario        │                     Action                     │ MTTR  │ Confidence │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ Single spike, service │ Wait & monitor for next 10 min. Set 5-min SNS  │ —     │ High       │
  │  healthy, no errors   │ silence if you need focus.                     │       │            │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ Sustained CPU >85% +  │ Trigger autoscaling scale-out or restart the   │ 3–5   │ High       │
  │ error rate rising     │ instance. Parallel: check recent deployments.  │ min   │            │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ Sustained CPU + no    │ Investigate running processes. Likely a        │ 30    │ Medium     │
  │ error rate change     │ resource leak or bad query. Possible rollback. │ min   │            │
  ├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
  │ CPU spiking every 5   │ Check scheduled tasks (cron jobs, batch jobs,  │ 10    │            │
  │ min (pattern)         │ cache refresh). Tune or reschedule if not      │ min   │ Medium     │
  │                       │ critical.                                      │       │            │
  └───────────────────────┴────────────────────────────────────────────────┴───────┴────────────┘

  ---
  Questions to Clarify ASAP

  1. Is the service responding to requests? (Is this customer-facing yet?)
  2. Did we deploy in the last hour? (rollback candidate?)
  3. Is CPU still elevated now (2026-04-05T03:50+)? Or did it drop back?
  4. What's the baseline CPU for this instance? (Is 92% abnormal or cyclical?)
  5. Is request volume up? (Check request rate metric; if traffic is normal, this is suspicious.)

  ---
  Escalation Criteria

  Escalate to incident commander if:
  - CPU remains >90% for >5 consecutive periods (25 minutes sustained)
  - Error rate rises above 1% (catalog-api 5xx or timeout errors)
  - Customer complaints in support tickets reference slow catalog or 500 errors
  - Unable to determine root cause within 15 minutes

  This does NOT yet require a SEV-1 page. It's a SEV-3 (investigate) trending toward SEV-2 (degrade
  service) if unresolved in 10 minutes.

  ---
  Long-Term: Fix the Alert Fatigue

  Once resolved, tune this alarm:
  - Increase evaluation periods to 2 (requires 2 consecutive 5-min periods above 90%, filters
  spikes)
  - **Add missing metrics (error rate, latency) as composite alarm conditions
  - Set up dashboard with CPU + RequestCount + Errors + Latency to spot the pattern next time

Step 1.4

⏺ SRE Incident Analysis: HighCPUUtilization-catalog-api

  Severity Assessment

  ┌────────────────┬────────────────────────────────────────────────────────────────────────────┐
  │   Dimension    │                                 Assessment                                 │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Incident Level │ 3 (Elevated) — ops-critical SNS topic triggered, but off-peak timing       │
  │                │ limits customer impact                                                     │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Customer       │ Low (current) — 03:47 UTC is 6+ hours before peak traffic window (09:00    │
  │ Impact         │ UTC)                                                                       │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Blast Radius   │ Single instance — catalog-api only; ALB has other backends; cache layer in │
  │                │  place                                                                     │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ MTTR Estimate  │ 15–45 min (investigation + fix) vs. 5 min (instance restart/scale)         │
  ├────────────────┼────────────────────────────────────────────────────────────────────────────┤
  │ Confidence     │ Medium — single datapoint spike; need current state verification           │
  │ Level          │                                                                            │
  └────────────────┴────────────────────────────────────────────────────────────────────────────┘

  ---
  Immediate Triage (Next 2 Minutes)

  Check these in parallel while on call:

  1. Is CPU sustained or transient?
    - Query CloudWatch for last 10 minutes of CPU (1-min granularity)
    - If spike already dropped → transient, monitor closely
    - If still >85% → investigate root cause immediately
  2. Is traffic elevated?
    - Check RequestCount metric on ALB target group
    - Compare to baseline for 03:00–04:00 UTC window
    - If requests are normal → it's a process issue, not traffic
  3. Are database connections healthy?
    - Query DatabaseConnections on db-catalog RDS instance
    - If at/near 100 limit → app is likely in connection retry loop
    - If normal (40–60) → not the issue
  4. Check ElastiCache cache hit rate:
    - Cache hit rate <85% → probable cache node failure or eviction storm
    - This would explain elevated backend load

  ---
  Root Cause Decision Tree

  ├─ CPU still >85%?
  │  ├─ YES, requests normal (healthy baseline)
  │  │  └─ Process issue: memory leak, zombie threads, batch job
  │  │     → SSH, run `top`, check running processes
  │  │     → Check cron jobs, background tasks, recent deploys
  │  │
  │  ├─ YES, requests >3x baseline
  │  │  └─ Traffic spike (unusual for 03:47 UTC)
  │  │     → Check CDN cache invalidation events
  │  │     → Check for bot traffic / scraping
  │  │     → Consider immediate scale-up if sustained
  │  │
  │  ├─ YES, DB connections maxed
  │  │  └─ Database bottleneck (connection pool exhaustion)
  │  │     → Check RDS slow query logs
  │  │     → Rolling restart of app instances (careful — one at a time)
  │  │
  │  └─ YES, cache hit rate <85%
  │     └─ Cache failure (node down, eviction storm)
  │        → Failover to Redis standby if available
  │        → Monitor backend load as cache recovers
  │
  └─ CPU dropped below 80%?
     └─ Transient spike → monitor for patterns
        → If spikes recur every hour → likely scheduled job
        → If random → thermal blip or GC pause, continue monitoring

  ---
  Recommended Actions (In Order)

  Phase 1: Verify Scope (0–2 min)

  - Pull fresh CloudWatch metrics for CPU, RequestCount, Database Connections (last 15 min)
  - Check X-Ray traces for slow transactions or errors
  - Review CloudWatch Logs for errors at 03:40–03:50 UTC window
  - Check EC2 system logs for OS-level issues (OOM, kernel panic markers)

  Phase 2: Investigate Without Interrupting Service (2–10 min)

  - If CPU still >85% & requests normal: SSH to i-0abc123def456001, run:
  top -b -n 2 | head -20        # Top CPU-consuming processes
  ps aux | grep java             # If app is JVM-based
  docker stats                   # If containerized
  free -h                        # Memory and swap
  - If DB connections maxed: Check RDS slow query log for queries taking >5s
  - If cache hit rate <85%: Restart Redis cluster nodes in maintenance window

  Phase 3: Mitigation (Choose Based on Investigation)

  Option A: Process Investigation Confirms Runaway Process
  - Kill the offending process
  - Monitor CPU for 5 min to confirm recovery
  - Post-incident: Review logs, check for memory leaks or infinite loops

  Option B: No Obvious Process Issue
  - Safe option: Auto-scale up to t3.xlarge (4 vCPU) immediately
    - Reduces CPU to ~46% (92.3% on 2 vCPU → ~46% on 4 vCPU)
    - Buys time for investigation
    - Cost impact: ~$0.10/hour temporary
  - Continue investigation while scaled up
  - Scale back down once root cause is fixed

  Option C: Database Connection Exhaustion Confirmed
  - Implement connection pool draining: stop accepting new requests on one instance (ALB removes
  from target group)
  - Restart the instance after connections drain (5–10 min)
  - Repeat on other instances if needed

  ---
  Alert Tuning (Post-Incident)

  Current threshold (90%) is aggressive for this workload:
  - Baseline peak: 60–65%, so 90% threshold has ~25% headroom
  - Single datapoint (EvaluationPeriods=1) triggers on transient spikes
  - Recommendation after root cause is known:
    - Adjust to 85% threshold with 2 evaluation periods (10 min sustained)
    - This filters transient spikes, catches real issues earlier

  ---
  Handoff Checklist for On-Call

  - Alarm fired: 2026-04-05 03:47 UTC (off-peak)
  - Confidence: Medium (single spike, need current state)
  - Time to action: 1 min (ops-critical SLA)
  - Safe to restart? Likely yes, but investigate first (5 min cost vs. risk)
  - Escalation trigger: If CPU remains >85% after 10 min + you find no root cause → page database
  team (may indicate RDS issue)

  What does your current CloudWatch dashboard show? Is CPU still elevated?

Step 1.5

 ALARM ANALYSIS: HighCPUUtilization-catalog-api

  Initial Assessment

  Severity: CRITICAL (ops-critical SNS firing)
  Confidence Level: MEDIUM (single 5-min datapoint — need more data)
  Estimated Customer Impact: POTENTIAL (depends on error rate — check NOW)
  Time-Critical? YES (off-peak CPU spike is abnormal — usually indicates runaway process)

  ---
  Immediate Context Red Flag

  Time of alarm: 2026-04-05 03:47 UTC

  This is OFF-PEAK (baseline should be 15-25% CPU). Peak hours are 09:00-21:00 UTC.

  → This is NOT a traffic surge. Something else is consuming cycles at an odd hour. This raises the
  probability of:
  1. Runaway process (memory leak, infinite loop, or runaway query) — HIGHEST PROBABILITY
  2. Recent deployment gone wrong
  3. Cache failure (every query cascades to database)

  ---
  FIRST DIAGNOSTIC COMMAND (run immediately)

  # Get process snapshot on the instance
  aws ssm send-command \
    --instance-ids i-0abc123def456001 \
    --document-name "AWS-RunShellScript" \
    --region us-east-1 \
    --parameters 'commands=["echo \n=== TOP (processes) ===\n && top -bn1 -o -%CPU | head -15 &&
  echo \n=== MEMORY ===\n && free -h && echo \n=== DISK I/O ===\n && iostat -x 1 2 | tail -10 &&
  echo \n=== SWAP ===\n && grep -i swap /proc/meminfo"]'

  # Check deployment history
  aws deploy describe-deployments \
    --region us-east-1 \
    --query 'deployments[0:3].[deploymentId,status,creator,createTime]' \
    --output table

  Why this first?
  - top identifies if a single process is runaway (look for >50% CPU on one PID)
  - free / iostat / swap show if instance is in distress (OOM swapping kills performance)
  - Deployment check catches if we shipped broken code

  ---
  DECISION TREE (while waiting for diagnostics)

  If top shows a single process > 50% CPU:

  Process Name = java / python / node?
  ├─ YES (app process):
  │  ├─ Check logs for ERROR/EXCEPTION spike
  │  ├─ Run: kill -TERM <PID> (graceful) → monitor for 2 min
  │  └─ If CPU normalizes: investigate code leak (recent PR?)
  │  └─ If CPU stays high: kill -9 <PID> → alert dev team
  │
  └─ NO (system process: postgres, mysql, nginx):
     └─ Database or reverse proxy is slammed
     └─ Check RDS connections + ElastiCache health next

  If top shows distributed CPU (many processes at 5-10% each):

  → Likely cache failure or database cascade
  → Run ElastiCache health check immediately

  If no obvious runaway AND deployment < 30 min old:

  → Suspect code regression from recent deploy
  → Decision: Rollback vs. investigate
     └─ If error rate is spiking: ROLLBACK NOW
     └─ If error rate is normal: investigate further

  ---
  PARALLEL DIAGNOSTICS (run within 2 minutes)

  # Cache health
  aws elasticache describe-cache-clusters \
    --cache-cluster-id catalog-cache-1 \
    --show-cache-node-info \
    --region us-east-1 \
    --query 'CacheClusters[0].[CacheNodeType,CacheClusterStatus,CacheNodes[0].CacheNodeStatus]'

  # RDS connection count
  aws rds describe-db-instances \
    --db-instance-identifier db-catalog \
    --region us-east-1 \
    --query 'DBInstances[0].DBParameterGroups[0]'

  # Check X-Ray for errors
  aws xray get-trace-summaries \
    --start-time $(date -u -d '5 minutes ago' +%s) \
    --end-time $(date -u +%s) \
    --region us-east-1 \
    --query 'TraceSummaries[0:5].[HttpStatus,ErrorFlag,FaultFlag]'

  # ALB request count (to rule out traffic surge)
  aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name RequestCount \
    --start-time $(date -u -d '10 minutes ago' +%s) \
    --end-time $(date -u +%s) \
    --period 300 \
    --statistics Sum \
    --region us-east-1

  ---
  RESPONSE ACTIONS (priority order)

  ┌───────────────────────────────┬───────────────────────────────────────────────────┬─────────┐
  │            Finding            │                      Action                       │  MTTR   │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Runaway process identified    │ kill -TERM <PID> → wait 2 min → kill -9 if needed │ 3-5 min │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Recent deploy is culprit      │ Rollback via CodeDeploy                           │ 5-10    │
  │                               │                                                   │ min     │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Cache degraded (status ≠      │ Restart affected node → verify hit rate recovery  │ 2-3 min │
  │ available)                    │                                                   │         │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ RDS connections > 90 (of 100  │ Check for connection leaks in app; restart app    │ 5 min   │
  │ max)                          │ gracefully                                        │         │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ Error rate spiking in X-Ray   │ Correlation test: is error rate tied to CPU spike │ N/A     │
  │                               │  start time?                                      │         │
  ├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
  │ No clear cause after 5 min    │ Page secondary on-call → prepare for instance     │ —       │
  │                               │ replacement                                       │         │
  └───────────────────────────────┴───────────────────────────────────────────────────┴─────────┘

  ---
  ESCALATION CRITERIA

  Page dev team lead if:
  - CPU remains > 90% for > 10 minutes AND root cause is unclear
  - Multiple alarms firing simultaneously (CPU + latency + DB connections)
  - X-Ray shows error rate spiking in correlation with CPU

  Declare minor incident if:
  - Customer error rate confirmed elevated
  - ETA for fix > 15 minutes

  ---
  YOUR CALL RIGHT NOW

  Recommended immediate action:

  1. Run the first diagnostic command NOW (SSM send-command for top/free/iostat)
  2. Queue the parallel diagnostics while you wait for top output (2 min SLA)
  3. Set a 5-minute timer: If CPU doesn't drop and no root cause found, page secondary on-call

  Why this approach?
  - Off-peak + 92% CPU = probable runaway process, not a traffic issue
  - t3.large with 2 vCPU can't sustain 92% indefinitely — something is actively broken
  - You have ~5 minutes to identify the cause before customer impact cascades (if error rate is
  rising)

  What does top show when you run that first command?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment