This alarm indicates acute resource pressure on your catalog-api instance. Here's the immediate
triage:
Critical Status
- Instance: i-0abc123def456001 in us-east-1
- Metric: 92.3% CPU (crossed 90% threshold at 03:47 UTC on 2026-04-05)
- Severity: CRITICAL (routed to ops-critical SNS topic)
Immediate Actions (Priority Order)
1. Verify the instance is still responsive (next 2 minutes)
# Check if instance is reachable
aws ec2 describe-instances --instance-ids i-0abc123def456001 --region us-east-1
# Check recent CloudWatch metrics for context
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456001 --start-time 2026-04-05T03:17:00Z \
--end-time 2026-04-05T04:00:00Z --period 60 --statistics Average,Maximum
2. Identify what's consuming CPU (next 5 minutes)
- SSH into the instance and run: top / htop / ps aux --sort=-%cpu to find the culprit process
- Check application logs for errors or runaway loops
3. Scale horizontally if catalog-api is load-balanceable
- If behind an Auto Scaling Group: verify ASG scaling policies are active
- If not: launch a second instance and update load balancer target group
- Don't restart yet — you need to know why CPU spiked first
4. Suppress cascading alerts (if appropriate)
- If you're investigating, consider temporarily disabling the alarm to avoid alert fatigue
- Re-enable once root cause is identified
Questions to Answer Now
┌──────────────────────────────────┬────────────────────┬────────────────────────────────────┐
│ Question │ Why │ What To Do │
├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
│ Is this a one-time spike or │ Transient vs. │ Check CloudWatch graph for 1-hour │
│ sustained? │ structural problem │ history. Is it trending down? │
├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
│ Is the instance oversized for │ May need permanent │ Check memory & disk. Is CPU the │
│ the workload? │ resize │ only bottleneck? │
├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
│ Are there deployment/config │ Common cause of │ Check git log in your infra repo, │
│ changes in the last 1-2 hours? │ sudden load │ or deployment logs in your CI/CD │
│ │ │ system │
├──────────────────────────────────┼────────────────────┼────────────────────────────────────┤
│ Is there a known data ingestion │ Scheduled work may │ Check cron jobs: crontab -l │
│ job or batch process? │ explain spike │ │
└──────────────────────────────────┴────────────────────┴────────────────────────────────────┘
Next Steps (Based on Root Cause)
- If load increased legitimately → Scale up (more instances or larger instance type)
- If a process is runaway → Kill it, check application code/config
- If misconfigured alarm → Adjust threshold (92% may be normal for this workload)
- If instance is undersized → Migrate to larger instance type + adjust ASG
---
Action needed from ops team: What's the current instance type (t3.medium, m5.large, etc.) and is
catalog-api behind an ASG?
⏺ SRE Analysis: HighCPUUtilization-catalog-api
Severity Assessment: MEDIUM (Requires immediate investigation, not emergency)
Customer Impact Potential: Moderate to High (if sustained, causes request queueing/timeouts)
MTTR Estimate: 5–15 min (if traffic-driven); 30+ min (if application bug)
Confidence Level: LOW — single metric, missing context
---
What This Alarm Tells You
✅ Red flag:
- catalog-api hit 92% CPU in a 5-minute window
- Routed to ops-critical (your alerting is configured correctly)
⚠️ Significant gaps:
- One datapoint — is this a sustained issue or a 30-second spike within a 5-min period?
- No request volume — is traffic actually up, or is a single query inefficient?
- No error rates — are customers experiencing 5xx errors, or just slower responses?
- No latency data — can't tell if catalog-api is timing out downstream calls
- No memory usage — is this a memory leak manifesting as CPU (GC pressure)?
---
Decision Tree: What's Happening?
CPU @ 92% → Is error rate also elevated?
├─ YES → Application bug (infinite loop, inefficient query, memory leak)
│ Action: Check recent deployments, trace slowest queries
│ MTTR: 30 min (requires debugging + rollback or fix-forward)
│
└─ NO → Check request volume vs. baseline
├─ Traffic UP 2-3x → Traffic spike, normal scaling response
│ Action: Autoscaling should kick in; check ASG state
│ MTTR: 3-5 min (scale out)
│
└─ Traffic NORMAL → Undersized instance or unknown workload
Action: Check running processes; possible resource leak
MTTR: 5-10 min (restart) or 30+ min (investigation)
---
Immediate Actions (Next 5 Minutes)
1. Check real-time metrics (CloudWatch dashboard or CLI):
# Pull last 10 minutes of data
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456001 \
--statistics Average \
--start-time 2026-04-05T03:37:00Z \
--end-time 2026-04-05T03:47:00Z \
--period 60
# Check application metrics
aws cloudwatch get-metric-statistics \
--namespace CatalogAPI \
--metric-name RequestCount,ErrorRate,Latency \
--statistics Average,Sum \
--start-time 2026-04-05T03:37:00Z \
--end-time 2026-04-05T03:47:00Z \
--period 60
2. Assess customer impact:
- Is catalog-api responding? curl -I https://api.example.com/catalog/health
- Check error logs: grep ERROR /var/log/catalog-api/*.log | tail -20
- Are downstream services timing out? (inventory, recommendations, cart)
3. Rule out transient spike:
- If it's the only high-CPU datapoint, and the service is responding normally → No immediate
action. Monitor closely.
- If CPU stayed high for 3+ consecutive periods → Escalate to triage phase (below).
---
Triage Phase (If CPU Remains >85% or Error Rate Rises)
Test 1: Is it a traffic surge?
# Compare current request rate to baseline
# If RequestRate > baseline by 2x, autoscaling group should handle it
# If ASG isn't scaling, check:
aws autoscaling describe-auto-scaling-instances \
--instance-ids i-0abc123def456001
# Are desired capacity and running instances in sync?
If yes: Expected behavior. Autoscaling in progress. No action, monitor scale-out completion.
Test 2: Is it an application issue?
# SSH to instance and check running processes
ps aux --sort=-%cpu | head -10
# Look for: Java process (GC), Python runaway, database client
# If recent deployment: check app logs for exceptions
tail -f /var/log/catalog-api/application.log | grep -i "exception\|error\|timeout"
# Check database queries if catalog-api is query-heavy
# (you'll need slow query logs or APM tracing)
If yes: Likely a bug. Rollback last deployment (if deployed in last 30 min), or restart the
service (if crash/memory leak suspected). Investigate root cause post-incident.
Test 3: Is the instance undersized?
# Check memory + disk I/O, not just CPU
aws ec2 describe-instances --instance-ids i-0abc123def456001 \
--query 'Reservations[0].Instances[0].[InstanceType,State.Name]'
# If t3.small or t3.micro and traffic is normal, consider upsizing
---
Recommended Next Steps
┌───────────────────────┬────────────────────────────────────────────────┬───────┬────────────┐
│ Scenario │ Action │ MTTR │ Confidence │
├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
│ Single spike, service │ Wait & monitor for next 10 min. Set 5-min SNS │ — │ High │
│ healthy, no errors │ silence if you need focus. │ │ │
├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
│ Sustained CPU >85% + │ Trigger autoscaling scale-out or restart the │ 3–5 │ High │
│ error rate rising │ instance. Parallel: check recent deployments. │ min │ │
├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
│ Sustained CPU + no │ Investigate running processes. Likely a │ 30 │ Medium │
│ error rate change │ resource leak or bad query. Possible rollback. │ min │ │
├───────────────────────┼────────────────────────────────────────────────┼───────┼────────────┤
│ CPU spiking every 5 │ Check scheduled tasks (cron jobs, batch jobs, │ 10 │ │
│ min (pattern) │ cache refresh). Tune or reschedule if not │ min │ Medium │
│ │ critical. │ │ │
└───────────────────────┴────────────────────────────────────────────────┴───────┴────────────┘
---
Questions to Clarify ASAP
1. Is the service responding to requests? (Is this customer-facing yet?)
2. Did we deploy in the last hour? (rollback candidate?)
3. Is CPU still elevated now (2026-04-05T03:50+)? Or did it drop back?
4. What's the baseline CPU for this instance? (Is 92% abnormal or cyclical?)
5. Is request volume up? (Check request rate metric; if traffic is normal, this is suspicious.)
---
Escalation Criteria
Escalate to incident commander if:
- CPU remains >90% for >5 consecutive periods (25 minutes sustained)
- Error rate rises above 1% (catalog-api 5xx or timeout errors)
- Customer complaints in support tickets reference slow catalog or 500 errors
- Unable to determine root cause within 15 minutes
This does NOT yet require a SEV-1 page. It's a SEV-3 (investigate) trending toward SEV-2 (degrade
service) if unresolved in 10 minutes.
---
Long-Term: Fix the Alert Fatigue
Once resolved, tune this alarm:
- Increase evaluation periods to 2 (requires 2 consecutive 5-min periods above 90%, filters
spikes)
- **Add missing metrics (error rate, latency) as composite alarm conditions
- Set up dashboard with CPU + RequestCount + Errors + Latency to spot the pattern next time
⏺ SRE Incident Analysis: HighCPUUtilization-catalog-api
Severity Assessment
┌────────────────┬────────────────────────────────────────────────────────────────────────────┐
│ Dimension │ Assessment │
├────────────────┼────────────────────────────────────────────────────────────────────────────┤
│ Incident Level │ 3 (Elevated) — ops-critical SNS topic triggered, but off-peak timing │
│ │ limits customer impact │
├────────────────┼────────────────────────────────────────────────────────────────────────────┤
│ Customer │ Low (current) — 03:47 UTC is 6+ hours before peak traffic window (09:00 │
│ Impact │ UTC) │
├────────────────┼────────────────────────────────────────────────────────────────────────────┤
│ Blast Radius │ Single instance — catalog-api only; ALB has other backends; cache layer in │
│ │ place │
├────────────────┼────────────────────────────────────────────────────────────────────────────┤
│ MTTR Estimate │ 15–45 min (investigation + fix) vs. 5 min (instance restart/scale) │
├────────────────┼────────────────────────────────────────────────────────────────────────────┤
│ Confidence │ Medium — single datapoint spike; need current state verification │
│ Level │ │
└────────────────┴────────────────────────────────────────────────────────────────────────────┘
---
Immediate Triage (Next 2 Minutes)
Check these in parallel while on call:
1. Is CPU sustained or transient?
- Query CloudWatch for last 10 minutes of CPU (1-min granularity)
- If spike already dropped → transient, monitor closely
- If still >85% → investigate root cause immediately
2. Is traffic elevated?
- Check RequestCount metric on ALB target group
- Compare to baseline for 03:00–04:00 UTC window
- If requests are normal → it's a process issue, not traffic
3. Are database connections healthy?
- Query DatabaseConnections on db-catalog RDS instance
- If at/near 100 limit → app is likely in connection retry loop
- If normal (40–60) → not the issue
4. Check ElastiCache cache hit rate:
- Cache hit rate <85% → probable cache node failure or eviction storm
- This would explain elevated backend load
---
Root Cause Decision Tree
├─ CPU still >85%?
│ ├─ YES, requests normal (healthy baseline)
│ │ └─ Process issue: memory leak, zombie threads, batch job
│ │ → SSH, run `top`, check running processes
│ │ → Check cron jobs, background tasks, recent deploys
│ │
│ ├─ YES, requests >3x baseline
│ │ └─ Traffic spike (unusual for 03:47 UTC)
│ │ → Check CDN cache invalidation events
│ │ → Check for bot traffic / scraping
│ │ → Consider immediate scale-up if sustained
│ │
│ ├─ YES, DB connections maxed
│ │ └─ Database bottleneck (connection pool exhaustion)
│ │ → Check RDS slow query logs
│ │ → Rolling restart of app instances (careful — one at a time)
│ │
│ └─ YES, cache hit rate <85%
│ └─ Cache failure (node down, eviction storm)
│ → Failover to Redis standby if available
│ → Monitor backend load as cache recovers
│
└─ CPU dropped below 80%?
└─ Transient spike → monitor for patterns
→ If spikes recur every hour → likely scheduled job
→ If random → thermal blip or GC pause, continue monitoring
---
Recommended Actions (In Order)
Phase 1: Verify Scope (0–2 min)
- Pull fresh CloudWatch metrics for CPU, RequestCount, Database Connections (last 15 min)
- Check X-Ray traces for slow transactions or errors
- Review CloudWatch Logs for errors at 03:40–03:50 UTC window
- Check EC2 system logs for OS-level issues (OOM, kernel panic markers)
Phase 2: Investigate Without Interrupting Service (2–10 min)
- If CPU still >85% & requests normal: SSH to i-0abc123def456001, run:
top -b -n 2 | head -20 # Top CPU-consuming processes
ps aux | grep java # If app is JVM-based
docker stats # If containerized
free -h # Memory and swap
- If DB connections maxed: Check RDS slow query log for queries taking >5s
- If cache hit rate <85%: Restart Redis cluster nodes in maintenance window
Phase 3: Mitigation (Choose Based on Investigation)
Option A: Process Investigation Confirms Runaway Process
- Kill the offending process
- Monitor CPU for 5 min to confirm recovery
- Post-incident: Review logs, check for memory leaks or infinite loops
Option B: No Obvious Process Issue
- Safe option: Auto-scale up to t3.xlarge (4 vCPU) immediately
- Reduces CPU to ~46% (92.3% on 2 vCPU → ~46% on 4 vCPU)
- Buys time for investigation
- Cost impact: ~$0.10/hour temporary
- Continue investigation while scaled up
- Scale back down once root cause is fixed
Option C: Database Connection Exhaustion Confirmed
- Implement connection pool draining: stop accepting new requests on one instance (ALB removes
from target group)
- Restart the instance after connections drain (5–10 min)
- Repeat on other instances if needed
---
Alert Tuning (Post-Incident)
Current threshold (90%) is aggressive for this workload:
- Baseline peak: 60–65%, so 90% threshold has ~25% headroom
- Single datapoint (EvaluationPeriods=1) triggers on transient spikes
- Recommendation after root cause is known:
- Adjust to 85% threshold with 2 evaluation periods (10 min sustained)
- This filters transient spikes, catches real issues earlier
---
Handoff Checklist for On-Call
- Alarm fired: 2026-04-05 03:47 UTC (off-peak)
- Confidence: Medium (single spike, need current state)
- Time to action: 1 min (ops-critical SLA)
- Safe to restart? Likely yes, but investigate first (5 min cost vs. risk)
- Escalation trigger: If CPU remains >85% after 10 min + you find no root cause → page database
team (may indicate RDS issue)
What does your current CloudWatch dashboard show? Is CPU still elevated?
ALARM ANALYSIS: HighCPUUtilization-catalog-api
Initial Assessment
Severity: CRITICAL (ops-critical SNS firing)
Confidence Level: MEDIUM (single 5-min datapoint — need more data)
Estimated Customer Impact: POTENTIAL (depends on error rate — check NOW)
Time-Critical? YES (off-peak CPU spike is abnormal — usually indicates runaway process)
---
Immediate Context Red Flag
Time of alarm: 2026-04-05 03:47 UTC
This is OFF-PEAK (baseline should be 15-25% CPU). Peak hours are 09:00-21:00 UTC.
→ This is NOT a traffic surge. Something else is consuming cycles at an odd hour. This raises the
probability of:
1. Runaway process (memory leak, infinite loop, or runaway query) — HIGHEST PROBABILITY
2. Recent deployment gone wrong
3. Cache failure (every query cascades to database)
---
FIRST DIAGNOSTIC COMMAND (run immediately)
# Get process snapshot on the instance
aws ssm send-command \
--instance-ids i-0abc123def456001 \
--document-name "AWS-RunShellScript" \
--region us-east-1 \
--parameters 'commands=["echo \n=== TOP (processes) ===\n && top -bn1 -o -%CPU | head -15 &&
echo \n=== MEMORY ===\n && free -h && echo \n=== DISK I/O ===\n && iostat -x 1 2 | tail -10 &&
echo \n=== SWAP ===\n && grep -i swap /proc/meminfo"]'
# Check deployment history
aws deploy describe-deployments \
--region us-east-1 \
--query 'deployments[0:3].[deploymentId,status,creator,createTime]' \
--output table
Why this first?
- top identifies if a single process is runaway (look for >50% CPU on one PID)
- free / iostat / swap show if instance is in distress (OOM swapping kills performance)
- Deployment check catches if we shipped broken code
---
DECISION TREE (while waiting for diagnostics)
If top shows a single process > 50% CPU:
Process Name = java / python / node?
├─ YES (app process):
│ ├─ Check logs for ERROR/EXCEPTION spike
│ ├─ Run: kill -TERM <PID> (graceful) → monitor for 2 min
│ └─ If CPU normalizes: investigate code leak (recent PR?)
│ └─ If CPU stays high: kill -9 <PID> → alert dev team
│
└─ NO (system process: postgres, mysql, nginx):
└─ Database or reverse proxy is slammed
└─ Check RDS connections + ElastiCache health next
If top shows distributed CPU (many processes at 5-10% each):
→ Likely cache failure or database cascade
→ Run ElastiCache health check immediately
If no obvious runaway AND deployment < 30 min old:
→ Suspect code regression from recent deploy
→ Decision: Rollback vs. investigate
└─ If error rate is spiking: ROLLBACK NOW
└─ If error rate is normal: investigate further
---
PARALLEL DIAGNOSTICS (run within 2 minutes)
# Cache health
aws elasticache describe-cache-clusters \
--cache-cluster-id catalog-cache-1 \
--show-cache-node-info \
--region us-east-1 \
--query 'CacheClusters[0].[CacheNodeType,CacheClusterStatus,CacheNodes[0].CacheNodeStatus]'
# RDS connection count
aws rds describe-db-instances \
--db-instance-identifier db-catalog \
--region us-east-1 \
--query 'DBInstances[0].DBParameterGroups[0]'
# Check X-Ray for errors
aws xray get-trace-summaries \
--start-time $(date -u -d '5 minutes ago' +%s) \
--end-time $(date -u +%s) \
--region us-east-1 \
--query 'TraceSummaries[0:5].[HttpStatus,ErrorFlag,FaultFlag]'
# ALB request count (to rule out traffic surge)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name RequestCount \
--start-time $(date -u -d '10 minutes ago' +%s) \
--end-time $(date -u +%s) \
--period 300 \
--statistics Sum \
--region us-east-1
---
RESPONSE ACTIONS (priority order)
┌───────────────────────────────┬───────────────────────────────────────────────────┬─────────┐
│ Finding │ Action │ MTTR │
├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
│ Runaway process identified │ kill -TERM <PID> → wait 2 min → kill -9 if needed │ 3-5 min │
├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
│ Recent deploy is culprit │ Rollback via CodeDeploy │ 5-10 │
│ │ │ min │
├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
│ Cache degraded (status ≠ │ Restart affected node → verify hit rate recovery │ 2-3 min │
│ available) │ │ │
├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
│ RDS connections > 90 (of 100 │ Check for connection leaks in app; restart app │ 5 min │
│ max) │ gracefully │ │
├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
│ Error rate spiking in X-Ray │ Correlation test: is error rate tied to CPU spike │ N/A │
│ │ start time? │ │
├───────────────────────────────┼───────────────────────────────────────────────────┼─────────┤
│ No clear cause after 5 min │ Page secondary on-call → prepare for instance │ — │
│ │ replacement │ │
└───────────────────────────────┴───────────────────────────────────────────────────┴─────────┘
---
ESCALATION CRITERIA
Page dev team lead if:
- CPU remains > 90% for > 10 minutes AND root cause is unclear
- Multiple alarms firing simultaneously (CPU + latency + DB connections)
- X-Ray shows error rate spiking in correlation with CPU
Declare minor incident if:
- Customer error rate confirmed elevated
- ETA for fix > 15 minutes
---
YOUR CALL RIGHT NOW
Recommended immediate action:
1. Run the first diagnostic command NOW (SSM send-command for top/free/iostat)
2. Queue the parallel diagnostics while you wait for top output (2 min SLA)
3. Set a 5-minute timer: If CPU doesn't drop and no root cause found, page secondary on-call
Why this approach?
- Off-peak + 92% CPU = probable runaway process, not a traffic issue
- t3.large with 2 vCPU can't sustain 92% indefinitely — something is actively broken
- You have ~5 minutes to identify the cause before customer impact cascades (if error rate is
rising)
What does top show when you run that first command?