Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 14, 2026 23:08
Show Gist options
  • Select an option

  • Save jmealo/30289a097409f470f0e4418e5f10ea88 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/30289a097409f470f0e4418e5f10ea88 to your computer and use it in GitHub Desktop.
RCA: Intel-API AMQP Cascade Failure — CPU Limit Regression (2026-03-14)

RCA: Intel-API AMQP Cascade Failure — CPU Limit Regression

Date: 2026-03-14 Environment: Production Severity: Critical (P1) Duration: ~23 hours (2026-03-14 00:15 UTC — ongoing) Status: IN PROGRESS — fix identified, rolling restart underway

Summary

A 3x reduction in CPU limits for intel-api (1500m → 500m) caused progressive CPU throttling, which degraded AMQP publish performance. This cascaded into channel lock timeouts across the platform, causing 500 errors on intel-requests-api, consumer failures, and a 20k+ message backlog in RabbitMQ.

Timeline

Time (UTC) Event
2026-03-09 00:47 Commit dc5e9bef2 merged: "Standardize metrics configurations for backend services". Changed base values.yaml CPU limit from 1500m → 500m. Production override did NOT set a CPU limit, inheriting the new base value.
2026-03-13 23:45 Helm upgrade deployed to production. 150 intel-api pods rolled from ReplicaSet b6776cc7879f7695fbd. All pods now running with 500m CPU limit (previously 1500m).
2026-03-14 00:00 Rollout complete. All 150 pods on new ReplicaSet.
2026-03-14 00:15 First AMQP publish errors appear — 571 errors in 15 minutes.
2026-03-14 00:15–04:00 AMQP errors sustain at ~700/15min with brief dip at 04:00.
2026-03-14 06:00–12:00 Errors resume and escalate: 877→3,404/hr.
2026-03-14 12:00–15:00 Steady at ~3,200/hr.
2026-03-14 15:00 Inflection point: errors jump to 5,784/hr. Retry storm feedback loop begins.
2026-03-14 16:00–20:00 Exponential growth: 10,558 → 18,763/hr.
2026-03-14 ~22:45 Investigation begins. Multiple critical alerts firing.
2026-03-14 ~23:00 Root cause identified. Rolling restart of intel-api and intel-requests-api initiated. CPU limit fix prepared.

Root Cause

The Change

Commit dc5e9bef2 ("Standardize metrics configurations for backend services", 2026-03-09) modified applications/backend/helm/intel-api/values.yaml:

resources:
  limits:
-    cpu: 1500m
+    cpu: 500m
    memory: 2000Mi

The production environment override (environments/production/values.yaml) only set resources.requests — not resources.limits. This meant production inherited the base CPU limit, which dropped from 1500m to 500m.

The Effect

  • 23.1% of CPU scheduling periods were throttled across 150 intel-api pods
  • 6.81 throttled-seconds per second accumulated across the fleet
  • While average CPU usage (~100m/pod) appeared low, burst workloads (AMQP publishes involving TLS, serialization, async lock acquisition) were hitting the 500m ceiling

The Cascade

CPU limit 1500m → 500m
  → CPU throttling on burst workloads (23% of periods)
  → aiormq channel lock held longer during throttled publishes
  → Other publishes timeout (30s wait_for) → TimeoutError
  → intel-api: "AMQP publish error" on intel.from_cache events
  → Failed messages stored to DB (STORE_FAILED_AMQP_MESSAGES=1)
  → Stored messages replayed (PUBLISH_STORED_AMQP_MESSAGES=1) → retry storms
  → Retry storms cause publish rate spikes (up to 96k msg/sec)
  → More CPU pressure → more throttling → exponential growth
  → intel-requests-api: publish_lifecycle_event 30s timeouts → 500 on PATCH/DELETE
  → intel-searcher: AMQP connection death → ConsumerErrorSpike
  → outage-updater: ReadTimeout calling intel-api → queue backlog (20.4k messages)
  → notification-sender, message-amplifier: downstream failures

Why It Took Time to Manifest

  1. The CPU limit regression was committed on Mar 9 but not deployed until Mar 13 23:45
  2. Initial error rate was moderate (~700/15min) — within existing alert thresholds
  3. The STORE_FAILED_AMQP_MESSAGES + PUBLISH_STORED_AMQP_MESSAGES feedback loop caused exponential growth — errors doubled every few hours
  4. By Mar 14 15:00, the retry storm overwhelmed the system

Impact

Services Affected

Service Error Count Error Type
intel-api 145,011 AMQP publish errors (since onset) AMQP publish error: intel.from_cache
intel-requests-api ~35,000 errors today TimeoutError / RuntimeError on PATCH/DELETE
outage-updater ~965 errors since 18:00 ReadTimeout calling intel-api, Redis lock contention
intel-searcher (all variants) Multiple pods AMQP connection is None after maximum reconnect attempts
intel-searcher-retry Multiple pods AMQP reconnect task was cancelled
notification-sender Elevated ConsumerErrorSpike
message-amplifier Elevated ConsumerErrorSpike

User Impact

  • PATCH and DELETE requests to intel-requests-api returning 500 errors (data update failures)
  • Lifecycle events not published → downstream consumers not processing intel request state changes
  • Outage update queue backlog of 20.4k messages (delayed outage processing)
  • Possible data staleness for end users viewing intel/outage data

RabbitMQ State During Incident

  • Cluster healthy (5 nodes, memory/disk OK, confirmations flowing)
  • Queue backlog: outage-updater 20.4k messages, intel-searcher-intel-request-retry 3.3k
  • Connections and channels stable (~1,350 connections, ~2,050 channels)
  • RabbitMQ was NOT the problem — the issue was client-side channel lock contention from CPU-throttled publish operations

Fix

Immediate (applied)

Restored CPU limit in applications/backend/helm/intel-api/environments/production/values.yaml:

resources:
  limits:
    cpu: 1500m    # Restored from base default of 500m
  requests:
    cpu: 75m
    memory: 832Mi

In Progress

  • Rolling restart of intel-api and intel-requests-api to clear dead AMQP connections
  • Deploy CPU limit fix

Post-Incident

  1. Audit all production services for CPU limit regressions from the dc5e9bef2 commit
  2. Add CPU throttling alert: Alert when container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 25% sustained for 15 minutes
  3. Review AMQP publish timeout: The 30s default in aiormq is too long for an API request path. A shorter timeout with faster failover would prevent cascade
  4. Review stored message replay: The STORE_FAILED_AMQP_MESSAGES + PUBLISH_STORED_AMQP_MESSAGES feedback loop turned a moderate issue into an exponential one

Other Services with CPU Throttling

The same dc5e9bef2 commit may have affected other services. Current throttling observed:

Service Namespace % Periods Throttled Throttled sec/sec Severity
archive-tagger default 99.0% 4.07 Critical — near-complete throttle
db exporter db 90.2% 0.21 High — monitoring degraded
grafana-sc-dashboard monitoring 59.8% low Medium
snmp-heartbeat-sender default 50.0% low Medium
promitor-agent-scraper monitoring 33.8% 0.13 Medium
alertmanager monitoring 27.2% 1.20 Medium — alert delivery may be delayed
outage-validation-task-manager default 23.6% 0.16 Medium
kong proxy kong 22.5% 1.22 Medium — API gateway
byparr infra 21.6% 0.12 Low

archive-tagger at 99% throttled is a separate issue that needs immediate attention.

Fixes Applied for Other Throttled Services

Service Old Limit New Limit Rationale
archive-tagger 75m (base default, inherited) 500m (production override) Avg usage 70m, 99% throttled — limit was barely above avg
pgcopydb exporter 200m (base default, inherited) 500m (production override) Avg usage 50m, 90% throttled — spiky workload needs headroom
kong gateway 250m (intentionally set) Not changed 22-33% throttled, but comment says "30d max 188m, 250m = 33% headroom". Needs review but was deliberately sized.

Detection Gap

This incident was not caught by alerts because:

  1. No alert exists for CPU throttling percentage
  2. The CPUThrottlingHigh info alert was firing (85 instances) but classified as info severity
  3. The AMQP error escalation was gradual enough to avoid tripping burst-based alerts until the exponential phase

Lessons Learned

  1. Base value changes affect all environments — CPU limit changes in base values.yaml propagate to any environment that doesn't explicitly override resources.limits
  2. Metric standardization commits should not change resource limits — the commit mixed unrelated concerns (metrics enable/disable + CPU limit changes)
  3. AMQP stored message replay can create feedback loops — failed publishes → store → replay → more failures
  4. CPU throttling needs production alerting — the CPUThrottlingHigh alert existed but was info-severity and ignored
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment