RCA: Intel-API AMQP Cascade Failure — CPU Limit Regression

Date: 2026-03-14 Environment: Production Severity: Critical (P1) Duration: ~23 hours (2026-03-14 00:15 UTC — ongoing) Status: IN PROGRESS — fix identified, rolling restart underway

Summary

A 3x reduction in CPU limits for intel-api (1500m → 500m) caused progressive CPU throttling, which degraded AMQP publish performance. This cascaded into channel lock timeouts across the platform, causing 500 errors on intel-requests-api, consumer failures, and a 20k+ message backlog in RabbitMQ.

Timeline

Time (UTC)	Event
2026-03-09 00:47	Commit `dc5e9bef2` merged: "Standardize metrics configurations for backend services". Changed base `values.yaml` CPU limit from 1500m → 500m. Production override did NOT set a CPU limit, inheriting the new base value.
2026-03-13 23:45	Helm upgrade deployed to production. 150 intel-api pods rolled from ReplicaSet `b6776cc78` → `79f7695fbd`. All pods now running with 500m CPU limit (previously 1500m).
2026-03-14 00:00	Rollout complete. All 150 pods on new ReplicaSet.
2026-03-14 00:15	First AMQP publish errors appear — 571 errors in 15 minutes.
2026-03-14 00:15–04:00	AMQP errors sustain at ~700/15min with brief dip at 04:00.
2026-03-14 06:00–12:00	Errors resume and escalate: 877→3,404/hr.
2026-03-14 12:00–15:00	Steady at ~3,200/hr.
2026-03-14 15:00	Inflection point: errors jump to 5,784/hr. Retry storm feedback loop begins.
2026-03-14 16:00–20:00	Exponential growth: 10,558 → 18,763/hr.
2026-03-14 ~22:45	Investigation begins. Multiple critical alerts firing.
2026-03-14 ~23:00	Root cause identified. Rolling restart of intel-api and intel-requests-api initiated. CPU limit fix prepared.

Root Cause

The Change

Commit dc5e9bef2 ("Standardize metrics configurations for backend services", 2026-03-09) modified applications/backend/helm/intel-api/values.yaml:

resources:
  limits:
-    cpu: 1500m
+    cpu: 500m
    memory: 2000Mi

The production environment override (environments/production/values.yaml) only set resources.requests — not resources.limits. This meant production inherited the base CPU limit, which dropped from 1500m to 500m.

The Effect

23.1% of CPU scheduling periods were throttled across 150 intel-api pods
6.81 throttled-seconds per second accumulated across the fleet
While average CPU usage (~100m/pod) appeared low, burst workloads (AMQP publishes involving TLS, serialization, async lock acquisition) were hitting the 500m ceiling

The Cascade

CPU limit 1500m → 500m
  → CPU throttling on burst workloads (23% of periods)
  → aiormq channel lock held longer during throttled publishes
  → Other publishes timeout (30s wait_for) → TimeoutError
  → intel-api: "AMQP publish error" on intel.from_cache events
  → Failed messages stored to DB (STORE_FAILED_AMQP_MESSAGES=1)
  → Stored messages replayed (PUBLISH_STORED_AMQP_MESSAGES=1) → retry storms
  → Retry storms cause publish rate spikes (up to 96k msg/sec)
  → More CPU pressure → more throttling → exponential growth
  → intel-requests-api: publish_lifecycle_event 30s timeouts → 500 on PATCH/DELETE
  → intel-searcher: AMQP connection death → ConsumerErrorSpike
  → outage-updater: ReadTimeout calling intel-api → queue backlog (20.4k messages)
  → notification-sender, message-amplifier: downstream failures

Why It Took Time to Manifest

The CPU limit regression was committed on Mar 9 but not deployed until Mar 13 23:45
Initial error rate was moderate (~700/15min) — within existing alert thresholds
The STORE_FAILED_AMQP_MESSAGES + PUBLISH_STORED_AMQP_MESSAGES feedback loop caused exponential growth — errors doubled every few hours
By Mar 14 15:00, the retry storm overwhelmed the system

Impact

Services Affected

Service	Error Count	Error Type
intel-api	145,011 AMQP publish errors (since onset)	`AMQP publish error: intel.from_cache`
intel-requests-api	~35,000 errors today	`TimeoutError` / `RuntimeError` on PATCH/DELETE
outage-updater	~965 errors since 18:00	`ReadTimeout` calling intel-api, Redis lock contention
intel-searcher (all variants)	Multiple pods	`AMQP connection is None after maximum reconnect attempts`
intel-searcher-retry	Multiple pods	`AMQP reconnect task was cancelled`
notification-sender	Elevated	ConsumerErrorSpike
message-amplifier	Elevated	ConsumerErrorSpike

User Impact

PATCH and DELETE requests to intel-requests-api returning 500 errors (data update failures)
Lifecycle events not published → downstream consumers not processing intel request state changes
Outage update queue backlog of 20.4k messages (delayed outage processing)
Possible data staleness for end users viewing intel/outage data

RabbitMQ State During Incident

Cluster healthy (5 nodes, memory/disk OK, confirmations flowing)
Queue backlog: outage-updater 20.4k messages, intel-searcher-intel-request-retry 3.3k
Connections and channels stable (~1,350 connections, ~2,050 channels)
RabbitMQ was NOT the problem — the issue was client-side channel lock contention from CPU-throttled publish operations

Fix

Immediate (applied)

Restored CPU limit in applications/backend/helm/intel-api/environments/production/values.yaml:

resources:
  limits:
    cpu: 1500m    # Restored from base default of 500m
  requests:
    cpu: 75m
    memory: 832Mi

In Progress

Rolling restart of intel-api and intel-requests-api to clear dead AMQP connections
Deploy CPU limit fix

Post-Incident

Audit all production services for CPU limit regressions from the dc5e9bef2 commit
Add CPU throttling alert: Alert when container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 25% sustained for 15 minutes
Review AMQP publish timeout: The 30s default in aiormq is too long for an API request path. A shorter timeout with faster failover would prevent cascade
Review stored message replay: The STORE_FAILED_AMQP_MESSAGES + PUBLISH_STORED_AMQP_MESSAGES feedback loop turned a moderate issue into an exponential one

Other Services with CPU Throttling

The same dc5e9bef2 commit may have affected other services. Current throttling observed:

Service	Namespace	% Periods Throttled	Throttled sec/sec	Severity
archive-tagger	default	99.0%	4.07	Critical — near-complete throttle
db exporter	db	90.2%	0.21	High — monitoring degraded
grafana-sc-dashboard	monitoring	59.8%	low	Medium
snmp-heartbeat-sender	default	50.0%	low	Medium
promitor-agent-scraper	monitoring	33.8%	0.13	Medium
alertmanager	monitoring	27.2%	1.20	Medium — alert delivery may be delayed
outage-validation-task-manager	default	23.6%	0.16	Medium
kong proxy	kong	22.5%	1.22	Medium — API gateway
byparr	infra	21.6%	0.12	Low

archive-tagger at 99% throttled is a separate issue that needs immediate attention.

Fixes Applied for Other Throttled Services

Service	Old Limit	New Limit	Rationale
archive-tagger	75m (base default, inherited)	500m (production override)	Avg usage 70m, 99% throttled — limit was barely above avg
pgcopydb exporter	200m (base default, inherited)	500m (production override)	Avg usage 50m, 90% throttled — spiky workload needs headroom
kong gateway	250m (intentionally set)	Not changed	22-33% throttled, but comment says "30d max 188m, 250m = 33% headroom". Needs review but was deliberately sized.

Detection Gap

This incident was not caught by alerts because:

No alert exists for CPU throttling percentage
The CPUThrottlingHigh info alert was firing (85 instances) but classified as info severity
The AMQP error escalation was gradual enough to avoid tripping burst-based alerts until the exponential phase

Lessons Learned

Base value changes affect all environments — CPU limit changes in base values.yaml propagate to any environment that doesn't explicitly override resources.limits
Metric standardization commits should not change resource limits — the commit mixed unrelated concerns (metrics enable/disable + CPU limit changes)
AMQP stored message replay can create feedback loops — failed publishes → store → replay → more failures
CPU throttling needs production alerting — the CPUThrottlingHigh alert existed but was info-severity and ignored

jmealo/2026-03-14-rca-intel-api-amqp-cascade.md

Select an option

No results found