Date: 2026-03-14 Environment: Production Severity: Critical (P1) Duration: ~23 hours (2026-03-14 00:15 UTC — ongoing) Status: IN PROGRESS — fix identified, rolling restart underway
A 3x reduction in CPU limits for intel-api (1500m → 500m) caused progressive CPU throttling, which degraded AMQP publish performance. This cascaded into channel lock timeouts across the platform, causing 500 errors on intel-requests-api, consumer failures, and a 20k+ message backlog in RabbitMQ.
| Time (UTC) | Event |
|---|---|
| 2026-03-09 00:47 | Commit dc5e9bef2 merged: "Standardize metrics configurations for backend services". Changed base values.yaml CPU limit from 1500m → 500m. Production override did NOT set a CPU limit, inheriting the new base value. |
| 2026-03-13 23:45 | Helm upgrade deployed to production. 150 intel-api pods rolled from ReplicaSet b6776cc78 → 79f7695fbd. All pods now running with 500m CPU limit (previously 1500m). |
| 2026-03-14 00:00 | Rollout complete. All 150 pods on new ReplicaSet. |
| 2026-03-14 00:15 | First AMQP publish errors appear — 571 errors in 15 minutes. |
| 2026-03-14 00:15–04:00 | AMQP errors sustain at ~700/15min with brief dip at 04:00. |
| 2026-03-14 06:00–12:00 | Errors resume and escalate: 877→3,404/hr. |
| 2026-03-14 12:00–15:00 | Steady at ~3,200/hr. |
| 2026-03-14 15:00 | Inflection point: errors jump to 5,784/hr. Retry storm feedback loop begins. |
| 2026-03-14 16:00–20:00 | Exponential growth: 10,558 → 18,763/hr. |
| 2026-03-14 ~22:45 | Investigation begins. Multiple critical alerts firing. |
| 2026-03-14 ~23:00 | Root cause identified. Rolling restart of intel-api and intel-requests-api initiated. CPU limit fix prepared. |
Commit dc5e9bef2 ("Standardize metrics configurations for backend services", 2026-03-09) modified applications/backend/helm/intel-api/values.yaml:
resources:
limits:
- cpu: 1500m
+ cpu: 500m
memory: 2000MiThe production environment override (environments/production/values.yaml) only set resources.requests — not resources.limits. This meant production inherited the base CPU limit, which dropped from 1500m to 500m.
- 23.1% of CPU scheduling periods were throttled across 150 intel-api pods
- 6.81 throttled-seconds per second accumulated across the fleet
- While average CPU usage (~100m/pod) appeared low, burst workloads (AMQP publishes involving TLS, serialization, async lock acquisition) were hitting the 500m ceiling
CPU limit 1500m → 500m
→ CPU throttling on burst workloads (23% of periods)
→ aiormq channel lock held longer during throttled publishes
→ Other publishes timeout (30s wait_for) → TimeoutError
→ intel-api: "AMQP publish error" on intel.from_cache events
→ Failed messages stored to DB (STORE_FAILED_AMQP_MESSAGES=1)
→ Stored messages replayed (PUBLISH_STORED_AMQP_MESSAGES=1) → retry storms
→ Retry storms cause publish rate spikes (up to 96k msg/sec)
→ More CPU pressure → more throttling → exponential growth
→ intel-requests-api: publish_lifecycle_event 30s timeouts → 500 on PATCH/DELETE
→ intel-searcher: AMQP connection death → ConsumerErrorSpike
→ outage-updater: ReadTimeout calling intel-api → queue backlog (20.4k messages)
→ notification-sender, message-amplifier: downstream failures
- The CPU limit regression was committed on Mar 9 but not deployed until Mar 13 23:45
- Initial error rate was moderate (~700/15min) — within existing alert thresholds
- The
STORE_FAILED_AMQP_MESSAGES+PUBLISH_STORED_AMQP_MESSAGESfeedback loop caused exponential growth — errors doubled every few hours - By Mar 14 15:00, the retry storm overwhelmed the system
| Service | Error Count | Error Type |
|---|---|---|
| intel-api | 145,011 AMQP publish errors (since onset) | AMQP publish error: intel.from_cache |
| intel-requests-api | ~35,000 errors today | TimeoutError / RuntimeError on PATCH/DELETE |
| outage-updater | ~965 errors since 18:00 | ReadTimeout calling intel-api, Redis lock contention |
| intel-searcher (all variants) | Multiple pods | AMQP connection is None after maximum reconnect attempts |
| intel-searcher-retry | Multiple pods | AMQP reconnect task was cancelled |
| notification-sender | Elevated | ConsumerErrorSpike |
| message-amplifier | Elevated | ConsumerErrorSpike |
- PATCH and DELETE requests to
intel-requests-apireturning 500 errors (data update failures) - Lifecycle events not published → downstream consumers not processing intel request state changes
- Outage update queue backlog of 20.4k messages (delayed outage processing)
- Possible data staleness for end users viewing intel/outage data
- Cluster healthy (5 nodes, memory/disk OK, confirmations flowing)
- Queue backlog:
outage-updater20.4k messages,intel-searcher-intel-request-retry3.3k - Connections and channels stable (~1,350 connections, ~2,050 channels)
- RabbitMQ was NOT the problem — the issue was client-side channel lock contention from CPU-throttled publish operations
Restored CPU limit in applications/backend/helm/intel-api/environments/production/values.yaml:
resources:
limits:
cpu: 1500m # Restored from base default of 500m
requests:
cpu: 75m
memory: 832Mi- Rolling restart of
intel-apiandintel-requests-apito clear dead AMQP connections - Deploy CPU limit fix
- Audit all production services for CPU limit regressions from the
dc5e9bef2commit - Add CPU throttling alert: Alert when
container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 25%sustained for 15 minutes - Review AMQP publish timeout: The 30s default in
aiormqis too long for an API request path. A shorter timeout with faster failover would prevent cascade - Review stored message replay: The
STORE_FAILED_AMQP_MESSAGES+PUBLISH_STORED_AMQP_MESSAGESfeedback loop turned a moderate issue into an exponential one
The same dc5e9bef2 commit may have affected other services. Current throttling observed:
| Service | Namespace | % Periods Throttled | Throttled sec/sec | Severity |
|---|---|---|---|---|
| archive-tagger | default | 99.0% | 4.07 | Critical — near-complete throttle |
| db exporter | db | 90.2% | 0.21 | High — monitoring degraded |
| grafana-sc-dashboard | monitoring | 59.8% | low | Medium |
| snmp-heartbeat-sender | default | 50.0% | low | Medium |
| promitor-agent-scraper | monitoring | 33.8% | 0.13 | Medium |
| alertmanager | monitoring | 27.2% | 1.20 | Medium — alert delivery may be delayed |
| outage-validation-task-manager | default | 23.6% | 0.16 | Medium |
| kong proxy | kong | 22.5% | 1.22 | Medium — API gateway |
| byparr | infra | 21.6% | 0.12 | Low |
archive-tagger at 99% throttled is a separate issue that needs immediate attention.
| Service | Old Limit | New Limit | Rationale |
|---|---|---|---|
| archive-tagger | 75m (base default, inherited) | 500m (production override) | Avg usage 70m, 99% throttled — limit was barely above avg |
| pgcopydb exporter | 200m (base default, inherited) | 500m (production override) | Avg usage 50m, 90% throttled — spiky workload needs headroom |
| kong gateway | 250m (intentionally set) | Not changed | 22-33% throttled, but comment says "30d max 188m, 250m = 33% headroom". Needs review but was deliberately sized. |
This incident was not caught by alerts because:
- No alert exists for CPU throttling percentage
- The
CPUThrottlingHighinfo alert was firing (85 instances) but classified asinfoseverity - The AMQP error escalation was gradual enough to avoid tripping burst-based alerts until the exponential phase
- Base value changes affect all environments — CPU limit changes in base
values.yamlpropagate to any environment that doesn't explicitly overrideresources.limits - Metric standardization commits should not change resource limits — the commit mixed unrelated concerns (metrics enable/disable + CPU limit changes)
- AMQP stored message replay can create feedback loops — failed publishes → store → replay → more failures
- CPU throttling needs production alerting — the
CPUThrottlingHighalert existed but was info-severity and ignored