Date: 2026-04-04 Objective: Migrate pg-common primary to newly provisioned nodes across staging, demo, and production, then decommission old-node instances to reach a 3-replica cluster per environment.
| Instance | Role | Node | Node Age |
Date: 2026-04-04 Objective: Migrate pg-common primary to newly provisioned nodes across staging, demo, and production, then decommission old-node instances to reach a 3-replica cluster per environment.
| Instance | Role | Node | Node Age |
The prometheus_client Python library uses threading.Lock per label combination on every .labels().observe() call. This caused OOM crashes (700+ MiB → OOMKill) on aaa-api in staging. The fix replaces the prometheus_client backend with OpenTelemetry via a PROMETHEUS_BACKEND=otel env var toggle in gisual-prometheus-clients.
Validated in staging since 2026-03-22: aaa-api running at 162 MiB (under 368 MiB limit), zero restarts, zero 500 errors, all metrics present in /metrics output.
| Status | PROVISIONAL — ASU-specific loss confirmed, root cause unknown |
| Supersedes | Previous RCA |
| Incident date | 2026-03-10 |
| Analysis date | 2026-03-16 |
| Reverted | intel-requests-api !48 (batch_id change from !47) |
Two incidents, one root cause chain, 42 documented false signals across 2 days of investigation.
Actual root cause: AAA API at 8 replicas (some crash-looping) couldn't serve permissions queries from intel-requests-api fast enough. This caused cascading request queuing through intel-requests-api and incidents-api, starving notification-sender of API capacity. SQL queries were sub-millisecond throughout.
Source: Production logs, 2026-03-06 08:44 UTC
Charter org_id: 9de5d801-235a-4451-8c89-d2c3974c71e8
All queries use real IDs extracted from prod notification-sender logs.
WARNING: Run these inside a
BEGIN; ... ROLLBACK;transaction on a read replica if possible. EXPLAIN ANALYZE actually executes the query.
Context: Production SLA breach. notification-sender queue peaked at 7,617 msgs. Per-message processing: median 7.8s, max 14s. Charter webhook delivery is <10ms — the bottleneck is entirely internal API calls.