Jeff Mealo jmealo

OTel Metrics Rollout — Monday Checklist

Background

The prometheus_client Python library uses threading.Lock per label combination on every .labels().observe() call. This caused OOM crashes (700+ MiB → OOMKill) on aaa-api in staging. The fix replaces the prometheus_client backend with OpenTelemetry via a PROMETHEUS_BACKEND=otel env var toggle in gisual-prometheus-clients.

Validated in staging since 2026-03-22: aaa-api running at 162 MiB (under 368 MiB limit), zero restarts, zero 500 errors, all metrics present in /metrics output.

RCA: Missing intel_requests.created (ENG-3349) — Corrected


Status	PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes	Previous RCA
Incident date	2026-03-10
Analysis date	2026-03-16
Reverted	intel-requests-api !48 (batch_id change from !47)

RCA: Intel-API AMQP Cascade Failure — CPU Limit Regression

Date: 2026-03-14 Environment: Production Severity: Critical (P1) Duration: ~23 hours (2026-03-14 00:15 UTC — ongoing) Status: IN PROGRESS — fix identified, rolling restart underway

Summary

Staging Deploy Status — 2026-03-14

Summary

Deployed 25 backend services to staging to test metrics changes. Found and fixed 5 bugs introduced by library version mismatches. All services are now running healthy.

Libraries Published

Library	Version	Fix

RCA: search-retry-feeder CrashLoopBackOff (Production)

Date: 2026-03-11 Reported by: Diego via Slack/Datadog Investigated by: Jeff Mealo (with Claude Code) Severity: Low (self-recovering, no data loss) Status: Root cause identified, fix committed and pending deploy

Summary

42 Red Herrings: Notification-Sender Queue Backup (March 5-6, 2026)

Two incidents, one root cause chain, 42 documented false signals across 2 days of investigation.

Actual root cause: AAA API at 8 replicas (some crash-looping) couldn't serve permissions queries from intel-requests-api fast enough. This caused cascading request queuing through intel-requests-api and incidents-api, starving notification-sender of API capacity. SQL queries were sub-millisecond throughout.

March 5: 29 Red Herrings (5-hour investigation)

EXPLAIN ANALYZE Queries for Notification-Sender Hot Path

Source: Production logs, 2026-03-06 08:44 UTC Charter org_id: 9de5d801-235a-4451-8c89-d2c3974c71e8

All queries use real IDs extracted from prod notification-sender logs.

WARNING: Run these inside a BEGIN; ... ROLLBACK; transaction on a read replica if possible. EXPLAIN ANALYZE actually executes the query.

Notification-Sender Bottleneck Deep Dive — 2026-03-06

Context: Production SLA breach. notification-sender queue peaked at 7,617 msgs. Per-message processing: median 7.8s, max 14s. Charter webhook delivery is <10ms — the bottleneck is entirely internal API calls.

Notification-Sender Production Triage — 2026-03-06

Date: March 6, 2026 (~04:00–08:30 UTC) Environment: Production Service: notification-sender (v0.5.45) Severity: SLA breach — notifications delayed beyond 15-minute target

	[package]
	name = "wasm-h3-map"
	version = "0.1.0"
	edition = "2021"

	[lib]
	crate-type = ["cdylib"]

	[dependencies]
	wasm-bindgen = "0.2"

Jeff Mealo jmealo

OTel Metrics Rollout — Monday Checklist

Background

RCA: Missing intel_requests.created (ENG-3349) — Corrected

RCA: Intel-API AMQP Cascade Failure — CPU Limit Regression

Summary

Staging Deploy Status — 2026-03-14

Summary

Libraries Published

RCA: search-retry-feeder CrashLoopBackOff (Production)

Summary

42 Red Herrings: Notification-Sender Queue Backup (March 5-6, 2026)

March 5: 29 Red Herrings (5-hour investigation)

EXPLAIN ANALYZE Queries for Notification-Sender Hot Path

Notification-Sender Bottleneck Deep Dive — 2026-03-06

Table of Contents

Notification-Sender Production Triage — 2026-03-06

Executive Summary