jmealo/2026-03-16-rca-missing-intel-requests-created.md Secret

Last active March 16, 2026 20:03

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jmealo/d81250648d67089111676b13de6120e7.js"></script>
Save jmealo/d81250648d67089111676b13de6120e7 to your computer and use it in GitHub Desktop.

Download ZIP

RCA: Missing intel_requests.created messages (March 10, 2026)

Raw

2026-03-16-rca-missing-intel-requests-created.md

RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

Status: PROVISIONAL — ASU-specific loss confirmed, root cause unknown Date: 2026-03-10 incident, 2026-03-16 re-analysis JIRA: ENG-3349 Previous RCA: Gist d81250648d (superseded by this document)

Summary

2,798 of 300,839 intel requests on March 10 never had facts created (~0.93%). Re-analysis of 6 sample batches (404 IRs) using correct identifiers shows:

event-recorder: 404/404 — 0% loss¹
AIPT: 404/404 — 0% loss²
ASU: 388/404 — 3.96% loss, consistent across every batch³

The loss is ASU-specific. The previous RCA incorrectly attributed loss to all consumers due to a search methodology error⁴.

What the Previous RCA Got Wrong

Claim	Correction
"event-recorder missed 8,146"	0% loss — received every message in all sampled batches¹
"AIPT received but ASU didn't"	AIPT also received 100%²
"Independent consumer-side shutdown race"	Only ASU loses messages — shared library bug would affect all equally
MR 47: separate batch_id fixes lock contention	Broke traceability, didn't address ASU loss — reverted⁵

The error: consumer logs index messages by correlation_id and batch_id, not intel_request_id. The original searched by intel_request_id, found nothing, and concluded messages were lost. In reality, the messages were there under different identifiers⁶.

Tracing Methodology

ID Mapping

Cody provided intel_request_id values. Consumer logs don't contain this field on processing lines⁷. The mapping is:

intel_request_id  →  DB metadata column[^db-query]
  └→ correlation_id  (unique per location, logged as CID)
  └→ batch_id        (shared per API request, = original correlation_id)[^batch-id-code]

10 sample IRs mapped to 6 distinct batches, all org 9de5d801⁸.

Counting Method

Each processed message produces exactly 2 log lines (Processing... + Processed...) from the shared amqp-consumer library⁹. Counting message_type:exact("intel_requests.created") lines per batch_id per service, divided by 2, gives IRs processed¹⁰.

Results

batch_id	event-recorder	AIPT	ASU	ASU missing
`c2a627b3`	174	174	168	6 (3.4%)
`215181db`	22	22	21	1 (4.5%)
`812f79b7`	68	68	66	2 (2.9%)
`d334c245`	64	64	62	2 (3.1%)
`2186cea6`	22	22	20	2 (9.1%)
`7438ef94`	54	54	51	3 (5.6%)
Total	404	404	388	16 (3.96%)

Why `intel_request_id` Appears as a Structured Field (But Misleads)

Vector's VRL pipeline¹¹ extracts IDs from unstructured log text into structured VictoriaLogs fields:

Log pattern	Extracted field	VRL source
`(Intel Request {uuid})`	`intel_request_id`	¹²
`(CID {uuid})`	`correlation_id`	¹³
`(Batch ID {uuid})`	`batch_id`	¹⁴

ASU logs (Intel Request {id}): Received power_status.updated...¹⁵ on discard lines — Vector extracts this into intel_request_id. But the generic Processing/Processed message type: intel_requests.created lines⁹ don't contain that pattern, so intel_request_id is absent on the lines that prove message delivery.

This creates a trap: searching intel_request_id:exact("56a219a6...") finds ASU's power_status.updated discard (hours later, different batch) but misses the intel_requests.created processing entirely.

Event-recorder never logs individual IR IDs at all¹⁶ — it only stores event_type, correlation_id, and organization_id as structured metadata.

What We Don't Know

The root cause of ASU's ~4% loss is unidentified. Theories (all unverified):

Silent discard in ASU handler — an early return path that doesn't log¹⁵
Backpressure — ASU processing time (0.12-0.15s/msg) is 3x slower than event-recorder (0.05s/msg)
Queue configuration — different prefetch/QoS or replica count

Next Steps

Deploy _on_message logging (MR 17¹⁷) on ASU — determines whether messages reach ASU's callback at all
Audit ASU handler for silent drops — check all return paths in intel_requests.created processing
Compare ASU queue config to event-recorder and AIPT (prefetch, replicas, resource limits)

Verification Query (post MR 17 deployment)

service_name:exact("asset-status-updater") AND "Received message" AND "intel_requests.created"

Compare count to event-recorder's count for the same time window. If equal → loss is post-delivery (handler bug). If lower → loss is pre-delivery (RabbitMQ/queue issue).

Actions Taken

Reverted MR 47⁵ — restored correlation_id as batch_id (traceability fix)
MR 17¹⁷ still open — diagnostic logging, pending Diego's review of shutdown protection

Grounding

Database

VictoriaLogs Queries

Grafana Verification Links

ASU batch c2a627b3 — expect 336 lines
event-recorder batch c2a627b3 — expect 348 lines

Source Code

Vector VRL (Log Field Extraction)

MRs

event-recorder raw: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108. Total=808 lines ÷2 = 404 IRs. ↩ ↩²
AIPT raw: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108. Total=808 lines ÷2 = 404 IRs. ↩ ↩²
ASU raw: c2a627b3=336, 215181db=42, 812f79b7=132, d334c245=124, 2186cea6=40, 7438ef94=102. Total=776 lines ÷2 = 388 IRs. ↩
Original RCA gist did not preserve exact queries used. The gist states findings like "No log for IR 56a219a6" without query syntax. Based on field analysis, the search likely used intel_request_id or free-text IR UUID, which doesn't match intel_requests.created processing lines. ↩
Revert MR — intel-requests-api !48 (merged 2026-03-16). Restores self.correlation_id as batch_id. ↩ ↩²
intel_request_id is stored in the DB row. correlation_id and batch_id are in metadata JSONB. Consumer logs only contain the latter two on processing lines. ↩
victorialogs_field_names for message_type:exact("intel_requests.created") per service, 2026-03-10T13:00-14:00Z. ASU: intel_request_id present (27,478 hits on co-stored lines, zero hits when filtered to intel_requests.created + specific IR ID). event-recorder: field absent entirely. ↩
SELECT intel_request_id::text, (metadata->>'correlation_id')::text, (metadata->>'batch_id')::text FROM intel_requests.intel_requests_p20260301 WHERE intel_request_id IN (...) on production-app-db-2-restore. 10 IRs → 6 batches, all org 9de5d801. ↩
amqp-consumer generic logging — gisual-amqp-consumer/app.py:657-664 (Processing) and L717-725 (Processed). Logs correlation_id, request_id, batch_id — never intel_request_id. Logger context fields set at L632-642. ↩ ↩²
victorialogs_stats_query with service_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-...") OR ...) _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines — run per service at time=2026-03-10T20:00:00Z. ↩
Vector VRL ID extraction pipeline — vector/vrl/07-extract-ids.vrl. All patterns use if !exists(...) guards (only extract from text when field absent from structured JSON). ↩
intel_request_id from (Intel Request {uuid}) — 07-extract-ids.vrl:198-203. ↩
correlation_id from (CID {uuid}) — 07-extract-ids.vrl:205-211. ↩
batch_id from (Batch ID {uuid}) — 07-extract-ids.vrl:213-219. ↩
ASU power_on discard logging — processor.py:248-252. Logs intel_request_id in _msg text as (Intel Request %s). ↩ ↩²
event-recorder Loki structured metadata — app.py:372-387. Only event_type, correlation_id, organization_id, organization_short_name. ↩
Diagnostic logging MR — gisual-amqp-consumer !17 (open, pending review). ↩ ↩²

Raw

rca-missing-intel-requests.md

RCA: Missing `intel_requests.created` Messages (March 10, 2026)

Summary

2,798 / 300,839 (0.93%) intel requests were created in the database but never processed by downstream consumers (ASU, AIPT). The loss is uniform across all organizations and message types, consistent with a systemic issue in the AMQP consumer library rather than a content-specific bug.

Architecture (the message path)

Incident Source
  → message-amplifier (consumes incidents.received from RPC exchange)
    → HTTP POST to intel-requests-api /bulk/create
      → intel-requests-api creates IRs in DB (committed)
      → intel-requests-api publishes intel_requests.created amplify to RPC exchange
    → message-amplifier (consumes intel_requests.created from RPC exchange)
      → publishes individual events to EVENT exchange
        → event-recorder, ASU, AIPT, asset-monitor, infrastructure-status-recorder consume
          (each has its own queue bound to the event exchange)

Evidence

Message counts across the pipeline (March 10)

Stage	Count	Gap from total
Total IRs created (DB)	300,839	—
message-amplifier `intel_requests.created` amplifications	39,915 batches (all successful, 0 partial drops)	—
event-recorder processed `intel_requests.created`	292,693	8,146 (2.7%)
ASU processed `intel_requests.created`	282,131	18,708 (6.2%)
Missing IRs (no facts in bronze.facts)	2,798	0.93%

Key observations

The events ARE in the event store (confirmed by Cody) — event-recorder captured 292,693 of them
event-recorder and ASU have DIFFERENT gap sizes (8,146 vs 18,708) — each consumer independently loses messages from its own queue
Zero AMQP publish failures logged from intel-requests-api or message-amplifier
Zero partial drops in message-amplifier — all 39,915 intel_requests.created amplification batches completed fully
The 2,798 missing facts is a subset of the overall consumer-side loss

Publisher-side metrics

Metric	Count	Source
Successful `intel_requests.created` amplifications	39,915 batches	VictoriaLogs
`"Amplified 0/"` for `intel_requests.created`	0	VictoriaLogs
AMQP publish failures (IR API)	0	VictoriaLogs
AMQP publish failures (message-amplifier)	0	VictoriaLogs

Consumer-side metrics

Metric	Count	Source
`repeated_location` errors in message-amplifier	2,729	VictoriaLogs
Transient error exchange routing (intel_requests.created, 23:00-00:00 hour)	602	VictoriaLogs
Lock contention rate	~30-35% of amplify messages	VictoriaLogs
ASU errors for intel_requests	0	VictoriaLogs

Root Cause: Consumer-Side Message Loss in gisual-amqp-consumer

Why we know it's consumer-side

The events exist in the event store — event-recorder consumed them, proving they were published to the event exchange
Each consumer loses a different number of messages — event-recorder missed 8,146, ASU missed 18,708. If the publisher were the problem, all consumers would miss the same messages
Zero publish failures in the entire pipeline

Primary mechanism: `CancelledError` during `_process_messages` shutdown

The gisual-amqp-consumer library has a race condition during pod shutdown:

Message is delivered to the consumer via _on_message → appended to self.buffer (a plain Python list, no logging)
_process_messages pops message from buffer, processes it, appends to processed list
Pod receives SIGTERM → shutdown_coordinator.initiate_shutdown() → state = SHUTTING_DOWN
Grace period elapses → _cancel_all_operations() cancels the task running _process_messages
CancelledError (BaseException, NOT caught by except Exception) propagates into the finally block
The finally block has multiple await points (postprocess, ack, nack) — if CancelledError interrupts these, messages are neither acked nor nacked
RabbitMQ redelivers after TCP connection drops, but there's a timing window where the connection closes before redelivery can happen

Key code path (gisual_amqp_consumer/app.py:599-761):

async def _process_messages(self, messages) -> None:
    processed = []
    try:
        while messages:
            message = messages.pop(0)  # Removed from buffer
            try:
                async with self.shutdown_coordinator.track_operation(operation_id):
                    await self._process_with_context(decoded)
                    processed.append(message)
            except Exception as error:  # Does NOT catch CancelledError (BaseException)
                await self._handle_error(message, error)
    finally:
        # Multiple await points - can be interrupted by CancelledError
        completed = await self.postprocess(processed)  # ← await
        for message in completed:
            await self.amqp.ack(message)  # ← await
        for message in failed:
            await self.amqp.nack(message, requeue=True)  # ← await

Secondary mechanism: Buffer loss during shutdown

Messages sitting in self.buffer (received via _on_message but not yet popped by _process_messages) should be requeued by _shutdown_amqp_consumption → _requeue_buffer. However, if _cleanup_all_tasks cancels the task before _requeue_buffer completes, messages in the buffer are lost.

Why the loss rate is ~1% and uniform

The loss is timing-dependent — it only occurs during the narrow window of pod shutdown
Kubernetes rolling deployments cause periodic pod restarts across all consumer services
The probability of a message being in-flight during shutdown is relatively constant (~1%)
The loss is independent of message content, explaining the uniform distribution across organizations and types

Contributing Factor: Redis Lock Contention in message-amplifier

While NOT the root cause, Redis lock contention adds latency and retry overhead:

incidents.received and intel_requests.created amplify messages share the same batch_id
When processed by different message-amplifier pods simultaneously, they contend for the same Redis lock
~30-35% of intel_requests.created amplify messages hit lock contention → routed to rpc-error exchange → retried after 5s
All retries succeed (zero partial drops), but the added latency increases the window for consumer-side loss

Key Code References

File	Lines	What
`gisual-amqp-consumer/app.py`	467-468	`_on_message` — buffer append with no logging
`gisual-amqp-consumer/app.py`	538-540	`_execute_message_batch` — not protected from cancellation
`gisual-amqp-consumer/app.py`	599-761	`_process_messages` — CancelledError can interrupt finally block acks
`gisual-amqp-consumer/app.py`	780-893	Error handling — LockError treated as transient
`gisual-service-utils/shutdown.py`	424-435	`_cancel_all_operations` — cancels non-protected tasks
`gisual-service-utils/shutdown.py`	483-522	`_cleanup_all_tasks` — cancels ALL non-protected asyncio tasks
`intel-requests-api/intel_requests.py`	1079-1111	`intel_requests.created` amplify publish
`message-amplifier/app.py`	159-195	`amplify_generic_message` — idempotent publish with Redis lock

Fixes (branches prepared)

MR 1: `gisual-amqp-consumer` — `fix/on-message-logging-and-shutdown-safety`

Change 1: Env-gated _on_message logging (diagnostic)

async def _on_message(self, message) -> None:
    if self._log_on_message:
        self.logger.info(
            'Received message routing_key=%s correlation_id=%s',
            message.delivery.routing_key,
            message.header.properties.correlation_id,
        )
    self.buffer.append(message)

Gated on LOG_ON_MESSAGE env var (default off)
Confirms whether messages reach the consumer at all

Change 2: Protect _execute_message_batch from cancellation (fix)

async def _execute_message_batch(self, messages):
    current_task = asyncio.current_task()
    if current_task:
        self.shutdown_coordinator.register_protected_task(current_task)
    try:
        async with self.amqp.process_lock:
            await self._process_messages(messages)
    finally:
        if current_task:
            self.shutdown_coordinator.unregister_protected_task(current_task)

Prevents _cleanup_all_tasks from cancelling the batch mid-flight
Ensures acks/nacks in the finally block complete before the task is cancelled

MR 2: `intel-requests-api` — `fix/intel-requests-created-amplify-reliability`

Change 1: Use separate batch_id (uuid4()) instead of self.correlation_id — eliminates Redis lock contention with incidents.received in message-amplifier

Change 2: Add before-publish logging with record type, count, and batch_id

Change 3: Raise Exception on publish failure instead of just logging — ensures caller retries

Further Investigation

Recommended next steps

Deploy _on_message logging to one ASU pod in staging with LOG_ON_MESSAGE=true — compare received count vs processed count to quantify buffer-level loss
Check pod restart frequency during March 10th for ASU and event-recorder — correlate restart windows with missing message timestamps
Cross-reference specific missing IR IDs against event-recorder logs — if event-recorder has the Processing log for an IR but ASU doesn't, confirms independent consumer-side loss

Long-term improvements

Transactional Outbox Pattern in intel-requests-api — write amplify message to DB outbox table in same transaction as IR creation, poll-and-publish separately
Consumer-side deduplication — track processed message IDs to detect and recover from missed messages
Dead letter monitoring — alert on messages accumulating in error queues

Appendix: Traced Missing IR `56a219a6-7597-42c7-85ea-1636cdfa8e53`

Batch: `c2a627b3-3f7c-4c89-808c-4d79b19c69cb` (174 IRs for org `9de5d801`)

Time	Service / Pod	Event
13:55:46.528-558	intel-requests-api `sz7lh`	Created 174 IRs via BulkCreateHandler, HTTP 201 (2.4s)
13:55:47.041	message-amplifier `tqsnn`	"Created new intel request record 56a219a6..." (incidents.received handler)
13:55:47.114	message-amplifier `ch4dq`	Picks up `intel_requests.created` amplify → LockError → rpc-error
13:55:52.327	message-amplifier `ltnmf`	Retry from rpc-error → still locked → rpc-error
13:55:57.517	message-amplifier `vpq88`	Another retry → still locked → rpc-error
13:55:58.775	message-amplifier `tqsnn`	"Amplified 174/174 incidents.received" (14.56s to process)
13:56:02.700	message-amplifier `9lhr9`	Retry successfully acquires lock for `intel_requests.created`
13:56:13.924	message-amplifier `9lhr9`	"Amplified 174/174 intel_requests.created messages" (11.22s)
13:56:22-46	multiple pods	Additional retries arrive → deduplicated by Redis idempotent SET

Conclusion from trace

intel-requests-api: Created the IRs and returned HTTP 201 ✅
message-amplifier: Successfully amplified all 174 intel_requests.created events to the event exchange ✅
event-recorder: No log for IR 56a219a6 — did not receive the event ❌
ASU: Only processed a later power_status.updated for this IR, never intel_requests.created ❌

The event was published to the event exchange but never consumed by event-recorder or ASU. This definitively confirms consumer-side loss in the gisual-amqp-consumer library.

Lock contention impact

The incidents.received amplification took 14.56 seconds (processing 174 locations with HTTP calls to intel-requests-api). During this time:

3 separate intel_requests.created retry attempts failed due to lock contention
Each retry added 5 seconds of delay (via rpc-error-delay queue)
Total delay before successful amplification: ~26 seconds (13:55:47 → 13:56:13)

This delay means the 174 events hit the consumer queues in a burst ~26 seconds after creation, potentially during consumer pod restarts or when buffers were already full.

jmealo/2026-03-16-rca-missing-intel-requests-created.md Secret

RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

Summary

What the Previous RCA Got Wrong

Tracing Methodology

ID Mapping

Counting Method

Results

Why intel_request_id Appears as a Structured Field (But Misleads)

What We Don't Know

Next Steps

Verification Query (post MR 17 deployment)

Actions Taken

Grounding

Database

VictoriaLogs Queries

Grafana Verification Links

Source Code

Vector VRL (Log Field Extraction)

MRs

Footnotes

RCA: Missing intel_requests.created Messages (March 10, 2026)

Summary

Architecture (the message path)

Evidence

Message counts across the pipeline (March 10)

Key observations

Publisher-side metrics

Consumer-side metrics

Root Cause: Consumer-Side Message Loss in gisual-amqp-consumer

Why we know it's consumer-side

Primary mechanism: CancelledError during _process_messages shutdown

Secondary mechanism: Buffer loss during shutdown

Why the loss rate is ~1% and uniform

Contributing Factor: Redis Lock Contention in message-amplifier

Key Code References

Fixes (branches prepared)

MR 1: gisual-amqp-consumer — fix/on-message-logging-and-shutdown-safety

MR 2: intel-requests-api — fix/intel-requests-created-amplify-reliability

Further Investigation

Recommended next steps

Long-term improvements

Appendix: Traced Missing IR 56a219a6-7597-42c7-85ea-1636cdfa8e53

Batch: c2a627b3-3f7c-4c89-808c-4d79b19c69cb (174 IRs for org 9de5d801)

Conclusion from trace

Lock contention impact

Why `intel_request_id` Appears as a Structured Field (But Misleads)

RCA: Missing `intel_requests.created` Messages (March 10, 2026)

Primary mechanism: `CancelledError` during `_process_messages` shutdown

MR 1: `gisual-amqp-consumer` — `fix/on-message-logging-and-shutdown-safety`

MR 2: `intel-requests-api` — `fix/intel-requests-created-amplify-reliability`

Appendix: Traced Missing IR `56a219a6-7597-42c7-85ea-1636cdfa8e53`

Batch: `c2a627b3-3f7c-4c89-808c-4d79b19c69cb` (174 IRs for org `9de5d801`)