Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 16, 2026 20:03
Show Gist options
  • Select an option

  • Save jmealo/d81250648d67089111676b13de6120e7 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/d81250648d67089111676b13de6120e7 to your computer and use it in GitHub Desktop.
RCA: Missing intel_requests.created messages (March 10, 2026)

RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

Status: PROVISIONAL — ASU-specific loss confirmed, root cause unknown Date: 2026-03-10 incident, 2026-03-16 re-analysis JIRA: ENG-3349 Previous RCA: Gist d81250648d (superseded by this document)


Summary

2,798 of 300,839 intel requests on March 10 never had facts created (~0.93%). Re-analysis of 6 sample batches (404 IRs) using correct identifiers shows:

  • event-recorder: 404/404 — 0% loss1
  • AIPT: 404/404 — 0% loss2
  • ASU: 388/404 — 3.96% loss, consistent across every batch3

The loss is ASU-specific. The previous RCA incorrectly attributed loss to all consumers due to a search methodology error4.

What the Previous RCA Got Wrong

Claim Correction
"event-recorder missed 8,146" 0% loss — received every message in all sampled batches1
"AIPT received but ASU didn't" AIPT also received 100%2
"Independent consumer-side shutdown race" Only ASU loses messages — shared library bug would affect all equally
MR 47: separate batch_id fixes lock contention Broke traceability, didn't address ASU loss — reverted5

The error: consumer logs index messages by correlation_id and batch_id, not intel_request_id. The original searched by intel_request_id, found nothing, and concluded messages were lost. In reality, the messages were there under different identifiers6.

Tracing Methodology

ID Mapping

Cody provided intel_request_id values. Consumer logs don't contain this field on processing lines7. The mapping is:

intel_request_id  →  DB metadata column[^db-query]
  └→ correlation_id  (unique per location, logged as CID)
  └→ batch_id        (shared per API request, = original correlation_id)[^batch-id-code]

10 sample IRs mapped to 6 distinct batches, all org 9de5d8018.

Counting Method

Each processed message produces exactly 2 log lines (Processing... + Processed...) from the shared amqp-consumer library9. Counting message_type:exact("intel_requests.created") lines per batch_id per service, divided by 2, gives IRs processed10.

Results

batch_id event-recorder AIPT ASU ASU missing
c2a627b3 174 174 168 6 (3.4%)
215181db 22 22 21 1 (4.5%)
812f79b7 68 68 66 2 (2.9%)
d334c245 64 64 62 2 (3.1%)
2186cea6 22 22 20 2 (9.1%)
7438ef94 54 54 51 3 (5.6%)
Total 404 404 388 16 (3.96%)

Why intel_request_id Appears as a Structured Field (But Misleads)

Vector's VRL pipeline11 extracts IDs from unstructured log text into structured VictoriaLogs fields:

Log pattern Extracted field VRL source
(Intel Request {uuid}) intel_request_id 12
(CID {uuid}) correlation_id 13
(Batch ID {uuid}) batch_id 14

ASU logs (Intel Request {id}): Received power_status.updated...15 on discard lines — Vector extracts this into intel_request_id. But the generic Processing/Processed message type: intel_requests.created lines9 don't contain that pattern, so intel_request_id is absent on the lines that prove message delivery.

This creates a trap: searching intel_request_id:exact("56a219a6...") finds ASU's power_status.updated discard (hours later, different batch) but misses the intel_requests.created processing entirely.

Event-recorder never logs individual IR IDs at all16 — it only stores event_type, correlation_id, and organization_id as structured metadata.

What We Don't Know

The root cause of ASU's ~4% loss is unidentified. Theories (all unverified):

  1. Silent discard in ASU handler — an early return path that doesn't log15
  2. Backpressure — ASU processing time (0.12-0.15s/msg) is 3x slower than event-recorder (0.05s/msg)
  3. Queue configuration — different prefetch/QoS or replica count

Next Steps

  1. Deploy _on_message logging (MR 1717) on ASU — determines whether messages reach ASU's callback at all
  2. Audit ASU handler for silent drops — check all return paths in intel_requests.created processing
  3. Compare ASU queue config to event-recorder and AIPT (prefetch, replicas, resource limits)

Verification Query (post MR 17 deployment)

service_name:exact("asset-status-updater") AND "Received message" AND "intel_requests.created"

Compare count to event-recorder's count for the same time window. If equal → loss is post-delivery (handler bug). If lower → loss is pre-delivery (RabbitMQ/queue issue).

Actions Taken

  • Reverted MR 475 — restored correlation_id as batch_id (traceability fix)
  • MR 1717 still open — diagnostic logging, pending Diego's review of shutdown protection

Grounding

Database

VictoriaLogs Queries

Grafana Verification Links

Source Code

Vector VRL (Log Field Extraction)

MRs

Footnotes

  1. event-recorder raw: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108. Total=808 lines ÷2 = 404 IRs. 2

  2. AIPT raw: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108. Total=808 lines ÷2 = 404 IRs. 2

  3. ASU raw: c2a627b3=336, 215181db=42, 812f79b7=132, d334c245=124, 2186cea6=40, 7438ef94=102. Total=776 lines ÷2 = 388 IRs.

  4. Original RCA gist did not preserve exact queries used. The gist states findings like "No log for IR 56a219a6" without query syntax. Based on field analysis, the search likely used intel_request_id or free-text IR UUID, which doesn't match intel_requests.created processing lines.

  5. Revert MR — intel-requests-api !48 (merged 2026-03-16). Restores self.correlation_id as batch_id. 2

  6. intel_request_id is stored in the DB row. correlation_id and batch_id are in metadata JSONB. Consumer logs only contain the latter two on processing lines.

  7. victorialogs_field_names for message_type:exact("intel_requests.created") per service, 2026-03-10T13:00-14:00Z. ASU: intel_request_id present (27,478 hits on co-stored lines, zero hits when filtered to intel_requests.created + specific IR ID). event-recorder: field absent entirely.

  8. SELECT intel_request_id::text, (metadata->>'correlation_id')::text, (metadata->>'batch_id')::text FROM intel_requests.intel_requests_p20260301 WHERE intel_request_id IN (...) on production-app-db-2-restore. 10 IRs → 6 batches, all org 9de5d801.

  9. amqp-consumer generic logging — gisual-amqp-consumer/app.py:657-664 (Processing) and L717-725 (Processed). Logs correlation_id, request_id, batch_id — never intel_request_id. Logger context fields set at L632-642. 2

  10. victorialogs_stats_query with service_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-...") OR ...) _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines — run per service at time=2026-03-10T20:00:00Z.

  11. Vector VRL ID extraction pipeline — vector/vrl/07-extract-ids.vrl. All patterns use if !exists(...) guards (only extract from text when field absent from structured JSON).

  12. intel_request_id from (Intel Request {uuid})07-extract-ids.vrl:198-203.

  13. correlation_id from (CID {uuid})07-extract-ids.vrl:205-211.

  14. batch_id from (Batch ID {uuid})07-extract-ids.vrl:213-219.

  15. ASU power_on discard logging — processor.py:248-252. Logs intel_request_id in _msg text as (Intel Request %s). 2

  16. event-recorder Loki structured metadata — app.py:372-387. Only event_type, correlation_id, organization_id, organization_short_name.

  17. Diagnostic logging MR — gisual-amqp-consumer !17 (open, pending review). 2

RCA: Missing intel_requests.created Messages (March 10, 2026)

Summary

2,798 / 300,839 (0.93%) intel requests were created in the database but never processed by downstream consumers (ASU, AIPT). The loss is uniform across all organizations and message types, consistent with a systemic issue in the AMQP consumer library rather than a content-specific bug.

Architecture (the message path)

Incident Source
  → message-amplifier (consumes incidents.received from RPC exchange)
    → HTTP POST to intel-requests-api /bulk/create
      → intel-requests-api creates IRs in DB (committed)
      → intel-requests-api publishes intel_requests.created amplify to RPC exchange
    → message-amplifier (consumes intel_requests.created from RPC exchange)
      → publishes individual events to EVENT exchange
        → event-recorder, ASU, AIPT, asset-monitor, infrastructure-status-recorder consume
          (each has its own queue bound to the event exchange)

Evidence

Message counts across the pipeline (March 10)

Stage Count Gap from total
Total IRs created (DB) 300,839
message-amplifier intel_requests.created amplifications 39,915 batches (all successful, 0 partial drops)
event-recorder processed intel_requests.created 292,693 8,146 (2.7%)
ASU processed intel_requests.created 282,131 18,708 (6.2%)
Missing IRs (no facts in bronze.facts) 2,798 0.93%

Key observations

  • The events ARE in the event store (confirmed by Cody) — event-recorder captured 292,693 of them
  • event-recorder and ASU have DIFFERENT gap sizes (8,146 vs 18,708) — each consumer independently loses messages from its own queue
  • Zero AMQP publish failures logged from intel-requests-api or message-amplifier
  • Zero partial drops in message-amplifier — all 39,915 intel_requests.created amplification batches completed fully
  • The 2,798 missing facts is a subset of the overall consumer-side loss

Publisher-side metrics

Metric Count Source
Successful intel_requests.created amplifications 39,915 batches VictoriaLogs
"Amplified 0/" for intel_requests.created 0 VictoriaLogs
AMQP publish failures (IR API) 0 VictoriaLogs
AMQP publish failures (message-amplifier) 0 VictoriaLogs

Consumer-side metrics

Metric Count Source
repeated_location errors in message-amplifier 2,729 VictoriaLogs
Transient error exchange routing (intel_requests.created, 23:00-00:00 hour) 602 VictoriaLogs
Lock contention rate ~30-35% of amplify messages VictoriaLogs
ASU errors for intel_requests 0 VictoriaLogs

Root Cause: Consumer-Side Message Loss in gisual-amqp-consumer

Why we know it's consumer-side

  1. The events exist in the event store — event-recorder consumed them, proving they were published to the event exchange
  2. Each consumer loses a different number of messages — event-recorder missed 8,146, ASU missed 18,708. If the publisher were the problem, all consumers would miss the same messages
  3. Zero publish failures in the entire pipeline

Primary mechanism: CancelledError during _process_messages shutdown

The gisual-amqp-consumer library has a race condition during pod shutdown:

  1. Message is delivered to the consumer via _on_message → appended to self.buffer (a plain Python list, no logging)
  2. _process_messages pops message from buffer, processes it, appends to processed list
  3. Pod receives SIGTERM → shutdown_coordinator.initiate_shutdown() → state = SHUTTING_DOWN
  4. Grace period elapses → _cancel_all_operations() cancels the task running _process_messages
  5. CancelledError (BaseException, NOT caught by except Exception) propagates into the finally block
  6. The finally block has multiple await points (postprocess, ack, nack) — if CancelledError interrupts these, messages are neither acked nor nacked
  7. RabbitMQ redelivers after TCP connection drops, but there's a timing window where the connection closes before redelivery can happen

Key code path (gisual_amqp_consumer/app.py:599-761):

async def _process_messages(self, messages) -> None:
    processed = []
    try:
        while messages:
            message = messages.pop(0)  # Removed from buffer
            try:
                async with self.shutdown_coordinator.track_operation(operation_id):
                    await self._process_with_context(decoded)
                    processed.append(message)
            except Exception as error:  # Does NOT catch CancelledError (BaseException)
                await self._handle_error(message, error)
    finally:
        # Multiple await points - can be interrupted by CancelledError
        completed = await self.postprocess(processed)  # ← await
        for message in completed:
            await self.amqp.ack(message)  # ← await
        for message in failed:
            await self.amqp.nack(message, requeue=True)  # ← await

Secondary mechanism: Buffer loss during shutdown

Messages sitting in self.buffer (received via _on_message but not yet popped by _process_messages) should be requeued by _shutdown_amqp_consumption_requeue_buffer. However, if _cleanup_all_tasks cancels the task before _requeue_buffer completes, messages in the buffer are lost.

Why the loss rate is ~1% and uniform

  • The loss is timing-dependent — it only occurs during the narrow window of pod shutdown
  • Kubernetes rolling deployments cause periodic pod restarts across all consumer services
  • The probability of a message being in-flight during shutdown is relatively constant (~1%)
  • The loss is independent of message content, explaining the uniform distribution across organizations and types

Contributing Factor: Redis Lock Contention in message-amplifier

While NOT the root cause, Redis lock contention adds latency and retry overhead:

  • incidents.received and intel_requests.created amplify messages share the same batch_id
  • When processed by different message-amplifier pods simultaneously, they contend for the same Redis lock
  • ~30-35% of intel_requests.created amplify messages hit lock contention → routed to rpc-error exchange → retried after 5s
  • All retries succeed (zero partial drops), but the added latency increases the window for consumer-side loss

Key Code References

File Lines What
gisual-amqp-consumer/app.py 467-468 _on_message — buffer append with no logging
gisual-amqp-consumer/app.py 538-540 _execute_message_batch — not protected from cancellation
gisual-amqp-consumer/app.py 599-761 _process_messages — CancelledError can interrupt finally block acks
gisual-amqp-consumer/app.py 780-893 Error handling — LockError treated as transient
gisual-service-utils/shutdown.py 424-435 _cancel_all_operations — cancels non-protected tasks
gisual-service-utils/shutdown.py 483-522 _cleanup_all_tasks — cancels ALL non-protected asyncio tasks
intel-requests-api/intel_requests.py 1079-1111 intel_requests.created amplify publish
message-amplifier/app.py 159-195 amplify_generic_message — idempotent publish with Redis lock

Fixes (branches prepared)

MR 1: gisual-amqp-consumerfix/on-message-logging-and-shutdown-safety

Change 1: Env-gated _on_message logging (diagnostic)

async def _on_message(self, message) -> None:
    if self._log_on_message:
        self.logger.info(
            'Received message routing_key=%s correlation_id=%s',
            message.delivery.routing_key,
            message.header.properties.correlation_id,
        )
    self.buffer.append(message)
  • Gated on LOG_ON_MESSAGE env var (default off)
  • Confirms whether messages reach the consumer at all

Change 2: Protect _execute_message_batch from cancellation (fix)

async def _execute_message_batch(self, messages):
    current_task = asyncio.current_task()
    if current_task:
        self.shutdown_coordinator.register_protected_task(current_task)
    try:
        async with self.amqp.process_lock:
            await self._process_messages(messages)
    finally:
        if current_task:
            self.shutdown_coordinator.unregister_protected_task(current_task)
  • Prevents _cleanup_all_tasks from cancelling the batch mid-flight
  • Ensures acks/nacks in the finally block complete before the task is cancelled

MR 2: intel-requests-apifix/intel-requests-created-amplify-reliability

Change 1: Use separate batch_id (uuid4()) instead of self.correlation_id — eliminates Redis lock contention with incidents.received in message-amplifier

Change 2: Add before-publish logging with record type, count, and batch_id

Change 3: Raise Exception on publish failure instead of just logging — ensures caller retries

Further Investigation

Recommended next steps

  1. Deploy _on_message logging to one ASU pod in staging with LOG_ON_MESSAGE=true — compare received count vs processed count to quantify buffer-level loss
  2. Check pod restart frequency during March 10th for ASU and event-recorder — correlate restart windows with missing message timestamps
  3. Cross-reference specific missing IR IDs against event-recorder logs — if event-recorder has the Processing log for an IR but ASU doesn't, confirms independent consumer-side loss

Long-term improvements

  1. Transactional Outbox Pattern in intel-requests-api — write amplify message to DB outbox table in same transaction as IR creation, poll-and-publish separately
  2. Consumer-side deduplication — track processed message IDs to detect and recover from missed messages
  3. Dead letter monitoring — alert on messages accumulating in error queues

Appendix: Traced Missing IR 56a219a6-7597-42c7-85ea-1636cdfa8e53

Batch: c2a627b3-3f7c-4c89-808c-4d79b19c69cb (174 IRs for org 9de5d801)

Time Service / Pod Event
13:55:46.528-558 intel-requests-api sz7lh Created 174 IRs via BulkCreateHandler, HTTP 201 (2.4s)
13:55:47.041 message-amplifier tqsnn "Created new intel request record 56a219a6..." (incidents.received handler)
13:55:47.114 message-amplifier ch4dq Picks up intel_requests.created amplify → LockError → rpc-error
13:55:52.327 message-amplifier ltnmf Retry from rpc-error → still locked → rpc-error
13:55:57.517 message-amplifier vpq88 Another retry → still locked → rpc-error
13:55:58.775 message-amplifier tqsnn "Amplified 174/174 incidents.received" (14.56s to process)
13:56:02.700 message-amplifier 9lhr9 Retry successfully acquires lock for intel_requests.created
13:56:13.924 message-amplifier 9lhr9 "Amplified 174/174 intel_requests.created messages" (11.22s)
13:56:22-46 multiple pods Additional retries arrive → deduplicated by Redis idempotent SET

Conclusion from trace

  1. intel-requests-api: Created the IRs and returned HTTP 201 ✅
  2. message-amplifier: Successfully amplified all 174 intel_requests.created events to the event exchange ✅
  3. event-recorder: No log for IR 56a219a6 — did not receive the event ❌
  4. ASU: Only processed a later power_status.updated for this IR, never intel_requests.created

The event was published to the event exchange but never consumed by event-recorder or ASU. This definitively confirms consumer-side loss in the gisual-amqp-consumer library.

Lock contention impact

The incidents.received amplification took 14.56 seconds (processing 174 locations with HTTP calls to intel-requests-api). During this time:

  • 3 separate intel_requests.created retry attempts failed due to lock contention
  • Each retry added 5 seconds of delay (via rpc-error-delay queue)
  • Total delay before successful amplification: ~26 seconds (13:55:47 → 13:56:13)

This delay means the 174 events hit the consumer queues in a burst ~26 seconds after creation, potentially during consumer pod restarts or when buffers were already full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment