Status: PROVISIONAL — ASU-specific loss confirmed, root cause unknown Date: 2026-03-10 incident, 2026-03-16 re-analysis JIRA: ENG-3349 Previous RCA: Gist d81250648d (superseded by this document)
2,798 of 300,839 intel requests on March 10 never had facts created (~0.93%). Re-analysis of 6 sample batches (404 IRs) using correct identifiers shows:
- event-recorder: 404/404 — 0% loss1
- AIPT: 404/404 — 0% loss2
- ASU: 388/404 — 3.96% loss, consistent across every batch3
The loss is ASU-specific. The previous RCA incorrectly attributed loss to all consumers due to a search methodology error4.
| Claim | Correction |
|---|---|
| "event-recorder missed 8,146" | 0% loss — received every message in all sampled batches1 |
| "AIPT received but ASU didn't" | AIPT also received 100%2 |
| "Independent consumer-side shutdown race" | Only ASU loses messages — shared library bug would affect all equally |
| MR 47: separate batch_id fixes lock contention | Broke traceability, didn't address ASU loss — reverted5 |
The error: consumer logs index messages by correlation_id and batch_id, not intel_request_id. The original searched by intel_request_id, found nothing, and concluded messages were lost. In reality, the messages were there under different identifiers6.
Cody provided intel_request_id values. Consumer logs don't contain this field on processing lines7. The mapping is:
intel_request_id → DB metadata column[^db-query]
└→ correlation_id (unique per location, logged as CID)
└→ batch_id (shared per API request, = original correlation_id)[^batch-id-code]
10 sample IRs mapped to 6 distinct batches, all org 9de5d8018.
Each processed message produces exactly 2 log lines (Processing... + Processed...) from the shared amqp-consumer library9. Counting message_type:exact("intel_requests.created") lines per batch_id per service, divided by 2, gives IRs processed10.
| batch_id | event-recorder | AIPT | ASU | ASU missing |
|---|---|---|---|---|
c2a627b3 |
174 | 174 | 168 | 6 (3.4%) |
215181db |
22 | 22 | 21 | 1 (4.5%) |
812f79b7 |
68 | 68 | 66 | 2 (2.9%) |
d334c245 |
64 | 64 | 62 | 2 (3.1%) |
2186cea6 |
22 | 22 | 20 | 2 (9.1%) |
7438ef94 |
54 | 54 | 51 | 3 (5.6%) |
| Total | 404 | 404 | 388 | 16 (3.96%) |
Vector's VRL pipeline11 extracts IDs from unstructured log text into structured VictoriaLogs fields:
| Log pattern | Extracted field | VRL source |
|---|---|---|
(Intel Request {uuid}) |
intel_request_id |
12 |
(CID {uuid}) |
correlation_id |
13 |
(Batch ID {uuid}) |
batch_id |
14 |
ASU logs (Intel Request {id}): Received power_status.updated...15 on discard lines — Vector extracts this into intel_request_id. But the generic Processing/Processed message type: intel_requests.created lines9 don't contain that pattern, so intel_request_id is absent on the lines that prove message delivery.
This creates a trap: searching intel_request_id:exact("56a219a6...") finds ASU's power_status.updated discard (hours later, different batch) but misses the intel_requests.created processing entirely.
Event-recorder never logs individual IR IDs at all16 — it only stores event_type, correlation_id, and organization_id as structured metadata.
The root cause of ASU's ~4% loss is unidentified. Theories (all unverified):
- Silent discard in ASU handler — an early return path that doesn't log15
- Backpressure — ASU processing time (0.12-0.15s/msg) is 3x slower than event-recorder (0.05s/msg)
- Queue configuration — different prefetch/QoS or replica count
- Deploy
_on_messagelogging (MR 1717) on ASU — determines whether messages reach ASU's callback at all - Audit ASU handler for silent drops — check all return paths in
intel_requests.createdprocessing - Compare ASU queue config to event-recorder and AIPT (prefetch, replicas, resource limits)
service_name:exact("asset-status-updater") AND "Received message" AND "intel_requests.created"
Compare count to event-recorder's count for the same time window. If equal → loss is post-delivery (handler bug). If lower → loss is pre-delivery (RabbitMQ/queue issue).
- Reverted MR 475 — restored
correlation_idasbatch_id(traceability fix) - MR 1717 still open — diagnostic logging, pending Diego's review of shutdown protection
- ASU batch c2a627b3 — expect 336 lines
- event-recorder batch c2a627b3 — expect 348 lines
Footnotes
-
event-recorder raw:
c2a627b3=348,215181db=44,812f79b7=136,d334c245=128,2186cea6=44,7438ef94=108. Total=808 lines ÷2 = 404 IRs. ↩ ↩2 -
AIPT raw:
c2a627b3=348,215181db=44,812f79b7=136,d334c245=128,2186cea6=44,7438ef94=108. Total=808 lines ÷2 = 404 IRs. ↩ ↩2 -
ASU raw:
c2a627b3=336,215181db=42,812f79b7=132,d334c245=124,2186cea6=40,7438ef94=102. Total=776 lines ÷2 = 388 IRs. ↩ -
Original RCA gist did not preserve exact queries used. The gist states findings like "No log for IR 56a219a6" without query syntax. Based on field analysis, the search likely used
intel_request_idor free-text IR UUID, which doesn't matchintel_requests.createdprocessing lines. ↩ -
Revert MR — intel-requests-api !48 (merged 2026-03-16). Restores
self.correlation_idasbatch_id. ↩ ↩2 -
intel_request_idis stored in the DB row.correlation_idandbatch_idare inmetadataJSONB. Consumer logs only contain the latter two on processing lines. ↩ -
victorialogs_field_namesformessage_type:exact("intel_requests.created")per service, 2026-03-10T13:00-14:00Z. ASU:intel_request_idpresent (27,478 hits on co-stored lines, zero hits when filtered tointel_requests.created+ specific IR ID). event-recorder: field absent entirely. ↩ -
SELECT intel_request_id::text, (metadata->>'correlation_id')::text, (metadata->>'batch_id')::text FROM intel_requests.intel_requests_p20260301 WHERE intel_request_id IN (...)onproduction-app-db-2-restore. 10 IRs → 6 batches, all org9de5d801. ↩ -
amqp-consumer generic logging —
gisual-amqp-consumer/app.py:657-664(Processing) andL717-725(Processed). Logscorrelation_id,request_id,batch_id— neverintel_request_id. Logger context fields set atL632-642. ↩ ↩2 -
victorialogs_stats_querywithservice_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-...") OR ...) _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines— run per service attime=2026-03-10T20:00:00Z. ↩ -
Vector VRL ID extraction pipeline —
vector/vrl/07-extract-ids.vrl. All patterns useif !exists(...)guards (only extract from text when field absent from structured JSON). ↩ -
intel_request_idfrom(Intel Request {uuid})—07-extract-ids.vrl:198-203. ↩ -
correlation_idfrom(CID {uuid})—07-extract-ids.vrl:205-211. ↩ -
batch_idfrom(Batch ID {uuid})—07-extract-ids.vrl:213-219. ↩ -
ASU
power_ondiscard logging —processor.py:248-252. Logsintel_request_idin_msgtext as(Intel Request %s). ↩ ↩2 -
event-recorder Loki structured metadata —
app.py:372-387. Onlyevent_type,correlation_id,organization_id,organization_short_name. ↩ -
Diagnostic logging MR — gisual-amqp-consumer !17 (open, pending review). ↩ ↩2