Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 16, 2026 23:16
Show Gist options
  • Select an option

  • Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.
RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

RCA: Missing intel_requests.created (ENG-3349) — Corrected

Status PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes Previous RCA
Incident date 2026-03-10
Analysis date 2026-03-16
Reverted intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.


Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id event-recorder AIPT ASU ASU lost
c2a627b3 174 174 168 6 (3.4%)
215181db 22 22 21 1 (4.5%)
812f79b7 68 68 66 2 (2.9%)
d334c245 64 64 62 2 (3.1%)
2186cea6 22 22 20 2 (9.1%)
7438ef94 54 54 51 3 (5.6%)
Total 404 404 388 16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Evidence: Raw VictoriaLogs stats query + output

Query (run per service — replace {service} with service name):

{service_name="{service}"} message_type:"intel_requests.created"
AND batch_id:~"c2a627b3|215181db|812f79b7|d334c245|2186cea6|7438ef94"
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines

View raw logs: event-recorder | AIPT | ASU

service c2a627b3 215181db 812f79b7 d334c245 2186cea6 7438ef94 total ÷2
event-recorder 348 44 136 128 44 108 808 404
AIPT 348 44 136 128 44 108 808 404
ASU 336 42 132 124 40 102 776 388

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)
intel_request_id correlation_id batch_id created (UTC)
56a219a6 11c84002 c2a627b3 13:55:46
bc2d6cef e12e237a c2a627b3 13:55:46
eac57da4 cec14379 215181db 13:55:50
27b6e33c e64cd0a6 215181db 13:55:50
57007346 e5234837 812f79b7 15:18:44
819448ae 59355b93 812f79b7 15:18:44
55941d68 ab310e6b 812f79b7 15:18:44
9f3454be af861b31 d334c245 15:19:14
2cfbd887 a578785d 2186cea6 15:20:27
90125c20 602b7ea5 7438ef94 15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service intel_request_id in VictoriaLogs? Source On intel_requests.created lines? Proof
ASU Yes Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) No — only on power_status.updated discard lines search returns 0 results, free-text finds only discard line
AIPT Yes Python logs it on fetch prediction step (source) No — different log line finds fetch prediction, not Processing/Processed
event-recorder No Only stores correlation_id, organization_id as metadata (source) N/A

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid})correlation_id (VRL), (Batch ID {uuid})batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

  • Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
  • The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
  • Consumer-side loss IS happening — just only in ASU, not everywhere

5. Replay experiment independently confirms ASU-specific loss

A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).

200 replay jobs ran across 4 dates, covering multiple days of production data:

Run date Jobs New (missed) 409s (had it) Loss rate
2026-01-15 35 4,438 428,977 1.024%
2026-01-16 112 608 2,968,829 0.021%
2026-01-17 5 272 3,024,879 0.009%
2026-02-16 48 183 522,793 0.035%
Total 200 5,501 6,945,478 0.079%

Verify: {service_name=~"replay-from-loki-hour-.*"} "TOTAL" over Jan 15 – Mar 16

The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).

Evidence: Sample replay output (hour 18, Dec 29 data)

Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC

REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new):   104
Duplicates (409):               872,513
Discarded (no API call):        79,774
Errors:                         0

By event type:
  Event Type                    New   409s   Skip
  asset_status.updated           13    788   2047
  incidents.deleted               0      5      7
  incidents.updated               0    532  22060
  intel_requests.created          1  41112     68
  power_status.updated            0  62941  55574
  search.expired                 90 767135     18
  TOTAL                         104 872513  79774

Verify: {service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL" on Jan 17


6. Two candidate loss mechanisms (not yet verified)

Acute — AMQP connection failures: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (CHANNEL_ERROR, PRECONDITION_FAILED - unknown delivery tag, Failed to ACK message ... 'NoneType'). This single hour accounts for 1,975 of ~2,000 missed intel_requests.created events on that day (VLogs).

Evidence: Per-hour intel_requests.created replay results (Jan 13 data)
Hour (UTC) New (missed) 409s Loss rate Note
00-21 0-4 each 5,000-16,000 <0.05% Steady state
22 1,975 6,553 23.2% AMQP connection death
23 13 6,711 0.19% Tail of recovery

Verify: {service_name=~"replay-from-loki-hour-.*"} " intel_requests.created" on Jan 15

Chronic — asset_status.updated code path: The _process_asset_status_updated_message handler has a name-to-UUID resolution path (processor.py:554-634) that uses a Redis cache with 24h TTL. When the cache is cold or the asset doesn't exist yet, _get_assets_by_name returns [] → zero API calls → event silently discarded. Other event types (power_status.updated, intel_requests.created) use stable UUIDs from the message body and never hit this cache — explaining why they have 0% loss at steady state.

Event type Asset identity Cache dependency Steady-state loss
power_status.updated intel_request_id UUID in message None 0%
intel_requests.created intel_request_id UUID in message None ~0% (only during AMQP failures)
asset_status.updated Name→UUID via Redis cache (source) 24h TTL ~5%
search.expired intel_request_id UUID in message None ~0% (sporadic spikes)

Root Cause: Two Candidates (Not Yet Verified)

The following are correlated with the loss, not proven to cause it. The March 10 replay (next step) will confirm or rule them out.

  1. AMQP connection failures correlate with bulk loss during failure windows. On Jan 13 hour 22, assets-api 500s triggered PRECONDITION_FAILED and failed ACKs in the same hour that 1,975 intel_requests.created events went missing. But correlation ≠ causation — we haven't confirmed that ASU never received those specific messages (vs received but failed to process them).

  2. _get_assets_by_name Redis cache returning empty can silently discard asset_status.updated events — the code path exists (source) and the replay shows ~5% "new" for that event type. But "new on replay" could also mean the original POST failed for other reasons, not just an empty cache. No production logging currently captures when this code path returns [].

Next Steps

  1. Replay March 10 data with enhanced logging (MR 29) — the NEW: lines now include correlation_id and batch_id, letting us cross-reference against ASU production logs to confirm whether each missed event was a delivery failure (CID absent from ASU logs) or a processing failure (CID present but no API call)
  2. Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's AMQP callback at all
  3. Investigate AMQP reconnection behavior — the PRECONDITION_FAILED - unknown delivery tag error suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch
  4. Consider adding logging to _get_assets_by_name when it returns [] — currently this is a silent discard

Runbook: Replay March 10 Data

Prerequisites

  • MR 29 merged and image built (docker.gisual.net/backend/consumers/asset-status-updater:latest)
  • March 10 data still in Loki retention (verify with step 1)
  • kubectl access to production cluster

Step 1: Verify Loki has March 10 data

kubectl run --rm -it loki-check --image=curlimages/curl --restart=Never -- \
  curl -s 'http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/query_range' \
  --data-urlencode 'query={source="events", name="intel_requests.created"}' \
  --data-urlencode 'start=1741564800' \
  --data-urlencode 'end=1741568400' \
  --data-urlencode 'limit=1' | head -c 200

If empty/error → data has aged out, stop here.

Step 2: Run replay for March 10 (24 parallel hourly jobs)

cd ~/src/gisual/backend/consumers/asset-status-updater/scripts

# Dry run first — prints manifests without applying
./generate-parallel-jobs.sh 2026-03-10 --dry-run | head -50

# Apply (ensure IMAGE_TAG points to the build with MR 29)
IMAGE_TAG=latest ./generate-parallel-jobs.sh 2026-03-10

Step 3: Monitor progress

# Watch job completion
kubectl get jobs -l app.kubernetes.io/name=replay-from-loki -w

# Tail a specific hour's logs
kubectl logs -f job/replay-from-loki-hour-13

# Check for errors across all jobs
kubectl logs -l app.kubernetes.io/name=replay-from-loki --tail=5 | grep -E "^(ERROR|REPLAY SUMMARY)"

Step 4: Collect REPLAY SUMMARY from all jobs

Wait for all 24 jobs to complete, then:

# Dump all summaries
for hour in $(seq -w 0 23); do
  echo "=== Hour $hour ==="
  kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep -A 20 "REPLAY SUMMARY"
  echo
done > /tmp/replay-march10-summaries.txt

Or via VictoriaLogs (after logs are shipped):

{service_name=~"replay-from-loki-hour-.*"} "TOTAL"

Time range: from job start to +4 hours.

Step 5: Extract NEW event IDs

# Get all NEW lines (events ASU missed)
for hour in $(seq -w 0 23); do
  kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep "^.*NEW:"
done > /tmp/replay-march10-new-events.txt

# Count by event type
cat /tmp/replay-march10-new-events.txt | grep -oP 'NEW: \S+' | sort | uniq -c | sort -rn

Step 6: Clean up jobs

kubectl delete jobs -l app.kubernetes.io/name=replay-from-loki

Runbook: Cross-Reference NEW Events Against ASU Production Logs

Once the replay completes and NEW: lines are collected:

Step 1: Extract correlation_ids from NEW lines

# Parse correlation_ids from the enhanced log format
grep "NEW:" /tmp/replay-march10-new-events.txt \
  | grep -oP 'correlation_id=\K[a-f0-9-]{36}' \
  | sort -u > /tmp/new-correlation-ids.txt

wc -l /tmp/new-correlation-ids.txt  # How many missed events

Step 2: Check if ASU ever received these messages

For each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID:

{service_name="asset-status-updater"} correlation_id:"{cid}" AND "Processing"

Time range: March 10 00:00–24:00 UTC (start=1773100800000&end=1773187200000).

Batch check (sample 10 CIDs):

head -10 /tmp/new-correlation-ids.txt | while read cid; do
  echo -n "$cid: "
  curl -s 'http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query' \
    --data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:\"$cid\" AND \"Processing\"" \
    --data-urlencode 'start=2026-03-10T00:00:00Z' \
    --data-urlencode 'end=2026-03-11T00:00:00Z' \
    --data-urlencode 'limit=1' | wc -c
  # >2 bytes = found, 0-2 bytes = not found
done

Step 3: Interpret results

ASU logged "Processing" for CID? Meaning Root cause
No RabbitMQ never delivered the message to ASU AMQP connection failure or queue-level loss
Yes, and handler made API calls ASU processed it but the POST failed (transient error) and retries exhausted Retry/error-exchange exhaustion
Yes, but 0 API calls (DISCARDED in replay) _get_assets_by_name returned empty at original time Redis cache miss or asset not yet created

Step 4: Correlate with AMQP errors

Check if the missed events cluster around AMQP error windows:

{service_name="asset-status-updater"} level:"error" _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors

If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue.

Step 5: Document findings

Update this RCA with:

  • Total NEW count for March 10 (compare to Cody's 2,798)
  • Breakdown: delivery-layer vs application-layer loss
  • Whether the pattern matches Jan 13 (hour-22 spike) or is different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment