| Status | PROVISIONAL — ASU-specific loss confirmed, root cause unknown |
| Supersedes | Previous RCA |
| Incident date | 2026-03-10 |
| Analysis date | 2026-03-16 |
| Reverted | intel-requests-api !48 (batch_id change from !47) |
2,798/300,839 intel requests on March 10 never had facts created (~0.93%).
The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.
Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).
| batch_id | event-recorder | AIPT | ASU | ASU lost |
|---|---|---|---|---|
c2a627b3 |
174 | 174 | 168 | 6 (3.4%) |
215181db |
22 | 22 | 21 | 1 (4.5%) |
812f79b7 |
68 | 68 | 66 | 2 (2.9%) |
d334c245 |
64 | 64 | 62 | 2 (3.1%) |
2186cea6 |
22 | 22 | 20 | 2 (9.1%) |
7438ef94 |
54 | 54 | 51 | 3 (5.6%) |
| Total | 404 | 404 | 388 | 16 (3.96%) |
Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)
Evidence: Raw VictoriaLogs stats query + output
Query (run per service — replace {service} with service name):
{service_name="{service}"} message_type:"intel_requests.created"
AND batch_id:~"c2a627b3|215181db|812f79b7|d334c245|2186cea6|7438ef94"
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
View raw logs: event-recorder | AIPT | ASU
| service | c2a627b3 |
215181db |
812f79b7 |
d334c245 |
2186cea6 |
7438ef94 |
total | ÷2 |
|---|---|---|---|---|---|---|---|---|
| event-recorder | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| AIPT | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| ASU | 336 | 42 | 132 | 124 | 40 | 102 | 776 | 388 |
Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:
intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
→ metadata->>'batch_id' (per-request, = original correlation_id)
The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.
Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
(metadata->>'correlation_id')::text as correlation_id,
(metadata->>'batch_id')::text as batch_id,
created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)| intel_request_id | correlation_id | batch_id | created (UTC) |
|---|---|---|---|
56a219a6 |
11c84002 |
c2a627b3 |
13:55:46 |
bc2d6cef |
e12e237a |
c2a627b3 |
13:55:46 |
eac57da4 |
cec14379 |
215181db |
13:55:50 |
27b6e33c |
e64cd0a6 |
215181db |
13:55:50 |
57007346 |
e5234837 |
812f79b7 |
15:18:44 |
819448ae |
59355b93 |
812f79b7 |
15:18:44 |
55941d68 |
ab310e6b |
812f79b7 |
15:18:44 |
9f3454be |
af861b31 |
d334c245 |
15:19:14 |
2cfbd887 |
a578785d |
2186cea6 |
15:20:27 |
90125c20 |
602b7ea5 |
7438ef94 |
15:20:27 |
Why searching by intel_request_id is misleading:
| Service | intel_request_id in VictoriaLogs? |
Source | On intel_requests.created lines? |
Proof |
|---|---|---|---|---|
| ASU | Yes | Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) |
No — only on power_status.updated discard lines |
search returns 0 results, free-text finds only discard line |
| AIPT | Yes | Python logs it on fetch prediction step (source) |
No — different log line | finds fetch prediction, not Processing/Processed |
| event-recorder | No | Only stores correlation_id, organization_id as metadata (source) |
N/A | — |
The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.
Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.
Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).
- Lock contention between
incidents.receivedandintel_requests.createdin message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3) - The
_on_messagediagnostic logging in MR 17 is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere
A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).
200 replay jobs ran across 4 dates, covering multiple days of production data:
| Run date | Jobs | New (missed) | 409s (had it) | Loss rate |
|---|---|---|---|---|
| 2026-01-15 | 35 | 4,438 | 428,977 | 1.024% |
| 2026-01-16 | 112 | 608 | 2,968,829 | 0.021% |
| 2026-01-17 | 5 | 272 | 3,024,879 | 0.009% |
| 2026-02-16 | 48 | 183 | 522,793 | 0.035% |
| Total | 200 | 5,501 | 6,945,478 | 0.079% |
Verify:
{service_name=~"replay-from-loki-hour-.*"} "TOTAL"over Jan 15 – Mar 16
The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).
Evidence: Sample replay output (hour 18, Dec 29 data)
Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC
REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new): 104
Duplicates (409): 872,513
Discarded (no API call): 79,774
Errors: 0
By event type:
Event Type New 409s Skip
asset_status.updated 13 788 2047
incidents.deleted 0 5 7
incidents.updated 0 532 22060
intel_requests.created 1 41112 68
power_status.updated 0 62941 55574
search.expired 90 767135 18
TOTAL 104 872513 79774
Verify:
{service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL"on Jan 17
Acute — AMQP connection failures: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (CHANNEL_ERROR, PRECONDITION_FAILED - unknown delivery tag, Failed to ACK message ... 'NoneType'). This single hour accounts for 1,975 of ~2,000 missed intel_requests.created events on that day (VLogs).
Evidence: Per-hour intel_requests.created replay results (Jan 13 data)
| Hour (UTC) | New (missed) | 409s | Loss rate | Note |
|---|---|---|---|---|
| 00-21 | 0-4 each | 5,000-16,000 | <0.05% | Steady state |
| 22 | 1,975 | 6,553 | 23.2% | AMQP connection death |
| 23 | 13 | 6,711 | 0.19% | Tail of recovery |
Verify:
{service_name=~"replay-from-loki-hour-.*"} " intel_requests.created"on Jan 15
Chronic — asset_status.updated code path: The _process_asset_status_updated_message handler has a name-to-UUID resolution path (processor.py:554-634) that uses a Redis cache with 24h TTL. When the cache is cold or the asset doesn't exist yet, _get_assets_by_name returns [] → zero API calls → event silently discarded. Other event types (power_status.updated, intel_requests.created) use stable UUIDs from the message body and never hit this cache — explaining why they have 0% loss at steady state.
| Event type | Asset identity | Cache dependency | Steady-state loss |
|---|---|---|---|
power_status.updated |
intel_request_id UUID in message |
None | 0% |
intel_requests.created |
intel_request_id UUID in message |
None | ~0% (only during AMQP failures) |
asset_status.updated |
Name→UUID via Redis cache (source) | 24h TTL | ~5% |
search.expired |
intel_request_id UUID in message |
None | ~0% (sporadic spikes) |
The following are correlated with the loss, not proven to cause it. The March 10 replay (next step) will confirm or rule them out.
-
AMQP connection failures correlate with bulk loss during failure windows. On Jan 13 hour 22, assets-api 500s triggered
PRECONDITION_FAILEDand failed ACKs in the same hour that 1,975intel_requests.createdevents went missing. But correlation ≠ causation — we haven't confirmed that ASU never received those specific messages (vs received but failed to process them). -
_get_assets_by_nameRedis cache returning empty can silently discardasset_status.updatedevents — the code path exists (source) and the replay shows ~5% "new" for that event type. But "new on replay" could also mean the original POST failed for other reasons, not just an empty cache. No production logging currently captures when this code path returns[].
- Replay March 10 data with enhanced logging (MR 29) — the
NEW:lines now includecorrelation_idandbatch_id, letting us cross-reference against ASU production logs to confirm whether each missed event was a delivery failure (CID absent from ASU logs) or a processing failure (CID present but no API call) - Deploy MR 17 on ASU with
LOG_ON_MESSAGE=true— proves whether messages reach ASU's AMQP callback at all - Investigate AMQP reconnection behavior — the
PRECONDITION_FAILED - unknown delivery tagerror suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch - Consider adding logging to
_get_assets_by_namewhen it returns[]— currently this is a silent discard
- MR 29 merged and image built (
docker.gisual.net/backend/consumers/asset-status-updater:latest) - March 10 data still in Loki retention (verify with step 1)
kubectlaccess to production cluster
kubectl run --rm -it loki-check --image=curlimages/curl --restart=Never -- \
curl -s 'http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/query_range' \
--data-urlencode 'query={source="events", name="intel_requests.created"}' \
--data-urlencode 'start=1741564800' \
--data-urlencode 'end=1741568400' \
--data-urlencode 'limit=1' | head -c 200If empty/error → data has aged out, stop here.
cd ~/src/gisual/backend/consumers/asset-status-updater/scripts
# Dry run first — prints manifests without applying
./generate-parallel-jobs.sh 2026-03-10 --dry-run | head -50
# Apply (ensure IMAGE_TAG points to the build with MR 29)
IMAGE_TAG=latest ./generate-parallel-jobs.sh 2026-03-10# Watch job completion
kubectl get jobs -l app.kubernetes.io/name=replay-from-loki -w
# Tail a specific hour's logs
kubectl logs -f job/replay-from-loki-hour-13
# Check for errors across all jobs
kubectl logs -l app.kubernetes.io/name=replay-from-loki --tail=5 | grep -E "^(ERROR|REPLAY SUMMARY)"Wait for all 24 jobs to complete, then:
# Dump all summaries
for hour in $(seq -w 0 23); do
echo "=== Hour $hour ==="
kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep -A 20 "REPLAY SUMMARY"
echo
done > /tmp/replay-march10-summaries.txtOr via VictoriaLogs (after logs are shipped):
{service_name=~"replay-from-loki-hour-.*"} "TOTAL"
Time range: from job start to +4 hours.
# Get all NEW lines (events ASU missed)
for hour in $(seq -w 0 23); do
kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep "^.*NEW:"
done > /tmp/replay-march10-new-events.txt
# Count by event type
cat /tmp/replay-march10-new-events.txt | grep -oP 'NEW: \S+' | sort | uniq -c | sort -rnkubectl delete jobs -l app.kubernetes.io/name=replay-from-lokiOnce the replay completes and NEW: lines are collected:
# Parse correlation_ids from the enhanced log format
grep "NEW:" /tmp/replay-march10-new-events.txt \
| grep -oP 'correlation_id=\K[a-f0-9-]{36}' \
| sort -u > /tmp/new-correlation-ids.txt
wc -l /tmp/new-correlation-ids.txt # How many missed eventsFor each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID:
{service_name="asset-status-updater"} correlation_id:"{cid}" AND "Processing"
Time range: March 10 00:00–24:00 UTC (start=1773100800000&end=1773187200000).
Batch check (sample 10 CIDs):
head -10 /tmp/new-correlation-ids.txt | while read cid; do
echo -n "$cid: "
curl -s 'http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query' \
--data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:\"$cid\" AND \"Processing\"" \
--data-urlencode 'start=2026-03-10T00:00:00Z' \
--data-urlencode 'end=2026-03-11T00:00:00Z' \
--data-urlencode 'limit=1' | wc -c
# >2 bytes = found, 0-2 bytes = not found
done| ASU logged "Processing" for CID? | Meaning | Root cause |
|---|---|---|
| No | RabbitMQ never delivered the message to ASU | AMQP connection failure or queue-level loss |
| Yes, and handler made API calls | ASU processed it but the POST failed (transient error) and retries exhausted | Retry/error-exchange exhaustion |
| Yes, but 0 API calls (DISCARDED in replay) | _get_assets_by_name returned empty at original time |
Redis cache miss or asset not yet created |
Check if the missed events cluster around AMQP error windows:
{service_name="asset-status-updater"} level:"error" _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors
If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue.
Update this RCA with:
- Total NEW count for March 10 (compare to Cody's 2,798)
- Breakdown: delivery-layer vs application-layer loss
- Whether the pattern matches Jan 13 (hour-22 spike) or is different