DSM monitor alert analysis — past 3 months

Date generated: 2026-04-29

Query and scope

Used pup against Datadog org2 and queried Event Management for monitor alert events matching:

source:alert team:data-streams-monitoring status:(error OR warn)

Time window queried: 2026-01-29T13:59:01Z → 2026-04-29T13:59:01Z (~past 90 days).

Notes:

Recovery/OK events were excluded.
Both alert/error and warn states were counted as “fired”.
fired events counts all matching firing event records.
unique cycles deduplicates by Datadog alert cycle key where present; this is usually a better proxy for distinct incidents than raw event volume.
renotify is included separately because it can inflate raw event counts for long-running alerts.

Overall summary

97 monitors fired at least once.
3,208 firing events total.
2,054 unique alert cycles total.
Status split: 2,057 error/alert events, 1,151 warn events.
Renotification events: 261.

Monitors that fired the most — ranked by firing event count

#	monitor_id	fired events	unique cycles	error	warn	renotify	monitor
1	258256336	951	401	285	666	0	[dsm-api] 4xx rate in {{datacenter.name}}
2	276191315	350	222	298	52	111	[Build Horizon] Service nearing or exceeding the 28-day rebuild deadline
3	272862707	300	264	18	282	0	Demo env transaction tracking not processing transactions
4	261328594	253	253	253	0	0	[dsm-kafka-configs-sync] [*] ShortBurn Availability
5	258046692	181	103	105	76	0	[dsm-tt-scheduler] High pod restart rate
6	261343942	111	111	111	0	0	[dsm-dlq-metrics-generator] [*] LongBurn Availability
7	246051592	110	110	110	0	0	[data-pipeline-edge][{{datacenter.name}}] Too many restarts
8	246072014	86	79	86	0	7	[dsm-api] High P95 latency in {{datacenter.name}}
9	261328590	60	49	60	0	11	[dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
10	261328592	59	48	59	0	11	[dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high
11	236894144	46	3	46	0	23	[transaction-tracking] Pods not ready in {{datacenter.name}}
12	246072006	41	24	41	0	17	[dsm-api] Low availability in {{datacenter.name}}
13	261341891	40	23	40	0	17	[dsm-org-trial-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
14	258046617	38	19	38	0	19	[dsm-tt-scheduler] High memory utilization in {{datacenter.name}} for {{display_container_name.name}}
15	258256342	35	1	18	17	0	[dsm-api] High memory usage in {{datacenter.name}}
16	246049695	32	32	32	0	0	[data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable
17	265151345	32	1	32	0	0	[Synthetics] [dsm-api][prtest07.prod.dog] /alerts org2 synthetic
18	265151347	32	1	32	0	0	[Synthetics] [dsm-api][prtest07.prod.dog] /service_summary org2 synthetic
19	265151348	32	1	32	0	0	[Synthetics] [dsm-api][prtest07.prod.dog] /map org2 synthetic
20	265151356	32	1	32	0	0	[Synthetics] [dsm-api][prtest07.prod.dog] /apm_streaming_services org2 synthetic

Monitors that fired the most — ranked by unique alert cycles

#	monitor_id	unique cycles	fired events	error	warn	renotify	monitor
1	258256336	401	951	285	666	0	[dsm-api] 4xx rate in {{datacenter.name}}
2	272862707	264	300	18	282	0	Demo env transaction tracking not processing transactions
3	261328594	253	253	253	0	0	[dsm-kafka-configs-sync] [*] ShortBurn Availability
4	276191315	222	350	298	52	111	[Build Horizon] Service nearing or exceeding the 28-day rebuild deadline
5	261343942	111	111	111	0	0	[dsm-dlq-metrics-generator] [*] LongBurn Availability
6	246051592	110	110	110	0	0	[data-pipeline-edge][{{datacenter.name}}] Too many restarts
7	258046692	103	181	105	76	0	[dsm-tt-scheduler] High pod restart rate
8	246072014	79	86	86	0	7	[dsm-api] High P95 latency in {{datacenter.name}}
9	261328590	49	60	60	0	11	[dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
10	261328592	48	59	59	0	11	[dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high

Service rollup — top services by firing event count

#	service	fired events	unique cycles	monitors
1	dsm-api	1276	537	31
2	dsm-kafka-configs-sync	381	356	5
3	ephemera-data-streams-checkpoints-kv	350	222	1
4	undefined	325	289	4
5	dsm-tt-scheduler	233	134	5
6	dsm-dlq-metrics-generator	144	137	5
7	data-pipeline-edge	123	123	2
8	dsm-tt-processor	85	62	8
9	transaction-tracking	59	6	2
10	dsm-batch-metrics-processor	58	52	5
11	dsm-org-trial-sync	49	29	5
12	data-streams-lag-writer	40	35	3
13	dsm-tt-bucket-processor	20	14	8
14	rc-api	16	16	1
15	data-streams-resolver	13	10	4

Key takeaways

The biggest source of firing volume was [dsm-api] 4xx rate in {datacenter.name} with 951 firing events and 401 unique cycles.
Build Horizon was #2 by raw event count (350) but had 111 renotifications; by unique cycles it ranked #4.
The highest-recurring non-dsm-api monitors were:
- Demo env transaction tracking not processing transactions — 300 events, 264 cycles.
- [dsm-kafka-configs-sync] [*] ShortBurn Availability — 253 events, 253 cycles.
- [dsm-tt-scheduler] High pod restart rate — 181 events, 103 cycles.
- [data-pipeline-edge][{datacenter.name}] Too many restarts — 110 events, 110 cycles.
dsm-api dominated service-level firing volume: 1,276 events across 31 monitors.

Monitors that paged the on-call rotation

I interpreted “page the on-call rotation” as events that actually notified @oncall-data-streams-monitoring. I checked the rendered event notification recipients, not just monitor message templates, because some monitors include conditional @oncall blocks.

Across the 97 fired monitors, 9 monitors actually paged the on-call rotation during the 90-day window.

monitor_id	page events	page cycles	fired events	service	monitor	page transition types
246049695	32	32	32	data-streams-lag-writer	`[data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable`	no data: 32
246072006	17	17	41	dsm-api	`[dsm-api] Low availability in {{datacenter.name}}`	renotify: 17
246047579	5	4	5	data-streams-resolver	`[data-streams-resolver][{{datacenter.name}}] Packet loss on Koutris`	no data: 3, alert: 1, warn: 1
246047578	4	4	4	data-streams-resolver	`[data-streams-resolver][{{datacenter.name}}] Latency metric generation packet loss on Koutris`	no data: 4
246049699	3	1	6	data-streams-lag-writer	`[data-streams-lag-writer][{{datacenter.name}}] Errors resolving primary tag`	alert: 3
246047581	2	1	3	data-streams-resolver	`[data-streams-resolver][{{datacenter.name}}] Errors resolving primary tag`	alert: 2
249461913	2	2	2	data-streams-lag-writer	`[data-streams-lag-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}}`	alert: 2
249461860	1	1	1	data-observability-schema-writer	`[data-observability-schema-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}}`	alert: 1
249461914	1	1	1	data-streams-resolver	`[data-streams-resolver][{{datacenter.name}}] Lagging on stream {{stream_id.name}}`	alert: 1

Among the top 20 noisiest monitors, only these two paged on-call:

246072006 — [dsm-api] Low availability in {{datacenter.name}}
246049695 — [data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable

The other high-volume monitors, including [dsm-api] 4xx rate, Build Horizon, demo transaction tracking, and the SLO burn monitors, did not actually notify @oncall-data-streams-monitoring in the firing events checked; they mostly notified Slack/non-on-call handles.

All monitors that fired

#	monitor_id	fired events	unique cycles	error	warn	renotify	service	monitor
1	258256336	951	401	285	666	0	dsm-api	[dsm-api] 4xx rate in {{datacenter.name}}
2	276191315	350	222	298	52	111	ephemera-data-streams-checkpoints-kv	[Build Horizon] Service nearing or exceeding the 28-day rebuild deadline
3	272862707	300	264	18	282	0	undefined	Demo env transaction tracking not processing transactions
4	261328594	253	253	253	0	0	dsm-kafka-configs-sync	[dsm-kafka-configs-sync] [*] ShortBurn Availability
5	258046692	181	103	105	76	0	dsm-tt-scheduler	[dsm-tt-scheduler] High pod restart rate
6	261343942	111	111	111	0	0	dsm-dlq-metrics-generator	[dsm-dlq-metrics-generator] [*] LongBurn Availability
7	246051592	110	110	110	0	0	data-pipeline-edge	[data-pipeline-edge][{{datacenter.name}}] Too many restarts
8	246072014	86	79	86	0	7	dsm-api	[dsm-api] High P95 latency in {{datacenter.name}}
9	261328590	60	49	60	0	11	dsm-kafka-configs-sync	[dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
10	261328592	59	48	59	0	11	dsm-kafka-configs-sync	[dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high
11	236894144	46	3	46	0	23	transaction-tracking	[transaction-tracking] Pods not ready in {{datacenter.name}}
12	246072006	41	24	41	0	17	dsm-api	[dsm-api] Low availability in {{datacenter.name}}
13	261341891	40	23	40	0	17	dsm-org-trial-sync	[dsm-org-trial-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
14	258046617	38	19	38	0	19	dsm-tt-scheduler	[dsm-tt-scheduler] High memory utilization in {{datacenter.name}} for {{display_container_name.name}}
15	258256342	35	1	18	17	0	dsm-api	[dsm-api] High memory usage in {{datacenter.name}}
16	246049695	32	32	32	0	0	data-streams-lag-writer	[data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable
17	265151345	32	1	32	0	0	dsm-api	[Synthetics] [dsm-api][prtest07.prod.dog] /alerts org2 synthetic
18	265151347	32	1	32	0	0	dsm-api	[Synthetics] [dsm-api][prtest07.prod.dog] /service_summary org2 synthetic
19	265151348	32	1	32	0	0	dsm-api	[Synthetics] [dsm-api][prtest07.prod.dog] /map org2 synthetic
20	265151356	32	1	32	0	0	dsm-api	[Synthetics] [dsm-api][prtest07.prod.dog] /apm_streaming_services org2 synthetic
21	258035898	31	20	8	23	0	dsm-tt-processor	[dsm-tt-processor] High stream consumer lag on {{stream_id}}/{{traffic_lane}} in {{datacenter.name}}
22	258035896	29	21	10	19	0	dsm-tt-processor	[dsm-tt-processor] High pod restart rate
23	261342390	29	29	29	0	0	dsm-batch-metrics-processor	[dsm-batch-metrics-processor] [*] ShortBurn Availability
24	250778918	22	22	22	0	0	undefined	[RobC Test] DLQ Metrics Failed to Generate
25	261342391	17	17	17	0	0	dsm-batch-metrics-processor	[dsm-batch-metrics-processor] [*] LongBurn Availability
26	179623743	16	16	16	0	0	rc-api	[rc-api][{{datacenter.name}}] Resource POST /api/ui/remote_config/products/dsm_live_messages has a high p95 latency
27	261343943	14	14	14	0	0	dsm-dlq-metrics-generator	[dsm-dlq-metrics-generator] [*] ShortBurn Availability
28	236894141	13	3	13	0	11	transaction-tracking	[transaction-tracking] Deployment is stale in {{datacenter.name}}
29	258655308	13	13	13	0	0	data-pipeline-edge	[data-pipeline-edge][{{datacenter.name}}] Significant drop in traffic
30	249205034	12	12	12	0	0	data-streams-ui	[data-streams-ui] Data streams application crashed
31	254830712	10	5	10	0	5	dsm-dlq-metrics-generator	[dsm-dlq-metrics-generator] High CPU utilization in {{datacenter.name}}
32	258046711	10	10	10	0	0	dsm-tt-scheduler	[dsm-tt-scheduler] No buckets submitted for {{org_id}}
33	258035669	8	8	8	0	0	dsm-tt-processor	[{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-tt-processor
34	263810515	8	8	8	0	0	dsm-kafka-configs-writer	[{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-kafka-configs-writer
35	254959906	7	4	7	0	3	dsm-kafka-configs-sync	[dsm-kafka-configs-sync] Pods not ready in {{datacenter.name}}
36	258047025	7	5	2	5	0	dsm-tt-bucket-processor	[dsm-tt-bucket-processor] High stream consumer lag on {{stream_id}}/{{traffic_lane}}
37	246049699	6	1	3	3	0	data-streams-lag-writer	[data-streams-lag-writer][{{datacenter.name}}] Errors resolving primary tag
38	246072001	6	1	6	0	5	dsm-api	[dsm-api] Deployment is stale in {{datacenter.name}}
39	258035674	6	3	6	0	3	dsm-tt-processor	[dsm-tt-processor] Pods not ready in {{datacenter.name}}
40	261343940	6	5	6	0	1	dsm-dlq-metrics-generator	[dsm-dlq-metrics-generator][{{datacenter.name}}] Task {{task.name}} failure rate is high
41	153173090	5	4	3	2	0	unifiedkv-api	Too many CQL driver refresh failures (WIP)
42	246047579	5	4	4	1	0	data-streams-resolver	[data-streams-resolver][{{datacenter.name}}] Packet loss on Koutris
43	180902702	4	2	4	0	2	dsm-batch-metrics-processor	[dsm-batch-metrics-processor] Pods not ready in {{datacenter.name}}
44	246047578	4	4	4	0	0	data-streams-resolver	[data-streams-resolver][{{datacenter.name}}] Latency metric generation packet loss on Koutris
45	254582741	4	2	4	0	2	dsm-org-trial-sync	[dsm-org-trial-sync] High CPU utilization in {{datacenter.name}}
46	258035679	4	4	4	0	0	dsm-tt-processor	[dsm-tt-processor] High stash lag on stream {{stream_id.name}}
47	258046947	4	2	4	0	2	dsm-tt-bucket-processor	[dsm-tt-bucket-processor] High memory utilization in {{datacenter.name}} for {{display_container_name.name}}
48	261342387	4	2	4	0	2	dsm-batch-metrics-processor	[dsm-batch-metrics-processor][{{datacenter.name}}] Task {{task.name}} failure rate is high
49	261342388	4	2	4	0	2	dsm-batch-metrics-processor	[dsm-batch-metrics-processor][{{datacenter.name}}] Task {{task.name}} timeout rate is high
50	266571506	4	4	4	0	0	dsm-tt-trace-processor	[dsm-tt-trace-processor] High stash lag on stream {{stream_id.name}}
51	154680615	3	1	1	2	0	cassandra	[cassandra][k8s] Too many CQL driver (Cassandra) refresh errors
52	174272983	3	2	3	0	1	dsm-dlq-metrics-generator	[dsm-dlq-metrics-generator] Pods not ready in {{datacenter.name}}
53	246047581	3	1	2	1	0	data-streams-resolver	[data-streams-resolver][{{datacenter.name}}] Errors resolving primary tag
54	246072002	3	2	3	0	1	dsm-api	[dsm-api] Pods not ready in {{datacenter.name}}
55	246072402	3	3	3	0	0	dsm-api	[Synthetics] [dsm-api][us1.prod.dog] /apm_streaming_services org2 synthetic
56	246072423	3	3	3	0	0	dsm-api	[Synthetics] [dsm-api][us1.prod.dog] /alerts org2 synthetic
57	258035897	3	2	2	1	0	dsm-tt-processor	[dsm-tt-processor] High DLT stream consumer lag on {{stream_id}} in {{datacenter.name}}
58	258046614	3	1	3	0	2	dsm-tt-scheduler	[dsm-tt-scheduler] Deployment is stale in {{datacenter.name}}
59	261341893	3	2	3	0	1	dsm-org-trial-sync	[dsm-org-trial-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high
60	147788614	2	2	2	0	0	undefined	Per-team custom metric costs are significantly above historical baseline
61	246072424	2	1	2	0	0	dsm-api	[Synthetics] [dsm-api][us3.prod.dog] /alerts org2 synthetic
62	249461913	2	2	2	0	0	data-streams-lag-writer	[data-streams-lag-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}}
63	258035671	2	2	2	0	0	dsm-tt-processor	[{{datacenter.name}}] Stream consumer is falling behind for dsm-tt-processor on a dead letter topic
64	258035673	2	2	2	0	0	dsm-tt-processor	[{{datacenter.name}}] Topic consumer of {{kafka_topic.name}} is close to retention limit
65	258046929	2	1	2	0	1	dsm-tt-bucket-processor	[dsm-tt-bucket-processor] Pods not ready in {{datacenter.name}}
66	258046952	2	2	2	0	0	dsm-tt-bucket-processor	[{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-tt-bucket-processor
67	258047024	2	1	1	1	0	dsm-tt-bucket-processor	[dsm-tt-bucket-processor] High DLT stream consumer lag on {{stream_id}}
68	261328593	2	2	2	0	0	dsm-kafka-configs-sync	[dsm-kafka-configs-sync] [*] LongBurn Availability
69	263810513	2	1	2	0	1	dsm-kafka-configs-writer	[dsm-kafka-configs-writer] Pods not ready in {{datacenter.name}}
70	246072401	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap1.prod.dog] /map org2 synthetic
71	246072403	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us1.prod.dog] /service_summary org2 synthetic
72	246072404	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap1.prod.dog] /alerts org2 synthetic
73	246072405	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us3.prod.dog] /apm_streaming_services org2 synthetic
74	246072406	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][eu1.prod.dog] /apm_streaming_services org2 synthetic
75	246072407	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us5.prod.dog] /alerts org2 synthetic
76	246072408	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us3.prod.dog] /service_summary org2 synthetic
77	246072410	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us5.prod.dog] /service_summary org2 synthetic
78	246072411	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][eu1.prod.dog] /service_summary org2 synthetic
79	246072412	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][eu1.prod.dog] /alerts org2 synthetic
80	246072413	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us5.prod.dog] /apm_streaming_services org2 synthetic
81	246072418	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap1.prod.dog] /service_summary org2 synthetic
82	246072419	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][us5.prod.dog] /map org2 synthetic
83	249461860	1	1	1	0	0	data-observability-schema-writer	[data-observability-schema-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}}
84	249461914	1	1	1	0	0	data-streams-resolver	[data-streams-resolver][{{datacenter.name}}] Lagging on stream {{stream_id.name}}
85	258046616	1	1	1	0	0	dsm-tt-scheduler	[dsm-tt-scheduler] Pods not ready in {{datacenter.name}}
86	258046930	1	1	1	0	0	dsm-tt-bucket-processor	[{{datacenter.name}}] Topic consumer of {{kafka_topic.name}} is close to retention limit
87	258046933	1	1	1	0	0	dsm-tt-bucket-processor	[dsm-tt-bucket-processor] High stash lag on stream {{stream_id.name}}
88	258046950	1	1	1	0	0	dsm-tt-bucket-processor	[{{datacenter.name}}] Stream consumer is falling behind for dsm-tt-bucket-processor on a dead letter topic
89	258256343	1	1	1	0	0	dsm-api	[dsm-api] Pod out-of-memory kill in {{datacenter.name}}
90	261341894	1	1	1	0	0	dsm-org-trial-sync	[dsm-org-trial-sync] [*] ShortBurn Availability
91	261341895	1	1	1	0	0	dsm-org-trial-sync	[dsm-org-trial-sync] [*] LongBurn Availability
92	265151346	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap2.prod.dog] /apm_streaming_services org2 synthetic
93	265151349	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap2.prod.dog] /map org2 synthetic
94	265151350	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap2.prod.dog] /alerts org2 synthetic
95	265151355	1	1	1	0	0	dsm-api	[Synthetics] [dsm-api][ap2.prod.dog] /service_summary org2 synthetic
96	266571512	1	1	1	0	0	dsm-tt-trace-processor	[{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-tt-trace-processor
97	273420469	1	1	1	0	0	undefined	release-agent-release-agent-121fe90e-baba-43f7-949a-61e5b786062c-sumallcost metric is queryable by team tag for Team Resources CloudCostGraph

chnn/2026-04-29_dsm_monitor_alerts_analysis.md

Select an option

No results found