Skip to content

Instantly share code, notes, and snippets.

@chnn
Created April 29, 2026 14:10
Show Gist options
  • Select an option

  • Save chnn/256930c360d416aef035b43c8e99ac07 to your computer and use it in GitHub Desktop.

Select an option

Save chnn/256930c360d416aef035b43c8e99ac07 to your computer and use it in GitHub Desktop.

DSM monitor alert analysis — past 3 months

Date generated: 2026-04-29

Query and scope

Used pup against Datadog org2 and queried Event Management for monitor alert events matching:

source:alert team:data-streams-monitoring status:(error OR warn)

Time window queried: 2026-01-29T13:59:01Z2026-04-29T13:59:01Z (~past 90 days).

Notes:

  • Recovery/OK events were excluded.
  • Both alert/error and warn states were counted as “fired”.
  • fired events counts all matching firing event records.
  • unique cycles deduplicates by Datadog alert cycle key where present; this is usually a better proxy for distinct incidents than raw event volume.
  • renotify is included separately because it can inflate raw event counts for long-running alerts.

Overall summary

  • 97 monitors fired at least once.
  • 3,208 firing events total.
  • 2,054 unique alert cycles total.
  • Status split: 2,057 error/alert events, 1,151 warn events.
  • Renotification events: 261.

Monitors that fired the most — ranked by firing event count

# monitor_id fired events unique cycles error warn renotify monitor
1 258256336 951 401 285 666 0 [dsm-api] 4xx rate in {{datacenter.name}}
2 276191315 350 222 298 52 111 [Build Horizon] Service nearing or exceeding the 28-day rebuild deadline
3 272862707 300 264 18 282 0 Demo env transaction tracking not processing transactions
4 261328594 253 253 253 0 0 [dsm-kafka-configs-sync] [*] ShortBurn Availability
5 258046692 181 103 105 76 0 [dsm-tt-scheduler] High pod restart rate
6 261343942 111 111 111 0 0 [dsm-dlq-metrics-generator] [*] LongBurn Availability
7 246051592 110 110 110 0 0 [data-pipeline-edge][{{datacenter.name}}] Too many restarts
8 246072014 86 79 86 0 7 [dsm-api] High P95 latency in {{datacenter.name}}
9 261328590 60 49 60 0 11 [dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
10 261328592 59 48 59 0 11 [dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high
11 236894144 46 3 46 0 23 [transaction-tracking] Pods not ready in {{datacenter.name}}
12 246072006 41 24 41 0 17 [dsm-api] Low availability in {{datacenter.name}}
13 261341891 40 23 40 0 17 [dsm-org-trial-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
14 258046617 38 19 38 0 19 [dsm-tt-scheduler] High memory utilization in {{datacenter.name}} for {{display_container_name.name}}
15 258256342 35 1 18 17 0 [dsm-api] High memory usage in {{datacenter.name}}
16 246049695 32 32 32 0 0 [data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable
17 265151345 32 1 32 0 0 [Synthetics] [dsm-api][prtest07.prod.dog] /alerts org2 synthetic
18 265151347 32 1 32 0 0 [Synthetics] [dsm-api][prtest07.prod.dog] /service_summary org2 synthetic
19 265151348 32 1 32 0 0 [Synthetics] [dsm-api][prtest07.prod.dog] /map org2 synthetic
20 265151356 32 1 32 0 0 [Synthetics] [dsm-api][prtest07.prod.dog] /apm_streaming_services org2 synthetic

Monitors that fired the most — ranked by unique alert cycles

# monitor_id unique cycles fired events error warn renotify monitor
1 258256336 401 951 285 666 0 [dsm-api] 4xx rate in {{datacenter.name}}
2 272862707 264 300 18 282 0 Demo env transaction tracking not processing transactions
3 261328594 253 253 253 0 0 [dsm-kafka-configs-sync] [*] ShortBurn Availability
4 276191315 222 350 298 52 111 [Build Horizon] Service nearing or exceeding the 28-day rebuild deadline
5 261343942 111 111 111 0 0 [dsm-dlq-metrics-generator] [*] LongBurn Availability
6 246051592 110 110 110 0 0 [data-pipeline-edge][{{datacenter.name}}] Too many restarts
7 258046692 103 181 105 76 0 [dsm-tt-scheduler] High pod restart rate
8 246072014 79 86 86 0 7 [dsm-api] High P95 latency in {{datacenter.name}}
9 261328590 49 60 60 0 11 [dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
10 261328592 48 59 59 0 11 [dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high

Service rollup — top services by firing event count

# service fired events unique cycles monitors
1 dsm-api 1276 537 31
2 dsm-kafka-configs-sync 381 356 5
3 ephemera-data-streams-checkpoints-kv 350 222 1
4 undefined 325 289 4
5 dsm-tt-scheduler 233 134 5
6 dsm-dlq-metrics-generator 144 137 5
7 data-pipeline-edge 123 123 2
8 dsm-tt-processor 85 62 8
9 transaction-tracking 59 6 2
10 dsm-batch-metrics-processor 58 52 5
11 dsm-org-trial-sync 49 29 5
12 data-streams-lag-writer 40 35 3
13 dsm-tt-bucket-processor 20 14 8
14 rc-api 16 16 1
15 data-streams-resolver 13 10 4

Key takeaways

  • The biggest source of firing volume was [dsm-api] 4xx rate in {datacenter.name} with 951 firing events and 401 unique cycles.
  • Build Horizon was #2 by raw event count (350) but had 111 renotifications; by unique cycles it ranked #4.
  • The highest-recurring non-dsm-api monitors were:
    • Demo env transaction tracking not processing transactions — 300 events, 264 cycles.
    • [dsm-kafka-configs-sync] [*] ShortBurn Availability — 253 events, 253 cycles.
    • [dsm-tt-scheduler] High pod restart rate — 181 events, 103 cycles.
    • [data-pipeline-edge][{datacenter.name}] Too many restarts — 110 events, 110 cycles.
  • dsm-api dominated service-level firing volume: 1,276 events across 31 monitors.

Monitors that paged the on-call rotation

I interpreted “page the on-call rotation” as events that actually notified @oncall-data-streams-monitoring. I checked the rendered event notification recipients, not just monitor message templates, because some monitors include conditional @oncall blocks.

Across the 97 fired monitors, 9 monitors actually paged the on-call rotation during the 90-day window.

monitor_id page events page cycles fired events service monitor page transition types
246049695 32 32 32 data-streams-lag-writer [data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable no data: 32
246072006 17 17 41 dsm-api [dsm-api] Low availability in {{datacenter.name}} renotify: 17
246047579 5 4 5 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Packet loss on Koutris no data: 3, alert: 1, warn: 1
246047578 4 4 4 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Latency metric generation packet loss on Koutris no data: 4
246049699 3 1 6 data-streams-lag-writer [data-streams-lag-writer][{{datacenter.name}}] Errors resolving primary tag alert: 3
246047581 2 1 3 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Errors resolving primary tag alert: 2
249461913 2 2 2 data-streams-lag-writer [data-streams-lag-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}} alert: 2
249461860 1 1 1 data-observability-schema-writer [data-observability-schema-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}} alert: 1
249461914 1 1 1 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Lagging on stream {{stream_id.name}} alert: 1

Among the top 20 noisiest monitors, only these two paged on-call:

  1. 246072006[dsm-api] Low availability in {{datacenter.name}}
  2. 246049695[data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable

The other high-volume monitors, including [dsm-api] 4xx rate, Build Horizon, demo transaction tracking, and the SLO burn monitors, did not actually notify @oncall-data-streams-monitoring in the firing events checked; they mostly notified Slack/non-on-call handles.

All monitors that fired

# monitor_id fired events unique cycles error warn renotify service monitor
1 258256336 951 401 285 666 0 dsm-api [dsm-api] 4xx rate in {{datacenter.name}}
2 276191315 350 222 298 52 111 ephemera-data-streams-checkpoints-kv [Build Horizon] Service nearing or exceeding the 28-day rebuild deadline
3 272862707 300 264 18 282 0 undefined Demo env transaction tracking not processing transactions
4 261328594 253 253 253 0 0 dsm-kafka-configs-sync [dsm-kafka-configs-sync] [*] ShortBurn Availability
5 258046692 181 103 105 76 0 dsm-tt-scheduler [dsm-tt-scheduler] High pod restart rate
6 261343942 111 111 111 0 0 dsm-dlq-metrics-generator [dsm-dlq-metrics-generator] [*] LongBurn Availability
7 246051592 110 110 110 0 0 data-pipeline-edge [data-pipeline-edge][{{datacenter.name}}] Too many restarts
8 246072014 86 79 86 0 7 dsm-api [dsm-api] High P95 latency in {{datacenter.name}}
9 261328590 60 49 60 0 11 dsm-kafka-configs-sync [dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
10 261328592 59 48 59 0 11 dsm-kafka-configs-sync [dsm-kafka-configs-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high
11 236894144 46 3 46 0 23 transaction-tracking [transaction-tracking] Pods not ready in {{datacenter.name}}
12 246072006 41 24 41 0 17 dsm-api [dsm-api] Low availability in {{datacenter.name}}
13 261341891 40 23 40 0 17 dsm-org-trial-sync [dsm-org-trial-sync][{{datacenter.name}}] Task {{task.name}} failure rate is high
14 258046617 38 19 38 0 19 dsm-tt-scheduler [dsm-tt-scheduler] High memory utilization in {{datacenter.name}} for {{display_container_name.name}}
15 258256342 35 1 18 17 0 dsm-api [dsm-api] High memory usage in {{datacenter.name}}
16 246049695 32 32 32 0 0 data-streams-lag-writer [data-streams-lag-writer][{{datacenter.name}}] data_streams.throughput_by_schema metric unavailable
17 265151345 32 1 32 0 0 dsm-api [Synthetics] [dsm-api][prtest07.prod.dog] /alerts org2 synthetic
18 265151347 32 1 32 0 0 dsm-api [Synthetics] [dsm-api][prtest07.prod.dog] /service_summary org2 synthetic
19 265151348 32 1 32 0 0 dsm-api [Synthetics] [dsm-api][prtest07.prod.dog] /map org2 synthetic
20 265151356 32 1 32 0 0 dsm-api [Synthetics] [dsm-api][prtest07.prod.dog] /apm_streaming_services org2 synthetic
21 258035898 31 20 8 23 0 dsm-tt-processor [dsm-tt-processor] High stream consumer lag on {{stream_id}}/{{traffic_lane}} in {{datacenter.name}}
22 258035896 29 21 10 19 0 dsm-tt-processor [dsm-tt-processor] High pod restart rate
23 261342390 29 29 29 0 0 dsm-batch-metrics-processor [dsm-batch-metrics-processor] [*] ShortBurn Availability
24 250778918 22 22 22 0 0 undefined [RobC Test] DLQ Metrics Failed to Generate
25 261342391 17 17 17 0 0 dsm-batch-metrics-processor [dsm-batch-metrics-processor] [*] LongBurn Availability
26 179623743 16 16 16 0 0 rc-api [rc-api][{{datacenter.name}}] Resource POST /api/ui/remote_config/products/dsm_live_messages has a high p95 latency
27 261343943 14 14 14 0 0 dsm-dlq-metrics-generator [dsm-dlq-metrics-generator] [*] ShortBurn Availability
28 236894141 13 3 13 0 11 transaction-tracking [transaction-tracking] Deployment is stale in {{datacenter.name}}
29 258655308 13 13 13 0 0 data-pipeline-edge [data-pipeline-edge][{{datacenter.name}}] Significant drop in traffic
30 249205034 12 12 12 0 0 data-streams-ui [data-streams-ui] Data streams application crashed
31 254830712 10 5 10 0 5 dsm-dlq-metrics-generator [dsm-dlq-metrics-generator] High CPU utilization in {{datacenter.name}}
32 258046711 10 10 10 0 0 dsm-tt-scheduler [dsm-tt-scheduler] No buckets submitted for {{org_id}}
33 258035669 8 8 8 0 0 dsm-tt-processor [{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-tt-processor
34 263810515 8 8 8 0 0 dsm-kafka-configs-writer [{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-kafka-configs-writer
35 254959906 7 4 7 0 3 dsm-kafka-configs-sync [dsm-kafka-configs-sync] Pods not ready in {{datacenter.name}}
36 258047025 7 5 2 5 0 dsm-tt-bucket-processor [dsm-tt-bucket-processor] High stream consumer lag on {{stream_id}}/{{traffic_lane}}
37 246049699 6 1 3 3 0 data-streams-lag-writer [data-streams-lag-writer][{{datacenter.name}}] Errors resolving primary tag
38 246072001 6 1 6 0 5 dsm-api [dsm-api] Deployment is stale in {{datacenter.name}}
39 258035674 6 3 6 0 3 dsm-tt-processor [dsm-tt-processor] Pods not ready in {{datacenter.name}}
40 261343940 6 5 6 0 1 dsm-dlq-metrics-generator [dsm-dlq-metrics-generator][{{datacenter.name}}] Task {{task.name}} failure rate is high
41 153173090 5 4 3 2 0 unifiedkv-api Too many CQL driver refresh failures (WIP)
42 246047579 5 4 4 1 0 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Packet loss on Koutris
43 180902702 4 2 4 0 2 dsm-batch-metrics-processor [dsm-batch-metrics-processor] Pods not ready in {{datacenter.name}}
44 246047578 4 4 4 0 0 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Latency metric generation packet loss on Koutris
45 254582741 4 2 4 0 2 dsm-org-trial-sync [dsm-org-trial-sync] High CPU utilization in {{datacenter.name}}
46 258035679 4 4 4 0 0 dsm-tt-processor [dsm-tt-processor] High stash lag on stream {{stream_id.name}}
47 258046947 4 2 4 0 2 dsm-tt-bucket-processor [dsm-tt-bucket-processor] High memory utilization in {{datacenter.name}} for {{display_container_name.name}}
48 261342387 4 2 4 0 2 dsm-batch-metrics-processor [dsm-batch-metrics-processor][{{datacenter.name}}] Task {{task.name}} failure rate is high
49 261342388 4 2 4 0 2 dsm-batch-metrics-processor [dsm-batch-metrics-processor][{{datacenter.name}}] Task {{task.name}} timeout rate is high
50 266571506 4 4 4 0 0 dsm-tt-trace-processor [dsm-tt-trace-processor] High stash lag on stream {{stream_id.name}}
51 154680615 3 1 1 2 0 cassandra [cassandra][k8s] Too many CQL driver (Cassandra) refresh errors
52 174272983 3 2 3 0 1 dsm-dlq-metrics-generator [dsm-dlq-metrics-generator] Pods not ready in {{datacenter.name}}
53 246047581 3 1 2 1 0 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Errors resolving primary tag
54 246072002 3 2 3 0 1 dsm-api [dsm-api] Pods not ready in {{datacenter.name}}
55 246072402 3 3 3 0 0 dsm-api [Synthetics] [dsm-api][us1.prod.dog] /apm_streaming_services org2 synthetic
56 246072423 3 3 3 0 0 dsm-api [Synthetics] [dsm-api][us1.prod.dog] /alerts org2 synthetic
57 258035897 3 2 2 1 0 dsm-tt-processor [dsm-tt-processor] High DLT stream consumer lag on {{stream_id}} in {{datacenter.name}}
58 258046614 3 1 3 0 2 dsm-tt-scheduler [dsm-tt-scheduler] Deployment is stale in {{datacenter.name}}
59 261341893 3 2 3 0 1 dsm-org-trial-sync [dsm-org-trial-sync][{{datacenter.name}}] Task {{task.name}} timeout rate is high
60 147788614 2 2 2 0 0 undefined Per-team custom metric costs are significantly above historical baseline
61 246072424 2 1 2 0 0 dsm-api [Synthetics] [dsm-api][us3.prod.dog] /alerts org2 synthetic
62 249461913 2 2 2 0 0 data-streams-lag-writer [data-streams-lag-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}}
63 258035671 2 2 2 0 0 dsm-tt-processor [{{datacenter.name}}] Stream consumer is falling behind for dsm-tt-processor on a dead letter topic
64 258035673 2 2 2 0 0 dsm-tt-processor [{{datacenter.name}}] Topic consumer of {{kafka_topic.name}} is close to retention limit
65 258046929 2 1 2 0 1 dsm-tt-bucket-processor [dsm-tt-bucket-processor] Pods not ready in {{datacenter.name}}
66 258046952 2 2 2 0 0 dsm-tt-bucket-processor [{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-tt-bucket-processor
67 258047024 2 1 1 1 0 dsm-tt-bucket-processor [dsm-tt-bucket-processor] High DLT stream consumer lag on {{stream_id}}
68 261328593 2 2 2 0 0 dsm-kafka-configs-sync [dsm-kafka-configs-sync] [*] LongBurn Availability
69 263810513 2 1 2 0 1 dsm-kafka-configs-writer [dsm-kafka-configs-writer] Pods not ready in {{datacenter.name}}
70 246072401 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap1.prod.dog] /map org2 synthetic
71 246072403 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us1.prod.dog] /service_summary org2 synthetic
72 246072404 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap1.prod.dog] /alerts org2 synthetic
73 246072405 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us3.prod.dog] /apm_streaming_services org2 synthetic
74 246072406 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][eu1.prod.dog] /apm_streaming_services org2 synthetic
75 246072407 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us5.prod.dog] /alerts org2 synthetic
76 246072408 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us3.prod.dog] /service_summary org2 synthetic
77 246072410 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us5.prod.dog] /service_summary org2 synthetic
78 246072411 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][eu1.prod.dog] /service_summary org2 synthetic
79 246072412 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][eu1.prod.dog] /alerts org2 synthetic
80 246072413 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us5.prod.dog] /apm_streaming_services org2 synthetic
81 246072418 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap1.prod.dog] /service_summary org2 synthetic
82 246072419 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][us5.prod.dog] /map org2 synthetic
83 249461860 1 1 1 0 0 data-observability-schema-writer [data-observability-schema-writer][{{datacenter.name}}] Lagging on stream {{stream_id.name}}
84 249461914 1 1 1 0 0 data-streams-resolver [data-streams-resolver][{{datacenter.name}}] Lagging on stream {{stream_id.name}}
85 258046616 1 1 1 0 0 dsm-tt-scheduler [dsm-tt-scheduler] Pods not ready in {{datacenter.name}}
86 258046930 1 1 1 0 0 dsm-tt-bucket-processor [{{datacenter.name}}] Topic consumer of {{kafka_topic.name}} is close to retention limit
87 258046933 1 1 1 0 0 dsm-tt-bucket-processor [dsm-tt-bucket-processor] High stash lag on stream {{stream_id.name}}
88 258046950 1 1 1 0 0 dsm-tt-bucket-processor [{{datacenter.name}}] Stream consumer is falling behind for dsm-tt-bucket-processor on a dead letter topic
89 258256343 1 1 1 0 0 dsm-api [dsm-api] Pod out-of-memory kill in {{datacenter.name}}
90 261341894 1 1 1 0 0 dsm-org-trial-sync [dsm-org-trial-sync] [*] ShortBurn Availability
91 261341895 1 1 1 0 0 dsm-org-trial-sync [dsm-org-trial-sync] [*] LongBurn Availability
92 265151346 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap2.prod.dog] /apm_streaming_services org2 synthetic
93 265151349 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap2.prod.dog] /map org2 synthetic
94 265151350 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap2.prod.dog] /alerts org2 synthetic
95 265151355 1 1 1 0 0 dsm-api [Synthetics] [dsm-api][ap2.prod.dog] /service_summary org2 synthetic
96 266571512 1 1 1 0 0 dsm-tt-trace-processor [{{datacenter.name}}] Stream consumer is falling behind on the main lane for dsm-tt-trace-processor
97 273420469 1 1 1 0 0 undefined release-agent-release-agent-121fe90e-baba-43f7-949a-61e5b786062c-sumallcost metric is queryable by team tag for Team Resources CloudCostGraph
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment