16 monitors | env=stg | not excluded | not validated
Generated: 2026-04-21
All use the same pattern: sum(<metric>) by (alert_team, stack, service, role, region) > threshold
| Alert Name | Metric | Threshold |
|---|---|---|
infra-gs-latency-trino-execution-time-p50-1m |
trino_execution_QueryManager_ExecutionTime_OneMinute_P50 |
> 300,000 |
infra-gs-latency-trino-execution-time-p50-5m |
trino_execution_QueryManager_ExecutionTime_FiveMinutes_P50 |
> 300,000 |
infra-gs-latency-trino-execution-time-p75-5m |
trino_execution_QueryManager_ExecutionTime_FiveMinutes_P75 |
> 300,000 |
infra-gs-latency-trino-execution-time-p95-5m |
trino_execution_QueryManager_ExecutionTime_FiveMinutes_P95 |
> 600,000 |
infra-gs-latency-trino-queued-time-p75-1m |
trino_execution_QueryManager_QueuedTime_OneMinute_P75 |
> 1,000 |
Blocker: These trino_execution_QueryManager_* JMX metrics are exposed via a Prometheus JMX exporter but not collected by the Datadog JMX integration. Requires adding these MBeans to the DD JMX check config.
All use the same pattern: sum(rate(<metric>[5m])) by (alert_team, stack, service, role, region) > threshold
| Alert Name | Metric | Threshold |
|---|---|---|
infra-gs-errors-trino-gateway-proxy-request-failed |
io_airlift_http_client_type_HttpClient_name_ForProxy_RequestFailed |
> 10 |
infra-gs-errors-trino-gateway-router-request-failed |
io_airlift_http_client_type_HttpClient_name_ForRouter_RequestFailed |
> 10 |
infra-gs-errors-trino-gateway-router-5xx |
io_airlift_http_client_type_HttpClient_name_ForRouter_5xxResponse |
> 5 |
infra-gs-errors-trino-gateway-router-4xx |
io_airlift_http_client_type_HttpClient_name_ForRouter_4xxResponse |
> 5 |
infra-gs-errors-trino-gateway-router-3xx |
io_airlift_http_client_type_HttpClient_name_ForRouter_3xxResponse |
> 5 |
Blocker: io_airlift_http_client_* are internal Trino/Starburst JMX MBeans. Not collected by DD's JMX integration. Same fix as Group 1 — extend JMX check config.
Alert: infra-gs-errors-trino-insufficient-resources
sum(rate(trino_execution_QueryManager_InsufficientResourcesFailures_TotalCount[5m]))
by (alert_team, stack, service, role, region) > 1
Blocker: Same JMX family as Group 1 — trino_execution_* MBeans not in DD.
| Alert Name | PromQL |
|---|---|
infra-argocd-service-check |
time() - topk(1, timestamp(argocd_cluster_info)) >= 300 |
infra-elasticsearch-snapshot-failure |
time() - max by (alert_team,name,stack) (elasticsearch_exporter_elasticsearch_snapshot_stats_snapshot_end_time_timestamp{stack!~"tdx"}) > 90000 |
Both use time() - timestamp/value to detect staleness. Two blockers: (a) the underlying metrics aren't in DD, (b) time() arithmetic has no direct DD equivalent — these need DD no-data monitors or custom metrics.
Alert: infra-gs-availability-kafka-broker-down
(max(kafka_brokers{}) by (region, role, alert_team, stack, service)
< max(kafka_brokers{} offset 10m) by (region, role, alert_team, stack, service))
or
(count(count by (broker_id, region, role, alert_team, stack, service)
(kafka_server_replica_fetcher_metrics_connection_count{})) by (region, role, alert_team, stack, service)
< count(count by (broker_id, region, role, alert_team, stack, service)
(kafka_server_replica_fetcher_metrics_connection_count{} offset 10m)) by (region, role, alert_team, stack, service))
Uses offset comparison (current vs 10min ago) to detect broker loss. Blocker: DD has kafka.broker_offset but no kafka_brokers count metric; offset requires DD change detection monitors.
Alert: infra-zookeeper-java-gc-collection-time
max(irate(zookeeper_java_garbage_collector_CollectionTime[5m]))
by (alert_team, instance, name, stack, service, role, region) > 2000
Blocker: zookeeper_java_garbage_collector_CollectionTime is from the ZK Prometheus exporter. DD's ZooKeeper integration doesn't expose JVM GC metrics.
Alert: voice-evolution-freeswitch-process-memory-growth
avg by (region) (delta(process_resident_memory_bytes{role="voice-evolution", service="freeswitch"}[15m]))
> 524288000
Uses delta() (15m growth > 500MB). Blocker: process_resident_memory_bytes from node_exporter/process-exporter isn't collected by DD for FreeSWITCH hosts.
| Pattern | Count | Groups | Difficulty |
|---|---|---|---|
sum(metric) by (...) — simple aggregation |
5 | Trino percentiles | direct |
sum(rate(counter[5m])) by (...) — rate of counter |
6 | io.airlift + trino insufficient | partial |
time() - timestamp(metric) — staleness |
2 | ArgoCD, ES snapshots | custom |
max(metric) < max(metric offset Xm) — change detection |
1 | Kafka broker down | custom |
max(irate(counter[5m])) — instantaneous rate |
1 | ZK GC | partial |
avg by (...) (delta(gauge[Xm])) — growth detection |
1 | FreeSWITCH memory | custom |
| Blocker | Monitors | Action Needed |
|---|---|---|
| JMX MBeans not in DD check config | 11 (Groups 1+2+3) | Add trino_execution_* and io_airlift_* to DD JMX integration |
| Metric absent from DD integration | 3 (ArgoCD, ZK GC, FreeSWITCH) | Enable/extend respective DD integrations or use custom metrics |
| No DD equivalent for PromQL pattern | 3 (time(), offset) | Use DD no-data monitors, change detection, or custom approach |
| Metric not collected at all | 1 (ES snapshots) | Custom metric or DD process check |
11 of 16 (69%) are blocked by Trino JMX configuration alone — enabling those MBeans in the DD JMX check would unblock the majority.