STG Unvalidated Monitors — PromQL Grouping Analysis

16 monitors | env=stg | not excluded | not validated
Generated: 2026-04-21

Group 1: Trino JMX Percentile Metrics (5 monitors) — `direct` difficulty

All use the same pattern: sum(<metric>) by (alert_team, stack, service, role, region) > threshold

Alert Name	Metric	Threshold
`infra-gs-latency-trino-execution-time-p50-1m`	`trino_execution_QueryManager_ExecutionTime_OneMinute_P50`	> 300,000
`infra-gs-latency-trino-execution-time-p50-5m`	`trino_execution_QueryManager_ExecutionTime_FiveMinutes_P50`	> 300,000
`infra-gs-latency-trino-execution-time-p75-5m`	`trino_execution_QueryManager_ExecutionTime_FiveMinutes_P75`	> 300,000
`infra-gs-latency-trino-execution-time-p95-5m`	`trino_execution_QueryManager_ExecutionTime_FiveMinutes_P95`	> 600,000
`infra-gs-latency-trino-queued-time-p75-1m`	`trino_execution_QueryManager_QueuedTime_OneMinute_P75`	> 1,000

Blocker: These trino_execution_QueryManager_* JMX metrics are exposed via a Prometheus JMX exporter but not collected by the Datadog JMX integration. Requires adding these MBeans to the DD JMX check config.

Group 2: Trino Gateway io.airlift HTTP Client Metrics (5 monitors) — `partial` difficulty

All use the same pattern: sum(rate(<metric>[5m])) by (alert_team, stack, service, role, region) > threshold

Alert Name	Metric	Threshold
`infra-gs-errors-trino-gateway-proxy-request-failed`	`io_airlift_http_client_type_HttpClient_name_ForProxy_RequestFailed`	> 10
`infra-gs-errors-trino-gateway-router-request-failed`	`io_airlift_http_client_type_HttpClient_name_ForRouter_RequestFailed`	> 10
`infra-gs-errors-trino-gateway-router-5xx`	`io_airlift_http_client_type_HttpClient_name_ForRouter_5xxResponse`	> 5
`infra-gs-errors-trino-gateway-router-4xx`	`io_airlift_http_client_type_HttpClient_name_ForRouter_4xxResponse`	> 5
`infra-gs-errors-trino-gateway-router-3xx`	`io_airlift_http_client_type_HttpClient_name_ForRouter_3xxResponse`	> 5

Blocker: io_airlift_http_client_* are internal Trino/Starburst JMX MBeans. Not collected by DD's JMX integration. Same fix as Group 1 — extend JMX check config.

Group 3: Trino Execution Rate Counter (1 monitor) — `partial` difficulty

Alert: infra-gs-errors-trino-insufficient-resources

sum(rate(trino_execution_QueryManager_InsufficientResourcesFailures_TotalCount[5m]))
  by (alert_team, stack, service, role, region) > 1

Blocker: Same JMX family as Group 1 — trino_execution_* MBeans not in DD.

Group 4: `time()` Arithmetic Staleness Checks (2 monitors) — `custom` difficulty

Alert Name	PromQL
`infra-argocd-service-check`	`time() - topk(1, timestamp(argocd_cluster_info)) >= 300`
`infra-elasticsearch-snapshot-failure`	`time() - max by (alert_team,name,stack) (elasticsearch_exporter_elasticsearch_snapshot_stats_snapshot_end_time_timestamp{stack!~"tdx"}) > 90000`

Both use time() - timestamp/value to detect staleness. Two blockers: (a) the underlying metrics aren't in DD, (b) time() arithmetic has no direct DD equivalent — these need DD no-data monitors or custom metrics.

Group 5: Kafka Broker Count Comparison (1 monitor) — `custom` difficulty

Alert: infra-gs-availability-kafka-broker-down

(max(kafka_brokers{}) by (region, role, alert_team, stack, service)
  < max(kafka_brokers{} offset 10m) by (region, role, alert_team, stack, service))
or
(count(count by (broker_id, region, role, alert_team, stack, service)
    (kafka_server_replica_fetcher_metrics_connection_count{})) by (region, role, alert_team, stack, service)
  < count(count by (broker_id, region, role, alert_team, stack, service)
    (kafka_server_replica_fetcher_metrics_connection_count{} offset 10m)) by (region, role, alert_team, stack, service))

Uses offset comparison (current vs 10min ago) to detect broker loss. Blocker: DD has kafka.broker_offset but no kafka_brokers count metric; offset requires DD change detection monitors.

Group 6: ZooKeeper JVM GC (1 monitor) — `partial` difficulty

Alert: infra-zookeeper-java-gc-collection-time

max(irate(zookeeper_java_garbage_collector_CollectionTime[5m]))
  by (alert_team, instance, name, stack, service, role, region) > 2000

Blocker: zookeeper_java_garbage_collector_CollectionTime is from the ZK Prometheus exporter. DD's ZooKeeper integration doesn't expose JVM GC metrics.

Group 7: FreeSWITCH Process Memory (1 monitor) — `custom` difficulty

Alert: voice-evolution-freeswitch-process-memory-growth

avg by (region) (delta(process_resident_memory_bytes{role="voice-evolution", service="freeswitch"}[15m]))
  > 524288000

Uses delta() (15m growth > 500MB). Blocker: process_resident_memory_bytes from node_exporter/process-exporter isn't collected by DD for FreeSWITCH hosts.

Summary by PromQL Pattern

Pattern	Count	Groups	Difficulty
`sum(metric) by (...)` — simple aggregation	5	Trino percentiles	direct
`sum(rate(counter[5m])) by (...)` — rate of counter	6	io.airlift + trino insufficient	partial
`time() - timestamp(metric)` — staleness	2	ArgoCD, ES snapshots	custom
`max(metric) < max(metric offset Xm)` — change detection	1	Kafka broker down	custom
`max(irate(counter[5m]))` — instantaneous rate	1	ZK GC	partial
`avg by (...) (delta(gauge[Xm]))` — growth detection	1	FreeSWITCH memory	custom

Summary by Blocker Type

Blocker	Monitors	Action Needed
JMX MBeans not in DD check config	11 (Groups 1+2+3)	Add `trino_execution_` and `io_airlift_` to DD JMX integration
Metric absent from DD integration	3 (ArgoCD, ZK GC, FreeSWITCH)	Enable/extend respective DD integrations or use custom metrics
No DD equivalent for PromQL pattern	3 (time(), offset)	Use DD no-data monitors, change detection, or custom approach
Metric not collected at all	1 (ES snapshots)	Custom metric or DD process check

11 of 16 (69%) are blocked by Trino JMX configuration alone — enabling those MBeans in the DD JMX check would unblock the majority.

sergiofbsilva/stg-unvalidated-monitors-analysis.md

Select an option

No results found

Select an option

No results found

STG Unvalidated Monitors — PromQL Grouping Analysis

Group 1: Trino JMX Percentile Metrics (5 monitors) — `direct` difficulty

Group 2: Trino Gateway io.airlift HTTP Client Metrics (5 monitors) — `partial` difficulty

Group 3: Trino Execution Rate Counter (1 monitor) — `partial` difficulty

Group 4: `time()` Arithmetic Staleness Checks (2 monitors) — `custom` difficulty

Group 5: Kafka Broker Count Comparison (1 monitor) — `custom` difficulty

Group 6: ZooKeeper JVM GC (1 monitor) — `partial` difficulty

Group 7: FreeSWITCH Process Memory (1 monitor) — `custom` difficulty

Summary by PromQL Pattern

Summary by Blocker Type

sergiofbsilva/stg-unvalidated-monitors-analysis.md

STG Unvalidated Monitors — PromQL Grouping Analysis

Group 1: Trino JMX Percentile Metrics (5 monitors) — direct difficulty

Group 2: Trino Gateway io.airlift HTTP Client Metrics (5 monitors) — partial difficulty

Group 3: Trino Execution Rate Counter (1 monitor) — partial difficulty

Group 4: time() Arithmetic Staleness Checks (2 monitors) — custom difficulty

Group 5: Kafka Broker Count Comparison (1 monitor) — custom difficulty

Group 6: ZooKeeper JVM GC (1 monitor) — partial difficulty

Group 7: FreeSWITCH Process Memory (1 monitor) — custom difficulty

Summary by PromQL Pattern

Summary by Blocker Type

Group 1: Trino JMX Percentile Metrics (5 monitors) — `direct` difficulty

Group 2: Trino Gateway io.airlift HTTP Client Metrics (5 monitors) — `partial` difficulty

Group 3: Trino Execution Rate Counter (1 monitor) — `partial` difficulty

Group 4: `time()` Arithmetic Staleness Checks (2 monitors) — `custom` difficulty

Group 5: Kafka Broker Count Comparison (1 monitor) — `custom` difficulty

Group 6: ZooKeeper JVM GC (1 monitor) — `partial` difficulty

Group 7: FreeSWITCH Process Memory (1 monitor) — `custom` difficulty