Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save sergiofbsilva/d8c62a35bb6a611b70c5a90d90e6b735 to your computer and use it in GitHub Desktop.

Select an option

Save sergiofbsilva/d8c62a35bb6a611b70c5a90d90e6b735 to your computer and use it in GitHub Desktop.
STG Unvalidated Monitors — PromQL Grouping Analysis (16 monitors)

STG Unvalidated Monitors — PromQL Grouping Analysis

16 monitors | env=stg | not excluded | not validated
Generated: 2026-04-21


Group 1: Trino JMX Percentile Metrics (5 monitors) — direct difficulty

All use the same pattern: sum(<metric>) by (alert_team, stack, service, role, region) > threshold

Alert Name Metric Threshold
infra-gs-latency-trino-execution-time-p50-1m trino_execution_QueryManager_ExecutionTime_OneMinute_P50 > 300,000
infra-gs-latency-trino-execution-time-p50-5m trino_execution_QueryManager_ExecutionTime_FiveMinutes_P50 > 300,000
infra-gs-latency-trino-execution-time-p75-5m trino_execution_QueryManager_ExecutionTime_FiveMinutes_P75 > 300,000
infra-gs-latency-trino-execution-time-p95-5m trino_execution_QueryManager_ExecutionTime_FiveMinutes_P95 > 600,000
infra-gs-latency-trino-queued-time-p75-1m trino_execution_QueryManager_QueuedTime_OneMinute_P75 > 1,000

Blocker: These trino_execution_QueryManager_* JMX metrics are exposed via a Prometheus JMX exporter but not collected by the Datadog JMX integration. Requires adding these MBeans to the DD JMX check config.


Group 2: Trino Gateway io.airlift HTTP Client Metrics (5 monitors) — partial difficulty

All use the same pattern: sum(rate(<metric>[5m])) by (alert_team, stack, service, role, region) > threshold

Alert Name Metric Threshold
infra-gs-errors-trino-gateway-proxy-request-failed io_airlift_http_client_type_HttpClient_name_ForProxy_RequestFailed > 10
infra-gs-errors-trino-gateway-router-request-failed io_airlift_http_client_type_HttpClient_name_ForRouter_RequestFailed > 10
infra-gs-errors-trino-gateway-router-5xx io_airlift_http_client_type_HttpClient_name_ForRouter_5xxResponse > 5
infra-gs-errors-trino-gateway-router-4xx io_airlift_http_client_type_HttpClient_name_ForRouter_4xxResponse > 5
infra-gs-errors-trino-gateway-router-3xx io_airlift_http_client_type_HttpClient_name_ForRouter_3xxResponse > 5

Blocker: io_airlift_http_client_* are internal Trino/Starburst JMX MBeans. Not collected by DD's JMX integration. Same fix as Group 1 — extend JMX check config.


Group 3: Trino Execution Rate Counter (1 monitor) — partial difficulty

Alert: infra-gs-errors-trino-insufficient-resources

sum(rate(trino_execution_QueryManager_InsufficientResourcesFailures_TotalCount[5m]))
  by (alert_team, stack, service, role, region) > 1

Blocker: Same JMX family as Group 1 — trino_execution_* MBeans not in DD.


Group 4: time() Arithmetic Staleness Checks (2 monitors) — custom difficulty

Alert Name PromQL
infra-argocd-service-check time() - topk(1, timestamp(argocd_cluster_info)) >= 300
infra-elasticsearch-snapshot-failure time() - max by (alert_team,name,stack) (elasticsearch_exporter_elasticsearch_snapshot_stats_snapshot_end_time_timestamp{stack!~"tdx"}) > 90000

Both use time() - timestamp/value to detect staleness. Two blockers: (a) the underlying metrics aren't in DD, (b) time() arithmetic has no direct DD equivalent — these need DD no-data monitors or custom metrics.


Group 5: Kafka Broker Count Comparison (1 monitor) — custom difficulty

Alert: infra-gs-availability-kafka-broker-down

(max(kafka_brokers{}) by (region, role, alert_team, stack, service)
  < max(kafka_brokers{} offset 10m) by (region, role, alert_team, stack, service))
or
(count(count by (broker_id, region, role, alert_team, stack, service)
    (kafka_server_replica_fetcher_metrics_connection_count{})) by (region, role, alert_team, stack, service)
  < count(count by (broker_id, region, role, alert_team, stack, service)
    (kafka_server_replica_fetcher_metrics_connection_count{} offset 10m)) by (region, role, alert_team, stack, service))

Uses offset comparison (current vs 10min ago) to detect broker loss. Blocker: DD has kafka.broker_offset but no kafka_brokers count metric; offset requires DD change detection monitors.


Group 6: ZooKeeper JVM GC (1 monitor) — partial difficulty

Alert: infra-zookeeper-java-gc-collection-time

max(irate(zookeeper_java_garbage_collector_CollectionTime[5m]))
  by (alert_team, instance, name, stack, service, role, region) > 2000

Blocker: zookeeper_java_garbage_collector_CollectionTime is from the ZK Prometheus exporter. DD's ZooKeeper integration doesn't expose JVM GC metrics.


Group 7: FreeSWITCH Process Memory (1 monitor) — custom difficulty

Alert: voice-evolution-freeswitch-process-memory-growth

avg by (region) (delta(process_resident_memory_bytes{role="voice-evolution", service="freeswitch"}[15m]))
  > 524288000

Uses delta() (15m growth > 500MB). Blocker: process_resident_memory_bytes from node_exporter/process-exporter isn't collected by DD for FreeSWITCH hosts.


Summary by PromQL Pattern

Pattern Count Groups Difficulty
sum(metric) by (...) — simple aggregation 5 Trino percentiles direct
sum(rate(counter[5m])) by (...) — rate of counter 6 io.airlift + trino insufficient partial
time() - timestamp(metric) — staleness 2 ArgoCD, ES snapshots custom
max(metric) < max(metric offset Xm) — change detection 1 Kafka broker down custom
max(irate(counter[5m])) — instantaneous rate 1 ZK GC partial
avg by (...) (delta(gauge[Xm])) — growth detection 1 FreeSWITCH memory custom

Summary by Blocker Type

Blocker Monitors Action Needed
JMX MBeans not in DD check config 11 (Groups 1+2+3) Add trino_execution_* and io_airlift_* to DD JMX integration
Metric absent from DD integration 3 (ArgoCD, ZK GC, FreeSWITCH) Enable/extend respective DD integrations or use custom metrics
No DD equivalent for PromQL pattern 3 (time(), offset) Use DD no-data monitors, change detection, or custom approach
Metric not collected at all 1 (ES snapshots) Custom metric or DD process check

11 of 16 (69%) are blocked by Trino JMX configuration alone — enabling those MBeans in the DD JMX check would unblock the majority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment