Flexible per-alert silences in Grafana

Overview

Certain expected behaviors in the infrastructure can cause alerts to fire unnecessarily, generating noise and contributing to alert fatigue.

For example, database backups can increase the number of disk operations beyond what is expected during business hours but they don't cause issues that require human actions because they usually run outside business hours.

If these events happen on specific time periods, they can be ignored programmatically.

Unfortunatelly, neither Grafana's mute timings nor silences offer a way to specific flexible time periods (e.g. Thursdays between 1:00AM and 1:30AM). They only support continuous time periods (silences) and can't filter specific alerts (mute timings)

If either of these mechanism supported filtering by labels (so we can target a specific alert) and a schedule (so we can describe the time periods), this workaround/hack wouldn't be necessary.

Implementation

Using a Prometheus-compatible backend, it's possible to implement per-alert silences in Grafana with the help of PromQL functions:

hour()
minute()
day_of_week()
day_of_month()
day_of_year()

One additional query for each PromQL function is added to the alert and a final Math expression is used to combine them.

Example 1

Do not fire alert if time is between 1:00 and 1:59 UTC (that is, if hour is equal to 1).

Query A: sum(traefik_entrypoint_open_connections) by (entrypoint) (This is our main query)
Query B: hour()
Reduce expression C: max(), input: A
Math expression D: $C > 100 && $B != 1)

A and C will depend on your original alert. B and maybe D are new additions to the alert definition.

Example 2

A more complex example: Ignore alerts on Mondays and Wednesdays, between 3:50 and 4:10

Query A: sum(traefik_entrypoint_open_connections) by (entrypoint) (This is our main query)
Query B: day_of_week()
Query C: hour()
Query D: minute()
Reduce expression E: max(), input: A
Math expression F: $C > 100 && !(($B == 1 || $B == 3) && (($C == 3 && $D > 50) || ($C == 4 && $D < 10))) (notice the ! to ignore the time period we just described)

Conclusion

Hopefully Grafana will improve its alerting mechanism to allow these silences to be specified more dynamically.

It's also good to label alerts that use silences in case you need to edit them in the future since they aren't managed in a central location like mute timings or silences (e.g. add silence_schedule=true that allow you to filter them easily).

gtirloni/README.md