-
-
Save LittleNewton/7a0d007c9f986bf4edd385f839b0cf19 to your computer and use it in GitHub Desktop.
Prometheus alert rules for node exporter
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
groups: | |
- name: node_exporter_alerts | |
rules: | |
- alert: Node down | |
expr: up{job="monitoring-pi"} == 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
title: Node {{ $labels.instance }} is down | |
description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 2 minutes. Node seems down. | |
- alert: HostOutOfMemory | |
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host out of memory (instance {{ $labels.instance }}) | |
description: Node memory is filling up (< 10% left)\n VALUE = {{ $value }} | |
- alert: HostMemoryUnderMemoryPressure | |
expr: rate(node_vmstat_pgmajfault[1m]) > 1000 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host memory under memory pressure (instance {{ $labels.instance }}) | |
description: The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }} | |
- alert: HostUnusualNetworkThroughputIn | |
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 | |
for: 5m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host unusual network throughput in (instance {{ $labels.instance }}) | |
description: Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }} | |
- alert: HostUnusualNetworkThroughputOut | |
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 | |
for: 5m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host unusual network throughput out (instance {{ $labels.instance }}) | |
description: Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }} | |
- alert: HostUnusualDiskReadRate | |
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 | |
for: 5m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host unusual disk read rate (instance {{ $labels.instance }}) | |
description: Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }} | |
- alert: HostUnusualDiskWriteRate | |
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host unusual disk write rate (instance {{ $labels.instance }}) | |
description: Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }} | |
# Please add ignored mountpoints in node_exporter parameters like | |
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". | |
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. | |
- alert: HostOutOfDiskSpace | |
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host out of disk space (instance {{ $labels.instance }}) | |
description: Disk is almost full (< 10% left)\n VALUE = {{ $value }} | |
# Please add ignored mountpoints in node_exporter parameters like | |
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". | |
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. | |
- alert: HostDiskWillFillIn24Hours | |
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host disk will fill in 24 hours (instance {{ $labels.instance }}) | |
description: Filesystem is predicted to run out of space within the next 24 hours at current write rate\n VALUE = {{ $value }} | |
- alert: HostOutOfInodes | |
expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host out of inodes (instance {{ $labels.instance }}) | |
description: Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }} | |
- alert: HostInodesWillFillIn24Hours | |
expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{mountpoint="/rootfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }}) | |
description: Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n VALUE = {{ $value }} | |
- alert: HostUnusualDiskReadLatency | |
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host unusual disk read latency (instance {{ $labels.instance }}) | |
description: Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }} | |
- alert: HostUnusualDiskWriteLatency | |
expr: rate(node_disk_write_time_seconds_totali{device!~"mmcblk.+"}[1m]) / rate(node_disk_writes_completed_total{device!~"mmcblk.+"}[1m]) > 0.1 and rate(node_disk_writes_completed_total{device!~"mmcblk.+"}[1m]) > 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host unusual disk write latency (instance {{ $labels.instance }}) | |
description: Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }} | |
- alert: HostHighCpuLoad | |
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 | |
for: 0m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host high CPU load (instance {{ $labels.instance }}) | |
description: CPU load is > 80%\n VALUE = {{ $value }} | |
- alert: HostCpuStealNoisyNeighbor | |
expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10 | |
for: 0m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }}) | |
description: CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }} | |
# 1000 context switches is an arbitrary number. | |
# Alert threshold depends on nature of application. | |
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 | |
- alert: HostContextSwitching | |
expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000 | |
for: 0m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host context switching (instance {{ $labels.instance }}) | |
description: Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }} | |
- alert: HostSwapIsFillingUp | |
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host swap is filling up (instance {{ $labels.instance }}) | |
description: Swap is filling up (>80%)\n VALUE = {{ $value }} | |
- alert: HostSystemdServiceCrashed | |
expr: node_systemd_unit_state{state="failed"} == 1 | |
for: 0m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host SystemD service crashed (instance {{ $labels.instance }}) | |
description: SystemD service crashed\n VALUE = {{ $value }} | |
- alert: HostPhysicalComponentTooHot | |
expr: node_hwmon_temp_celsius > 75 | |
for: 5m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host physical component too hot (instance {{ $labels.instance }}) | |
description: Physical hardware component too hot\n VALUE = {{ $value }} | |
- alert: HostNodeOvertemperatureAlarm | |
expr: node_hwmon_temp_crit_alarm_celsius == 1 | |
for: 0m | |
labels: | |
severity: critical | |
annotations: | |
summary: Host node overtemperature alarm (instance {{ $labels.instance }}) | |
description: Physical node temperature alarm triggered\n VALUE = {{ $value }} | |
- alert: HostRaidArrayGotInactive | |
expr: node_md_state{state="inactive"} > 0 | |
for: 0m | |
labels: | |
severity: critical | |
annotations: | |
summary: Host RAID array got inactive (instance {{ $labels.instance }}) | |
description: RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n VALUE = {{ $value }} | |
- alert: HostRaidDiskFailure | |
expr: node_md_disks{state="failed"} > 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host RAID disk failure (instance {{ $labels.instance }}) | |
description: At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }} | |
- alert: HostKernelVersionDeviations | |
expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1 | |
for: 6h | |
labels: | |
severity: warning | |
annotations: | |
summary: Host kernel version deviations (instance {{ $labels.instance }}) | |
description: Different kernel versions are running\n VALUE = {{ $value }} | |
- alert: HostOomKillDetected | |
expr: increase(node_vmstat_oom_kill[1m]) > 0 | |
for: 0m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host OOM kill detected (instance {{ $labels.instance }}) | |
description: OOM kill detected\n VALUE = {{ $value }} | |
- alert: HostEdacCorrectableErrorsDetected | |
expr: increase(node_edac_correctable_errors_total[1m]) > 0 | |
for: 0m | |
labels: | |
severity: info | |
annotations: | |
summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }}) | |
description: Instance has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }} | |
- alert: HostEdacUncorrectableErrorsDetected | |
expr: node_edac_uncorrectable_errors_total > 0 | |
for: 0m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }}) | |
description: Instance has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }} | |
- alert: HostNetworkReceiveErrors | |
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host Network Receive Errors (instance {{ $labels.instance }}:{{ $labels.device }}) | |
description: Instance interface has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n VALUE = {{ $value }} | |
- alert: HostNetworkTransmitErrors | |
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host Network Transmit Errors (instance {{ $labels.instance }}:{{ $labels.device }}) | |
description: Instance has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }} | |
- alert: HostNetworkInterfaceSaturated | |
expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 | |
for: 1m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host Network Interface Saturated (instance {{ $labels.instance }}:{{ $labels.interface }}) | |
description: The network interface is getting overloaded.\n VALUE = {{ $value }} | |
- alert: HostConntrackLimit | |
expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8 | |
for: 5m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host conntrack limit (instance {{ $labels.instance }}) | |
description: The number of conntrack is approching limit\n VALUE = {{ $value }} | |
- alert: HostClockSkew | |
expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host clock skew (instance {{ $labels.instance }}) | |
description: Clock skew detected. Clock is out of sync.\n VALUE = {{ $value }} | |
- alert: HostClockNotSynchronising | |
expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: Host clock not synchronising (instance {{ $labels.instance }}) | |
description: Clock not synchronising.\n VALUE = {{ $value }} | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment