Skip to content

Instantly share code, notes, and snippets.

@webchi
Created January 2, 2022 13:38
Show Gist options
  • Save webchi/aad4079ec9f1c67d871acef1d105d0ce to your computer and use it in GitHub Desktop.
Save webchi/aad4079ec9f1c67d871acef1d105d0ce to your computer and use it in GitHub Desktop.
Ansible all-in-one promethes monitoring setup with node_exporter, blackbox and alertmanager
---
- name: Setup monitoring server
hosts: all
become: true
roles:
- cloudalchemy.prometheus
- cloudalchemy.grafana
- cloudalchemy.alertmanager
- cloudalchemy.blackbox-exporter
vars:
grafana_url: "https://grafana.example.com"
grafana_address: "127.0.0.1"
grafana_security:
admin_user: admin
admin_password: password
prometheus_skip_install: true
prometheus_version: 2.22.0
prometheus_web_listen_address: "127.0.0.1:9090"
prometheus_scrape_configs:
- job_name: "prometheus"
metrics_path: "{{ prometheus_metrics_path }}"
static_configs:
- targets:
- "127.0.0.1:9090"
- job_name: "node"
file_sd_configs:
- files:
- "{{ prometheus_conig_dir }}/file_sd/node.yml"
- job_name: "blackbox"
metrics_path: /probe
params:
module: [https_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- https://google.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
prometheus_targets:
node:
- targets:
- local-vm-1:9100
- local-vm-2:9100
prometheus_alertmanager_config:
- basic_auth:
username: basic
password: auth
static_configs:
- targets:
- alerts.example.com
prometheus_alert_rules:
# CPU
- alert: CpuHugeLoad
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 15s
labels:
severity: warning
annotations:
summary: "{% raw %}CPU load on {{ $labels.nodename }}{% endraw %}"
description: "{% raw %}CPU load (5m) is HUGE\n VALUE = {{ $value }}\n LABELS: {{ $labels }}{% endraw %}"
# Memory
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "{% raw %}Host out of memory {{ $labels.nodename }}{% endraw %}"
description: "{% raw %}Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}{% endraw %}"
# Disc
- alert: DiskOutOfSpace
expr: node_filesystem_free_bytes{mountpoint ="/"} / node_filesystem_size_bytes{mountpoint ="/"} * 100 < 10
for: 30m
labels:
severity: warning
annotations:
summary: "{% raw %}Out of disk space on {{ $labels.nodename }}{% endraw %}"
description: "{% raw %}Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}{% endraw %}"
alertmanager_web_listen_address: "127.0.0.1:9093"
alertmanager_web_external_url: "https://alerts.example.com"
alertmanager_route:
group_by: ["alertname"]
group_wait: 20s
group_interval: 5m
repeat_interval: 3h
receiver: discord_webhook
alertmanager_receivers:
- name: "discord_webhook"
webhook_configs:
- url: "http://127.0.0.1:9094"
blackbox_exporter_web_listen_address: "127.0.0.1:9115"
blackbox_exporter_configuration_modules:
https_2xx:
prober: http
timeout: 15s
http:
method: GET
no_follow_redirects: false
fail_if_ssl: false
fail_if_not_ssl: true
preferred_ip_protocol: "ipv4"
valid_status_codes: [200]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment