Clone this repo:
git clone https://gist.github.com/08be6d6e7605a43fe52d1f201c2b47d8.git
cd 08be6d6e7605a43fe52d1f201c2b47d8
Start the docker stack:
docker-compose up -d
And visit http://localhost:8080 to check current alerts.
To simulate a downtime case, you can stop the monitored service (rabbitmq):
$ docker-compose scale rabbitmq=0
And wait a few seconds to check a new alert:
EOTLDR
This is a small experiment to get familiar with alerting tools for Prometheus. AlertManager allows us to trigger alerts based on Prometheus metrics values, to several destinations such as Slack, Email, Pagerduty, OpsGenie or custom Webhooks.
As shown on the diagram, each one of our apps is monitored by Prometheus, who takes relevant metrics about the health of each service (like uptime, CPU usage, number of requests, ...). These metrics can be displayed in a fancy Grafana dashboard, as well as captured by AlertManager, which evaluates these metrics under several conditions to decide if creating/resolving alerts. These alerts can trigger Email and Slack notifications, and can be concentrated in a single UI to have real-time track of our incidents.
The first file we would like to see is docker-compose.yml. It contains a small set of services, similar to the previous diagram:
I chose RabbitMQ as the monitored service without any reason. Just picked this up because of its ease to be managed and monitored by Prometheus.
As we can see on the first 2 services, Prometheus requires to fetch metrics from an exporter container, which in this case is called rabbitmq-exporter
. This container gets metrics from RabbitMQ (using RABBIT_URL
variable), and publishes them in a format which is compatible with Prometheus:
# docker-compose.yml
rabbitmq:
image: rabbitmq:3.7.8-management-alpine
restart: always
rabbitmq-exporter:
image: kbudde/rabbitmq-exporter:v0.29.0
restart: always
environment:
RABBIT_URL: "http://rabbitmq:15672"
Then, we have a prometheus
container that reads the metrics exposed by rabbitmq-exporter
:
# docker-compose.yml
prometheus:
image: prom/prometheus:v2.6.0
restart: always
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
The link between prometheus
and rabbitmq-exporter
is configured in prometheus.yml file, that is given to Prometheus container as a volume:
# prometheus.yml
scrape_configs:
- job_name: 'rabbitmq-test'
scrape_interval: 1s
metrics_path: /metrics
static_configs:
- targets: ['rabbitmq-exporter:9090']
Which says to fetch metrics from RabbitMQ exporter every 1 second.
The next step is to configure Alert Rules to be used by AlertManager. These are also part of Prometheus configuration, and are defined in alerting_rules.yml
file and provided as a volume to Prometheus container:
# alerting_rules.yml
groups:
- name: alerting_rules
interval: 1s
rules:
- alert: rabbitmqDown
expr: rabbitmq_up == 0
for: 10s
labels:
severity: critical
annotations:
summary: "RabbitMQ is Down"
description: "RabbitMQ is so dead"
This file defines a rule to check if rabbitmq_up
metric (exposed by rabbitmq-exporter
) has the value 0
during 10 seconds. If so, a new critical
Alert called rabbitmqDown
will be created with some comments defined at annotations
section.
Now that we have Prometheus generating Alerts, it is time for AlertManager to come in:
# docker-compose.yml
alertmanager:
image: prom/alertmanager:v0.15.3
restart: always
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
Using the official AlertManager docker image, we can quickly configure a container to fetch Alerts from Prometheus and trigger Slack notifications:
# alertmanager.yml
route:
group_by: ['instance', 'severity']
routes:
- match:
alertname: rabbitmqDown
receiver: 'tranque-slack-hook'
receivers:
- name: 'tranque-slack-hook'
slack_configs:
- api_url: "https://hooks.slack.com/services/your-slack-hook"
title: "{{ .CommonAnnotations.summary }}"
title_link: ""
text: "RabbitMQ server got down for 10 seconds"
Given the alert called rabbitmqDown
(defined in alerting_rules.yml
file), we can trigger a Slack notification declared as tranque-slack-hook
. That receiver configuration requires to specify a Slack hook URL, as well as a Title and Description of the message to be sent to our Slack channel. The result will look like this:
Finally, we can set up an Unsee container as a real-time Alert UI, to concentrate all current events received by AlertManager:
unsee:
image: cloudflare/unsee:latest
restart: always
environment:
ALERTMANAGER_URI: http://alertmanager:9093
ports:
- 8080:8080
If we forward unsee's port 8080
to our localhost, we will be able to see an interface like this:
That contains all active alerts and their corresponding descriptions and labels.
EODOC