These are notes for my Prometheus workshop. The follow-up workshop on Prometheus/Kubernetes can be found here.
- Technology: Time Series Database
- Approach: Black Box vs White Box
- Scope: Time Series (Prometheus) vs. Logfiles (ELK), vs. Tracing (Zipkin)
- Download, extract, run
./node_exporter
- Show example metrics (
node_cpu
,node_network_receive_bytes
,node_filesystem_avail
) - What is a time series, what are labels?
- Run second instance using
./node_exporter -web.listen-address=":9101"
- Download Prometheus 2.0.0-beta.2, extract, edit
prometheus.yml
and addnode1
andnode2
Job toscrape_configs
, run./prometheus
- Show Status -> Targets (scrape Interval is 15s, node will be "up" after 15s)
- Push vs Pull, HA Prometheus
- Example Queries:
node_network_receive_bytes
-> Mentioninstance
label, which is added by the Prometheus Server.node_network_receive_bytes{device="lo0"}
-> Bonus Question: Why are the values fornode1
andnode2
different?sum(node_network_receive_bytes)
sum (node_network_receive_bytes) by(instance)
sum without(instance) (node_network_receive_bytes)
node_network_receive_bytes[5m]
rate(node_network_receive_bytes[5m])
rate(node_network_receive_bytes[5m]) / 1024
sum(rate(node_network_receive_bytes[5m]) / 1024
sum(rate(node_network_receive_bytes[5m]) / 1024) by (instance)
sum without (device) (rate(node_network_receive_bytes[5m]) / 1024)
- There doesn't seem to be a binary download for Mac anymore. So run as follows:
docker run --rm -t -i -p 3000:3000 grafana/grafana
. - Login as
admin
/admin
- Add data source: Name
prometheus
, TypePrometheus
, URLhttp://public-ip:9090
(Can't use localhost because Grafana is in the Docker container and the Prometheus server is outside of the Docker container. Use the public IP address instead), Accessproxy
- Import and show example dashboard
- Some things don't display correctly in the example dashboard, because the dashboard is for Prometheus 1.x, and we run 2.x beta. Example Fix: Edit
Uptime
, replaceprocess_start_time_seconds
withprometheus_config_last_reload_success_timestamp_seconds
. - New metric with example query:
sum without (device, job) (rate(node_network_receive_bytes[5m]))
- Show metric
rate(http_requests_total{job="node1"}[1m])
in Prometheus test UI (including graph) -> average number of requests per second during the last minute - Create file
alerting.rules
in old format:
ALERT MuchTraffic
IF rate(http_requests_total{job="node1"}[1m]) > 5
FOR 1m
ANNOTATIONS {
summary = "High request rate on {{ $labels.instance }}",
description = "{{ $labels.instance }} has a request rate above 5 requests / second (current value: {{ $value }} requests / second)",
}
- Use
promtool
to convert to new format:./promtool update rules alerting.rules
(will create new filealerting.rules.yml
) - Show
alerting.rules.yml
and verify with./promtool check rules alerting.rules.yml
- Add to
rule_files
section inprometheus.yml
:- "alerting.rules.yml"
, then restart Prometheus - Go to "Alerts" tab to see that the alert is not active.
- Run
watch -n 0.1 wget -O- http://localhost:9100/metrics
to make alert active (after 1 minute). - During that minute, explain the
watch
command and tell that rules are not only used for alerting, but also for recording.
- Download, extract, run
./alertmanager -log.level debug -config.file simple.yml
- Add the following to
prometheus.yml
:alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']
- Show alertmanager's
simple.yml
to explain some config options.
Question: How to model HTTP Server Response Times?
- Example Histogram:
http_request_duration_seconds_bucket{le="0.005"} 0 http_request_duration_seconds_bucket{le="0.01"} 0 http_request_duration_seconds_bucket{le="0.025"} 3 http_request_duration_seconds_bucket{le="0.05"} 10 http_request_duration_seconds_bucket{le="0.1"} 22 http_request_duration_seconds_bucket{le="0.25"} 40 http_request_duration_seconds_bucket{le="0.5"} 52 http_request_duration_seconds_bucket{le="1.0"} 59 http_request_duration_seconds_bucket{le="2.5"} 59 http_request_duration_seconds_bucket{le="5"} 60 http_request_duration_seconds_bucket{le="10"} 60 http_request_duration_seconds_bucket{le="+Inf"} 60 http_request_duration_seconds_count 60
- Expose example Histogram:
- Save to file
example-histogram.txt
- Run
python -m SimpleHTTPServer 9301
- Add to
prometheus.yml
:- job_name: 'example' metrics_path: '/example-histogram.txt' static_configs: - targets: ['localhost:9301']
- Restart Prometheus
- Save to file
- Example 1: Overall rate <= 250ms is 2/3 (= 40/60):
sum(http_request_duration_seconds_bucket{le="0.25"}) by (job) / sum (http_request_duration_seconds_count) by (job)
- Example 2: Rate in last 5m window:
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job)
. Does not work with static example, because the numbers did not increase in the last 5m. Editexample-histogram.txt
and increase some numbers for demo. - Advanced question: Why
sum(rate(...))
and notrate(sum(...))
? - Example summary
http_request_duration_seconds{quantile="0.5"} 0.25 http_request_duration_seconds{quantile="0.9"} 0.3 http_request_duration_seconds{quantile="0.99"} 2.7 http_request_duration_seconds_sum 30.0 http_request_duration_seconds_count 100.0
- Explain
histogram_quantile()
function.