These are notes for my Prometheus workshop. The follow-up workshop on Prometheus/Kubernetes can be found here.
- Technology: Time Series Database
- Approach: Black Box vs White Box
- Scope: Time Series (Prometheus) vs. Logfiles (ELK), vs. Tracing (Zipkin)
- Download, extract, run
./node_exporter - Show example metrics (
node_cpu,node_network_receive_bytes,node_filesystem_avail) - What is a time series, what are labels?
- Run second instance using
./node_exporter -web.listen-address=":9101"
- Download Prometheus 2.0.0-beta.2, extract, edit
prometheus.ymland addnode1andnode2Job toscrape_configs, run./prometheus - Show Status -> Targets (scrape Interval is 15s, node will be "up" after 15s)
- Push vs Pull, HA Prometheus
- Example Queries:
node_network_receive_bytes-> Mentioninstancelabel, which is added by the Prometheus Server.node_network_receive_bytes{device="lo0"}-> Bonus Question: Why are the values fornode1andnode2different?sum(node_network_receive_bytes)sum (node_network_receive_bytes) by(instance)sum without(instance) (node_network_receive_bytes)node_network_receive_bytes[5m]rate(node_network_receive_bytes[5m])rate(node_network_receive_bytes[5m]) / 1024sum(rate(node_network_receive_bytes[5m]) / 1024sum(rate(node_network_receive_bytes[5m]) / 1024) by (instance)sum without (device) (rate(node_network_receive_bytes[5m]) / 1024)
- There doesn't seem to be a binary download for Mac anymore. So run as follows:
docker run --rm -t -i -p 3000:3000 grafana/grafana. - Login as
admin/admin - Add data source: Name
prometheus, TypePrometheus, URLhttp://public-ip:9090(Can't use localhost because Grafana is in the Docker container and the Prometheus server is outside of the Docker container. Use the public IP address instead), Accessproxy - Import and show example dashboard
- Some things don't display correctly in the example dashboard, because the dashboard is for Prometheus 1.x, and we run 2.x beta. Example Fix: Edit
Uptime, replaceprocess_start_time_secondswithprometheus_config_last_reload_success_timestamp_seconds. - New metric with example query:
sum without (device, job) (rate(node_network_receive_bytes[5m]))
- Show metric
rate(http_requests_total{job="node1"}[1m])in Prometheus test UI (including graph) -> average number of requests per second during the last minute - Create file
alerting.rulesin old format:
ALERT MuchTraffic
IF rate(http_requests_total{job="node1"}[1m]) > 5
FOR 1m
ANNOTATIONS {
summary = "High request rate on {{ $labels.instance }}",
description = "{{ $labels.instance }} has a request rate above 5 requests / second (current value: {{ $value }} requests / second)",
}
- Use
promtoolto convert to new format:./promtool update rules alerting.rules(will create new filealerting.rules.yml) - Show
alerting.rules.ymland verify with./promtool check rules alerting.rules.yml - Add to
rule_filessection inprometheus.yml:- "alerting.rules.yml", then restart Prometheus - Go to "Alerts" tab to see that the alert is not active.
- Run
watch -n 0.1 wget -O- http://localhost:9100/metricsto make alert active (after 1 minute). - During that minute, explain the
watchcommand and tell that rules are not only used for alerting, but also for recording.
- Download, extract, run
./alertmanager -log.level debug -config.file simple.yml - Add the following to
prometheus.yml:alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] - Show alertmanager's
simple.ymlto explain some config options.
Question: How to model HTTP Server Response Times?
- Example Histogram:
http_request_duration_seconds_bucket{le="0.005"} 0 http_request_duration_seconds_bucket{le="0.01"} 0 http_request_duration_seconds_bucket{le="0.025"} 3 http_request_duration_seconds_bucket{le="0.05"} 10 http_request_duration_seconds_bucket{le="0.1"} 22 http_request_duration_seconds_bucket{le="0.25"} 40 http_request_duration_seconds_bucket{le="0.5"} 52 http_request_duration_seconds_bucket{le="1.0"} 59 http_request_duration_seconds_bucket{le="2.5"} 59 http_request_duration_seconds_bucket{le="5"} 60 http_request_duration_seconds_bucket{le="10"} 60 http_request_duration_seconds_bucket{le="+Inf"} 60 http_request_duration_seconds_count 60 - Expose example Histogram:
- Save to file
example-histogram.txt - Run
python -m SimpleHTTPServer 9301 - Add to
prometheus.yml:- job_name: 'example' metrics_path: '/example-histogram.txt' static_configs: - targets: ['localhost:9301'] - Restart Prometheus
- Save to file
- Example 1: Overall rate <= 250ms is 2/3 (= 40/60):
sum(http_request_duration_seconds_bucket{le="0.25"}) by (job) / sum (http_request_duration_seconds_count) by (job) - Example 2: Rate in last 5m window:
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job). Does not work with static example, because the numbers did not increase in the last 5m. Editexample-histogram.txtand increase some numbers for demo. - Advanced question: Why
sum(rate(...))and notrate(sum(...))? - Example summary
http_request_duration_seconds{quantile="0.5"} 0.25 http_request_duration_seconds{quantile="0.9"} 0.3 http_request_duration_seconds{quantile="0.99"} 2.7 http_request_duration_seconds_sum 30.0 http_request_duration_seconds_count 100.0 - Explain
histogram_quantile()function.