Prometheus Workshop Notes

These are notes for my Prometheus workshop. The follow-up workshop on Prometheus/Kubernetes can be found here.

Overview

Technology: Time Series Database
Approach: Black Box vs White Box
Scope: Time Series (Prometheus) vs. Logfiles (ELK), vs. Tracing (Zipkin)

node_exporter

Download, extract, run ./node_exporter
Show example metrics (node_cpu, node_network_receive_bytes, node_filesystem_avail)
What is a time series, what are labels?
Run second instance using ./node_exporter -web.listen-address=":9101"

Prometheus Server

Download Prometheus 2.0.0-beta.2, extract, edit prometheus.yml and add node1 and node2 Job to scrape_configs, run ./prometheus
Show Status -> Targets (scrape Interval is 15s, node will be "up" after 15s)
Push vs Pull, HA Prometheus
Example Queries:
- node_network_receive_bytes -> Mention instance label, which is added by the Prometheus Server.
- node_network_receive_bytes{device="lo0"} -> Bonus Question: Why are the values for node1 and node2 different?
- sum(node_network_receive_bytes)
- sum (node_network_receive_bytes) by(instance)
- sum without(instance) (node_network_receive_bytes)
- node_network_receive_bytes[5m]
- rate(node_network_receive_bytes[5m])
- rate(node_network_receive_bytes[5m]) / 1024
- sum(rate(node_network_receive_bytes[5m]) / 1024
- sum(rate(node_network_receive_bytes[5m]) / 1024) by (instance)
- sum without (device) (rate(node_network_receive_bytes[5m]) / 1024)

Grafana

There doesn't seem to be a binary download for Mac anymore. So run as follows: docker run --rm -t -i -p 3000:3000 grafana/grafana.
Login as admin/admin
Add data source: Name prometheus, Type Prometheus, URL http://public-ip:9090 (Can't use localhost because Grafana is in the Docker container and the Prometheus server is outside of the Docker container. Use the public IP address instead), Access proxy
Import and show example dashboard
Some things don't display correctly in the example dashboard, because the dashboard is for Prometheus 1.x, and we run 2.x beta. Example Fix: Edit Uptime, replace process_start_time_seconds with prometheus_config_last_reload_success_timestamp_seconds.
New metric with example query: sum without (device, job) (rate(node_network_receive_bytes[5m]))

Alerts in Prometheus

Show metric rate(http_requests_total{job="node1"}[1m]) in Prometheus test UI (including graph) -> average number of requests per second during the last minute
Create file alerting.rules in old format:

ALERT MuchTraffic
    IF rate(http_requests_total{job="node1"}[1m]) > 5
    FOR 1m
    ANNOTATIONS {
        summary = "High request rate on {{ $labels.instance }}",
        description = "{{ $labels.instance }} has a request rate above 5 requests / second (current value: {{ $value }} requests / second)",
    }

Use promtool to convert to new format: ./promtool update rules alerting.rules (will create new file alerting.rules.yml)
Show alerting.rules.yml and verify with ./promtool check rules alerting.rules.yml
Add to rule_files section in prometheus.yml: - "alerting.rules.yml", then restart Prometheus
Go to "Alerts" tab to see that the alert is not active.
Run watch -n 0.1 wget -O- http://localhost:9100/metrics to make alert active (after 1 minute).
During that minute, explain the watch command and tell that rules are not only used for alerting, but also for recording.

Alert Manger

Download, extract, run ./alertmanager -log.level debug -config.file simple.yml

Add the following to prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']

Show alertmanager's simple.yml to explain some config options.

Advanced Topics

Question: How to model HTTP Server Response Times?

Example Histogram:

http_request_duration_seconds_bucket{le="0.005"}  0
http_request_duration_seconds_bucket{le="0.01"}   0
http_request_duration_seconds_bucket{le="0.025"}  3
http_request_duration_seconds_bucket{le="0.05"}  10
http_request_duration_seconds_bucket{le="0.1"}   22
http_request_duration_seconds_bucket{le="0.25"}  40
http_request_duration_seconds_bucket{le="0.5"}   52
http_request_duration_seconds_bucket{le="1.0"}   59
http_request_duration_seconds_bucket{le="2.5"}   59
http_request_duration_seconds_bucket{le="5"}     60
http_request_duration_seconds_bucket{le="10"}    60
http_request_duration_seconds_bucket{le="+Inf"}  60
http_request_duration_seconds_count 60

Expose example Histogram:
- Save to file example-histogram.txt
- Run python -m SimpleHTTPServer 9301
- Add to prometheus.yml:
```
- job_name: 'example'
  metrics_path: '/example-histogram.txt'
  static_configs:
  - targets: ['localhost:9301']
```
- Restart Prometheus
Example 1: Overall rate <= 250ms is 2/3 (= 40/60): sum(http_request_duration_seconds_bucket{le="0.25"}) by (job) / sum (http_request_duration_seconds_count) by (job)
Example 2: Rate in last 5m window: sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job). Does not work with static example, because the numbers did not increase in the last 5m. Edit example-histogram.txt and increase some numbers for demo.
Advanced question: Why sum(rate(...)) and not rate(sum(...))?

Example summary

http_request_duration_seconds{quantile="0.5"} 0.25
http_request_duration_seconds{quantile="0.9"} 0.3
http_request_duration_seconds{quantile="0.99"} 2.7
http_request_duration_seconds_sum 30.0
http_request_duration_seconds_count 100.0

Explain histogram_quantile() function.

fstab/prometheus-workshop.md