I'm sure I've gotten multiple things wrong here. Either flat out wrong, anti-patterns or just sub-optimal. I'm new to prometheus, grafana and mtail...so please feel free to share corrections/suggestions.
There are a handful of custom nginx stats exporters.
Some are tying into internal nginx stats like the official nginx exporter: nginx-prometheus-exporter
Others are custom applications that tail/parse nginx access logs in order to generate more detailed or custom stats (example: Martin Helmich's prometheus-nginxlog-exporter).
The problem with the first type is that you are limited to the nginx internal stats.
The problem with the second is that you are limited in customization due to the needs of the author being specific. If you are happy with the stats the author decided on, great. For instance with Martin Helmich's custom exporter, the docs mention working with about 3 nginx variables in the access log...I think there are many more "standard" ones that the exporter expects but the documentation wasn't clear to me. Including/customizing other metrics/labels does not seem to be possible with that exporter.
Since using any exporter means deploying a binary, setting up a systemd service to process manage the binary and doing a bit of configuration, I decided to just use mtail for this. The idea being I will probably use mtail (general purpose solution) for other things at some point, so will net less infrastructure provisioning work in the future.
mtail is basically just a log tailer that tries to be smart about not losing data due to log rotation and allows you to write custom PATTERN:ACTION logic to extract fields and metrics from log lines and combine them into custom metrics that will be exported on http for prometheus and other monitoring systems under a single service.
mtail isn't providing interval based counters (e.g. application hits in the last 5 minutes). The counters are perpetually growing (reset at restart of course)...so you will generally use prometheus' rate
function which looks at change in the counter over time. Prometheus is smart about bridging over restarts (ie where counters reset to 0).
I've included most of the relevant files here as an example configuration.
- install mtail binary on system
- create any custom users/groups
- create init script or systemd service. I provide a sample systemd unit below.
- Pick a location for a new/custom nginx access_log we will use for mtail.
- nginx will need write access and mtail read access.
- Beware of log rotation permission/ownership changes.
- permissions on the log file may not be enough, you may also need execute permissions on the parent directory.
- Configure nginx to write to the new/additional access_log
- Create mtail program to parse the new nginx access_log
- I include nginx.mtail below as an example.
- Make sure appropriate firewall port is open, that nginx is reloaded/running and mtail service is running.
- tail the new access log and make sure you are getting data. If you are buffering it may take a while.
- on the server:
curl http://localhost:<mtail-port>/metrics
and make sure you are getting prometheus metrics - Point prometheus server at the port mtail is running on in order to scrape the new metrics
- generally this means adding a new scrape config in prometheus.yml...but will depend on how you configure prometheus (service discovery, etc).
- Add new prometheus alerts based on the new metrics (for instance when when rate of non 200 responses is greater than 10% percent of 200 responses). Something like (untested):
sum(rate(nginx_request_count{nginx_status!="200"}[10m]) * 60) by (instance, nginx_host) / sum(rate(nginx_request_count{nginx_status="200"}[10m]) * 60) by (instance, nginx_host) > 0.1
- Build grafana dashboard/etc, alerts, etc
- grafana.nginx.json is an example dashboard showing how you can use these metrics.
- screenshots at the bottom
- Read the mtail Language doc.
- Run
mtail -h
and read through the commandline parameters that are available. - review the
/tmp/mtail*
log files if something isn't working or tail syslog orjournalctl -f | grep mtail
as you are doing things. Watch for errors. - make sure nginx and mtail are both running.
- if nginx and mtail are both running but nothing is showing up in the access_log, you probably have a permissions problem on the log file or directory where the logfile lives.
- think in terms of labels vs metrics. some log fields will be used for filtering/grouping (ie labels), others are used to update counters. In our example:
- Labels:
$host, $server_port, $request_method, $uri, $content_type, $status
- Metrics:
$request_length, $bytes_sent, $body_bytes_sent, $request_time\, $upstream_connect_time, $upstream_header_time, $upstream_response_time
$msec
is neither label or metric, but tells mtail the time of the logline since there may be a write delay due to buffering.
- Labels:
- If you try to increment counters with strings (e.g.
-
) there seems to be an issue with mtail where you will get an exported metric with no metric. e.g.nginx_request{...}
instead ofnginx_request{...} <metric>
. Prometheus will report "down" but you will see the http endpoint is up. If you look at prometheus UI > targets you can see the specific parse error message giving you a hint.
I know it seems like a lot...but that is mostly because I am over-explaining. I started using prometheus/grafana/node_exporter less than a week ago for the first time. I figured out all the mtail stuff described here in an afternoon. The grafana dashboard only took an hour to build from scratch. So overall I'm quite happy with the setup with mtail. The only gripe is the strange issue where mtail sometimes exports metrics without a metric. A restart of mtail seems to resolve that...but it is annoying that I don't know exactly why it happens. Something to dig into later.
Thanks @mattpr! I found one typo... "for instance when when rate of" -> "for instance when rate of"
(https://gist.github.com/mpursley/21e4ebed0de1a676f4ba16eac154ca41/revisions Github doesn't currently allow PRs for Gists :( )