Monitoring Web Workers with Prometheus

Design

Web workers can use the Prometheus PHP client to track metrics. Such metrics are stored in memory and also copied to a shared memory from which another process in the same SysV IPC namespace can read. This design allows us to run a micro webserver in a sidecar container from which we can export Prometheus metrics collected in the worker's cli app.

IPC namespace: when deployed on K8S, two containers of the same Pod run in the same IPC namespace, while two different pods on the same host run on a different IPC namespace unless explicitly configured to all share the host's IPC running the pod with hostIPC: true.

Why this design

There are three options to export metrics to Prometheus from a cli app:

Run a sidecar webserver exposing the worker's metrics (current solution)
Run a single webserver exposing all workers' metrics
Push the metrics to Push Gateway

Why not the Push Gateway?

The main con of this approach is that the Push Gateway has been designed to proxy metrics at the service-level and not at the single instance/process level. This means that once you push an instance-level metric (ie. same metric pushed from multiple instances of the same worker running in parallel) the metric will stay in the Push Gateway for its entire lifecycle, even if the instance of the worker has been terminated. Push Gateway supports an API to remove a group of metrics by label (ie. by instance) but calling it reliably whenever a worker terminates is non trivial (ie. think when the process crashes or if the request fails during a worker's shutdown procedure).

Another con of the Push Gateway is that it's yet another SPoF in the monitoring pipeline that should be avoided unless strictly required.

Why not running a single webserver exposing all worker's metrics?

This solution is possible thanks to the Redis storage support of the Prometheus PHP client. When configured to store metrics on Redis, all workers can track their own metrics on Redis, and than a single webserver instance can export all workers' metrics reading them from Redis.

Pros:

Not the overhead of a webserver for each single worker

Cons:

Adds an external dependency to monitoring. When Redis is down we loose web workers monitoring too, that's something shouldn't be coupled together.

pracucci/monitoring-web-workers-with-prometheus.md

Design

Why this design