In release v0.34.0 Pepr will release Informer Metrics. This gist will give you a quick and dirty (non-production) way of trying out the Prometheus Operator to scrape Pepr metrics.
To start off, you will need a fresh cluster, I am using k3d.
k3d cluster createNext, you will need a Pepr Module to test with. Since I want to showcase the informer metrics, I am going to the one of our soak tests which measures the informer performance over time.
Since we are currently on version Pepr v0.33.0, I am using a custom Pepr controller image, called pepr:dev, which is built from the main branch, so I need to import it into the cluster.
> k3d image import pepr:dev -c k3s-default
INFO[0000] Importing image(s) into cluster 'k3s-default'
INFO[0000] Saving 1 image(s) from runtime...
INFO[0003] Importing images into nodes...
INFO[0003] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20240730090649.tar' into node 'k3d-k3s-default-server-0'...
INFO[0005] Removing the tarball(s) from image volume...
INFO[0006] Removing k3d-tools node...
INFO[0006] Successfully imported image(s)
INFO[0006] Successfully imported 1 image(s) into 1 cluster(s) The Prometheus Operator GitHub shows us a way to just try out the operator inside of a cluster, for simplicity we will use this method to get up and going fast.
kubectl create -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yamlNote - Kubernetes folks encourage using apply of create, but in this case we must go with create to avoid Error fro server (Invalid)... is invalid: metadata.annotations: Too long: must have at most 262144 bytes
Oddly enough, this install the operator in the default namespace but it will suffice for demo purposes.
At this point, if you issue kubectl get po -n default, you should see a single pod consisting of the Operator's controller.
> k get po --show-labels
NAME READY STATUS RESTARTS AGE LABELS
prometheus-operator-764bc4f5fd-ccljv 1/1 Running 0 5m9s app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator,app.kubernetes.io/version=0.75.2,pod-template-hash=764bc4f5fdNow, spoiler alert, the Prometheus Operator's service account does not have enough permissions to scrape what we need, since this is a demo I'm going to keep it short and assign it cluster-admin role. (Never a good idea in prod, though)
> k create clusterrolebinding scrape-admin --serviceaccount=default:prometheus-operator --clusterrole=cluster-admin
clusterrolebinding.rbac.authorization.k8s.io/scrape-admin createdI am going to clone excellent-examples and go to hello-pepr-soak]. From here, do the build to generate the deployment artifacts. I am using the -i` to use the custom pepr:dev image instead of the latest release.
npx pepr build -i pepr:devThis will generate the typical manifests, but also being released in v0.34.0, is a feature to create the ServiceMonitor's from the helm chart to scrape Pepr for you. (the ServiceMonitor tell Prometheus what service and how to scrape)
All you have to do is enabled serviceMonitors from the values.yaml
# values.yaml
admission:
serviceMonitor:
enabled: true
watcher:
serviceMonitor:
enabled: trueTo demonstrate, in the hello-pepr-soak/dist/6233c672-7fca-5603-8e90-771828dd30fa-chart folder, I will issue helm template . to see the deployment artifacts generated.
...
---
# Source: 6233c672-7fca-5603-8e90-771828dd30fa/templates/watcher-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: watcher
annotations: {}
labels: {}
spec:
selector:
matchLabels:
pepr.dev/controller: watcher
namespaceSelector:
matchNames:
- pepr-system
endpoints:
- targetPort: 3000
scheme: https
tlsConfig:
insecureSkipVerify: trueI actually want to use the helm chart because I am testing out a PR, lets install the module. Ignore the error, it will not affect the installation of the helm chart (i just noticed as I was writing this post ¯\_(ツ)_/¯).
> helm install --create-namespace pepr . -n pepr-system ||
true
Error: INSTALLATION FAILED: 1 error occurred:
* namespaces "pepr-system" already existsMake sure Pepr is running and the ServiceMonitor is deployed:
> k get po,servicemonitor -n pepr-system
NAME READY STATUS RESTARTS AGE
pod/pepr-6233c672-7fca-5603-8e90-771828dd30fa-watcher-587885c8hmcxb 1/1 Running 0 1m50s
NAME AGE
servicemonitor.monitoring.coreos.com/watcher 1m50Now we need to spin up a Prometheus instance.
k apply -f -<<EOF
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
labels:
prometheus: prometheus
spec:
replicas: 1
serviceAccountName: prometheus-operator
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
ruleNamespaceSelector: {}
ruleSelector: {}
EOFAt this point, when the Prometheus pods come up, we can go ahead and check on our ServiceMonitor.
> k get po
NAME READY STATUS RESTARTS AGE
prometheus-operator-764bc4f5fd-8fcz2 1/1 Running 0 18m
prometheus-prometheus-0 2/2 Running 0 1m10sPort-Forward the Prometheus Operated Service to http://localhost:9090.
k port-forward svc/prometheus-operated 9090Query the graph to check out all the metrics that Pepr exposes
{container="watcher"}
The new informer metrics are gauge metrics. They show counts at some duration
# HELP pepr_Cache_Miss Number of cache misses per window
# TYPE pepr_Cache_Miss gauge
pepr_Cache_Miss{window="2024-07-25T11:54:33.897Z"} 18
pepr_Cache_Miss{window="2024-07-25T12:24:34.592Z"} 0
pepr_Cache_Miss{window="2024-07-25T13:14:33.450Z"} 22
pepr_Cache_Miss{window="2024-07-25T13:44:34.234Z"} 19
pepr_Cache_Miss{window="2024-07-25T14:14:34.961Z"} 0
# HELP pepr_resync_Failure_count Number of retries per count
# TYPE pepr_resync_Failure_count gauge
pepr_resync_Failure_Count{count="0"} 5
pepr_resync_Failure_Count{count="1"} 4- a
pepr_Cache_Missis an indication that an element retrieved during a poll was not present in the cache, so it had to "manually" update the cache. The firstpepr_Cache_Misswindow hydrates the cache so the number should be equivilent to everything you are watching in the cluster. (Do not be alarmed by cache misses, this is normal on a Kubernetes cluster due to the limitations of watch, this is a well tested pattern to optimize for failures that are inevitable) - a
pepr_resync_Failure_Countmeans that the Operator did not receive a watch event within thelastSeenLimitSecondsand it went into a reconciliation to re-establish the URL. There is aresyncFailureMaxsetting on the watcher to be able to have fine-grain control over how often and fast it reconciles.
We can query just the informer metrics with promql, but this view is suboptimal, we are really concerned about the counts and the windows.
pepr_Resync_Failure_Count{container="watcher"} or pepr_Cache_Miss{container="watcher"}
Lets optimize our promql query.
sum(pepr_Resync_Failure_Count) by (container, count) or sum(pepr_Cache_Miss) by (container, window)
This view shows the windows between polls, so you can see what the informer missed on a watch, and the number of times it went into a resync failure count per resync. This can be easily optimized according to your environment depending on your needs.
For instance, if you are getting many resync failures at count=2, then you should set your resyncFailureMax to 2 to immediately reconcile once you get 2 failures instead of the default of 5. You can also adjust your cache hydration window which is set to 30 mins by default. It all depends on what you are trying to achieve but there are tradeoffs in terms of network pressure (which is what causes the missed events in the first place).
A smart way to figure out this metric is to configure prometheus rules:
k apply -f -<<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: watch-failure-alert
namespace: pepr-system
spec:
groups:
- name: pepr-resync-failure-rules
rules:
- alert: watcher_instance_down
annotations:
description: The pods churned
summary: Watcher instance down
expr: count(up{container="watcher"}) != 0
labels:
severity: page
- alert: PeprResyncFailure
expr: pepr_Resync_Failure_Count{container="watcher", count="2"} == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Pepr Resync Failure Count Alert"
description: "The pepr_Resync_Failure_Count metric for container 'watcher' with count '2' has reached the value of 2."
EOFThis PrometheusRule is set to alert based on Watch failures, it will page when the watcher goes down, and will send a critical alert when the pepr_Resync_Failure_Count at 2 happens twice.
While port-forwarding, look at the rules.
It may take a couple of minutes for them to come up.
Next, lets take down the watcher instance and catch an alert firing. Scale down the watcher deployment and wait a few moments before looking back at the alerts.
> k scale deploy -n pepr-system -l pepr.dev/controller=watcher --replicas=0
deployment.apps/pepr-6233c672-7fca-5603-8e90-771828dd30fa-watcher scaledThe duration in which Prometheus scrapes metrics is adjustable through the prom.spec.scrapeInterval, but be careful, this also causes more network pressure, so at the end of the day it is all tradeoffs depending on the size of your cluster and what you need.
Shout out to Rob F to adding ServiceMonitor's to the Pepr Helm Chart. Remember, this feature is not yet released but it set to come out next Wednesday August 7.





