The Problem
The Plan
Setup a livenessProbe
command to check the CPU usage of the pod, if it drops
below a threshold value for a threshold time, kill the Pod.
Status I've tested the shell script in a pod inside our k8s cluster, and it reports values accurately.
It's fairly inaccurate, but should be more than good enough for our purposes.
The 0.01 is a CPU usage percentage. If the usage over 5 seconds is below that, the script exit status is 1 (error), if it's above, exit status is 0 (success).
This check is run every 30s, starting 30s after the pod runs, and must fail 3 times in a row to kill the pod.
Notes See also:
Kind of based on https://unix.stackexchange.com/questions/450748/calculating-cpu-usage-of-a-cgroup-over-a-period-of-time
The returned value from cpuacct.usage
is CPU usage in nanoseconds.