A few commands to run to test what triggers PLEG.
When using Docker, all container statuses are compared and it needs to happen within 3 minutes. Else the following log will be shown:
Message:PLEG is not healthy: pleg was last seen active
To test for Docker responsiveness (or possibly a hanging container and/or pod stuck in Terminating), run the following command:
time docker ps --format "{{.ID}}\t{{.Names}}" | while read id name; do echo -n "${name} (${id})"; RESP=$(/usr/bin/time -f"%e" docker inspect $id 2>&1 > /dev/null); echo " -> ${RESP}"; done
This will print a per container runtime of docker inspect
on that container, plus the total time it took to complete.
If Docker is already slow, you can try to lookup child processes of kubelet that might hang/got stuck:
ps -A -o pid,ppid,comm | grep $(pidof kubelet)
The kubelet reports metrics on pleg (amongst others).
Add /opt/rke
to the certificate files if you are running RancherOS or Container Linux.
curl -sLk --cacert /etc/kubernetes/ssl/kube-ca.pem --cert /etc/kubernetes/ssl/kube-node.pem --key /etc/kubernetes/ssl/kube-node-key.pem https://127.0.0.1:10250/metrics | grep pleg
# HELP kubelet_pleg_relist_interval_microseconds Interval in microseconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_microseconds summary
kubelet_pleg_relist_interval_microseconds{quantile="0.5"} 1.079013e+06
kubelet_pleg_relist_interval_microseconds{quantile="0.9"} 1.116376e+06
kubelet_pleg_relist_interval_microseconds{quantile="0.99"} 1.447572e+06
kubelet_pleg_relist_interval_microseconds_sum 2.4932124052e+10
kubelet_pleg_relist_interval_microseconds_count 24601
# HELP kubelet_pleg_relist_latency_microseconds Latency in microseconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_latency_microseconds summary
kubelet_pleg_relist_latency_microseconds{quantile="0.5"} 78632
kubelet_pleg_relist_latency_microseconds{quantile="0.9"} 116205
kubelet_pleg_relist_latency_microseconds{quantile="0.99"} 446852
kubelet_pleg_relist_latency_microseconds_sum 3.2610447e+08
kubelet_pleg_relist_latency_microseconds_count 24601
The numbers shown are from a healthy node, with response times under or just above a second. These will show way higher on an overloaded system, possibly going into 5.xxxxxx or 6.xxxxxx, summing up on to 3+ minutes to make PLEG go unhealthy.