- Detect permanent node problems and set Node Conditions using the Node Problem Detector.
- Configure Draino to cordon and drain nodes when they exhibit the NPD's KernelDeadlock condition, or a variant of KernelDeadlock we call VolumeTaskHung.
- Let the Cluster Autoscaler scale down underutilised nodes, including the nodes Draino has drained.
Note: Draino will log nothing, and export no metrics until it actually drains a node.
Once the Descheduler supports descheduling pods based on taints, Draino could be replaced by the Descheduler running in combination with the scheduler's TaintNodesByCondition functionality.
See kubernetes-sigs/descheduler#131
Node-problem-detector on EKS (and other recent kernels) must use k8s.gcr.io/node-problem-detector:v0.6.2 per kubernetes/node-problem-detector#184
| NodeCondition | Duration | Source | Draino Appropriate |
|---|---|---|---|
| KernelDeadlock | permanent | node-problem-detector | ✅ |
| ReadonlyFilesystem | permanent | node-problem-detector | ✅ |
| OutOfDisk | permanent? | ❓ | ✅ |
| MemoryPressure | Temporary? | ❓ | ❌ ❓ |
| DiskPressure | Temporary? | ❓ | ❌ ❓ |
| PIDPressure | Temporary? | ❓ | ❌ ❓ |
| Ready | Temporary 😅 | N/A | ❌ |
I think Kubelet should try to relieve MemoryPressure, DiskPressure, and PIDPressure conditions https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
Depends on eviction policy though.
The idea behind a permanent node condition was that they were permanent. Put otherwise, if a condition could clear up at some future time I would not expect NPD to set a node condition in the first place but rather to emit a node event indicating a temporary problem.
Draino, for example, assumes that once a node condition is set by NPD there are no possible remediations other than terminating and replacing the node.
The Problem API section of the README distinguishes between node conditions and node events in two ways. It mentions that conditions are for permanent issues and events are for temporary issues, but also says that conditions are for problems that make the node completely unavailable while events are for problems that have a limited impact on pods. This indicates the decision between using a condition and an event is not only based on the permanence of the problem, but also the severity of the problem. It's quite possible that a problem could make a node completely unusable but not be permanent. Similarly it's possible that a problem could be permanent but not make the node completely unusable.
There are no metrics exposed until draino actually encounters a configured node condition and does something, but you should at least get a 200 OK:
kubectl -n kube-system exec -it $(kubectl get pod -n kube-system -l component=draino -o jsonpath="{.items[0].metadata.name}") -- apk add curl
kubectl -n kube-system exec -it $(kubectl get pod -n kube-system -l component=draino -o jsonpath="{.items[0].metadata.name}") -- curl -v http://localhost:10002/metrics