This was tested using a default OpenShift 4.11 IPI deployment to AWS. The worker nodes had 16GiB of memory.
First, we'll need a namespace to use for the below experiments.
oc new-project alloc
Before starting, we need to configure eviction thresholds
Taken from here, this will create a kubelet configuration for both soft and hard eviction of Pods from the node(s).
First, we apply a label to the mcp
so that we can use it for a selector with the KubeletConfig
.
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
labels:
custom-kubelet: memory-eviction-threshold
name: worker
Once that's applied, configure the threshold.
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: worker-kubeconfig
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: memory-eviction-threshold
kubeletConfig:
evictionSoft:
memory.available: "4Gi"
evictionSoftGracePeriod:
memory.available: "1m30s"
evictionHard:
memory.available: "1Gi"
evictionPressureTransitionPeriod: 0s
This will trigger a node reboot.
This will create a pod that consumes memory over time, simulating a memory leak or just an app that rarely frees memory, in order to cause the oom_killer to take action.
The below was inspired by this blog post, using the container image built from this repo, which also has the options described here.
The value specified in -x 15360
is the maximum amount of memory it will consume. For a default AWS IPI deployment, the nodes have 16GiB of memory total, this will exceed the available memory and trigger an OOM kill action.
apiVersion: v1
kind: Pod
metadata:
name: alloc-hard
namespace: alloc
spec:
containers:
- name: alloc-mem-2
image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
imagePullPolicy: Always
command: ["dotnet"]
args: ["AllocMem.dll", "-m", "100", "-x", "15360", "-e", "1000", "-f", "1", "-p", "5"]
Another example of creating a pod to force an OOM kill is here.
Once the pod is created, do an oc get pod -o wide
to see which node it's been assigned to, then oc adm top node <name>
to see it's memory utilization in real time. The pod will also report it's utilization in the logs, follow along using the command oc logs -f alloc-hard
.
After a short time, depending on the amount of memory on the node, the Pod will be terminated as a result of memory pressure. We can see via the pod events and node logs when this happens.
Cleanup with oc delete pod alloc-hard
.
This time, we'll be more gentle and consume enough resources to push the node above the soft eviction threshold, but not to the point doing an OOM kill.
apiVersion: v1
kind: Pod
metadata:
name: alloc-soft
namespace: alloc
spec:
containers:
- name: alloc-mem-2
image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
imagePullPolicy: Always
command: ["dotnet"]
args: ["AllocMem.dll", "-m", "100", "-x", "12288", "-e", "1000", "-f", "1", "-p", "5"]
As above, we can see via the pod logs, events, and node status (oc describe node <name>
) that the node enters a memory pressure scenario. This will request that the pod be termianted gracefully rather than forcefully.
Cleanup with oc delete pod alloc-soft
.
Follow the docs. We want to use the LifecycleAndUtilization
profile, we also want to specifically include our alloc
namespace.
- Leave the thresholds at the default, 20% underutilized and 50% overutilized.
- Set the descheduling interval to 60 seconds.
With three nodes in the cluster, we want to create a scenario where one of the nodes is overutilized, meaning it has more than 50% of it's resources utilized, but less than the configured soft eviction threshold. Using the default AWS IPI config, along with the thresholds configured above, this means we want memory utilization to be between 50-75%, or 8-12 GiB.
Before creating the pod, start a terminal which is monitoring the logs for the descheduler pod in the openshift-kube-descheduler-operator
namespace.
apiVersion: v1
kind: Pod
metadata:
name: alloc-desched
namespace: alloc
spec:
containers:
- name: alloc-mem-2
image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
imagePullPolicy: Always
command: ["dotnet"]
args: ["AllocMem.dll", "-m", "100", "-x", "9216", "-e", "1000", "-f", "1", "-p", "5"]
The above will create a pod that, after a short time, consumes 9GiB of memory. Montior the node resource utilization using the command oc adm top node <name>
(determine which node using oc get pod -o wide
).
Assuming at least one of the other worker nodes is below 20% utilization, we should see the descheduler take action after no more than 60 seconds to terminate the above pod.
Cleanup with oc delete pod alloc-desched
.
Descheduling with the LowNodeUtilization
profile is only effective when there are nodes below the underutilization threshold. We can show that it will not take action by increasing node memory utilization above 20%.
This time we'll create a deployment with an antiaffinity rule for the pods to spread them across all workers
apiVersion: apps/v1
kind: Deployment
metadata:
name: alloc-low
spec:
replicas: 3
selector:
matchLabels:
app: alloc-low
template:
metadata:
name: alloc-low
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- alloc-low
topologyKey: kubernetes.io/hostname
containers:
- name: alloc-mem-2
image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
imagePullPolicy: Always
command: ["dotnet"]
args: ["AllocMem.dll", "-m", "100", "-x", "4096", "-e", "1000", "-f", "1", "-p", "5"]
Now we can create a pod that will exceed the threshold for the descheduler's max utilization to see what will happen.
apiVersion: v1
kind: Pod
metadata:
name: alloc-desched
namespace: alloc
spec:
containers:
- name: alloc-mem-2
image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
imagePullPolicy: Always
command: ["dotnet"]
args: ["AllocMem.dll", "-m", "100", "-x", "4096", "-e", "1000", "-f", "1", "-p", "5"]
Using oc adm top node
we can see that at least one of the nodes is exceeding the high utilization. But, it won't take any action because all worker nodes are exceeding the low utilization threshold.
-
What happens when a pod grows over time and consistently results in the node being over the eviction threshold?
- It will bounce around being terminated on the hosts until it lands on one that has enough resources for it and/or the priority is increased so that other pods are evicted instead.
-
The combination of these two technologies - descheduler with the
LowNodeUtilization
policy and eviction - offer the ability to do some resource balancing across the available nodes. However, we don't want to overthink it and micromanage either. As configured above, when the nodes are utilized between 20 and 75%, no action is taken. The specific thresholds will be very dependent on your apps, how they behave (i.e. resource utilization), cluster architecture, and other factors - not the least of which is cluster autoscaling configuration.