This was tested using a default OpenShift 4.11 IPI deployment to AWS. The worker nodes had 16GiB of memory.

First, we'll need a namespace to use for the below experiments.

oc new-project alloc

Before starting, we need to configure eviction thresholds

Taken from here, this will create a kubelet configuration for both soft and hard eviction of Pods from the node(s).

First, we apply a label to the mcp so that we can use it for a selector with the KubeletConfig.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  labels:
    custom-kubelet: memory-eviction-threshold
  name: worker

Once that's applied, configure the threshold.

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: memory-eviction-threshold
  kubeletConfig:
    evictionSoft: 
      memory.available: "4Gi"
    evictionSoftGracePeriod:
      memory.available: "1m30s"
    evictionHard: 
      memory.available: "1Gi"
    evictionPressureTransitionPeriod: 0s

This will trigger a node reboot.

Pod OOM termiantion and eviction

Force a pod to be OOM killed

This will create a pod that consumes memory over time, simulating a memory leak or just an app that rarely frees memory, in order to cause the oom_killer to take action.

The below was inspired by this blog post, using the container image built from this repo, which also has the options described here.

The value specified in -x 15360 is the maximum amount of memory it will consume. For a default AWS IPI deployment, the nodes have 16GiB of memory total, this will exceed the available memory and trigger an OOM kill action.

apiVersion: v1
kind: Pod
metadata:
  name: alloc-hard
  namespace: alloc
spec:
  containers:
  - name: alloc-mem-2
    image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
    imagePullPolicy: Always
    command: ["dotnet"]
    args: ["AllocMem.dll", "-m", "100", "-x", "15360", "-e", "1000", "-f", "1", "-p", "5"]

Another example of creating a pod to force an OOM kill is here.

Once the pod is created, do an oc get pod -o wide to see which node it's been assigned to, then oc adm top node <name> to see it's memory utilization in real time. The pod will also report it's utilization in the logs, follow along using the command oc logs -f alloc-hard.

After a short time, depending on the amount of memory on the node, the Pod will be terminated as a result of memory pressure. We can see via the pod events and node logs when this happens.

Cleanup with oc delete pod alloc-hard.

Force a soft eviction

This time, we'll be more gentle and consume enough resources to push the node above the soft eviction threshold, but not to the point doing an OOM kill.

apiVersion: v1
kind: Pod
metadata:
  name: alloc-soft
  namespace: alloc
spec:
  containers:
  - name: alloc-mem-2
    image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
    imagePullPolicy: Always
    command: ["dotnet"]
    args: ["AllocMem.dll", "-m", "100", "-x", "12288", "-e", "1000", "-f", "1", "-p", "5"]

As above, we can see via the pod logs, events, and node status (oc describe node <name>) that the node enters a memory pressure scenario. This will request that the pod be termianted gracefully rather than forcefully.

Cleanup with oc delete pod alloc-soft.

Descheduler

Configure the descheduler

Follow the docs. We want to use the LifecycleAndUtilization profile, we also want to specifically include our alloc namespace.

Leave the thresholds at the default, 20% underutilized and 50% overutilized.
Set the descheduling interval to 60 seconds.

Trigger a descheduling

With three nodes in the cluster, we want to create a scenario where one of the nodes is overutilized, meaning it has more than 50% of it's resources utilized, but less than the configured soft eviction threshold. Using the default AWS IPI config, along with the thresholds configured above, this means we want memory utilization to be between 50-75%, or 8-12 GiB.

Before creating the pod, start a terminal which is monitoring the logs for the descheduler pod in the openshift-kube-descheduler-operator namespace.

apiVersion: v1
kind: Pod
metadata:
  name: alloc-desched
  namespace: alloc
spec:
  containers:
  - name: alloc-mem-2
    image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
    imagePullPolicy: Always
    command: ["dotnet"]
    args: ["AllocMem.dll", "-m", "100", "-x", "9216", "-e", "1000", "-f", "1", "-p", "5"]

The above will create a pod that, after a short time, consumes 9GiB of memory. Montior the node resource utilization using the command oc adm top node <name> (determine which node using oc get pod -o wide).

Assuming at least one of the other worker nodes is below 20% utilization, we should see the descheduler take action after no more than 60 seconds to terminate the above pod.

Cleanup with oc delete pod alloc-desched.

Low node utilization

Descheduling with the LowNodeUtilization profile is only effective when there are nodes below the underutilization threshold. We can show that it will not take action by increasing node memory utilization above 20%.

This time we'll create a deployment with an antiaffinity rule for the pods to spread them across all workers

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alloc-low
spec:
  replicas: 3
  selector:
    matchLabels:
      app: alloc-low
  template:
    metadata:
      name: alloc-low
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - alloc-low
            topologyKey: kubernetes.io/hostname
      containers:
      - name: alloc-mem-2
        image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
        imagePullPolicy: Always
        command: ["dotnet"]
        args: ["AllocMem.dll", "-m", "100", "-x", "4096", "-e", "1000", "-f", "1", "-p", "5"]

Now we can create a pod that will exceed the threshold for the descheduler's max utilization to see what will happen.

apiVersion: v1
kind: Pod
metadata:
  name: alloc-desched
  namespace: alloc
spec:
  containers:
  - name: alloc-mem-2
    image: luckerby/alloc-mem:net5-20GiB-HeapHardLimit
    imagePullPolicy: Always
    command: ["dotnet"]
    args: ["AllocMem.dll", "-m", "100", "-x", "4096", "-e", "1000", "-f", "1", "-p", "5"]

Using oc adm top node we can see that at least one of the nodes is exceeding the high utilization. But, it won't take any action because all worker nodes are exceeding the low utilization threshold.

Food for thought

What happens when a pod grows over time and consistently results in the node being over the eviction threshold?
- It will bounce around being terminated on the hosts until it lands on one that has enough resources for it and/or the priority is increased so that other pods are evicted instead.
The combination of these two technologies - descheduler with the LowNodeUtilization policy and eviction - offer the ability to do some resource balancing across the available nodes. However, we don't want to overthink it and micromanage either. As configured above, when the nodes are utilized between 20 and 75%, no action is taken. The specific thresholds will be very dependent on your apps, how they behave (i.e. resource utilization), cluster architecture, and other factors - not the least of which is cluster autoscaling configuration.

acsulli/aoa-2022-09-14.md