Host cgroup managment

Host cgroup managment

Introduction

Kata Containers is an OCI compatible runtime, but sometimes with the extra security layer (VM), compatibility is difficult.

One feature that is difficult to be compatible is cgroups in host.

The runtime caller will provide in the config.json a set requeriments for cgroups.

CgroupsPath

cgroupsPath (string, OPTIONAL) path to the cgroups. It can be used to either control the cgroups
hierarchy for containers or to run a new process in an existing container

The path is also used to get stats about resouces usage in a cgroups herarchy in the host. So system administrators can create a multiples containers with the follwoing hierarchy.

Consider the following case where the cgroup system is memory.

Pod 1
- Container 1: cgrouppath=/kubepods/pod1/container1
- Container2: cgrouppath=/kubepods/pod1/container2
Pod 2
- Container1: cgrouppath=/kubepods/pod2/container2
- Container2: cgrouppath=/kubepods/pod2/container2

Kata Containers provides to options to handle cgroups in the host. This document will describe how is handled.

SandboxCgroupOnly enabled (optional, podoverhead oriented)

+----------------------------------------------------------+
|    +---------------------------------------------------+ |
|    |   +---------------------------------------------+ | |
|    |   |   +--------------------------------------+  | | |
|    |   |   | shim1, shim2 , hypervisor, proxy     |  | | |
|    |   |   |                                      |  | | |
|    |   |   | kata-sanbox-<id>                     |  | | |
|    |   |   +--------------------------------------+  | | |
|    |   |                                             | | |
|    |   |Pod 1                                        | | |
|    |   +---------------------------------------------+ | |
|    |                                                   | |
|    |   +---------------------------------------------+ | |
|    |   |   +--------------------------------------+  | | |
|    |   |   | shim1, shim2 , hypervisor, proxy     |  | | |
|    |   |   |                                      |  | | |
|    |   |   | kata-sanbox-<id>                     |  | | |
|    |   |   +--------------------------------------+  | | |  
|    |   |Pod 2                                        | | |
|    |   +---------------------------------------------+ | |
|    |kubepods                                           | |
|    +---------------------------------------------------+ |
|                                                          |
|Node                                                      |
+----------------------------------------------------------+

What this method does?

Given a PodSanbox container creation, let:

cgroupPod=Parent(container.CgroupsPath)
KataSandboxCgroup=${podCgroup}/kata-sandbox-<PodSandbox-id>

Then create a cgroup the KataSandboxCgroup
Join to the KataSandboxCgroup

Any process created by the runtime will be created in KataSandboxCgroup The runtime will not limit the cgroup in the host, but the caller is free to set the proper limits (considering Kata overhead).

In the example above the Pod cgroups are /kubepods/pod1 and /kubepods/pod2.

Kata will create unrestricted cgroup for all the host subsystems under the pod cgroup.

Why create a subcgroup and not put all in the cgroup pod

In case of run a workload with docker, where there is not a notion of Pod, docker does not create a Pod cgroup directory, and the runtime will get a request of create a Pod Sandbox under with a cgroup path like /docker/container-id. The runtime will assume that the pod cgroup is /docker. If multiples containers are created all of them will put all kata assets under /docker. In order to provide better organization kata runtime creates a subcgroup under the pod cgroup. The cgroup is not limited so it inherent the parent cgroup limits, so host implementors can apply limits or monitor the cgroup in the host.

Why?

Use one cgroup for all the pod helps to account and limit all kata assets.

Accounting

If the Kata caller wants to know the resoucer usage on the host it can get stats from the cgroup pod. So all cgroups stats in the hierarchy will include the kata overhead. This gives the posibility of get stats a differnet levels. For example based in the example above. The following stas can be query.

/kubepods Stat of all pods created under /kubepods
/kubepods/pod{n} Stat of a pod created under /kubepods/pod${n}
Note that you can not get cgroup information per container in the filesystem hierarchy The cgroup path /kubepods/podX/containerX will not be created on the host. To query per container stats the command kata-runtime stats <container-id> should be used.
Resouce isolation The Kata runtime caller is resposible to manage the pod sandbox, the runtime will not modify it. If the caller limit the CPU, memory or other subystem the all kata componetns will limited by that cgroup, this help to avoid the kata componets use resouces that are supose to be for other containers or system programs.

Pros and Cons

Pros
- Resouce isolation
- Host cgroup stats (inculding all kata components)
Cons
- IO performance degradation: When the caller limits a lot the CPU parent cgroup, some IO performance could be degradated. The caller should consider an extra overhead for Network and IO operations. Note that CPU only operations will not be degadated.
- Per container cgroup in host is not created. If a tool takes stats using the cgroup fileystem the container cgroup path will not exist, it is recommened to use kata-runtime events to collect container stats.

SandboxCgroupOnly disabled (default, legacy)

+----------------------------------------------------------+
|    +---------------------------------------------------+ |
|    |   +---------------------------------------------+ | |
|    |   |   +--------------------------------------+  | | |
|    |   |   |Container 1       |-|Container 2      |  | | |
|    |   |   |                  |-|                 |  | | |
|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
|    |   |   +--------------------------------------+  | | |
|    |   |                                             | | |
|    |   |Pod 1                                        | | |
|    |   +---------------------------------------------+ | |
|    |                                                   | |
|    |   +---------------------------------------------+ | |
|    |   |   +--------------------------------------+  | | |
|    |   |   |Container 1       |-|Container 2      |  | | |
|    |   |   |                  |-|                 |  | | |
|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
|    |   |   +--------------------------------------+  | | |
|    |   |                                             | | |
|    |   |Pod 2                                        | | |
|    |   +---------------------------------------------+ | |
|    |kubepods                                           | |
|    +---------------------------------------------------+ |
|    +---------------------------------------------------+ |
|    |  Hypervisor                                       | |
|    |Kata                                               | |
|    +---------------------------------------------------+ |
|                                                          |
|Node                                                      |
+----------------------------------------------------------+

What this method does?

Given a container creation let

containerCgroupHost=container.CgroupsPath

Rename containerCgroupHost path to add kata-

example:
/kubepods/pod1/container1
becomes
/kubepods/pod1/kata-container1

Let PodCgroupPath=PodSanboxContainerCgroup where PodSanboxContainerCgroup is the cgroup of a container of type PodSanbox
Limit the PodCgroupPath with the sum of all the container limits in the Sandbox.
Move only vcpu threads of hypervisor to PodCgroupPath
Per each container move its kata-shim to its own containerCgroupHost

Note that the Kata runtime will not add all the hypervisor to the path requested, only vcpus.

Creates a memory cgroup named '/kata' that is not restricted where the hypervisor is moved, this mitigates the risk of get an OOM. A similar metodology is used for hypervisor threads responsable for IO and network. Those IO threads are not added to any cgroup and are free to run on all the resouces of the host.

Why?

The hypervisor is not a process from container1 or container2 on each pod.
The hypervisor is a share resouce of each pod.
Kuberentes set a memory limit to the kubepods/pod1 with the sum of the requested memory per container.
- If the pod is very small qemu will be killed because out of memory in the cgroup.
- This is because there is an addtional overhead of having a VM 170 Mib for standard Kata configuration.

Pros and Cons

Pros:
- If Pod overhead is not considerd, Hypervisor may not be killed due to a OOM
- If Pod overhead is not considerd, network or IO perfomance is not degradated for very limited cpu pods.
- Internal VM containers stats, can be get query via kata-runtime envents
Cons:
- Not great for host stats: If adminstrator (or system) wants to know the memory usage in /kubepods, /kubepods/pod1, etc. The stats wont be correct.
- cgroup paths are not totally honor (it adds a sufix kata- to the container cgroup). So the caller can not query cgroup container in the host filesystem. It is recommened to use kata-runtime events to collect container stats.
- No full resouce isolation: IO threads can be noisy for other containers.

jcvenegas/Readme.md

Host cgroup managment

Introduction

SandboxCgroupOnly enabled (optional, podoverhead oriented)

What this method does?

Why create a subcgroup and not put all in the cgroup pod

Why?

Pros and Cons

SandboxCgroupOnly disabled (default, legacy)

What this method does?

Why?

Pros and Cons