https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
Limits and requests for memory are measured in bytes. You can express memory as a plain integer or as a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki. For example, the following represent roughly the same value:
128974848, 129e6, 129M, 123Mi
If someone uses a lowercase letter, for example m
, that has inconsistent behavior between Windows & Linux. m
is a valid suffix for CPU but not memory.
Note:
docker
command line options do supportm
,b
,k
andg
. docker resource constraints
On Linux, the m
prefix is dropped if it's included as part of a memory resource. This resource allocation will create an allocation of 300 bytes, not 300 Megabytes or Mebibytes. That's too small and container startup will fail. See full error logs below.
apiVersion: v1
kind: Pod
metadata:
name: nginx-test1
spec:
containers:
- name: nginx
image: library/nginx
resources:
limits:
cpu: 1
memory: 800m
ports:
- containerPort: 80
nodeSelector:
"beta.kubernetes.io/os": linux
On Windows, the m
prefix is treated the same as M
. This resource allocation will be 300 Megabytes, and container startup will succeed.
apiVersion: v1
kind: Pod
metadata:
name: iis-test2
spec:
containers:
- name: iis
image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
resources:
limits:
cpu: 1
memory: 800m
ports:
- containerPort: 80
nodeSelector:
"beta.kubernetes.io/os": windows
This would have the same effect on resources as docker run --memory 800m -d mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
on Windows.
There are a few different places this could be fixed. They're not mutually exclusive, so I'm asking SIG-Node and API reviewers to consider the impact of each.
Parsing cpu & memory limits separately is not feasible as the same string parser is shared for all resource types.
This would be the most consistent as the failure would happen immediately, and be consistent on both Windows & Linux. Let's start with a strawman proposal of 10Mi as the container minimum. Windows actually needs more like 200Mi, but I think 10Mi would be enough to catch the most egregious mistakes such as "10m".
Source Locations:
This would fail later, but won't involve changes to the apiserver, common APIs or admission controller. The error messages can still be made consistent between Linux & Windows, but different minimums can be enforced in each side. This could be implemented in the code that creates the container configs before passing them to CRI.
Source locations:
Linux: https://github.com/kubernetes/kubernetes/blob/33081c1f07be128a89441a39c467b7ea2221b23d/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L38 Windows: https://github.com/kubernetes/kubernetes/blob/b39d8f4777b3a01359046d05b1de17298b818f2f/pkg/kubelet/kuberuntime/kuberuntime_container_windows.go#L34
Here's what the status looks like for the failed pod:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
iis-test1 1/1 Running 0 10m
iis-test2 1/1 Running 0 10m
nginx-test1 1/1 Running 0 10m
nginx-test2 0/1 ContainerCreating 0 10m
patrick@planglx1:~/gist/a92766f23d39b442e5c8d92cb4ed05f2$ kubectl describe pod nginx-test2
Name: nginx-test2
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: aks-nodepool1-20293662-vmss000000/10.240.0.4
Start Time: Tue, 09 Jul 2019 23:44:09 +0000
Labels: <none>
Annotations: <none>
Status: Pending
IP:
Containers:
nginx:
Container ID:
Image: library/nginx
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 800m
Requests:
cpu: 100m
memory: 300m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-jrvwx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-jrvwx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-jrvwx
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/nginx-test2 to aks-nodepool1-20293662-vmss000000
Warning FailedCreatePodSandBox 10m kubelet, aks-nodepool1-20293662-vmss000000 Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-test2": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:293: copying bootstrap data to pipe caused \"write init-p: broken pipe\"": unknown
Warning FailedCreatePodSandBox 10m kubelet, aks-nodepool1-20293662-vmss000000 Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-test2": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:297: getting the final child's pid from pipe caused \"read init-p: connection reset by peer\"": unknown
Warning FailedCreatePodSandBox 5m52s (x230 over 10m) kubelet, aks-nodepool1-20293662-vmss000000 Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-test2": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:297: getting the final child's pid from pipe caused \"EOF\"": unknown
Normal SandboxChanged 52s (x538 over 10m) kubelet, aks-nodepool1-20293662-vmss000000 Pod sandbox changed, it will be killed and re-created.