Brief walkthrough of steps taken to debug issue seen when Docker was upgraded on a SLES (SuSE) and kube-proxy not being started automatically after the upgrade.
Error seen was:
starting container process caused \"process_linux.go:424: container init caused \\\"process_linux.go:390: setting cgroup config for procHooks process caused \\\\\\\"failed to write a *:* rwm to devices.allow: write /sys/fs/cgroup/devices/docker/8103ad3afeece25eda0d0f7799c35ee9f7986ebf80b36d28dad4472c3542953a/devices.allow: invalid argument\\\\\\\"\\\"\": unknown"
Not in specific order, sometimes Google beats isolating as you get a hit with the exact issue. Sometimes you need some info to search more specifically.
In this case, the most specific part is failed to write a *:* rwm to devices.allow
. The main issue with Googling errors is that the more generic the error, the more non relevant hits you get. So you start with the most specific and see if you can get a lead from there. If you get too generic errors (for instance, when it's a kernel error that can happen in more situations, add search terms to isolate). In this case, it would be kube-proxy
or kubernetes
or SUSE
/SLES
, or docker
but as Docker is so widely used, it clobbers the results.
You might not hit the exact issue directly (or at all even), so try to scan through the results and see if there is any background info that is useful. In this case, I found kubernetes/kubernetes#54804 (comment) for example which just explains what triggers this error and then kubernetes/kubernetes#54967.
Isolate meaning, eliminate variables that can cause the behavior. In this case, I tried reproducing using my default test setup which uses Ubuntu + upstream Docker and CentOS + upstream Docker and ran the same scenario (upgrade Docker on a running cluster). Both setups did not show the behavior, meaning it was pretty certain the issue is in SUSE/SLES packaging of Docker (possibly pre/post scripts).
In general, always try to reproduce with least steps possible. From search results, we found kernel related issues, so check all main components that can influence this. On cheap hosters/custom images, usually a custom built kernel is used which can cause issues.
If it doesn't reproduce, get more info on the system where its happening, e.g. use rancher/logs-collector
.
To check what behavior is changed in certain components (in this case, it only happened to kube-proxy
), you can search the kubernetes issues with specific kube-proxy queries or search through changelogs. In this case, the behavior for kube-proxy was changing in 1.16 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md):
ACTION REQUIRED: Removed deprecated flag --resource-container from kube-proxy. (#78294, @vllry)
The deprecated --resource-container flag has been removed from kube-proxy, and specifying it will now cause an error. The behavior is now as if you specified --resource-container="". If you previously specified a non-empty --resource-container, you can no longer do so as of kubernetes 1.16.
Apply changes found and see if it resolves the issue. If you haven't found any solid leads, my take is to change parameters related to the issue (in this case, kernel/Docker/kube-proxy) and see if it triggers it or not.