Walkthrough on debugging case regarding upgrading Docker on SLES (SuSE) and kube-proxy not starting

Brief walkthrough of steps taken to debug issue seen when Docker was upgraded on a SLES (SuSE) and kube-proxy not being started automatically after the upgrade.

Error seen was:

starting container process caused \"process_linux.go:424: container init caused \\\"process_linux.go:390: setting cgroup config for procHooks process caused \\\\\\\"failed to write a *:* rwm to devices.allow: write /sys/fs/cgroup/devices/docker/8103ad3afeece25eda0d0f7799c35ee9f7986ebf80b36d28dad4472c3542953a/devices.allow: invalid argument\\\\\\\"\\\"\": unknown"

Steps

Not in specific order, sometimes Google beats isolating as you get a hit with the exact issue. Sometimes you need some info to search more specifically.

Google the most specific part of the log and expand if needed

In this case, the most specific part is failed to write a *:* rwm to devices.allow. The main issue with Googling errors is that the more generic the error, the more non relevant hits you get. So you start with the most specific and see if you can get a lead from there. If you get too generic errors (for instance, when it's a kernel error that can happen in more situations, add search terms to isolate). In this case, it would be kube-proxy or kubernetes or SUSE/SLES, or docker but as Docker is so widely used, it clobbers the results.

You might not hit the exact issue directly (or at all even), so try to scan through the results and see if there is any background info that is useful. In this case, I found kubernetes/kubernetes#54804 (comment) for example which just explains what triggers this error and then kubernetes/kubernetes#54967.

Isolate

Isolate meaning, eliminate variables that can cause the behavior. In this case, I tried reproducing using my default test setup which uses Ubuntu + upstream Docker and CentOS + upstream Docker and ran the same scenario (upgrade Docker on a running cluster). Both setups did not show the behavior, meaning it was pretty certain the issue is in SUSE/SLES packaging of Docker (possibly pre/post scripts).

In general, always try to reproduce with least steps possible. From search results, we found kernel related issues, so check all main components that can influence this. On cheap hosters/custom images, usually a custom built kernel is used which can cause issues.

If it doesn't reproduce, get more info on the system where its happening, e.g. use rancher/logs-collector.

Use changelogs

To check what behavior is changed in certain components (in this case, it only happened to kube-proxy), you can search the kubernetes issues with specific kube-proxy queries or search through changelogs. In this case, the behavior for kube-proxy was changing in 1.16 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md):

ACTION REQUIRED: Removed deprecated flag --resource-container from kube-proxy. (#78294, @vllry)
The deprecated --resource-container flag has been removed from kube-proxy, and specifying it will now cause an error. The behavior is now as if you specified --resource-container="". If you previously specified a non-empty --resource-container, you can no longer do so as of kubernetes 1.16.

Test

Apply changes found and see if it resolves the issue. If you haven't found any solid leads, my take is to change parameters related to the issue (in this case, kernel/Docker/kube-proxy) and see if it triggers it or not.