Noproduction cluster outage post mortem

Post Portem

Date: 2017-04-11

Impact:

np.k8s.saltside.io internal service disrupted
tiller unavailable
unable to create new pods or recover failed pods

Timeline

I (AH) has finished developing some work to move to git-crypt for the Helm chart secrets. It seemed like it was to ready to complete a test run. I updated the code to deploy the sandbox chart to all markets on the non production cluster in parallel. This itself took some time and eventually completed.

I continued to monitor the pods with watch. I noticed the pods were stuck in the pending state and wrote it off because I had a meeting to go to. I figured it would resolve in time.

2 hours later: pods are still stuck in the pending state and a new state I've not seen before: "Unknown". I described some pods to spot check what was going on. There were errors about pulling images and errors from Docker itself. Again, I largely wrote it off because it was the non-production cluster and these are t2.medium instances. I figured things had just gotten overwhelmed by deploying everything at once. At this point, I did not want to figure out why things had broke, so I started to delete things.

I tried to delete helm releases but that had timed out because tiller had stopped working. So I figured, ok sure, I can recover from that. Let's go head and start cleaning out everything manually. This kind of worked. I needed to add --grace-period=0 and --force to forcibly axe pod stuck in pending/unknown/terminating states. This seemed to delete the things, but tiller was still unavailable.

I checked the kibe-system namespace pods to see what was going on there. I noticed some tiller pods where in a new "DeadNode" state along with "Unknown" or "Pending". "DeadNode" got my attention so I decided to check the nodes. I quickly ran through "kubectl get nodes". It seems the internal subsystems on those nodes had completely shutdown for various reasons. I noticed the container runtime was dead on one, the kubelet down on others. At this point I figured I'll make a note of this (in the post mortem). I decided it would be easier to terminate the instances at the ASG sort it out. I verified the worker instances had some sort of instance data that would bootstrap them. I terminated the instances in the AWS console.

Service was restored after new instances where created.

Observations

Node unavailability is not picked up by any monitoring. Kubectl get nodes reported that all 3 workers were unavailable.
No metrics/alarms on unavailable/other curious pod states. This is especially useful for kibe-system.
Pre-pulling images may have helped this problem
The tiller pods failed to reschedule after killing other pods because of CPU limits. This was curious because IIRC nothing had declared requests to accumulate CPU usage. This may need investigation
No (node) system utilization metrics in the k8s dashboard
Cluster benchmarking could help gauges these limits. Installing N charts at once is a common use case.
Disk pressure monitoring .... ?

ahawkins/post_mortem.md