Starting Vim
vim [file1] [file2] ...
In a perfect world our clusters would never experience a complete and total failure where data from all nodes is unrecoverable. Unfortunately this scenario is very possible and has happened before. In this article I will outline how to best prepare your environment for recovery in situations like this.
Situation: Employee A accidentally deletes all of the VM's for a production cluster after testing his latest script. How do you recover?
Option A: Keep VM snapshots of all of the nodes so that you can just restore them if they are deleted.
Option B: Manually bootstrap a new controlplane and etcd node to match one of the original nodes that were deleted.
In this article, I'm going to focus on Option B. In order to bootstrap a controlplane,etcd node, you will need an etcd snapshot, Kubernetes certificates and the runlike commands from the core Kubernetes components. If you prepare ahead of time for something like this, you can save a lot of time when it comes
If your etcd logs start showing messages like the following, your storage might be too slow for etcd or the server might be doing too much for etcd to operate properly.
2019-08-11 23:27:04.344948 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:293" took too long (1.530802357s) to execute
If you storage is really slow you will even see it throwing alerts in your monitoring system. What can you do the verify the performance of your storage? If the storage is is not performing correctly, how can you fix it? After researching this I found an IBM article that went over this extensively. Their findings on how to test were very helpful. The biggest factor is your storage latency. If it is not well below 10ms in the 99th percentile, you will see warnings in the etcd logs. We can test this with a tool called fio which I will outline below.
#!/bin/bash | |
echo "Building cluster_recovery.yml..." | |
echo "Working on Nodes..." | |
echo 'nodes:' > cluster_recovery.yml | |
kubectl --kubeconfig kube_config_cluster.yml -n kube-system get configmap full-cluster-state -o json | jq -r .data.\"full-cluster-state\" | jq -r .desiredState.rkeConfig.nodes | yq r - | sed 's/^/ /' | \ | |
sed -e 's/internalAddress/internal_address/g' | \ | |
sed -e 's/hostnameOverride/hostname_override/g' | \ | |
sed -e 's/sshKeyPath/ssh_key_path/g' >> cluster_recovery.yml | |
echo "" >> cluster_recovery.yml |