Major disaster preparation and recovery

In a perfect world our clusters would never experience a complete and total failure where data from all nodes is unrecoverable. Unfortunately this scenario is very possible and has happened before. In this article I will outline how to best prepare your environment for recovery in situations like this.

Situation: Employee A accidentally deletes all of the VM's for a production cluster after testing his latest script. How do you recover?

Option A: Keep VM snapshots of all of the nodes so that you can just restore them if they are deleted.

Option B: Manually bootstrap a new controlplane and etcd node to match one of the original nodes that were deleted.

In this article, I'm going to focus on Option B. In order to bootstrap a controlplane,etcd node, you will need an etcd snapshot, Kubernetes certificates and the runlike commands from the core Kubernetes components. If you prepare ahead of time for something like this, you can save a lot of time when it comes to bootstraping the controlplane,etcd node.

Preparation

To be prepared, the following steps should be performed before and after every upgrade of the Rancher server and downstream clusters.

Create offline backup of /var/lib/etcd and /etc/kubernetes if running non RancherOS/CoreOS node, if running RancherOS/CoreOS then backup /opt/rke.

Create offline backup of runlike commands of major Kubernetes components. Best to store the output of all of these runlike commands to a single file so that you can execute it with bash at a later date.

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike service-sidekick
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike etcd
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike kube-apiserver
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike kube-proxy
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike kubelet
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike kube-scheduler
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike kube-controller-manager

Keep offline backups of etcd snapshots for your downstream clusters and your local Rancher cluster.

Recovery

Create new node with the same IP address as one of your old busted etcd,controlplane nodes.
Restore /etc/kubernetes and /var/lib/etcd if you are on non RancherOS/CoreOS nodes. If you are on RancherOS/CoreOS nodes, then restore /opt/rke.
Run the runlike commands generated in your preparation steps. Since service-sidekick and etcd are dependencies of most containers, you must run the commands in the same order as in the preparation steps.
etcd will now be in a lost quorum / read-only state. To fix this you must run the following command from https://github.com/patrick0057/etcd-tools
```
bash ./restore-etcd-single.sh FORCE_NEW_CLUSTER
```
At this time, the cluster should be looking better in the web interface. Delete all etcd,controlplane nodes in the web interface except for the node that you bootstrapped back to life. Then perform an 'rke up' / cluster reconciliation by editing the cluster in the web interface, click "Edit as YAML", increment "addon_job_timeout" and save changes.
Once the cluster finishes its 'rke up', check to see if you can execute an 'etcdctl member list' on the etcd container. If you get 'context deadline exceeded' then you need to run the FORCE_NEW_CLUSTER command one more time.
```
bash ./restore-etcd-single.sh FORCE_NEW_CLUSTER
```

Once you are able to see your etcd member list, check to ensure the only member shows the same IP address three times. If it doesn't you can fix this as shown below.

[root@ip-172-31-6-12 ~]# docker exec etcd etcdctl member list
d6288174b26650ce, started, etcd-13.58.167.183, https://172.31.6.11:2380, https://172.31.6.12:2379,https://172.31.6.12:4001

[root@ip-172-31-6-12 ~]# docker exec etcd etcdctl member update d6288174b26650ce --peer-urls="https://172.31.6.13:2380"
Member d6288174b26650ce updated in cluster 3fdab2a2a3114d0f

[root@ip-172-31-6-12 ~]# docker exec etcd etcdctl member list
d6288174b26650ce, started, etcd-13.58.167.183, https://172.31.6.12:2380, https://172.31.6.12:2379,https://172.31.6.12:4001

I don't anticipate this needing to be done very often as I've only seen it happen once before.

Create three new etcd,controlplane nodes in the web interface. Once they are done creating, check the health of your etcd cluster again by running an etcdctl member list.
Remove the node you bootstrapped from the cluster. While it is probably perfectly fine to use in the long run, I have not done extensive testing on this. I only know that it is good enough to restore a cluster back to a working state.
Add new worker nodes

Untested recovery steps if you did not prepare in advance.

If you were not prepared in advanced for a cluster failure but still have etcd snapshots, it may still be possible to bootstrap an etcd,controlplane node to recover the cluster. Below are untested steps.

Create a new downstream cluster using the exact same Rancher version, Kubernetes version and use the exact same settings such as pod and services cidr as the cluster that has failed.
Perform runlike steps in preparation section of this document on an etcd,controlplane node.
Grab /var/lib/etcd or /opt/rke/var/lib/etcd
Grab certificates from the cluster secret of the crashed cluster's configmap in your Rancher installation's local cluster.

kubectl get secret -n cattle-system c-<CLUSTER-ID> -o jsonpath='{ .data.cluster }' | base64 -d > cluster.spec

script.sh to be run with cluster.spec on the node you need to restore the certificates on.

#!/bin/bash

#dump cluster.spec
#kubectl get secret -n cattle-system c-c-h7j5n -o jsonpath='{ .data.cluster }' | base64 -d > cluster.spec

certs=($(jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq keys | jq -r .[]))
for i in ${certs[*]}

do

    if [ "${i}" == "kube-admin" ] || [ "${i}" == "kube-apiserver" ] || [ "${i}" == "kube-ca" ] || [ "${i}" == "kube-etcd-controlplane-a-1-dev-k8s-devlabs" ] || [ "${i}" == "kube-etcd-controlplane-b-1-dev-k8s-devlabs" ] || [ "${i}" == "kube-etcd-controlplane-c-1-dev-k8s-devlabs" ] || [ "${i}" == "kube-service-account-token" ]; then

        getcert_path=$(jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."path"')
        echo $getcert_path
        jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."certificatePEM"' > $getcert_path


        getcert_keyPath=$(jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."keyPath"')
        echo $getcert_keyPath
        jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."keyPEM"' > $getcert_keyPath


else
        getcert_path=$(jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."path"')
        echo $getcert_path
        jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."certificatePEM"' > $getcert_path


        getcert_keyPath=$(jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."keyPath"')
        echo $getcert_keyPath
        jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."keyPEM"' > $getcert_keyPath


        getcert_configPath="/etc/kubernetes/ssl/$(jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."configPath"' | rev | cut -d'/' -f 1 | rev)"
        echo $getcert_configPath
        jq -r .metadata.fullState cluster.spec | jq -r .currentState.certificatesBundle | jq -r '."'$i'"."config"' > $getcert_configPath

fi
done

/etc/kubernetes/kube-api-authn-webhook.yaml below if needed

apiVersion: v1
kind: Config
clusters:
- name: Default
  cluster:
    insecure-skip-tls-verify: true
    server: http://127.0.0.1:6440/v1/authenticate
users:
- name: Default
  user:
    insecure-skip-tls-verify: true
current-context: webhook
contexts:
- name: webhook
  context:
    user: Default
    cluster: Default

Prepare a new node that has the same IP of one of your crashed controlplane,etcd
Restore /var/lib/etcd (non RancherOS) or /opt/rke/var/lib/etcd (RancherOS)
Restore certificates to /etc/kubernetes/ssl (non RancherOS) or /opt/rke/etcd/kubernetes/ssl (RancherOS)
Modify your runlike commands you retrieved in a previous step to match the IP's or hostnames of your crashed node. Take note that etcd might have two different IP addresses.
Run your modified runlike commands.

Restore one of your etcd snapshot to etcd.

curl -LO https://github.com/patrick0057/etcd-tools/raw/master/restore-etcd-single.sh
bash ./restore-etcd-single.sh </path/to/snapshot>

The rest should be the same as the normal recovery section starting at step 4.

Again please keep in mind that the non prepared version of the recovery has not been tested yet and is much more labor intensive than if you had prepared in advanced.

patrick0057/README.md