CFCR Certificate Rotation for Multi-Master Multi-AZ deployments

This approach attempts to rotate all certificates at once, then update all ETCD (Master) nodes at the same time. Unfortunately, Bosh can't update all Master nodes at the same time if they are deployed across AZs. So the procedure here is to reduce the number of Master nodes to one, and then expand again as we update all certificates across all VMs. This is faster but a bit riskier since we have the cluster with only one master node for a few minutes.

An alternative to this is to follow a more graceful approach to first roll out a new CA concatenated with the old CA and then regenerate leaf certificates for the ETCD servers. Then remove the old CA. This requires 3 passes (cluster updates) so it is slower but it is safer and allows Bosh to update Master nodes one at a time. This gist does not go into the details on how to do that.

Get required CLIs

Download credhub 2.2.1 CLI: Credhub2 CLI is requried to use the export option. Credhub 1.X does not have the export option yet. We need both CLIs

curl -JOL https://github.com/cloudfoundry-incubator/credhub-cli/releases/download/2.2.1/credhub-linux-2.2.1.tgz
tar -xzvf credhub-linux-2.2.1.tgz
mv credhub credhub2

Install ruby 2.3+

Install jq

Login to credhub

The easiest thing for CFCR users is to leverage the credhub_login script provided in kubo-deployment repository. Run it like this:

${repo_base_directory}/kubo-deployment/bin/credhub_login ${kubo_env_path}

Export all certificates for a given cluster

credhub2 export -p /${BOSH_ENVIRONMENT}/${kubo_env_name} > ${kubo_env_name}_certs_old.yaml

Example:

credhub2 export -p /cfcr/cfcr > cfcr_certs_old.yaml

Transform file to json

ruby -ryaml -rjson -e 'puts JSON.pretty_generate(YAML.load(ARGF))' < ${kubo_env_name}_certs_old.yaml > ${kubo_env_name}_certs_old.json

Example:

ruby -ryaml -rjson -e 'puts JSON.pretty_generate(YAML.load(ARGF))' < cfcr_certs_old.yaml > cfcr_certs_old.json

Inspect content to confirm export worked well. First CAs:

cat ${kubo_env_name}_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca == .value.certificate)) | .[].name'

Example:

cat cfcr_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca == .value.certificate)) | .[].name'

Expected output:

/cfcr/cfcr/etcd_ca
/cfcr/cfcr/kubernetes-dashboard-ca
/cfcr/cfcr/kubo_ca

Now certs:

cat ${kubo_env_name}_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca != .value.certificate)) | .[].name'

Example:

cat cfcr_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca != .value.certificate)) | .[].name'

Expected output: (This list may vary depending on the version of CFCR used)

/cfcr/cfcr/tls-etcdctl-flanneld
/cfcr/cfcr/tls-etcdctl-root
/cfcr/cfcr/tls-etcdctl-v0-29-0
/cfcr/cfcr/tls-etcd-v0-29-0
/cfcr/cfcr/tls-kubernetes-dashboard
/cfcr/cfcr/tls-influxdb
/cfcr/cfcr/tls-heapster
/cfcr/cfcr/tls-metrics-server
/cfcr/cfcr/tls-etcdctl
/cfcr/cfcr/tls-etcd-v0-17-0
/cfcr/cfcr/tls-kube-controller-manager
/cfcr/cfcr/tls-kubernetes
/cfcr/cfcr/tls-kubelet-client
/cfcr/cfcr/tls-kubelet

Reduce to one master

The easiest way is to create a bosh Ops file that changes the number of master instances to 1. Below is an example that can be used. But remember to edit to change the master instance-group name if your cluster uses a differen one. Also remember to update the statip_ips to the ones used by your cluster or remove the second replace if you don't use vip network for master nodes. Example of ops-single-master.yml file:

- type: replace
  path: /instance_groups/name=cfcr-master/instances
  value: 1
- type: replace
  path: /instance_groups/name=cfcr-master/networks/name=vip/static_ips
  value: [100.71.29.64]

Update the cluster using ops-single-master.yml:

bosh deploy -d cfcr "${bosh_env}/kubo-manifest.yml" -o "${bosh_env}/ops-single-master.yml"

This pass should be fast, deleting all Master nodes but 1, and updating all VMs with the updated internal links.

Regenerate all certs

First the CAs:

cat ${kubo_env_name}_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca == .value.certificate)) | .[].name' | xargs -n 1 -t credhub regenerate -n &> ${kubo_env_name}_ca_cert_regen.out

Example:

cat cfcr_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca == .value.certificate)) | .[].name' | xargs -n 1 -t credhub regenerate -n &> cfcr_ca_cert_regen.out

Now the certs:

cat ${kubo_env_name}_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca != .value.certificate)) | .[].name' | xargs -n 1 -t credhub regenerate -n &> ${kubo_env_name}_cert_regen.out

Example:

cat cfcr_certs_old.json | jq -r '.credentials | map(select(.type == "certificate")) | map(select(.value.ca != .value.certificate)) | .[].name' | xargs -n 1 -t credhub regenerate -n &> cfcr_cert_regen.out

Redeploy cluster

Redeploy cluster and Scale masters back to 3:

bosh deploy -d cfcr "${bosh_env}/kubo-manifest.yml" --skip-drain

If you are also rotating Bosh Certificates for the Agents, because they have expired and Agents are unresponsive, then run this command instead. (This is a common scenario if the Director was created at the same time the CFCR cluster was created):

bosh deploy -d cfcr "${bosh_env}/kubo-manifest.yml" --recreate --fix --skip-drain

jaimegag/CFCR_cert_rotation_multi-master-multi-az.md

CFCR Certificate Rotation for Multi-Master Multi-AZ deployments

Get required CLIs

Login to credhub

Export all certificates for a given cluster

Reduce to one master

Regenerate all certs

Redeploy cluster