Guide to doing a GKE HTTPS Cert Cutover from cluster1 to cluster2 with zero downtime

How to use:

Edit the input variables at the top
Copy paste steps one at a time into a bash/zsh terminal shell This will give better feedback and understanding of whats going on.

Phase 1: Prep Work

Phase 1 - Step 1: Provision 2 GKE Clusters and setup kubectx

brew install kubectx
gcloud container clusters get-credentials cluster-1 --zone us-central1-c --project chrism-playground
gcloud container clusters get-credentials cluster-2 --zone us-central1-c --project chrism-playground
kubectx cluster1=gke_chrism-playground_us-central1-c_cluster-1
kubectx cluster2=gke_chrism-playground_us-central1-c_cluster-2
# ^-- renamed kubectl cluster names to shorter names

Phase 1 - Step 2: Input Validation

Copy paste these in a text file
Edit them to reflect your environment
Copy paste the edited values into a termianl

# v-- GCP project
export PROJECT=chrism-playground
export DOMAIN=neoakris.dev
# ^-- domain name with a public internet TLD you have access to
export ORIGINAL_IP_NAME="original-global-ip"
export NEW_IP_NAME="new-global-ip"

Phase 1 - Step 3: Manual Input Validation

(Copy paste)

echo $PROJECT
echo $DOMAIN
echo $ORIGINAL_IP_NAME
echo $NEW_IP_NAME

Phase 2: Setup a mock up of a pre-existing site with HTTPS

Phase 2 - Step 1: Provision Static IP that will Map to a Load Balancer hosting the site

gcloud config set project $PROJECT

gcloud compute addresses create $ORIGINAL_IP_NAME --global --ip-version IPV4

export ORIGINAL_IP=$(gcloud compute addresses describe $ORIGINAL_IP_NAME --global | grep address: | cut -d ' ' -f 2)

echo $ORIGINAL_IP

Phase 2 - Step 2: Point DNS to Mock Up of Pre-Existing Site

Update public internet DNS so your domain name points to $ORIGINAL_IP
verify using command dig $DOMAIN or website https://mxtoolbox.com/DnsLookup.aspx

Note even if the above report fine, it's possible for a router or host OS to have the old IP cached.
So you should also check at the host OS level.
ping neoakris.dev -c 1 | head -n 1
Make sure this command shows the updated IP

Mac and Linux users can usually run sudo killall -HUP mDNSResponder to clear outdated DNS entries hashed at the host level.

Phase 2 - Step 3: Run Kubernetes commands to get the site up

mkdir -p ~/guide
cd ~/guide

# v-- switch kubectl config context to cluster1
kubectx cluster1

# v-- Note: this YAML object will work for both original & new cluster
tee managedcert.yaml  << EOF
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata:
  name: managed-cert
spec:
  domains:
  - $DOMAIN
EOF

alias k=kubectl


tee original-ingress.yaml  << EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: managed-cert-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: $ORIGINAL_IP_NAME
    networking.gke.io/managed-certificates: managed-cert
    kubernetes.io/ingress.class: "gce"
spec:
  defaultBackend:
    service:
      name: test
      port:
        number: 80 #comes from k get svc test
EOF


k create deployment test --image=nginx
k expose deploy/test --port=80 --name=test
k apply -f managedcert.yaml
k apply -f original-ingress.yaml

Note it'll take 10-60 minutes for the managedcertificate to transition from STATUS Provisioning to STATUS Active

k get managedcertificate
# NAME           AGE   STATUS
# managed-cert   10m   Provisioning

# Open a spare terminal 
export DOMAIN=neoakris.dev
echo $DOMAIN
# v-- (copy paste following as a multi line command, CTRL+C will exit)
while :
do
   kubectl get managedcertificate 
   curl --silent --fail https://$DOMAIN:443 -o '/dev/null' && echo "site is up" || echo "site is down" 
   sleep 1
done
# ^-- This will help you figure out when managed cert switches to status Active

k get managedcertificate
NAME           AGE   STATUS
managed-cert   17m   Active
# ^-- Once I saw this I closed the spare terminal

Phase 3: Prep for Cutover

Phase 3 - Step 1: Update Pre-Existing DNS's TTL

For a pre-existing DNS entry, it's recommended to update the TTL to a low value like 300 (5 minutes) (Wait the full length of your old TTL before moving on, so if your old TTL was 1 day, you should wait for 1 day, so that the old TTL will expire and be replaced by the new TTL value.) If using CloudFlare DNS it's recommended to temporarily turn off CloudFlare Proxy for the DNS entry.

Phase 3 - Step 2: Use Let's Encrypt Free to Provision a HTTPS Cert using cloud agnostic generic methodology

You'll use an interactive shell in a docker container to provision an HTTPS cert

# [admin@workstation:~/guide]
mkdir -p ~/guide/cert
cd ~/guide/cert

docker run -it --entrypoint=/bin/sh --volume $HOME/guide/cert:/.lego/certificates   docker.io/goacme/lego:latest
# [shell@dockerized-ACME-client:/]
# (Note: /lego is intentional, lego alone will say lego not found in path)
/lego --email "[email protected]" --domains="neoakris.dev" --dns "manual" run

# Press Y to Accept the Terms of Service

# It'll say something along the lines of 
# lego: Please create the following TXT record in your neoakris.dev. zone:
# _acme-challenge.neoakris.dev. 120 IN TXT "i892WmqXTiIg_FjG8myTTi2OnVxLhbMjoA9_wttttsE"
# So manually create a new TXT record with a TTL of 120 seconds with the given hostname & data
# IMPORTANT: Don't replace any pre-existing records with a TXT record. 
# Example: DON'T convert an A record to a TXT record.
#          DO Make a new record that is a TXT record.

# Press Enter after DNS has been updated according to the steps above.

# 2022/09/18 05:15:32 [INFO] [neoakris.dev] acme: Validations succeeded; requesting certificates
# 2022/09/18 05:15:33 [INFO] [neoakris.dev] Server responded with a certificate.
ls /.lego/certificates
# neoakris.dev.crt         neoakris.dev.issuer.crt  neoakris.dev.json        neoakris.dev.key
exit

# [admin@workstation:~/guide/cert]
ls
# neoakris.dev.crt         neoakris.dev.issuer.crt  neoakris.dev.json        neoakris.dev.key
# Note: You can remove the DNS TXT record now

Phase 3 - Step 3: Create a Kube Secret

# [admin@workstation:~/guide/cert]

echo $DOMAIN
#^-- make sure the value still looks right

export CRT=$(cat $DOMAIN.crt | base64)
export KEY=$(cat $DOMAIN.key | base64)

tee temporary-downtime-prevention-https-cert.yaml  << EOF
apiVersion: v1
kind: Secret
metadata:
  name: secret-tls
  namespace: default
type: kubernetes.io/tls
data:
  # the data is abbreviated in this example
  tls.crt: $CRT
  tls.key: $KEY
EOF

Phase 4: Deploy site to new cluster

Phase 4 - Step 1: Deploy YAMLs for site on new cluster

# [admin@workstation:~/guide/cert] 
cd ~/guide
# [admin@workstation:~/guide] 

# v-- switch kubectl config context to cluster2
kubectx cluster2

# v-- using apache2 docker image to make it easier to tell there's a difference
k create deployment test --image=httpd
k expose deploy/test --port=80 --name=test

gcloud compute addresses create $NEW_IP_NAME --global --ip-version IPV4

export NEW_IP=$(gcloud compute addresses describe $NEW_IP_NAME --global | grep address: | cut -d ' ' -f 2)

echo $NEW_IP

tee new-ingress.yaml  << EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: managed-cert-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: $NEW_IP_NAME
    networking.gke.io/managed-certificates: managed-cert
    kubernetes.io/ingress.class: "gce"
spec:
  tls:
  - hosts:
    - $DOMAIN
    secretName: secret-tls
  defaultBackend:
    service:
      name: test
      port:
        number: 80 #comes from k get svc test
EOF


kubectl apply -f managedcert.yaml
kubectl apply -f $HOME/guide/cert/temporary-downtime-prevention-https-cert.yaml
kubectl apply -f new-ingress.yaml

Phase 4 - Step 2: Mentally Understand whats going on

The new cluster currently has 1 manually imported HTTPS cert attached to the LB + 1 pending managed HTTPS cert stuck in provisioning until DNS cuts over
After DNS cuts over there will be 2 HTTPS certs belonging to the same DNS name attached to the LB that's fine in the short term BUT an important note is that. Only the managed HTTPS cert will auto renew. The manually imported one won't auto renew. So we'll want to remove the manually imported HTTPS cert when we're done, in order to avoid the LB having an expired and unexpired cert at the same time, which would happen after 3 months pass.
You can verify the 1 manually imported HTTPS cert is attached to the LB & working by doing the following test

#v-- This will show you the webpage of the original cluster
curl -v  https://$DOMAIN:443   --resolve $DOMAIN:443:$ORIGINAL_IP
# Thank you for using nginx

#v-- This will show you the validity of the cert of the original cluster
echo QUIT | openssl s_client -connect $ORIGINAL_IP:443 -servername $DOMAIN -showcerts 2>/dev/null | openssl x509 -noout -text | grep Validity -A 2
#        Validity
#            Not Before: Sep 16 18:56:35 2022 GMT
#            Not After : Dec 15 18:56:34 2022 GMT

IMPORTANT NOTE ABOUT THE NEXT STEP
(It might fail if you run it too soon) It should work after 10-60 mins, don't move on until it works.
Why 10-60 minutes? / What's going on?
There's a Kubernetes / GCP Controller Reconciliation loop that exists, It reads the Kubernetes secret attached to the Ingress object. After the reconciliation loop finishes the cert existing as a kube tls secret will be auto imported as an externally provisioned GCP cert and then attached to the LB. As shown in the example below
The url will look simliar to this: https://console.cloud.google.com/kubernetes/ingress/us-central1-c/cluster-2/default/managed-cert-ingress/details?project=chrism-playground

#v-- This will show you the webpage of the new cluster (may need to wait 10-60 mins for reconcilation loop)
curl -v  https://$DOMAIN:443   --resolve $DOMAIN:443:$NEW_IP
# It works!
# ^-- apache2 server's messaging



#v-- This will show you the validity of the cert of the new cluster (we can test before the dns cutover)
echo QUIT | openssl s_client -connect $NEW_IP:443 -servername $DOMAIN -showcerts 2>/dev/null | openssl x509 -noout -text | grep Validity -A 2
#        Validity
#            Not Before: Sep 17 22:10:46 2022 GMT
#            Not After : Dec 16 22:10:45 2022 GMT
# ^-- The fact that the Certs Validity dates are different is sufficient proof 
#     that we're dealing with different certs & things are working as expected

Phase 4 - Step 3: Prepare to Cutover DNS by setting up test driven development / feedback loop to verify success

Remember to wait for the original cluster's DNS TTL to be 5 min, if you had to update it from 3 days -> 5 min, you should wait 3 days. After 3 days have passed all DNS servers on the internet will re-check for DNS updates on a 5 min basis.
Setup this feedback loop
Open 2 spare terminals side by side or use something like tmux

Run the following in the Left Terminal

export DOMAIN=neoakris.dev
# v-- && means only run the next cmd if the previous command was successful, || only run next cmd if previous failed
curl --silent --fail https://$DOMAIN:443 -o '/dev/null' && echo "site is up" || echo "site is down"
# site is up

# v-- let's prove that 
curl --silent --fail i.dont.exist.com -o '/dev/null' && echo "site is up" || echo "site is down"
# site is down

# v-- wrap it in an infite loop that checks once a second (copy paste following as a multi line command, CTRL+C will exit)
while :
do
   curl --silent --fail https://$DOMAIN:443 -o '/dev/null' && echo "site is up" || echo "site is down" 
	sleep 1
done
# ^-- This way you'll be able to verify zero downtime occured

Run the following in the Right Terminal

export DOMAIN=neoakris.dev
echo $DOMAIN

# v-- (copy paste following as a multi line command, CTRL+C will exit)
while :
do
dig $DOMAIN | grep ANSWER -A 1 | grep $DOMAIN 
	sleep 1
done

Phase 4 - Step 4: Cutover DNS

Cutover DNS from the old IP to the new IP
spam the refresh button on your website, I noticed my browser reflected the change immediately in < 1 second. I had zero downtime, my loops said everything was up.
Pay attention to the left and right terminals output while it happens and you'll be able to verify zero downtime occured
and later my dig command showed the update meaning the site cutover for the rest of the internet as well.
You can close the right terminal now, but wait until the very end before you close the left terminal.

Phase 4 - Step 5: Verify the managed cert is up for the new cluster, then remove the pre-provisioned HTTPS cert

kubectx cluster2
k get managedcertificate managed-cert
# STATUS = Provisioning

# v-- copy paste from while to done as a multi line command
while :
do
kubectl get managedcertificate managed-cert
	sleep 1
done

# After about 10-60 minutes you'll see the managed-cert change from status provisioning to status Active
# Although the managed cert isn't ready yet
# The LB will be working perfect using the manually imported HTTPS cert which will result in zero downtime

Phase 4 - Step 6: Update the new Cluster's Ingress Object from 2 HTTPS certs to 1 HTTPS cert

We want to get rid of the 2nd HTTPS cert that we manually imported to avoid problems when it expires after 3 months.
(If you don't do this last step you can run into an issue where your site toggles between a valid cert and an expired cert)
IMPORTANT!: wait until kubectl get managedcertificate managed-cert shows STATUS Active before moving on.

# [admin@workstation:~/guide] 
cd ~/guide

cat new-ingress.yaml
# apiVersion: networking.k8s.io/v1
# kind: Ingress
# metadata:
#   name: managed-cert-ingress
#   annotations:
#     kubernetes.io/ingress.global-static-ip-name: new-global-ip
#     networking.gke.io/managed-certificates: managed-cert
#     kubernetes.io/ingress.class: "gce"
# spec:
#   tls:
#   - hosts:
#     - neoakris.dev
#     secretName: secret-tls
#   defaultBackend:
#     service:
#       name: test
#       port:
#         number: 80 #comes from k get svc test

tee new-ingress.yaml  << EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: managed-cert-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: $NEW_IP_NAME
    networking.gke.io/managed-certificates: managed-cert
    kubernetes.io/ingress.class: "gce"
spec:
  defaultBackend:
    service:
      name: test
      port:
        number: 80 #comes from k get svc test
EOF

cat new-ingress.yaml
# Make sure all the environment variable substitutions look right (basically just got rid of the spec.tls reference)

# Again Don't run the following until after cluster2's managed cert has switched
# from Status Provisioning to Status Active (which can take 10-60 minutes after the DNS cutover has occured)
kubectx cluster2
k get managedcertificate
# NAME           AGE   STATUS
# managed-cert   68m   Active  <--took about 22 minutes after the DNS cutover to see this on cluster2

kubectl apply -f new-ingress.yaml
kubectl delete -f $HOME/guide/cert/temporary-downtime-prevention-https-cert.yaml
# ^--also a good idea to delete the temporary tls secret to avoid confusing people.
# Even though we updated the ingress & deleted the TLS secret, our temporary
# left terminal feedback loop still shows "site is up"

neoakris/GKE_HTTPS_Cert_Cutover_Guide.md