Skip to content

Instantly share code, notes, and snippets.

@albertcard
Created March 23, 2022 15:14
Show Gist options
  • Save albertcard/fff35c0f18bb92fd6dbcc959c05c7b9f to your computer and use it in GitHub Desktop.
Save albertcard/fff35c0f18bb92fd6dbcc959c05c7b9f to your computer and use it in GitHub Desktop.

Convert a Worker Node to Master Node

  1. Put broken "master" node in maintenance, drain and shutdown -- if you can't shutdown, take solace in knowing you officially broke it further with the 'drain' command
# oc adm cordon <masternode>
# oc adm drain <masternode> --force --ignore-daemonsets
# ssh <masternode> 'shutdown -h now'
  1. Look for busted 'etcd' pod and login to 'etcd' pod
# oc get pods -n openshift-etcd | egrep -v 'Running|Completed'
# oc rsh -n openshift-etcd etcd-<workernode>.<clustername>.lab.upshift.rdu2.redhat.co
  1. In the pod, get the member list and remove host
# etcdctl member list -w table
# etcdctl member remove <id_of_host_to_remove>
# etcdctl member list -w table
  1. Delete the secrets for the broken "master"
# oc get secrets -n openshift-etcd | grep "<masternode>"
# oc delete secret -n openshift-etcd etcd-peer-<masternode>.<clustername>.lab.upshift.rdu2.redhat.com
# oc delete secret -n openshift-etcd etcd-serving-<masternode>.<clustername>.lab.upshift.rdu2.redhat.com
# oc delete secret -n openshift-etcd etcd-serving-metrics-<masternode>.<clustername>.lab.upshift.rdu2.redhat.com
  1. Promote a "worker" node to "master" by giving it the "master" role
# oc label nodes <workernode> kubernetes.io/role=master

or

# oc edit node <workernode>
apiVersion: v1
kind: Node
metadata:
  labels:
    node-role.kubernetes.io/master: ""
  1. Drain this new master and reboot
# oc adm cordon <workernode>
# oc adm drain <workernode> --force --ignore-daemonsets
# ssh <node> 'shutdown -r now'
# oc adm uncordon <workernode>
  1. Patch 'etcd', 'network' and delete 'openshift-ingress' / 'openshift-dns-operator' pods and it should start the recovery
# oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
# oc patch network cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

Odd find but the 'openshift-ingress' namespace contains 'router-default-' pods and can be the source of multiple degraded Cluster Operators such as 'authentication', 'console' and 'ingress'

THIS FIXES 'network', 'dns' as well as the stuck 'authentication' CLUSTER OPERATOR
# oc delete pods --all -n openshift-ingress                                              
# oc delete pods --all -n openshift-dns-operator                                      
# oc delete pods --all -n openshift-multus                                            
  1. Monitor health of 'etcd' and cluster (THIS CAN TAKE A WHILE)
# oc get pods -n openshift-etcd
# oc get co etcd
# oc get co
  1. Upgrade cluster (if applicable)
# oc adm upgrade 
or 
# oc adm upgrade --to=<version>
  1. Check health
# oc get oc
# oc get clusterversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment