Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active April 4, 2026 03:09
Show Gist options
  • Select an option

  • Save jmealo/2952aa1168c209d9acd1cf2cfb3f2a09 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/2952aa1168c209d9acd1cf2cfb3f2a09 to your computer and use it in GitHub Desktop.
CNPG pg-common Primary Switchover Runbook — Staging, Demo, Production

CNPG pg-common Primary Switchover Runbook

Date: 2026-04-04 Objective: Migrate pg-common primary to newly provisioned nodes across staging, demo, and production, then decommission old-node instances to reach a 3-replica cluster per environment.

Current State

Staging (azure-staging / staging-pg-common)

Instance Role Node Node Age
staging-pg-common-6 PRIMARY vmss000000 Mar 31 (old)
staging-pg-common-10 Standby (sync) vmss000002 Mar 31 (old)
staging-pg-common-9 Standby (sync) vmss00001e Mar 31 (old)
staging-pg-common-11 Standby (sync) vmss00001i Apr 4 (new)
staging-pg-common-12 Standby (sync) vmss00001j Apr 4 (new)
staging-pg-common-13 Standby (sync) vmss00001k Apr 4 (new)
  • Cluster: 6 instances, 6 ready, 181G, 0 replication lag
  • All new-node standbys fully synced

Demo (azure-demo / demo-pg-common)

Instance Role Node Node Age
demo-pg-common-13 PRIMARY vmss000000 Mar 29 (old)
demo-pg-common-2 Standby (async) vmss000003 Mar 29 (old)
demo-pg-common-15 Standby (async) vmss000002 Mar 29 (old)
demo-pg-common-16 Standby (async) vmss000018 Apr 4 (new)
demo-pg-common-17 Standby (async) vmss000019 Apr 4 (new)
demo-pg-common-18 Standby (async) vmss00001a Apr 4 (new)
  • Cluster: 6 instances, 6 ready, 121G, 0 replication lag
  • All new-node standbys fully synced

Production (azure-production / production-pg-common)

Instance Role Node Node Age
production-pg-common-19 PRIMARY vmss00000d Feb 3 (old)
production-pg-common-20 Standby (sync) vmss00000h Feb 3 (old)
production-pg-common-21 Standby (sync) vmss00000g Feb 3 (old)
production-pg-common-22 Standby (sync) vmss00000i Apr 4 (new)
production-pg-common-23 Standby (sync) vmss00000j Apr 4 (new)
production-pg-common-24 Initializing vmss00000k Apr 4 (new)
  • Cluster: 6 instances, 5 ready (instance-24 still initializing), 553G
  • Instances 22 and 23 fully synced (sync standbys, ~0.36s replay lag)
  • Instance 24 still joining — must be ready before starting production switchover

Procedure Overview

For each environment (staging → demo → production):

  1. Pre-flight checks — verify cluster health, replication lag, backups
  2. Promote — switchover primary to a new-node instance
  3. Cordon — prevent scheduling on old nodes (blocks CNPG recreation during destroy)
  4. Destroy — remove old-node instances one at a time
  5. Patch — set instances: 3 in the cluster spec (stops CNPG from recreating)
  6. Uncordon — release old nodes so cluster autoscaler can scale them down
  7. Post-flight checks — verify final cluster health

Why patch before uncordon? After destroying old instances, CNPG still wants 6 replicas. If we uncordon first, CNPG could immediately recreate pods on the old nodes. Patching to instances: 3 first tells CNPG the desired state matches reality (3 running). Only then is it safe to uncordon, allowing the autoscaler to scale down the empty old nodes.


Pre-flight Checklist (run before EACH environment)

# Set variables for the current environment
# STAGING:
CTX=azure-staging; CLUSTER=staging-pg-common; NS=db
# DEMO:
# CTX=azure-demo; CLUSTER=demo-pg-common; NS=db
# PRODUCTION:
# CTX=azure-production; CLUSTER=production-pg-common; NS=db

# 1. Cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX

# 2. Verify all instances are Ready and replication lag is 0
#    STOP if any instance shows lag > 1s or status != OK

# 3. Verify WAL archiving is working
#    Look for: "Working WAL archiving: OK"

# 4. Verify continuous backup recoverability
#    Look for: "First Point of Recoverability" is recent

GO/NO-GO decision point — do not proceed unless:

  • All instances show Status: OK
  • Replication lag is < 1 second on all standbys
  • WAL archiving is OK
  • Backup recoverability point is recent

Phase 1: Staging

1.1 Pre-flight

CTX=azure-staging; CLUSTER=staging-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTX

1.2 Promote new-node instance to primary

kubectl cnpg promote $CLUSTER staging-pg-common-11 -n $NS --context $CTX

Wait for switchover to complete:

# Poll until primary changes (should take < 30 seconds)
kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20

Verify:

  • Primary instance: staging-pg-common-11
  • Old primary (staging-pg-common-6) is now Standby
  • Cluster status returns to healthy

1.3 Cordon old nodes

kubectl cordon aks-pgcommon-36771050-vmss000000 --context $CTX
kubectl cordon aks-pgcommon-36771050-vmss000002 --context $CTX
kubectl cordon aks-pgcommon-36771050-vmss00001e --context $CTX

1.4 Destroy old-node instances (one at a time)

# Destroy old primary
kubectl cnpg destroy $CLUSTER staging-pg-common-6 -n $NS --context $CTX

# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX
# Destroy second old instance
kubectl cnpg destroy $CLUSTER staging-pg-common-10 -n $NS --context $CTX

# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX
# Destroy third old instance
kubectl cnpg destroy $CLUSTER staging-pg-common-9 -n $NS --context $CTX

# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX

1.5 Patch cluster spec to 3 instances

kubectl patch cluster $CLUSTER -n $NS --context $CTX \
  --type merge -p '{"spec":{"instances":3}}'

1.6 Uncordon old nodes (allows autoscaler to scale them down)

kubectl uncordon aks-pgcommon-36771050-vmss000000 --context $CTX
kubectl uncordon aks-pgcommon-36771050-vmss000002 --context $CTX
kubectl uncordon aks-pgcommon-36771050-vmss00001e --context $CTX

1.7 Post-flight verification

kubectl cnpg status $CLUSTER -n $NS --context $CTX

Verify:

  • 3 instances, 3 ready
  • Primary: staging-pg-common-11 on vmss00001i (new node)
  • Standbys: staging-pg-common-12, staging-pg-common-13 (new nodes)
  • Replication lag: 0
  • WAL archiving: OK
  • Cluster status: "Cluster in healthy state"

STOP and validate before proceeding to demo.


Phase 2: Demo

2.1 Pre-flight

CTX=azure-demo; CLUSTER=demo-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTX

2.2 Promote new-node instance to primary

kubectl cnpg promote $CLUSTER demo-pg-common-16 -n $NS --context $CTX

Wait and verify:

kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20

Verify:

  • Primary instance: demo-pg-common-16
  • Old primary is now Standby
  • Cluster healthy

2.3 Cordon old nodes

kubectl cordon aks-pgcommon-37672289-vmss000000 --context $CTX
kubectl cordon aks-pgcommon-37672289-vmss000002 --context $CTX
kubectl cordon aks-pgcommon-37672289-vmss000003 --context $CTX

2.4 Destroy old-node instances (one at a time)

kubectl cnpg destroy $CLUSTER demo-pg-common-13 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTX
kubectl cnpg destroy $CLUSTER demo-pg-common-2 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTX
kubectl cnpg destroy $CLUSTER demo-pg-common-15 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTX

2.5 Patch cluster spec to 3 instances

kubectl patch cluster $CLUSTER -n $NS --context $CTX \
  --type merge -p '{"spec":{"instances":3}}'

2.6 Uncordon old nodes (allows autoscaler to scale them down)

kubectl uncordon aks-pgcommon-37672289-vmss000000 --context $CTX
kubectl uncordon aks-pgcommon-37672289-vmss000002 --context $CTX
kubectl uncordon aks-pgcommon-37672289-vmss000003 --context $CTX

2.7 Post-flight verification

kubectl cnpg status $CLUSTER -n $NS --context $CTX

Verify:

  • 3 instances, 3 ready
  • Primary: demo-pg-common-16 on vmss000018 (new node)
  • Standbys: demo-pg-common-17, demo-pg-common-18 (new nodes)
  • Replication lag: 0
  • WAL archiving: OK
  • Cluster status: "Cluster in healthy state"

STOP and validate before proceeding to production.


Phase 3: Production

⚠ PRODUCTION — Extra caution required. Ensure instance-24 is fully ready before starting. If not, use instance-22 or instance-23 as promote target and adjust accordingly.

3.1 Pre-flight

CTX=azure-production; CLUSTER=production-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTX

Additional production checks:

  • Instance-24 status is OK and streaming (if not, exclude it and plan for 2 new + promote)
  • Confirm no active maintenance windows or deployments
  • Notify on-call / relevant teams

3.2 Promote new-node instance to primary

kubectl cnpg promote $CLUSTER production-pg-common-22 -n $NS --context $CTX

Wait and verify:

kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20

Verify:

  • Primary instance: production-pg-common-22
  • Old primary (production-pg-common-19) is now Standby
  • All standbys streaming with 0 lag
  • Cluster healthy

3.3 Cordon old nodes

kubectl cordon aks-pgcommon-53760683-vmss00000d --context $CTX
kubectl cordon aks-pgcommon-53760683-vmss00000g --context $CTX
kubectl cordon aks-pgcommon-53760683-vmss00000h --context $CTX

3.4 Destroy old-node instances (one at a time)

kubectl cnpg destroy $CLUSTER production-pg-common-19 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTX
kubectl cnpg destroy $CLUSTER production-pg-common-20 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTX
kubectl cnpg destroy $CLUSTER production-pg-common-21 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTX

3.5 Patch cluster spec to 3 instances

kubectl patch cluster $CLUSTER -n $NS --context $CTX \
  --type merge -p '{"spec":{"instances":3}}'

3.6 Uncordon old nodes (allows autoscaler to scale them down)

kubectl uncordon aks-pgcommon-53760683-vmss00000d --context $CTX
kubectl uncordon aks-pgcommon-53760683-vmss00000g --context $CTX
kubectl uncordon aks-pgcommon-53760683-vmss00000h --context $CTX

3.7 Post-flight verification

kubectl cnpg status $CLUSTER -n $NS --context $CTX

Verify:

  • 3 instances, 3 ready
  • Primary: production-pg-common-22 on vmss00000i (new node)
  • Standbys: production-pg-common-23, production-pg-common-24 (new nodes)
  • Replication lag: 0
  • WAL archiving: OK
  • Cluster status: "Cluster in healthy state"

Rollback Procedure

If a switchover causes issues at any point:

Immediate rollback (before destroy)

# Promote the old primary back
kubectl cnpg promote $CLUSTER <old-primary-instance> -n $NS --context $CTX

After partial destroy

If old instances have already been destroyed, the cluster is committed to the new nodes. Recovery options:

  1. Scale up instances to add more replicas on remaining nodes
  2. If the new primary is unhealthy, promote any healthy standby:
    kubectl cnpg promote $CLUSTER <healthy-standby> -n $NS --context $CTX

Emergency: Restore from backup

Last resort — use CNPG continuous backup to restore:

# Check first point of recoverability
kubectl cnpg status $CLUSTER -n $NS --context $CTX | grep "Point of Recoverability"

Post-Procedure Cleanup

After all three environments are complete:

  1. Drain and remove old nodes from the VMSS (via Azure CLI or portal)
  2. Verify pooler pods (-pooler-rw, -pooler-ro) reconnected to new primary
  3. Monitor for 24h: replication lag, WAL archiving, connection counts
  4. Clean up completed snapshot-recovery pods:
    kubectl delete pods -n db --field-selector=status.phase==Succeeded --context $CTX

Risk Assessment

Risk Mitigation
Switchover causes brief write unavailability Expected: ~5-10s per switchover. PgBouncer poolers absorb connection retry.
CNPG recreates destroyed instance on old node Cordoning old nodes before destroy prevents this.
New instance not caught up at promote time Pre-flight checks verify 0 lag before proceeding.
Production instance-24 not ready Use instance-22 as promote target (confirmed synced). Instance-23 and 24 become standbys.
Destroy removes PVC with data Expected behavior — we no longer need old-node data. Backups provide safety net.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment