Date: 2026-04-04 Objective: Migrate pg-common primary to newly provisioned nodes across staging, demo, and production, then decommission old-node instances to reach a 3-replica cluster per environment.
| Instance | Role | Node | Node Age |
|---|---|---|---|
staging-pg-common-6 |
PRIMARY | vmss000000 |
Mar 31 (old) |
staging-pg-common-10 |
Standby (sync) | vmss000002 |
Mar 31 (old) |
staging-pg-common-9 |
Standby (sync) | vmss00001e |
Mar 31 (old) |
staging-pg-common-11 |
Standby (sync) | vmss00001i |
Apr 4 (new) |
staging-pg-common-12 |
Standby (sync) | vmss00001j |
Apr 4 (new) |
staging-pg-common-13 |
Standby (sync) | vmss00001k |
Apr 4 (new) |
- Cluster: 6 instances, 6 ready, 181G, 0 replication lag
- All new-node standbys fully synced
| Instance | Role | Node | Node Age |
|---|---|---|---|
demo-pg-common-13 |
PRIMARY | vmss000000 |
Mar 29 (old) |
demo-pg-common-2 |
Standby (async) | vmss000003 |
Mar 29 (old) |
demo-pg-common-15 |
Standby (async) | vmss000002 |
Mar 29 (old) |
demo-pg-common-16 |
Standby (async) | vmss000018 |
Apr 4 (new) |
demo-pg-common-17 |
Standby (async) | vmss000019 |
Apr 4 (new) |
demo-pg-common-18 |
Standby (async) | vmss00001a |
Apr 4 (new) |
- Cluster: 6 instances, 6 ready, 121G, 0 replication lag
- All new-node standbys fully synced
| Instance | Role | Node | Node Age |
|---|---|---|---|
production-pg-common-19 |
PRIMARY | vmss00000d |
Feb 3 (old) |
production-pg-common-20 |
Standby (sync) | vmss00000h |
Feb 3 (old) |
production-pg-common-21 |
Standby (sync) | vmss00000g |
Feb 3 (old) |
production-pg-common-22 |
Standby (sync) | vmss00000i |
Apr 4 (new) |
production-pg-common-23 |
Standby (sync) | vmss00000j |
Apr 4 (new) |
production-pg-common-24 |
Initializing | vmss00000k |
Apr 4 (new) |
- Cluster: 6 instances, 5 ready (instance-24 still initializing), 553G
- Instances 22 and 23 fully synced (sync standbys, ~0.36s replay lag)
- Instance 24 still joining — must be ready before starting production switchover
For each environment (staging → demo → production):
- Pre-flight checks — verify cluster health, replication lag, backups
- Promote — switchover primary to a new-node instance
- Cordon — prevent scheduling on old nodes (blocks CNPG recreation during destroy)
- Destroy — remove old-node instances one at a time
- Patch — set
instances: 3in the cluster spec (stops CNPG from recreating) - Uncordon — release old nodes so cluster autoscaler can scale them down
- Post-flight checks — verify final cluster health
Why patch before uncordon? After destroying old instances, CNPG still wants 6 replicas. If we uncordon first, CNPG could immediately recreate pods on the old nodes. Patching to
instances: 3first tells CNPG the desired state matches reality (3 running). Only then is it safe to uncordon, allowing the autoscaler to scale down the empty old nodes.
# Set variables for the current environment
# STAGING:
CTX=azure-staging; CLUSTER=staging-pg-common; NS=db
# DEMO:
# CTX=azure-demo; CLUSTER=demo-pg-common; NS=db
# PRODUCTION:
# CTX=azure-production; CLUSTER=production-pg-common; NS=db
# 1. Cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX
# 2. Verify all instances are Ready and replication lag is 0
# STOP if any instance shows lag > 1s or status != OK
# 3. Verify WAL archiving is working
# Look for: "Working WAL archiving: OK"
# 4. Verify continuous backup recoverability
# Look for: "First Point of Recoverability" is recentGO/NO-GO decision point — do not proceed unless:
- All instances show Status: OK
- Replication lag is < 1 second on all standbys
- WAL archiving is OK
- Backup recoverability point is recent
CTX=azure-staging; CLUSTER=staging-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl cnpg promote $CLUSTER staging-pg-common-11 -n $NS --context $CTXWait for switchover to complete:
# Poll until primary changes (should take < 30 seconds)
kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20Verify:
-
Primary instance: staging-pg-common-11 - Old primary (
staging-pg-common-6) is now Standby - Cluster status returns to healthy
kubectl cordon aks-pgcommon-36771050-vmss000000 --context $CTX
kubectl cordon aks-pgcommon-36771050-vmss000002 --context $CTX
kubectl cordon aks-pgcommon-36771050-vmss00001e --context $CTX# Destroy old primary
kubectl cnpg destroy $CLUSTER staging-pg-common-6 -n $NS --context $CTX
# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX# Destroy second old instance
kubectl cnpg destroy $CLUSTER staging-pg-common-10 -n $NS --context $CTX
# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX# Destroy third old instance
kubectl cnpg destroy $CLUSTER staging-pg-common-9 -n $NS --context $CTX
# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl patch cluster $CLUSTER -n $NS --context $CTX \
--type merge -p '{"spec":{"instances":3}}'kubectl uncordon aks-pgcommon-36771050-vmss000000 --context $CTX
kubectl uncordon aks-pgcommon-36771050-vmss000002 --context $CTX
kubectl uncordon aks-pgcommon-36771050-vmss00001e --context $CTXkubectl cnpg status $CLUSTER -n $NS --context $CTXVerify:
- 3 instances, 3 ready
- Primary:
staging-pg-common-11onvmss00001i(new node) - Standbys:
staging-pg-common-12,staging-pg-common-13(new nodes) - Replication lag: 0
- WAL archiving: OK
- Cluster status: "Cluster in healthy state"
STOP and validate before proceeding to demo.
CTX=azure-demo; CLUSTER=demo-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl cnpg promote $CLUSTER demo-pg-common-16 -n $NS --context $CTXWait and verify:
kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20Verify:
-
Primary instance: demo-pg-common-16 - Old primary is now Standby
- Cluster healthy
kubectl cordon aks-pgcommon-37672289-vmss000000 --context $CTX
kubectl cordon aks-pgcommon-37672289-vmss000002 --context $CTX
kubectl cordon aks-pgcommon-37672289-vmss000003 --context $CTXkubectl cnpg destroy $CLUSTER demo-pg-common-13 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl cnpg destroy $CLUSTER demo-pg-common-2 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl cnpg destroy $CLUSTER demo-pg-common-15 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl patch cluster $CLUSTER -n $NS --context $CTX \
--type merge -p '{"spec":{"instances":3}}'kubectl uncordon aks-pgcommon-37672289-vmss000000 --context $CTX
kubectl uncordon aks-pgcommon-37672289-vmss000002 --context $CTX
kubectl uncordon aks-pgcommon-37672289-vmss000003 --context $CTXkubectl cnpg status $CLUSTER -n $NS --context $CTXVerify:
- 3 instances, 3 ready
- Primary:
demo-pg-common-16onvmss000018(new node) - Standbys:
demo-pg-common-17,demo-pg-common-18(new nodes) - Replication lag: 0
- WAL archiving: OK
- Cluster status: "Cluster in healthy state"
STOP and validate before proceeding to production.
⚠ PRODUCTION — Extra caution required. Ensure instance-24 is fully ready before starting. If not, use instance-22 or instance-23 as promote target and adjust accordingly.
CTX=azure-production; CLUSTER=production-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTXAdditional production checks:
- Instance-24 status is OK and streaming (if not, exclude it and plan for 2 new + promote)
- Confirm no active maintenance windows or deployments
- Notify on-call / relevant teams
kubectl cnpg promote $CLUSTER production-pg-common-22 -n $NS --context $CTXWait and verify:
kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20Verify:
-
Primary instance: production-pg-common-22 - Old primary (
production-pg-common-19) is now Standby - All standbys streaming with 0 lag
- Cluster healthy
kubectl cordon aks-pgcommon-53760683-vmss00000d --context $CTX
kubectl cordon aks-pgcommon-53760683-vmss00000g --context $CTX
kubectl cordon aks-pgcommon-53760683-vmss00000h --context $CTXkubectl cnpg destroy $CLUSTER production-pg-common-19 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl cnpg destroy $CLUSTER production-pg-common-20 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl cnpg destroy $CLUSTER production-pg-common-21 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTXkubectl patch cluster $CLUSTER -n $NS --context $CTX \
--type merge -p '{"spec":{"instances":3}}'kubectl uncordon aks-pgcommon-53760683-vmss00000d --context $CTX
kubectl uncordon aks-pgcommon-53760683-vmss00000g --context $CTX
kubectl uncordon aks-pgcommon-53760683-vmss00000h --context $CTXkubectl cnpg status $CLUSTER -n $NS --context $CTXVerify:
- 3 instances, 3 ready
- Primary:
production-pg-common-22onvmss00000i(new node) - Standbys:
production-pg-common-23,production-pg-common-24(new nodes) - Replication lag: 0
- WAL archiving: OK
- Cluster status: "Cluster in healthy state"
If a switchover causes issues at any point:
# Promote the old primary back
kubectl cnpg promote $CLUSTER <old-primary-instance> -n $NS --context $CTXIf old instances have already been destroyed, the cluster is committed to the new nodes. Recovery options:
- Scale up
instancesto add more replicas on remaining nodes - If the new primary is unhealthy, promote any healthy standby:
kubectl cnpg promote $CLUSTER <healthy-standby> -n $NS --context $CTX
Last resort — use CNPG continuous backup to restore:
# Check first point of recoverability
kubectl cnpg status $CLUSTER -n $NS --context $CTX | grep "Point of Recoverability"After all three environments are complete:
- Drain and remove old nodes from the VMSS (via Azure CLI or portal)
- Verify pooler pods (
-pooler-rw,-pooler-ro) reconnected to new primary - Monitor for 24h: replication lag, WAL archiving, connection counts
- Clean up completed snapshot-recovery pods:
kubectl delete pods -n db --field-selector=status.phase==Succeeded --context $CTX
| Risk | Mitigation |
|---|---|
| Switchover causes brief write unavailability | Expected: ~5-10s per switchover. PgBouncer poolers absorb connection retry. |
| CNPG recreates destroyed instance on old node | Cordoning old nodes before destroy prevents this. |
| New instance not caught up at promote time | Pre-flight checks verify 0 lag before proceeding. |
| Production instance-24 not ready | Use instance-22 as promote target (confirmed synced). Instance-23 and 24 become standbys. |
| Destroy removes PVC with data | Expected behavior — we no longer need old-node data. Backups provide safety net. |