CNPG pg-common Primary Switchover Runbook

Date: 2026-04-04 Objective: Migrate pg-common primary to newly provisioned nodes across staging, demo, and production, then decommission old-node instances to reach a 3-replica cluster per environment.

Current State

Staging (`azure-staging` / `staging-pg-common`)

Instance	Role	Node	Node Age
`staging-pg-common-6`	PRIMARY	`vmss000000`	Mar 31 (old)
`staging-pg-common-10`	Standby (sync)	`vmss000002`	Mar 31 (old)
`staging-pg-common-9`	Standby (sync)	`vmss00001e`	Mar 31 (old)
`staging-pg-common-11`	Standby (sync)	`vmss00001i`	Apr 4 (new)
`staging-pg-common-12`	Standby (sync)	`vmss00001j`	Apr 4 (new)
`staging-pg-common-13`	Standby (sync)	`vmss00001k`	Apr 4 (new)

Cluster: 6 instances, 6 ready, 181G, 0 replication lag
All new-node standbys fully synced

Demo (`azure-demo` / `demo-pg-common`)

Instance	Role	Node	Node Age
`demo-pg-common-13`	PRIMARY	`vmss000000`	Mar 29 (old)
`demo-pg-common-2`	Standby (async)	`vmss000003`	Mar 29 (old)
`demo-pg-common-15`	Standby (async)	`vmss000002`	Mar 29 (old)
`demo-pg-common-16`	Standby (async)	`vmss000018`	Apr 4 (new)
`demo-pg-common-17`	Standby (async)	`vmss000019`	Apr 4 (new)
`demo-pg-common-18`	Standby (async)	`vmss00001a`	Apr 4 (new)

Cluster: 6 instances, 6 ready, 121G, 0 replication lag
All new-node standbys fully synced

Production (`azure-production` / `production-pg-common`)

Instance	Role	Node	Node Age
`production-pg-common-19`	PRIMARY	`vmss00000d`	Feb 3 (old)
`production-pg-common-20`	Standby (sync)	`vmss00000h`	Feb 3 (old)
`production-pg-common-21`	Standby (sync)	`vmss00000g`	Feb 3 (old)
`production-pg-common-22`	Standby (sync)	`vmss00000i`	Apr 4 (new)
`production-pg-common-23`	Standby (sync)	`vmss00000j`	Apr 4 (new)
`production-pg-common-24`	Initializing	`vmss00000k`	Apr 4 (new)

Cluster: 6 instances, 5 ready (instance-24 still initializing), 553G
Instances 22 and 23 fully synced (sync standbys, ~0.36s replay lag)
Instance 24 still joining — must be ready before starting production switchover

Procedure Overview

For each environment (staging → demo → production):

Pre-flight checks — verify cluster health, replication lag, backups
Promote — switchover primary to a new-node instance
Cordon — prevent scheduling on old nodes (blocks CNPG recreation during destroy)
Destroy — remove old-node instances one at a time
Patch — set instances: 3 in the cluster spec (stops CNPG from recreating)
Uncordon — release old nodes so cluster autoscaler can scale them down
Post-flight checks — verify final cluster health

Why patch before uncordon? After destroying old instances, CNPG still wants 6 replicas. If we uncordon first, CNPG could immediately recreate pods on the old nodes. Patching to instances: 3 first tells CNPG the desired state matches reality (3 running). Only then is it safe to uncordon, allowing the autoscaler to scale down the empty old nodes.

Pre-flight Checklist (run before EACH environment)

# Set variables for the current environment
# STAGING:
CTX=azure-staging; CLUSTER=staging-pg-common; NS=db
# DEMO:
# CTX=azure-demo; CLUSTER=demo-pg-common; NS=db
# PRODUCTION:
# CTX=azure-production; CLUSTER=production-pg-common; NS=db

# 1. Cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX

# 2. Verify all instances are Ready and replication lag is 0
#    STOP if any instance shows lag > 1s or status != OK

# 3. Verify WAL archiving is working
#    Look for: "Working WAL archiving: OK"

# 4. Verify continuous backup recoverability
#    Look for: "First Point of Recoverability" is recent

GO/NO-GO decision point — do not proceed unless:

All instances show Status: OK
Replication lag is < 1 second on all standbys
WAL archiving is OK
Backup recoverability point is recent

Phase 1: Staging

1.1 Pre-flight

CTX=azure-staging; CLUSTER=staging-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTX

1.2 Promote new-node instance to primary

kubectl cnpg promote $CLUSTER staging-pg-common-11 -n $NS --context $CTX

Wait for switchover to complete:

# Poll until primary changes (should take < 30 seconds)
kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20

Verify:

Primary instance: staging-pg-common-11
Old primary (staging-pg-common-6) is now Standby
Cluster status returns to healthy

1.3 Cordon old nodes

kubectl cordon aks-pgcommon-36771050-vmss000000 --context $CTX
kubectl cordon aks-pgcommon-36771050-vmss000002 --context $CTX
kubectl cordon aks-pgcommon-36771050-vmss00001e --context $CTX

1.4 Destroy old-node instances (one at a time)

# Destroy old primary
kubectl cnpg destroy $CLUSTER staging-pg-common-6 -n $NS --context $CTX

# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX

# Destroy second old instance
kubectl cnpg destroy $CLUSTER staging-pg-common-10 -n $NS --context $CTX

# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX

# Destroy third old instance
kubectl cnpg destroy $CLUSTER staging-pg-common-9 -n $NS --context $CTX

# Wait ~30s, verify cluster health
kubectl cnpg status $CLUSTER -n $NS --context $CTX

1.5 Patch cluster spec to 3 instances

kubectl patch cluster $CLUSTER -n $NS --context $CTX \
  --type merge -p '{"spec":{"instances":3}}'

1.6 Uncordon old nodes (allows autoscaler to scale them down)

kubectl uncordon aks-pgcommon-36771050-vmss000000 --context $CTX
kubectl uncordon aks-pgcommon-36771050-vmss000002 --context $CTX
kubectl uncordon aks-pgcommon-36771050-vmss00001e --context $CTX

1.7 Post-flight verification

kubectl cnpg status $CLUSTER -n $NS --context $CTX

Verify:

3 instances, 3 ready
Primary: staging-pg-common-11 on vmss00001i (new node)
Standbys: staging-pg-common-12, staging-pg-common-13 (new nodes)
Replication lag: 0
WAL archiving: OK
Cluster status: "Cluster in healthy state"

STOP and validate before proceeding to demo.

Phase 2: Demo

2.1 Pre-flight

CTX=azure-demo; CLUSTER=demo-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTX

2.2 Promote new-node instance to primary

kubectl cnpg promote $CLUSTER demo-pg-common-16 -n $NS --context $CTX

Wait and verify:

kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20

Verify:

Primary instance: demo-pg-common-16
Old primary is now Standby
Cluster healthy

2.3 Cordon old nodes

kubectl cordon aks-pgcommon-37672289-vmss000000 --context $CTX
kubectl cordon aks-pgcommon-37672289-vmss000002 --context $CTX
kubectl cordon aks-pgcommon-37672289-vmss000003 --context $CTX

2.4 Destroy old-node instances (one at a time)

kubectl cnpg destroy $CLUSTER demo-pg-common-13 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTX

kubectl cnpg destroy $CLUSTER demo-pg-common-2 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTX

kubectl cnpg destroy $CLUSTER demo-pg-common-15 -n $NS --context $CTX
kubectl cnpg status $CLUSTER -n $NS --context $CTX

2.5 Patch cluster spec to 3 instances

kubectl patch cluster $CLUSTER -n $NS --context $CTX \
  --type merge -p '{"spec":{"instances":3}}'

2.6 Uncordon old nodes (allows autoscaler to scale them down)

kubectl uncordon aks-pgcommon-37672289-vmss000000 --context $CTX
kubectl uncordon aks-pgcommon-37672289-vmss000002 --context $CTX
kubectl uncordon aks-pgcommon-37672289-vmss000003 --context $CTX

2.7 Post-flight verification

kubectl cnpg status $CLUSTER -n $NS --context $CTX

Verify:

3 instances, 3 ready
Primary: demo-pg-common-16 on vmss000018 (new node)
Standbys: demo-pg-common-17, demo-pg-common-18 (new nodes)
Replication lag: 0
WAL archiving: OK
Cluster status: "Cluster in healthy state"

STOP and validate before proceeding to production.

Phase 3: Production

⚠ PRODUCTION — Extra caution required. Ensure instance-24 is fully ready before starting. If not, use instance-22 or instance-23 as promote target and adjust accordingly.

3.1 Pre-flight

CTX=azure-production; CLUSTER=production-pg-common; NS=db
kubectl cnpg status $CLUSTER -n $NS --context $CTX

Additional production checks:

Instance-24 status is OK and streaming (if not, exclude it and plan for 2 new + promote)
Confirm no active maintenance windows or deployments
Notify on-call / relevant teams

3.2 Promote new-node instance to primary

kubectl cnpg promote $CLUSTER production-pg-common-22 -n $NS --context $CTX

Wait and verify:

kubectl cnpg status $CLUSTER -n $NS --context $CTX | head -20

Verify:

Primary instance: production-pg-common-22
Old primary (production-pg-common-19) is now Standby
All standbys streaming with 0 lag
Cluster healthy

3.3 Cordon old nodes

kubectl cordon aks-pgcommon-53760683-vmss00000d --context $CTX
kubectl cordon aks-pgcommon-53760683-vmss00000g --context $CTX
kubectl cordon aks-pgcommon-53760683-vmss00000h --context $CTX

3.4 Destroy old-node instances (one at a time)

kubectl cnpg destroy $CLUSTER production-pg-common-19 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTX

kubectl cnpg destroy $CLUSTER production-pg-common-20 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTX

kubectl cnpg destroy $CLUSTER production-pg-common-21 -n $NS --context $CTX
# Wait ~30s
kubectl cnpg status $CLUSTER -n $NS --context $CTX

3.5 Patch cluster spec to 3 instances

kubectl patch cluster $CLUSTER -n $NS --context $CTX \
  --type merge -p '{"spec":{"instances":3}}'

3.6 Uncordon old nodes (allows autoscaler to scale them down)

kubectl uncordon aks-pgcommon-53760683-vmss00000d --context $CTX
kubectl uncordon aks-pgcommon-53760683-vmss00000g --context $CTX
kubectl uncordon aks-pgcommon-53760683-vmss00000h --context $CTX

3.7 Post-flight verification

kubectl cnpg status $CLUSTER -n $NS --context $CTX

Verify:

3 instances, 3 ready
Primary: production-pg-common-22 on vmss00000i (new node)
Standbys: production-pg-common-23, production-pg-common-24 (new nodes)
Replication lag: 0
WAL archiving: OK
Cluster status: "Cluster in healthy state"

Rollback Procedure

If a switchover causes issues at any point:

Immediate rollback (before destroy)

# Promote the old primary back
kubectl cnpg promote $CLUSTER <old-primary-instance> -n $NS --context $CTX

After partial destroy

If old instances have already been destroyed, the cluster is committed to the new nodes. Recovery options:

Scale up instances to add more replicas on remaining nodes

If the new primary is unhealthy, promote any healthy standby:

kubectl cnpg promote $CLUSTER <healthy-standby> -n $NS --context $CTX

Emergency: Restore from backup

Last resort — use CNPG continuous backup to restore:

# Check first point of recoverability
kubectl cnpg status $CLUSTER -n $NS --context $CTX | grep "Point of Recoverability"

Post-Procedure Cleanup

After all three environments are complete:

Drain and remove old nodes from the VMSS (via Azure CLI or portal)
Verify pooler pods (-pooler-rw, -pooler-ro) reconnected to new primary
Monitor for 24h: replication lag, WAL archiving, connection counts

Clean up completed snapshot-recovery pods:

kubectl delete pods -n db --field-selector=status.phase==Succeeded --context $CTX

Risk Assessment

Risk	Mitigation
Switchover causes brief write unavailability	Expected: ~5-10s per switchover. PgBouncer poolers absorb connection retry.
CNPG recreates destroyed instance on old node	Cordoning old nodes before destroy prevents this.
New instance not caught up at promote time	Pre-flight checks verify 0 lag before proceeding.
Production instance-24 not ready	Use instance-22 as promote target (confirmed synced). Instance-23 and 24 become standbys.
Destroy removes PVC with data	Expected behavior — we no longer need old-node data. Backups provide safety net.

jmealo/cnpg-switchover-runbook.md

CNPG pg-common Primary Switchover Runbook

Current State

Staging (azure-staging / staging-pg-common)

Demo (azure-demo / demo-pg-common)

Production (azure-production / production-pg-common)

Procedure Overview

Pre-flight Checklist (run before EACH environment)

Phase 1: Staging

1.1 Pre-flight

1.2 Promote new-node instance to primary

1.3 Cordon old nodes

1.4 Destroy old-node instances (one at a time)

1.5 Patch cluster spec to 3 instances

1.6 Uncordon old nodes (allows autoscaler to scale them down)

1.7 Post-flight verification

Phase 2: Demo

2.1 Pre-flight

2.2 Promote new-node instance to primary

2.3 Cordon old nodes

2.4 Destroy old-node instances (one at a time)

2.5 Patch cluster spec to 3 instances

2.6 Uncordon old nodes (allows autoscaler to scale them down)

2.7 Post-flight verification

Phase 3: Production

3.1 Pre-flight

3.2 Promote new-node instance to primary

3.3 Cordon old nodes

3.4 Destroy old-node instances (one at a time)

3.5 Patch cluster spec to 3 instances

3.6 Uncordon old nodes (allows autoscaler to scale them down)

3.7 Post-flight verification

Rollback Procedure

Immediate rollback (before destroy)

After partial destroy

Emergency: Restore from backup

Post-Procedure Cleanup

Risk Assessment

Staging (`azure-staging` / `staging-pg-common`)

Demo (`azure-demo` / `demo-pg-common`)

Production (`azure-production` / `production-pg-common`)