Skip to content

Instantly share code, notes, and snippets.

@tym-xqo
Last active December 9, 2019 20:02
Show Gist options
  • Save tym-xqo/2d034da7aab509549dc45aa6aa19f0cc to your computer and use it in GitHub Desktop.
Save tym-xqo/2d034da7aab509549dc45aa6aa19f0cc to your computer and use it in GitHub Desktop.
Failover plan for adding disk on boost01

Playbook/Checklist: adding disk space to boost01/mount

Failover to boost02

  • (Maybe/Probably) Suspend boost reporting job

  • ut site in maintenance mode(?)

  • Manual run of boost etl job

  • Switch Production k8s endpoints to boost02

    Note: switching endpoints before promoting the standby database seems like the best way to prevent any split-brain or data loss in the HA cluster, but may result in some momentary ugly airbrakes. Obviously the goal is to switch/promote/stop old master as close to simultaneously as we can manage.

    Q: Should this be a PR into infrastructure:production branch, or can we deploy a separate branch into Production (we do intend to switch back, after all)?

  • Roll pgbouncer pods in k8s

  • Switchover boost01:5432 primary instance to boost02:5432 by running repmgr standby switchover on boost02

  • Confirm logical replication subscription in session db (refresh or rebuild as needed)

    Logical subscription in benchprep_reporting_api_production is pointed at db02, and in theory will pickup where it left off when we promote the standby, but my confidence in that is limited

Prepare Postgres on boost01

  • Order new 1.9TB SED SSD for boost01

  • Wait for IBM to install the disk

  • Stop postgres on boost01:6432

  • Copy /var/lib/pgsql/10/data/*.conf to /tmp/5432/

  • Copy /mount/pgsql/10/wmx_rails_api/*conf to /tmp/6432/

  • Create new replica from base backup of db02 on /mount/pgsql/10/wmx_rails_api

  • Copy *.conf files from /tmp/6432/ to new data directory

  • Start postgres boost01:6432 as streaming replica

  • Create replica from base backup of boost02:5432 on /var/lib/pgsql/10/data

  • Copy *.conf files from /tmp/5432/ to /var/lib/pgsql/10/data

  • Start postgres boost01:5432 in standby mode

Failover back to boost01

  • Put site in maintenance mode(?)
  • Switch Production k8s endpoints to boost01
    • external-reporting-compute-database
    • external-reporting-database
  • Roll pgbouncer pods
  • Stop postgres on boost02:5432
  • Promote boost01:5432 from standby to master
  • Out of maintenance mode
  • Confirm FDW config and connection from boost01:5432/production_boost_reporting to wmx_rails_api_production on 6432 replica.
  • Refresh or rebuild logical replication subscription in session db
  • Re-enable boost etl cron job
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment