ssrihari/postgres-clusters-nuances.org

Last active April 15, 2024 04:46

Star (3) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ssrihari/6443dbc0d4da4a91ba48.js"></script>
Save ssrihari/6443dbc0d4da4a91ba48 to your computer and use it in GitHub Desktop.

Download ZIP

Postgres clusters and their nuances

Raw

postgres-clusters-nuances.org

Postgres clusters and their nuances

Why

Do you even need a cluster of postgreses? Why did we choose to create a cluster ourselves?
Why didn’t RDS work for us? Why it might not work for you.
What are other RDS like services out there?

Postgres side of things:

What you need to know within postgres so your data can be streamed down to multiple nodes quick, and efficiently.

Hardware

Can you do without an SSD? Are writing fast enough? How to find out?
Will your application’s I/O performance degrade when you’re writing WALs?
How fast do you need to ship WALs to another machine? Network I/O nitigrities.

WAL (Write Ahead Logging)

What is it? How does it work?
How to configure it for a cluster - hot_standby, wal_keep_segments, etc.

WAL Archives

Why do you need it when you have streaming replication?
How do you set the archive and streaming delays?

How does this impact normal workflows?

Large deletes or updates can blot WALs
Read-only standbys, large queries and timeouts

Repmgr side of things:

What does this magic replication manager do? What I learnt by digging into it’s C code.

What does a cluster of postgres machines look like?

What are ‘master’, ‘standbys’, ‘witnesses’, and ‘failed’ nodes
Cascading, and Synchronous replication
priorities, configurations

Into the guts of ‘how’ repmgr works

The consensus algorithm it uses
repl_nodes - a table of all the machines in the cluster, their configs, statuses
repl_status - how far is each machine behind the master?
repl_events - what’s happening in the cluster along a timeline

The things repmgr does for your cluster

well, it exists. not a single open tool (save pgpool) out there that’s as good.
help setup, configure the relationships between postgres boxes
‘Clone’, ‘Register’, ‘Follow’, ‘Promote’ commands
monitor for failure, elect a master, automate failover
How to tap into the promote and failover scenarios?

The things repmgr doesn’t

It doesn’t talk to your application. it doesn’t act as a load balancer. or a single entry point to your cluster.
Tell you when a failover happens. You need to rig that up yourself.
Give you extensive support or documentation. But, you have the source.

Failovers

How things can go wrong

Multi-master / split brain / STONITH
No master
Network disconnect, Disk full

What happens when a failover happens?

How long does it take? How do you know which is the new master?
How do we recover a failed DB? How long does it take?
How do we add a new machine to the cluster?
How do we remove stale node entries?

When things have gone wrong

What measures to take to ensure you’re equipped for bad scenarios.
How do you save data when you know you’re going to lose it?
Deletes/Truncates cascade. How do you free disk space?
When is your free disk space critically low?
How to recover from WAL archives? Logical backups? Filesystem backups?
Which backup/restore mechanism will work best for you?

Wrapping up the big picture

What does my application do when there’s a failover?
How do the db-connections move to the new master?
How can the application, and repmgr work together for maximum robustness?
Alerting, Monitoring, Maintenance

Raw

postgres-clusters-nuances.png

Raw

z-postgres-clusters-nuances-mind-map.pdf

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment