- Do you even need a cluster of postgreses? Why did we choose to create a cluster ourselves?
- Why didn’t RDS work for us? Why it might not work for you.
- What are other RDS like services out there?
- What you need to know within postgres so your data can be streamed down to multiple nodes quick, and efficiently.
- Can you do without an SSD? Are writing fast enough? How to find out?
- Will your application’s I/O performance degrade when you’re writing WALs?
- How fast do you need to ship WALs to another machine? Network I/O nitigrities.
- What is it? How does it work?
- How to configure it for a cluster - hot_standby, wal_keep_segments, etc.
- Why do you need it when you have streaming replication?
- How do you set the archive and streaming delays?
- Large deletes or updates can blot WALs
- Read-only standbys, large queries and timeouts
- What does this magic replication manager do? What I learnt by digging into it’s C code.
- What are ‘master’, ‘standbys’, ‘witnesses’, and ‘failed’ nodes
- Cascading, and Synchronous replication
- priorities, configurations
- The consensus algorithm it uses
- repl_nodes - a table of all the machines in the cluster, their configs, statuses
- repl_status - how far is each machine behind the master?
- repl_events - what’s happening in the cluster along a timeline
- well, it exists. not a single open tool (save pgpool) out there that’s as good.
- help setup, configure the relationships between postgres boxes
- ‘Clone’, ‘Register’, ‘Follow’, ‘Promote’ commands
- monitor for failure, elect a master, automate failover
- How to tap into the promote and failover scenarios?
- It doesn’t talk to your application. it doesn’t act as a load balancer. or a single entry point to your cluster.
- Tell you when a failover happens. You need to rig that up yourself.
- Give you extensive support or documentation. But, you have the source.
- Multi-master / split brain / STONITH
- No master
- Network disconnect, Disk full
- How long does it take? How do you know which is the new master?
- How do we recover a failed DB? How long does it take?
- How do we add a new machine to the cluster?
- How do we remove stale node entries?
- What measures to take to ensure you’re equipped for bad scenarios.
- How do you save data when you know you’re going to lose it?
- Deletes/Truncates cascade. How do you free disk space?
- When is your free disk space critically low?
- How to recover from WAL archives? Logical backups? Filesystem backups?
- Which backup/restore mechanism will work best for you?
- What does my application do when there’s a failover?
- How do the db-connections move to the new master?
- How can the application, and repmgr work together for maximum robustness?
- Alerting, Monitoring, Maintenance