- Documented replication topology
- Documented network topology
- Documented interface topology - including users, passwords, connection estimates, load balancers, connection proxies
- Documented procedure, schedule for failover and testing
- Documented procedure, schedule for disaster recovery and testing
- Documented procedure, schedule for maintenance, upgrades
- Documented procedure, schedule for data expiration
- Docuemnted procedure, schedule for backups and testing
- Documented schedule for system upgrades
- Automated maintenance
- Automated disaster recovery testing
- Automated backup testing
- Automated stage environment setup
- Automated data expiration
- Automated failover*
- Automated user management
- Configuration change management
- Monitoring of key Postgres performance indicators: query duration, query counts, IO utilization, cache hits, hot tables, locks, cancelled queries, vacuum frequency and duration, large shared buffer "churn" events, size of tables, size of database, bloat, tables with high sequential scan counts (indexing targets), network throughput
- Monitoring of key application performance indicators: query plans that scan a high % of partitioned tables, duration of stored procedures called regularly
- System monitoring: CPU, memory, IO, swap, DB connections
- Automated pg_hba.conf and user management through configuration management tool
- postgresql.conf managed through configuration management tool
- recovery.conf maintained with configuration management tool
- n+1 topology for replication/failover
- reporting system on 1 hr replication delay (to allow for extended BI queries)
- PITR archive for recovery from operator/developer error
- Backup testing completed weekly
- Failver, Disaster recovery testing completed once per quarter
- Uptime target for all systems defined and agreed to by users
Re: automation of failover -- Can be varying levels of automation depending on the environment. Most important bit is once the system or operator has decided failover is necessary, next steps should be automatic to avoid errors.