Following is a crisp, battle-tested playbook for running databases on Kubernetes—what to do, what to avoid, and how to keep them safe, fast, and recoverable.
- Default to managed DBs if possible (RDS/Aurora/Cloud SQL/AlloyDB/Atlas). Run on K8s only when you need: portability, custom extensions, tight sidecar/tooling, or cost control with commodity nodes.
- Use an Operator, not raw manifests. Prefer mature operators (e.g., Crunchy/Percona for Postgres & MySQL, Vitess for MySQL sharding, PXC/MongoDB Enterprise/StackGres, RabbitMQ Operator for queues). Operators give sane HA, backups, upgrades, and day-2 ops.
- StatefulSets + Headless Services for stable identities and volumes.
- Single-purpose clusters (or tainted node pools) for stateful workloads to limit noisy neighbors and unpredictable autoscaling.
- Topology awareness: zone-spread replicas (volume topology +
topologySpreadConstraints) to survive AZ loss.
- Use CSI with high IOPS/low latency: e.g., EBS gp3/io2, GCE pd-ssd, Azure Premium SSD v2. Set explicit IOPS & throughput where supported.
- StorageClass per tier (prod vs. dev), enable volume expansion and fsGroupChangePolicy=OnRootMismatch when needed.
- One PVC per replica; avoid shared filesystems (NFS) for primary write paths.
- Filesystem: ext4 or xfs; disable atime; consider
mountOptions: ["noatime"]. - Snapshots: CSI volume snapshots + tested restore workflows.
- Requests/limits sized for DBs (reserve CPU; ensure memory > working set + cache; avoid CPU throttling).
- PodDisruptionBudget to prevent voluntary evictions of primaries.
- Anti-affinity to keep replicas on different nodes/AZs.
- PriorityClass so DBs aren’t evicted before stateless apps.
- Taints/tolerations for dedicated “stateful” nodes.
- Node autoscaling: prefer scale-out only when replicas can tolerate restarts; cordon/drain with PDBs during maintenance.
- Readiness vs. Liveness vs. Startup probes: use engine-aware checks (e.g., “in primary state and accepting writes”), and never kill pods during long crash-recovery—use startupProbe.
- Split Services: write (primary) vs read (replicas). Consider SessionAffinity or a read balancer sidecar.
- Ordered shutdown hooks for clean demotion/promote (operator handles this).
- Daily full + frequent WAL/binlog archiving to object storage; test PITR (point-in-time restore).
- Kopia/Velero for cluster-level restore, but treat DB backups separately via operator native tooling.
- DR runbooks: documented RTO/RPO, restore drills, cross-region bootstraps.
- Operator-driven rolling engine upgrades with canaries.
- Schema migration gates (e.g., gh-ost/pt-osc for MySQL; strong locks/windows for Postgres). Tie migrations to app deploys.
- Exporters: postgres_exporter/mysqld_exporter, etc. Alert on lag, buffer hit ratio, checkpoints, I/O latency, P99 query latency.
- Logs: slow query logs shipped to Loki/ELK. Correlate with app traces.
- Capacity signals: IOPS headroom, WAL/redo rate, bloat, autovacuum (for Postgres).
- Least privilege: NetworkPolicies to restrict clients; separate namespaces; per-DB credentials.
- Secrets: External Secrets + KMS; rotate regularly.
- Pod security hardening:
runAsNonRoot,readOnlyRootFilesystemwhere possible,fsGroupfor volume perms, seccompProfile: RuntimeDefault, drop NET_RAW caps. - At-rest & in-transit encryption: storage encryption + TLS between app↔DB and replica links.
- Keep it warm: avoid frequent reschedules; pin primaries with anti-affinity & PDBs.
- IO budget: provision IOPS for peak checkpoints/recovery; watch queue depth.
- Postgres: tune
shared_buffers,effective_cache_size,checkpoint_timeout/_completion_target,work_mem, autovacuum settings per table. - MySQL:
innodb_buffer_pool_size(~60–70% RAM),innodb_flush_log_at_trx_commit, redo/undo sizing, doublewrite considerations.
- Separate disk class for primaries vs replicas; cheaper disks for analytics replicas.
- Use local NVMe only with careful replication (e.g., Patroni + sync replicas) and planned node failure handling.
- You need cross-region strong HA with minimal ops overhead.
- Hard compliance constraints where managed services simplify audits.
- Team lacks SRE bandwidth for day-2 (backups, tuning, incidents).
- Postgres: Crunchy/Percona/StackGres Operator; 1 primary + 2 replicas, gp3/io2 SSD, WAL archive to S3/GCS; PDB=1; NetworkPolicy-locked; Prometheus + Grafana; PITR tested monthly.
- MySQL: Percona XtraDB Cluster or Vitess for scale/sharding; split read/write services; exporter + slow logs; binlog to object storage.
- MongoDB: Vendor operator (or Percona); replica sets across AZs; block storage with provisioned IOPS.
- Operator chosen & tested
- StorageClass with IOPS/throughput set; snapshots enabled
- StatefulSet + headless Service; anti-affinity + topology spread
- PDBs, PriorityClass, taints/tolerations
- Engine-aware probes (startup/readiness)
- Split read/write Services
- Backups + PITR drill documented
- Exporters, slow logs, SLOs & alerts
- TLS everywhere; KMS-backed secrets; NetworkPolicies
- Upgrade/migration playbooks; chaos test for failover