Following is a crisp, battle-tested playbook for running databases on Kubernetes—what to do, what to avoid, and how to keep them safe, fast, and recoverable.

Before you start

Default to managed DBs if possible (RDS/Aurora/Cloud SQL/AlloyDB/Atlas). Run on K8s only when you need: portability, custom extensions, tight sidecar/tooling, or cost control with commodity nodes.
Use an Operator, not raw manifests. Prefer mature operators (e.g., Crunchy/Percona for Postgres & MySQL, Vitess for MySQL sharding, PXC/MongoDB Enterprise/StackGres, RabbitMQ Operator for queues). Operators give sane HA, backups, upgrades, and day-2 ops.

Core architecture

StatefulSets + Headless Services for stable identities and volumes.
Single-purpose clusters (or tainted node pools) for stateful workloads to limit noisy neighbors and unpredictable autoscaling.
Topology awareness: zone-spread replicas (volume topology + topologySpreadConstraints) to survive AZ loss.

Storage (make or break)

Use CSI with high IOPS/low latency: e.g., EBS gp3/io2, GCE pd-ssd, Azure Premium SSD v2. Set explicit IOPS & throughput where supported.
StorageClass per tier (prod vs. dev), enable volume expansion and fsGroupChangePolicy=OnRootMismatch when needed.
One PVC per replica; avoid shared filesystems (NFS) for primary write paths.
Filesystem: ext4 or xfs; disable atime; consider mountOptions: ["noatime"].
Snapshots: CSI volume snapshots + tested restore workflows.

Scheduling & HA

Requests/limits sized for DBs (reserve CPU; ensure memory > working set + cache; avoid CPU throttling).
PodDisruptionBudget to prevent voluntary evictions of primaries.
Anti-affinity to keep replicas on different nodes/AZs.
PriorityClass so DBs aren’t evicted before stateless apps.
Taints/tolerations for dedicated “stateful” nodes.
Node autoscaling: prefer scale-out only when replicas can tolerate restarts; cordon/drain with PDBs during maintenance.

Health, failover & traffic

Readiness vs. Liveness vs. Startup probes: use engine-aware checks (e.g., “in primary state and accepting writes”), and never kill pods during long crash-recovery—use startupProbe.
Split Services: write (primary) vs read (replicas). Consider SessionAffinity or a read balancer sidecar.
Ordered shutdown hooks for clean demotion/promote (operator handles this).

Backups & DR

Daily full + frequent WAL/binlog archiving to object storage; test PITR (point-in-time restore).
Kopia/Velero for cluster-level restore, but treat DB backups separately via operator native tooling.
DR runbooks: documented RTO/RPO, restore drills, cross-region bootstraps.

Upgrades & schema changes

Operator-driven rolling engine upgrades with canaries.
Schema migration gates (e.g., gh-ost/pt-osc for MySQL; strong locks/windows for Postgres). Tie migrations to app deploys.

Observability

Exporters: postgres_exporter/mysqld_exporter, etc. Alert on lag, buffer hit ratio, checkpoints, I/O latency, P99 query latency.
Logs: slow query logs shipped to Loki/ELK. Correlate with app traces.
Capacity signals: IOPS headroom, WAL/redo rate, bloat, autovacuum (for Postgres).

Security & compliance

Least privilege: NetworkPolicies to restrict clients; separate namespaces; per-DB credentials.
Secrets: External Secrets + KMS; rotate regularly.
Pod security hardening: runAsNonRoot, readOnlyRootFilesystem where possible, fsGroup for volume perms, seccompProfile: RuntimeDefault, drop NET_RAW caps.
At-rest & in-transit encryption: storage encryption + TLS between app↔DB and replica links.

Performance tuning (engine-agnostic hints)

Keep it warm: avoid frequent reschedules; pin primaries with anti-affinity & PDBs.
IO budget: provision IOPS for peak checkpoints/recovery; watch queue depth.
Postgres: tune shared_buffers, effective_cache_size, checkpoint_timeout/_completion_target, work_mem, autovacuum settings per table.
MySQL: innodb_buffer_pool_size (~60–70% RAM), innodb_flush_log_at_trx_commit, redo/undo sizing, doublewrite considerations.

Cost & reliability levers

Separate disk class for primaries vs replicas; cheaper disks for analytics replicas.
Use local NVMe only with careful replication (e.g., Patroni + sync replicas) and planned node failure handling.

When not to run DBs on K8s

You need cross-region strong HA with minimal ops overhead.
Hard compliance constraints where managed services simplify audits.
Team lacks SRE bandwidth for day-2 (backups, tuning, incidents).

Minimal reference setups

Postgres: Crunchy/Percona/StackGres Operator; 1 primary + 2 replicas, gp3/io2 SSD, WAL archive to S3/GCS; PDB=1; NetworkPolicy-locked; Prometheus + Grafana; PITR tested monthly.
MySQL: Percona XtraDB Cluster or Vitess for scale/sharding; split read/write services; exporter + slow logs; binlog to object storage.
MongoDB: Vendor operator (or Percona); replica sets across AZs; block storage with provisioned IOPS.

Quick checklist (print-worthy)

initcron/databases_on_kubernetes.md

Before you start

Core architecture

Storage (make or break)

Scheduling & HA

Health, failover & traffic

Backups & DR

Upgrades & schema changes

Observability

Security & compliance

Performance tuning (engine-agnostic hints)

Cost & reliability levers

When not to run DBs on K8s

Minimal reference setups

Quick checklist (print-worthy)