Skip to content

Instantly share code, notes, and snippets.

@initcron
Created October 29, 2025 07:30
Show Gist options
  • Save initcron/8ced2eb6b71539a5444e8a0cf24a713a to your computer and use it in GitHub Desktop.
Save initcron/8ced2eb6b71539a5444e8a0cf24a713a to your computer and use it in GitHub Desktop.
Best Practices for Running Databases on Kubernetes

Following is a crisp, battle-tested playbook for running databases on Kubernetes—what to do, what to avoid, and how to keep them safe, fast, and recoverable.

Before you start

  • Default to managed DBs if possible (RDS/Aurora/Cloud SQL/AlloyDB/Atlas). Run on K8s only when you need: portability, custom extensions, tight sidecar/tooling, or cost control with commodity nodes.
  • Use an Operator, not raw manifests. Prefer mature operators (e.g., Crunchy/Percona for Postgres & MySQL, Vitess for MySQL sharding, PXC/MongoDB Enterprise/StackGres, RabbitMQ Operator for queues). Operators give sane HA, backups, upgrades, and day-2 ops.

Core architecture

  • StatefulSets + Headless Services for stable identities and volumes.
  • Single-purpose clusters (or tainted node pools) for stateful workloads to limit noisy neighbors and unpredictable autoscaling.
  • Topology awareness: zone-spread replicas (volume topology + topologySpreadConstraints) to survive AZ loss.

Storage (make or break)

  • Use CSI with high IOPS/low latency: e.g., EBS gp3/io2, GCE pd-ssd, Azure Premium SSD v2. Set explicit IOPS & throughput where supported.
  • StorageClass per tier (prod vs. dev), enable volume expansion and fsGroupChangePolicy=OnRootMismatch when needed.
  • One PVC per replica; avoid shared filesystems (NFS) for primary write paths.
  • Filesystem: ext4 or xfs; disable atime; consider mountOptions: ["noatime"].
  • Snapshots: CSI volume snapshots + tested restore workflows.

Scheduling & HA

  • Requests/limits sized for DBs (reserve CPU; ensure memory > working set + cache; avoid CPU throttling).
  • PodDisruptionBudget to prevent voluntary evictions of primaries.
  • Anti-affinity to keep replicas on different nodes/AZs.
  • PriorityClass so DBs aren’t evicted before stateless apps.
  • Taints/tolerations for dedicated “stateful” nodes.
  • Node autoscaling: prefer scale-out only when replicas can tolerate restarts; cordon/drain with PDBs during maintenance.

Health, failover & traffic

  • Readiness vs. Liveness vs. Startup probes: use engine-aware checks (e.g., “in primary state and accepting writes”), and never kill pods during long crash-recovery—use startupProbe.
  • Split Services: write (primary) vs read (replicas). Consider SessionAffinity or a read balancer sidecar.
  • Ordered shutdown hooks for clean demotion/promote (operator handles this).

Backups & DR

  • Daily full + frequent WAL/binlog archiving to object storage; test PITR (point-in-time restore).
  • Kopia/Velero for cluster-level restore, but treat DB backups separately via operator native tooling.
  • DR runbooks: documented RTO/RPO, restore drills, cross-region bootstraps.

Upgrades & schema changes

  • Operator-driven rolling engine upgrades with canaries.
  • Schema migration gates (e.g., gh-ost/pt-osc for MySQL; strong locks/windows for Postgres). Tie migrations to app deploys.

Observability

  • Exporters: postgres_exporter/mysqld_exporter, etc. Alert on lag, buffer hit ratio, checkpoints, I/O latency, P99 query latency.
  • Logs: slow query logs shipped to Loki/ELK. Correlate with app traces.
  • Capacity signals: IOPS headroom, WAL/redo rate, bloat, autovacuum (for Postgres).

Security & compliance

  • Least privilege: NetworkPolicies to restrict clients; separate namespaces; per-DB credentials.
  • Secrets: External Secrets + KMS; rotate regularly.
  • Pod security hardening: runAsNonRoot, readOnlyRootFilesystem where possible, fsGroup for volume perms, seccompProfile: RuntimeDefault, drop NET_RAW caps.
  • At-rest & in-transit encryption: storage encryption + TLS between app↔DB and replica links.

Performance tuning (engine-agnostic hints)

  • Keep it warm: avoid frequent reschedules; pin primaries with anti-affinity & PDBs.
  • IO budget: provision IOPS for peak checkpoints/recovery; watch queue depth.
  • Postgres: tune shared_buffers, effective_cache_size, checkpoint_timeout/_completion_target, work_mem, autovacuum settings per table.
  • MySQL: innodb_buffer_pool_size (~60–70% RAM), innodb_flush_log_at_trx_commit, redo/undo sizing, doublewrite considerations.

Cost & reliability levers

  • Separate disk class for primaries vs replicas; cheaper disks for analytics replicas.
  • Use local NVMe only with careful replication (e.g., Patroni + sync replicas) and planned node failure handling.

When not to run DBs on K8s

  • You need cross-region strong HA with minimal ops overhead.
  • Hard compliance constraints where managed services simplify audits.
  • Team lacks SRE bandwidth for day-2 (backups, tuning, incidents).

Minimal reference setups

  • Postgres: Crunchy/Percona/StackGres Operator; 1 primary + 2 replicas, gp3/io2 SSD, WAL archive to S3/GCS; PDB=1; NetworkPolicy-locked; Prometheus + Grafana; PITR tested monthly.
  • MySQL: Percona XtraDB Cluster or Vitess for scale/sharding; split read/write services; exporter + slow logs; binlog to object storage.
  • MongoDB: Vendor operator (or Percona); replica sets across AZs; block storage with provisioned IOPS.

Quick checklist (print-worthy)

  • Operator chosen & tested
  • StorageClass with IOPS/throughput set; snapshots enabled
  • StatefulSet + headless Service; anti-affinity + topology spread
  • PDBs, PriorityClass, taints/tolerations
  • Engine-aware probes (startup/readiness)
  • Split read/write Services
  • Backups + PITR drill documented
  • Exporters, slow logs, SLOs & alerts
  • TLS everywhere; KMS-backed secrets; NetworkPolicies
  • Upgrade/migration playbooks; chaos test for failover
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment