High Availability (HA)

Disclaimer: ChatGPT generated document.

A Definitive, Comprehensive, End-to-End Guide

High Availability (HA) refers to the architectural, operational, and organizational practices that ensure a system remains accessible, functional, and resilient even when components fail. It is a discipline, not a single feature.

1. What High Availability Actually Means

1.1 Core Definition

High Availability is the ability of a system to remain operational for long periods of time, with minimal downtime—planned or unplanned.

A highly available system:

survives hardware failures
survives network failures
minimizes (or eliminates) maintenance downtime
recovers quickly and predictably

HA does not mean: ❌ zero downtime ❌ zero failures ❌ infinite redundancy

It does mean: ✔ failures do not significantly impact users ✔ systems continue operating despite disruptions ✔ recovery is automatic and extremely fast

2. Availability Metrics

Availability is expressed as a percentage:

Availability = Uptime / (Uptime + Downtime)

2.1 Levels of Availability (“Nines” Table)

“Nines”	Availability	Max Downtime / Year
2 nines	99%	3 days, 15 hours
3 nines	99.9%	8 hours, 45 minutes
4 nines	99.99%	52 minutes
5 nines	99.999%	5 minutes
6 nines	99.9999%	31 seconds

High availability generally means at least 99.99% or above, but depends on the industry.

3. The Pillars of High Availability

HA systems must excel in three core areas:

3.1 Redundancy

Multiple independent components so one failure does not bring down the whole system.

Includes:

redundant servers
redundant networks
redundant disks (RAID)
redundant availability zones
redundant power sources
redundant load balancers
redundant DNS providers

3.2 Failover

If a component fails, traffic is moved to a healthy one with minimal disruption.

Types:

active-active
active-passive
cold standby

Failover can be:

DNS-based
load-balancer-based
cluster-based
application-level

3.3 Monitoring & Recovery

HA systems rely on:

constant health checks
automatic fault detection
automatic node eviction
automatic traffic rerouting
automated repair or restart

Monitoring isn’t an optional accessory—it's the heart of HA.

4. Architecture Patterns for High Availability

4.1 Load Balancing

Distribute workload across multiple instances:

Layer 4 (TCP) load balancing
Layer 7 (HTTP) load balancing
Round robin
Least connections
Weighted distribution

The load balancer itself must be highly available (clustered, multi-zone).

4.2 Horizontal Scaling

More nodes → more fault tolerance.

Stateless services scale easiest.

4.3 Vertical Scaling

Bigger hardware; helps performance but not availability on its own.

4.4 Multi-AZ / Multi-Region Deployments

Multi-AZ (Availability Zone):

protects against power/network failures in a datacenter
required for “four nines” or higher

Multi-region:

protects against natural disasters
protects against regional outages
enables disaster recovery & zero RPO (for advanced setups)

4.5 Clustering

Used heavily in databases and stateful services.

Cluster types:

master–slave
multi-master
sharded clusters
synchronous / asynchronous replication

4.6 Database Replication

Provides:

failover
read scaling
reduced data loss

Tradeoffs include consistency issues and write latency.

4.7 Stateless Microservices

Stateless services:

restart instantly
scale easily
fail independently

Stateful systems must use:

distributed storage
sticky sessions (preferably avoided)
replicated databases
leader election algorithms (Raft, Paxos)

5. Operational Practices Required for HA

HA is more about process than technology.

5.1 Blue–Green Deployments

Deploy new version → switch traffic → rollback instantly if needed.

5.2 Rolling Deployments

Update servers gradually without downtime.

5.3 Canary Deployments

Send 1–5% of traffic to new release.

5.4 Chaos Engineering

Test failure modes intentionally:

shut down servers
kill network connections
simulate region failure
test degraded latency

If you don’t test failures, you don’t have HA.

5.5 Incident Response

Downtime increases dramatically if:

alerts are slow
engineers respond slowly
fixes require manual steps

Automation and runbooks are mandatory.

5.6 Observability

You need:

logs
metrics
distributed tracing
health dashboards
anomaly detection

6. High Availability vs. Fault Tolerance vs. Disaster Recovery

6.1 High Availability (HA)

Goal: minimize downtime Approach: redundancy + fast failover Acceptable: brief interruptions

6.2 Fault Tolerance

Goal: no interruptions at all Approach: real-time hardware/software duplication Example: aircraft control systems, space systems Very expensive

6.3 Disaster Recovery (DR)

Goal: recover from catastrophic failure Approach: off-site backups, delayed replication Recovery time measured in minutes to hours

HA ≠ DR, but HA systems often include DR.

7. Common Causes of Downtime

High availability systems must defend against:

7.1 Hardware Failures

disk failures
network card failures
RAM corruption
power supply failure

7.2 Software Failures

memory leaks
thread deadlocks
unhandled exceptions
OS kernel crashes

7.3 Network Failures

packet loss
routing instability
DDoS attacks
link failure

7.4 Human Error

#1 cause of outages.

7.5 Deployments

#2 cause of outages.

7.6 Cascading Failures

When one component throttles, the entire system collapses.

8. Components of High Availability Design

8.1 Redundant Hardware

dual power supplies
dual network cards
RAID arrays
multiple servers

8.2 Geographic Redundancy

Multi-AZ / multi-region replication.

8.3 Load Balancers and Reverse Proxies

Examples:

HAProxy
Nginx
Envoy
AWS ALB/ELB
Google Cloud Load Balancer

8.4 Distributed Storage

Options include:

Ceph
GlusterFS
Amazon S3
Google Cloud Storage
Distributed SQL (CockroachDB)

8.5 Auto-scaling

Expand or shrink capacity automatically based on demand.

8.6 Circuit Breakers & Rate Limiters

Prevent cascading failures.

Examples:

retry policies
exponential backoff
circuit breakers (Hystrix, Resilience4j)

9. High Availability in Cloud Environments

Cloud HA differs from on-prem.

9.1 Cloud Makes HA Easier

Cloud offers managed:

load balancers
multi-AZ replication
storage redundancy
automatic failover
auto-scaling

9.2 Cloud Does Not Guarantee HA

Misconfigurations still destroy availability:

single-AZ deployments
stateful services without replication
manual failover setup
non-redundant networking

10. High Availability Costs

Achieving high availability is expensive.

Level	Typical Cost Multiplier
99% → 99.9%	2–5×
99.9% → 99.99%	5–10×
99.99% → 99.999%	10–100×

Why?

more servers
more zones
more regions
more engineers
more monitoring systems
more automation

Cost grows exponentially as downtime approaches zero.

11. Real-World High Availability Architectures

11.1 Web Applications

Typically use:

load balancers
stateless containers
session storage in Redis
multi-AZ databases
CDN caching
auto-scaling groups

11.2 Database HA

Requires:

synchronous replication
automatic failover
consensus protocols (Raft, Paxos)
WAL replication

11.3 DNS HA

DNS must be multi-provider:

Cloudflare + Route 53
NS1 + Google Cloud DNS

DNS itself must be globally redundant.

12. Misconceptions About High Availability

12.1 "Just add more servers."

Without automated failover → useless.

12.2 "Cloud = HA automatically."

Nope. Cloud gives tools, not HA configurations.

12.3 "99.999% uptime is easy."

It is extremely difficult and costly.

12.4 "Monitoring equals HA."

Monitoring tells you after failure, not before.

12.5 "Active-passive is fine for 5 nines."

Too slow for ultra-high availability.

13. High Availability Design Principles (Master List)

Eliminate single points of failure
Use multi-zone deployments
Prefer stateless designs
Replicate stateful services
Use health checks everywhere
Automate failover
Use immutable infrastructure
Implement rolling deployments
Use chaos testing
Monitor everything
Prepare for human errors
Prefer active-active designs
Always test disaster scenarios
Maintain runbooks and automation

14. High Availability vs Scalability vs Reliability

Concept	Meaning
HA	System stays up
Scalability	System handles growth
Reliability	System behaves correctly

Example: A system can be scalable but not highly available (scales well but crashes often). Or highly available but not scalable (always up but slow).

15. Summary

High Availability is the practice of designing and operating systems that:

tolerate failure
recover automatically
minimize downtime
monitor themselves continuously
run across multiple zones or regions
use redundancy at every layer
avoid single points of failure
rely on automation more than humans

It requires:

architecture choices
operational excellence
monitoring sophistication
deployment discipline
continuous testing
significant cost investment

HA is one of the core disciplines of resilient system design, and achieving it is a combination of engineering, process, and culture—not just hardware.

If you'd like, I can also generate:

✔ A reference HA architecture diagram ✔ HA + Disaster Recovery combined blueprint ✔ A full availability budget calculator ✔ Detailed comparison: Active-Active vs Active-Passive ✔ A C++ systems-engineering perspective on HA ✔ A production-ready checklist for designing an HA system ✔ Best practices for HA in AWS, Google Cloud, Azure, or on-prem

Just tell me what direction you want next.

Five Nines

1. What 99.999% Availability Means

“Availability” is the percentage of time a system is operational and accessible as intended. 99.999% availability means the system is designed to be available all but 0.001% of the time per year.

1.1 Translation to Downtime

Availability	Max Downtime Per Year	Per Month	Per Week	Per Day
99.999%	~5.26 minutes	~25.9 seconds	~6.05 seconds	~0.864 seconds

This is extremely stringent and is considered "carrier-grade" or “mission-critical” reliability.

2. Why Five Nines Matters

Systems designed for five nines usually support:

Telecom infrastructure
Financial transactions
Medical devices
Power grid & utilities
Cloud services (e.g., managed load balancers, DNS)
Industrial control systems
High-end enterprise solutions

Achieving 5 nines has huge implications for design, engineering, cost, testing, and operations.

3. SLA vs. SLO vs. SLI

3.1 Service-Level Agreement (SLA)

A contractual guarantee of uptime.

3.2 Service-Level Objective (SLO)

A target availability internally set by engineering.

3.3 Service-Level Indicator (SLI)

The actual measurement of availability.

Important:

SLA availability is not equal to real availability. Companies sometimes advertise 5 nines for specific components, not entire systems.

4. The Mathematical Basis of Availability

Availability (A):

A = MTBF / (MTBF + MTTR)

Where:

MTBF = Mean time between failures
MTTR = Mean time to repair

For 5 nines:

MTTR must be extremely tiny (minutes or seconds)
Failures must be extremely rare

Even a single 10-minute outage kills your SLA for the entire year.

5. What You Actually Have to Do to Achieve 99.999%

This is where things get serious. “Five nines” is not possible without aggressive redundancy, automation, and extremely fast recovery.

5.1 Redundancy Everywhere

To tolerate component failures, you need:

Layer	Redundancy Required
Hardware	N+1 or N+2 servers, redundant NICs, RAID
Networking	Active-active routers, redundant uplinks
Compute	Clustering, failover instances
Storage	Replication, synchronous writes
Databases	Multi-AZ or multi-region
Services	Load balancing across zones
Power	Dual power feeds, UPS, diesel generators

A single point of failure = no 99.999%.

6. The Architecture Patterns Required

Let’s dive into the core patterns.

6.1 Multi-Zone High Availability

You need components deployed across:

Multiple availability zones (AZs) in the same region
Possibly multiple regions for disaster tolerance

6.2 Active-Active vs. Active-Passive

Active-active is almost always required for five nines:

Both nodes handle traffic simultaneously
Failover is instantaneous
Passive failover is usually too slow

6.3 Stateless microservices

State kills availability. Stateless servers can restart or be replaced instantly.

More state → more downtime.

6.4 Distributed Caching

Because you cannot afford slow systems.

Redis cluster
Memcached sharding
Region-local caches

6.5 Health checks + automatic failover

Every system requires:

frequent health checks (1s–5s interval)
automated node eviction
automated node replacement
automated traffic rerouting

6.6 Blue-Green or Rolling Deployments

Downtime during deployments must be zero.

This requires:

gradual traffic migration
canary deployments
full rollback automation

7. Operational Requirements for 99.999%

7.1 Monitoring & Observability

You need:

Distributed tracing
Metrics and alerting with <1m detection
Log aggregation
Synthetic checks from multiple regions
SLO dashboards

7.2 On-call Response Time

Incident response affects availability.

To maintain five nines:

Alerts must fire instantly
Engineers must respond within 1 minute
Automated mitigation is often necessary

Manual-only operations → impossible.

7.3 Chaos Engineering

You must test faults before they happen:

kill nodes
kill networks
kill processes
inject latency
simulate AZ failure

No testing → unknown recovery behavior → downtime.

8. Common Threats to Five Nines

8.1 Human Error

The #1 cause of outages.

Examples:

Wrong config push
Faulty database migrations
Accidental power cycling
Bad deploy

Mitigation: automation + reviews + feature flags.

8.2 Network Partitioning

Worst enemy of distributed systems.

8.3 Database Lockups

Deadlocks, corrupted indexes, replication lag.

8.4 Capacity Surges

Unexpected spike → queue buildup → cascading failure.

8.5 Dependency Outages

If you rely on:

third-party APIs
payment processors
cloud-hosted DNS

Your availability = their availability * your internal availability.

9. Acceptable Downtime Budget Examples

Because downtime budget is small:

Example: Deployments

If you deploy 20 times per day:

Each deploy must be <100ms visible impact
500ms glitch × 20 = 10 seconds of downtime per day
That alone destroys 5 nines

Example: Database maintenance

Even a 2-second failover eats your monthly downtime budget.

10. Cost of Achieving 99.999%

Costs scale nonlinearly.

Availability	Typical Cost Increase
99% → 99.9%	2–5×
99.9% → 99.99%	5–10×
99.99% → 99.999%	10–100×

This includes:

More servers
More redundancy
More monitoring
More SRE staff
More data centers
More automated systems
More sophisticated deployment pipelines

Almost no business needs 5 nines. Many think they do; few actually do.

11. Misconceptions About Five Nines

11.1 “Five nines means no outages.”

Wrong. It means very short outages (<5 min/year).

11.2 “Cloud providers guarantee five nines.”

Mostly false. Individual components rarely exceed 99.99%.

11.3 “If one component has 99.999%, the system has 99.999%.”

Wrong.

System availability = product of all component availabilities.

12. Mathematical Composition of Availability

If your system has 5 independent components:

Component availability: 0.99999
System availability: 0.99999⁵ = 0.99995 (~99.995%)

This is four nines, not five.

Systems degrade fast as dependencies increase.

13. What Actually Delivers Five Nines in Real Life

True 5-nines systems:

Telephone switching infrastructure
Some financial trading platforms
Nuclear monitoring systems
Aircraft avionics
Certain embedded medical systems
Some power-grid control systems

These are usually:

fully redundant
real-time
safety-critical
extremely expensive
heavily certified

14. Five Nines in Cloud Computing

Common cloud components:

Service	Availability
AWS S3	99.99% (four nines)
AWS EC2 Region	99.99%
AWS Route53	100% SLA
Google Cloud Load Balancer	99.99%
Cloudflare CDN	100% SLA regionally

The overall system rarely hits 5 nines without major architectural investment.

15. Summary

99.999% availability is an extremely strict requirement that:

allows only 5 minutes of downtime per year
requires redundancy everywhere
mandates automatic failover
demands fast incident response (<1 minute)
requires multi-zone deployments
forbids manual recoveries
significantly increases cost and complexity
cannot tolerate human error
requires rigorous testing, monitoring, and chaos engineering

Achieving five nines is possible, but extremely difficult, expensive, and usually unnecessary unless human lives, money, or infrastructure depend on it.

If you'd like, I can also provide:

✔ A practical checklist to achieve 5 nines ✔ A reference architecture diagram ✔ A calculation tool for availability budgets ✔ A comparison of 99.99% vs. 99.999% ✔ A cost breakdown model ✔ A real-world example architecture (AWS / GCP / Azure) ✔ A C++ reliability architecture guide

Just tell me what direction you want to go.

Failover

1. What Is Failover?

1.1 Definition

Failover is the process by which a system automatically transfers control or functionality from a failing or inactive component to a redundant or standby component in order to maintain availability and continuity of service.

In simpler terms:

Failover means switching to backup systems when something goes wrong, ideally without user impact.

It is a core mechanism for ensuring high availability (HA) and resilience.

2. Why Failover Is Necessary

Systems fail. Failover ensures:

Business continuity
Persistent user experience
Protection against infrastructure outages
Fully automated recovery
Reduced MTTR (Mean Time to Repair)

Failover can respond to:

hardware failures
software crashes
network interruptions
power issues
overload conditions
misconfigurations
human error

3. Types of Failover Mechanisms

There are three primary models, ordered by availability performance:

Failover Type	Description	Recovery Speed	Cost	Complexity	Typical Availability
Cold Standby	Backup inactive until failure occurs. Requires manual restart.	Minutes–hours	Low	Low	~99%
Warm Standby	Backup partially active, pre-configured. Limited sync.	Seconds–minutes	Medium	Medium	99.9–99.99%
Hot Standby (Active-Passive)	Fully redundant running instance waiting. Auto failover.	<5 seconds	High	High	99.99–99.999%
Active-Active	All nodes active simultaneously. Load shared. Immediate takeover.	<1 second	Highest	Highest	99.999%+

For mission-critical and carrier-grade systems → Active-Active is most recommended.

4. Failover Detection and Triggering

Failover requires fast and accurate failure detection.

4.1 Detection mechanisms

Health checks (ping, TCP probe, HTTP test)
Heartbeat messages
Cluster membership monitoring
Error rate or response time thresholds
Node liveness checks
Availability zone failure detection

Detection frequency is critical—too slow, and failover delays user impact; too fast, and you risk false positives.

Typical health check frequency: 1s–5s Failover timeout: 5s–30s depending on configuration

5. Failover Architecture Components

Component	Role
Monitor	Detects failure
Failover Manager	Decides on promoting standby
Load Balancer	Reroutes traffic
State Replication System	Shares data between nodes
Cluster Coordinator	Elects new primary (if needed)
Recovery Automation	Restores failing components

6. Failover Strategies by Layer

6.1 Infrastructure Layer

Redundant power supplies
Dual NICs
Redundant routers
Multi-AZ deployment

6.2 Network Layer

VRRP (Virtual Router Redundancy Protocol)
BGP failover
Software-defined networking (SDN)
Floating IP failover

6.3 Load Balancing Layer

L4 load balancers (HAProxy, LVS)
L7 load balancers (Envoy, Nginx, ALB)
DNS-based failover
IP Anycast routing

6.4 Compute/Application Layer

Active-active clusters
Kubernetes with readiness & liveness probes
Stateless microservices
Heartbeat systems

6.5 Database Layer

Leader election (Raft, Paxos)
Primary/Replica failover
Write-ahead logs (WAL)
Synchronous vs asynchronous replication tradeoffs

7. Failover Timing Considerations

Phase	Time Impact
Failure detection	~5–10s
Failover decision	<1s
New instance promotion	1–3s
Traffic rerouting	0–2s
Overall failover	3–15 seconds (HA)

For five nines availability:

Failover must complete in under a few seconds
Manual intervention is unacceptable

8. Failover Pitfalls & Risks

8.1 Split-Brain Scenario

Occurs when both primary and backup believe they are the active node.

Consequences:

Conflicting writes
Data corruption
Service instability

Prevention via:

quorum-based consensus
fencing mechanisms
STONITH ("Shoot The Other Node In The Head")

8.2 Cascading Failures

Failover causes additional load, overwhelming secondary system → multiple failures.

8.3 Overly aggressive health checks

Leads to unnecessary failovers, flapping.

8.4 Incomplete state replication

Systems restart but cannot process due to missing state.

8.5 Uncoordinated failover across layers

Compute fails over but DNS or LB still points to old node.

8.6 Silent failures

System fails, but monitoring does not detect, no failover initiated.

9. Optimizing Failover Systems

To build robust failover solutions:

9.1 Don’t Failover Too Fast

Balance between sensitivity and stability.

9.2 Automate, but Respect Safety

Automation is required, but must avoid recursive failure loops.

9.3 Prefer Active-Active

Best HA performance, minimal recovery time.

9.4 Use Quorum-based Decision Making

Avoid split-brain with group consensus.

9.5 Implement Circuit Breakers

Prevent overload during failover.

9.6 Test with Chaos Engineering

Failover must be tested deliberately.

👉 Techniques:

kill nodes
cut network links
inject latency
crash processes

9.7 Document & Script Runbooks

If automation fails, humans need fallbacks.

10. Failback (Returning to Original State)

Failover is switching away from primary.

Failback is switching back to primary once restored.

Two approaches:

Method	Description	Simplicity	Risk
Manual Failback	Operator choice	Medium	Low
Automatic Failback	Self-triggered return	High	Risk of oscillation

Best practice:

Use manual failback unless primary degradation was temporary and proven fixed

11. Failover in Major Cloud Environments

Cloud Provider	HA Mechanism
AWS	ELB/ALB, RDS Multi-AZ, Route53, ASG, DynamoDB Global Tables
GCP	Global Load Balancing, Cloud SQL HA, MIG auto-heal
Azure	Traffic Manager, Availability Sets, Zone Redundancy
Kubernetes	Pod auto-restart, rescheduling, health probes
On-Prem	Pacemaker/Corosync, VRRP, VMware HA

12. Example Failover Architecture (Simplified)

          ┌────────────────┐
          │  Global DNS    │
          └──────┬─────────┘
                 │
         ┌───────▼────────┐
         │ Load Balancer  │
         └───────┬────────┘
   ┌──────────────┼──────────────━┐
   │              │               │
┌──▼───┐     ┌────▼────┐     ┌────▼────┐
│Node A│     │Node B   │     │Node C   │
│Primary│    │Standby  │     │Standby  │
└──┬───┘     └─────────┘     └─────────┘
   │
   ├─► Replicated Database Cluster
   │   └── Synchronous Failover
   │
   └─► Shared Distributed Storage

13. Checklist: Production-Ready Failover

✔ Redundant nodes (at least 2, preferably 3+ quorum) ✔ Automated health checks ✔ Load-balanced entry points ✔ Automated switch-over ✔ Fast state synchronization ✔ No human intervention required ✔ Split-brain protection (quorum/fencing) ✔ Tested via failure injection ✔ Documented manual override ✔ Monitoring and alert workflows

14. Final Summary

Failover is:

the automated switching to standby systems when failures occur
critical for high availability
implemented using redundancy + health checks + automated orchestration
categorized into cold, warm, hot, and active-active
deeply interconnected with clustering, load balancing, state replication, and monitoring

Achieving high-quality failover requires:

architecture with no single points of failure
automated recovery mechanisms
carefully tuned detection systems
chaos and resilience testing
thorough planning and operational discipline

When implemented properly, failover enables systems to survive failures without user impact, maintaining high availability.

If you'd like to go further, I can provide:

🔹 Failover flowcharts and sequence diagrams 🔹 Examples using AWS, GCP, or Kubernetes 🔹 C++ system design practices for fault tolerance 🔹 Recommended algorithms (Raft, Paxos, VRRP, etc.) 🔹 Code-level failover handlers (for networking, sockets, services, etc.) 🔹 SLA availability budget calculator 🔹 Deep dive into Active-Active design vs. Active-Passive tradeoffs

Let me know what you'd like next!

MangaD/HighAvailability.md

High Availability (HA)

1. What High Availability Actually Means

1.1 Core Definition

2. Availability Metrics

2.1 Levels of Availability (“Nines” Table)

3. The Pillars of High Availability

3.1 Redundancy

3.2 Failover

3.3 Monitoring & Recovery

4. Architecture Patterns for High Availability

4.1 Load Balancing

4.2 Horizontal Scaling

4.3 Vertical Scaling

4.4 Multi-AZ / Multi-Region Deployments

4.5 Clustering

4.6 Database Replication

4.7 Stateless Microservices

5. Operational Practices Required for HA

5.1 Blue–Green Deployments

5.2 Rolling Deployments

5.3 Canary Deployments

5.4 Chaos Engineering

5.5 Incident Response

5.6 Observability

6. High Availability vs. Fault Tolerance vs. Disaster Recovery

6.1 High Availability (HA)

6.2 Fault Tolerance

6.3 Disaster Recovery (DR)

7. Common Causes of Downtime

7.1 Hardware Failures

7.2 Software Failures

7.3 Network Failures

7.4 Human Error

7.5 Deployments

7.6 Cascading Failures

8. Components of High Availability Design

8.1 Redundant Hardware

8.2 Geographic Redundancy

8.3 Load Balancers and Reverse Proxies

8.4 Distributed Storage

8.5 Auto-scaling

8.6 Circuit Breakers & Rate Limiters

9. High Availability in Cloud Environments

9.1 Cloud Makes HA Easier

9.2 Cloud Does Not Guarantee HA

10. High Availability Costs

11. Real-World High Availability Architectures

11.1 Web Applications

11.2 Database HA

11.3 DNS HA

12. Misconceptions About High Availability

12.1 "Just add more servers."

12.2 "Cloud = HA automatically."

12.3 "99.999% uptime is easy."

12.4 "Monitoring equals HA."

12.5 "Active-passive is fine for 5 nines."

13. High Availability Design Principles (Master List)

14. High Availability vs Scalability vs Reliability

15. Summary

Five Nines

1. What 99.999% Availability Means

1.1 Translation to Downtime

2. Why Five Nines Matters

3. SLA vs. SLO vs. SLI

3.1 Service-Level Agreement (SLA)

3.2 Service-Level Objective (SLO)

3.3 Service-Level Indicator (SLI)

Important:

4. The Mathematical Basis of Availability

5. What You Actually Have to Do to Achieve 99.999%

5.1 Redundancy Everywhere

6. The Architecture Patterns Required

6.1 Multi-Zone High Availability

6.2 Active-Active vs. Active-Passive

6.3 Stateless microservices

6.4 Distributed Caching

6.5 Health checks + automatic failover

6.6 Blue-Green or Rolling Deployments

7. Operational Requirements for 99.999%