Skip to content

Instantly share code, notes, and snippets.

@MangaD
Last active November 19, 2025 05:29
Show Gist options
  • Select an option

  • Save MangaD/bef72e3f7f75b7303d56ef483cce8702 to your computer and use it in GitHub Desktop.

Select an option

Save MangaD/bef72e3f7f75b7303d56ef483cce8702 to your computer and use it in GitHub Desktop.
High Availability

High Availability (HA)

CC0

Disclaimer: ChatGPT generated document.

A Definitive, Comprehensive, End-to-End Guide

High Availability (HA) refers to the architectural, operational, and organizational practices that ensure a system remains accessible, functional, and resilient even when components fail. It is a discipline, not a single feature.


1. What High Availability Actually Means

1.1 Core Definition

High Availability is the ability of a system to remain operational for long periods of time, with minimal downtime—planned or unplanned.

A highly available system:

  • survives hardware failures
  • survives network failures
  • minimizes (or eliminates) maintenance downtime
  • recovers quickly and predictably

HA does not mean: ❌ zero downtime ❌ zero failures ❌ infinite redundancy

It does mean: ✔ failures do not significantly impact users ✔ systems continue operating despite disruptions ✔ recovery is automatic and extremely fast


2. Availability Metrics

Availability is expressed as a percentage:

Availability = Uptime / (Uptime + Downtime)

2.1 Levels of Availability (“Nines” Table)

“Nines” Availability Max Downtime / Year
2 nines 99% 3 days, 15 hours
3 nines 99.9% 8 hours, 45 minutes
4 nines 99.99% 52 minutes
5 nines 99.999% 5 minutes
6 nines 99.9999% 31 seconds

High availability generally means at least 99.99% or above, but depends on the industry.


3. The Pillars of High Availability

HA systems must excel in three core areas:

3.1 Redundancy

Multiple independent components so one failure does not bring down the whole system.

Includes:

  • redundant servers
  • redundant networks
  • redundant disks (RAID)
  • redundant availability zones
  • redundant power sources
  • redundant load balancers
  • redundant DNS providers

3.2 Failover

If a component fails, traffic is moved to a healthy one with minimal disruption.

Types:

  • active-active
  • active-passive
  • cold standby

Failover can be:

  • DNS-based
  • load-balancer-based
  • cluster-based
  • application-level

3.3 Monitoring & Recovery

HA systems rely on:

  • constant health checks
  • automatic fault detection
  • automatic node eviction
  • automatic traffic rerouting
  • automated repair or restart

Monitoring isn’t an optional accessory—it's the heart of HA.


4. Architecture Patterns for High Availability

4.1 Load Balancing

Distribute workload across multiple instances:

  • Layer 4 (TCP) load balancing
  • Layer 7 (HTTP) load balancing
  • Round robin
  • Least connections
  • Weighted distribution

The load balancer itself must be highly available (clustered, multi-zone).

4.2 Horizontal Scaling

More nodes → more fault tolerance.

Stateless services scale easiest.

4.3 Vertical Scaling

Bigger hardware; helps performance but not availability on its own.

4.4 Multi-AZ / Multi-Region Deployments

Multi-AZ (Availability Zone):

  • protects against power/network failures in a datacenter
  • required for “four nines” or higher

Multi-region:

  • protects against natural disasters
  • protects against regional outages
  • enables disaster recovery & zero RPO (for advanced setups)

4.5 Clustering

Used heavily in databases and stateful services.

Cluster types:

  • master–slave
  • multi-master
  • sharded clusters
  • synchronous / asynchronous replication

4.6 Database Replication

Provides:

  • failover
  • read scaling
  • reduced data loss

Tradeoffs include consistency issues and write latency.

4.7 Stateless Microservices

Stateless services:

  • restart instantly
  • scale easily
  • fail independently

Stateful systems must use:

  • distributed storage
  • sticky sessions (preferably avoided)
  • replicated databases
  • leader election algorithms (Raft, Paxos)

5. Operational Practices Required for HA

HA is more about process than technology.

5.1 Blue–Green Deployments

Deploy new version → switch traffic → rollback instantly if needed.

5.2 Rolling Deployments

Update servers gradually without downtime.

5.3 Canary Deployments

Send 1–5% of traffic to new release.

5.4 Chaos Engineering

Test failure modes intentionally:

  • shut down servers
  • kill network connections
  • simulate region failure
  • test degraded latency

If you don’t test failures, you don’t have HA.

5.5 Incident Response

Downtime increases dramatically if:

  • alerts are slow
  • engineers respond slowly
  • fixes require manual steps

Automation and runbooks are mandatory.

5.6 Observability

You need:

  • logs
  • metrics
  • distributed tracing
  • health dashboards
  • anomaly detection

6. High Availability vs. Fault Tolerance vs. Disaster Recovery

6.1 High Availability (HA)

Goal: minimize downtime Approach: redundancy + fast failover Acceptable: brief interruptions

6.2 Fault Tolerance

Goal: no interruptions at all Approach: real-time hardware/software duplication Example: aircraft control systems, space systems Very expensive

6.3 Disaster Recovery (DR)

Goal: recover from catastrophic failure Approach: off-site backups, delayed replication Recovery time measured in minutes to hours

HA ≠ DR, but HA systems often include DR.


7. Common Causes of Downtime

High availability systems must defend against:

7.1 Hardware Failures

  • disk failures
  • network card failures
  • RAM corruption
  • power supply failure

7.2 Software Failures

  • memory leaks
  • thread deadlocks
  • unhandled exceptions
  • OS kernel crashes

7.3 Network Failures

  • packet loss
  • routing instability
  • DDoS attacks
  • link failure

7.4 Human Error

#1 cause of outages.

7.5 Deployments

#2 cause of outages.

7.6 Cascading Failures

When one component throttles, the entire system collapses.


8. Components of High Availability Design

8.1 Redundant Hardware

  • dual power supplies
  • dual network cards
  • RAID arrays
  • multiple servers

8.2 Geographic Redundancy

Multi-AZ / multi-region replication.

8.3 Load Balancers and Reverse Proxies

Examples:

  • HAProxy
  • Nginx
  • Envoy
  • AWS ALB/ELB
  • Google Cloud Load Balancer

8.4 Distributed Storage

Options include:

  • Ceph
  • GlusterFS
  • Amazon S3
  • Google Cloud Storage
  • Distributed SQL (CockroachDB)

8.5 Auto-scaling

Expand or shrink capacity automatically based on demand.

8.6 Circuit Breakers & Rate Limiters

Prevent cascading failures.

Examples:

  • retry policies
  • exponential backoff
  • circuit breakers (Hystrix, Resilience4j)

9. High Availability in Cloud Environments

Cloud HA differs from on-prem.

9.1 Cloud Makes HA Easier

Cloud offers managed:

  • load balancers
  • multi-AZ replication
  • storage redundancy
  • automatic failover
  • auto-scaling

9.2 Cloud Does Not Guarantee HA

Misconfigurations still destroy availability:

  • single-AZ deployments
  • stateful services without replication
  • manual failover setup
  • non-redundant networking

10. High Availability Costs

Achieving high availability is expensive.

Level Typical Cost Multiplier
99% → 99.9% 2–5×
99.9% → 99.99% 5–10×
99.99% → 99.999% 10–100×

Why?

  • more servers
  • more zones
  • more regions
  • more engineers
  • more monitoring systems
  • more automation

Cost grows exponentially as downtime approaches zero.


11. Real-World High Availability Architectures

11.1 Web Applications

Typically use:

  • load balancers
  • stateless containers
  • session storage in Redis
  • multi-AZ databases
  • CDN caching
  • auto-scaling groups

11.2 Database HA

Requires:

  • synchronous replication
  • automatic failover
  • consensus protocols (Raft, Paxos)
  • WAL replication

11.3 DNS HA

DNS must be multi-provider:

  • Cloudflare + Route 53
  • NS1 + Google Cloud DNS

DNS itself must be globally redundant.


12. Misconceptions About High Availability

12.1 "Just add more servers."

Without automated failover → useless.

12.2 "Cloud = HA automatically."

Nope. Cloud gives tools, not HA configurations.

12.3 "99.999% uptime is easy."

It is extremely difficult and costly.

12.4 "Monitoring equals HA."

Monitoring tells you after failure, not before.

12.5 "Active-passive is fine for 5 nines."

Too slow for ultra-high availability.


13. High Availability Design Principles (Master List)

  • Eliminate single points of failure
  • Use multi-zone deployments
  • Prefer stateless designs
  • Replicate stateful services
  • Use health checks everywhere
  • Automate failover
  • Use immutable infrastructure
  • Implement rolling deployments
  • Use chaos testing
  • Monitor everything
  • Prepare for human errors
  • Prefer active-active designs
  • Always test disaster scenarios
  • Maintain runbooks and automation

14. High Availability vs Scalability vs Reliability

Concept Meaning
HA System stays up
Scalability System handles growth
Reliability System behaves correctly

Example: A system can be scalable but not highly available (scales well but crashes often). Or highly available but not scalable (always up but slow).


15. Summary

High Availability is the practice of designing and operating systems that:

  • tolerate failure
  • recover automatically
  • minimize downtime
  • monitor themselves continuously
  • run across multiple zones or regions
  • use redundancy at every layer
  • avoid single points of failure
  • rely on automation more than humans

It requires:

  • architecture choices
  • operational excellence
  • monitoring sophistication
  • deployment discipline
  • continuous testing
  • significant cost investment

HA is one of the core disciplines of resilient system design, and achieving it is a combination of engineering, process, and culture—not just hardware.


If you'd like, I can also generate:

✔ A reference HA architecture diagram ✔ HA + Disaster Recovery combined blueprint ✔ A full availability budget calculator ✔ Detailed comparison: Active-Active vs Active-Passive ✔ A C++ systems-engineering perspective on HA ✔ A production-ready checklist for designing an HA system ✔ Best practices for HA in AWS, Google Cloud, Azure, or on-prem

Just tell me what direction you want next.


Five Nines

1. What 99.999% Availability Means

“Availability” is the percentage of time a system is operational and accessible as intended. 99.999% availability means the system is designed to be available all but 0.001% of the time per year.

1.1 Translation to Downtime

Availability Max Downtime Per Year Per Month Per Week Per Day
99.999% ~5.26 minutes ~25.9 seconds ~6.05 seconds ~0.864 seconds

This is extremely stringent and is considered "carrier-grade" or “mission-critical” reliability.


2. Why Five Nines Matters

Systems designed for five nines usually support:

  • Telecom infrastructure
  • Financial transactions
  • Medical devices
  • Power grid & utilities
  • Cloud services (e.g., managed load balancers, DNS)
  • Industrial control systems
  • High-end enterprise solutions

Achieving 5 nines has huge implications for design, engineering, cost, testing, and operations.


3. SLA vs. SLO vs. SLI

3.1 Service-Level Agreement (SLA)

A contractual guarantee of uptime.

3.2 Service-Level Objective (SLO)

A target availability internally set by engineering.

3.3 Service-Level Indicator (SLI)

The actual measurement of availability.

Important:

SLA availability is not equal to real availability. Companies sometimes advertise 5 nines for specific components, not entire systems.


4. The Mathematical Basis of Availability

Availability (A):

A = MTBF / (MTBF + MTTR)

Where:

  • MTBF = Mean time between failures
  • MTTR = Mean time to repair

For 5 nines:

  • MTTR must be extremely tiny (minutes or seconds)
  • Failures must be extremely rare

Even a single 10-minute outage kills your SLA for the entire year.


5. What You Actually Have to Do to Achieve 99.999%

This is where things get serious. “Five nines” is not possible without aggressive redundancy, automation, and extremely fast recovery.

5.1 Redundancy Everywhere

To tolerate component failures, you need:

Layer Redundancy Required
Hardware N+1 or N+2 servers, redundant NICs, RAID
Networking Active-active routers, redundant uplinks
Compute Clustering, failover instances
Storage Replication, synchronous writes
Databases Multi-AZ or multi-region
Services Load balancing across zones
Power Dual power feeds, UPS, diesel generators

A single point of failure = no 99.999%.


6. The Architecture Patterns Required

Let’s dive into the core patterns.

6.1 Multi-Zone High Availability

You need components deployed across:

  • Multiple availability zones (AZs) in the same region
  • Possibly multiple regions for disaster tolerance

6.2 Active-Active vs. Active-Passive

Active-active is almost always required for five nines:

  • Both nodes handle traffic simultaneously
  • Failover is instantaneous
  • Passive failover is usually too slow

6.3 Stateless microservices

State kills availability. Stateless servers can restart or be replaced instantly.

More state → more downtime.

6.4 Distributed Caching

Because you cannot afford slow systems.

  • Redis cluster
  • Memcached sharding
  • Region-local caches

6.5 Health checks + automatic failover

Every system requires:

  • frequent health checks (1s–5s interval)
  • automated node eviction
  • automated node replacement
  • automated traffic rerouting

6.6 Blue-Green or Rolling Deployments

Downtime during deployments must be zero.

This requires:

  • gradual traffic migration
  • canary deployments
  • full rollback automation

7. Operational Requirements for 99.999%

7.1 Monitoring & Observability

You need:

  • Distributed tracing
  • Metrics and alerting with <1m detection
  • Log aggregation
  • Synthetic checks from multiple regions
  • SLO dashboards

7.2 On-call Response Time

Incident response affects availability.

To maintain five nines:

  • Alerts must fire instantly
  • Engineers must respond within 1 minute
  • Automated mitigation is often necessary

Manual-only operations → impossible.

7.3 Chaos Engineering

You must test faults before they happen:

  • kill nodes
  • kill networks
  • kill processes
  • inject latency
  • simulate AZ failure

No testing → unknown recovery behavior → downtime.


8. Common Threats to Five Nines

8.1 Human Error

The #1 cause of outages.

Examples:

  • Wrong config push
  • Faulty database migrations
  • Accidental power cycling
  • Bad deploy

Mitigation: automation + reviews + feature flags.

8.2 Network Partitioning

Worst enemy of distributed systems.

8.3 Database Lockups

Deadlocks, corrupted indexes, replication lag.

8.4 Capacity Surges

Unexpected spike → queue buildup → cascading failure.

8.5 Dependency Outages

If you rely on:

  • third-party APIs
  • payment processors
  • cloud-hosted DNS

Your availability = their availability * your internal availability.


9. Acceptable Downtime Budget Examples

Because downtime budget is small:

Example: Deployments

If you deploy 20 times per day:

  • Each deploy must be <100ms visible impact
  • 500ms glitch × 20 = 10 seconds of downtime per day
  • That alone destroys 5 nines

Example: Database maintenance

Even a 2-second failover eats your monthly downtime budget.


10. Cost of Achieving 99.999%

Costs scale nonlinearly.

Availability Typical Cost Increase
99% → 99.9% 2–5×
99.9% → 99.99% 5–10×
99.99% → 99.999% 10–100×

This includes:

  • More servers
  • More redundancy
  • More monitoring
  • More SRE staff
  • More data centers
  • More automated systems
  • More sophisticated deployment pipelines

Almost no business needs 5 nines. Many think they do; few actually do.


11. Misconceptions About Five Nines

11.1 “Five nines means no outages.”

Wrong. It means very short outages (<5 min/year).

11.2 “Cloud providers guarantee five nines.”

Mostly false. Individual components rarely exceed 99.99%.

11.3 “If one component has 99.999%, the system has 99.999%.”

Wrong.

System availability = product of all component availabilities.


12. Mathematical Composition of Availability

If your system has 5 independent components:

  • Component availability: 0.99999
  • System availability: 0.99999⁵ = 0.99995 (~99.995%)

This is four nines, not five.

Systems degrade fast as dependencies increase.


13. What Actually Delivers Five Nines in Real Life

True 5-nines systems:

  • Telephone switching infrastructure
  • Some financial trading platforms
  • Nuclear monitoring systems
  • Aircraft avionics
  • Certain embedded medical systems
  • Some power-grid control systems

These are usually:

  • fully redundant
  • real-time
  • safety-critical
  • extremely expensive
  • heavily certified

14. Five Nines in Cloud Computing

Common cloud components:

Service Availability
AWS S3 99.99% (four nines)
AWS EC2 Region 99.99%
AWS Route53 100% SLA
Google Cloud Load Balancer 99.99%
Cloudflare CDN 100% SLA regionally

The overall system rarely hits 5 nines without major architectural investment.


15. Summary

99.999% availability is an extremely strict requirement that:

  • allows only 5 minutes of downtime per year
  • requires redundancy everywhere
  • mandates automatic failover
  • demands fast incident response (<1 minute)
  • requires multi-zone deployments
  • forbids manual recoveries
  • significantly increases cost and complexity
  • cannot tolerate human error
  • requires rigorous testing, monitoring, and chaos engineering

Achieving five nines is possible, but extremely difficult, expensive, and usually unnecessary unless human lives, money, or infrastructure depend on it.


If you'd like, I can also provide:

✔ A practical checklist to achieve 5 nines ✔ A reference architecture diagram ✔ A calculation tool for availability budgets ✔ A comparison of 99.99% vs. 99.999% ✔ A cost breakdown model ✔ A real-world example architecture (AWS / GCP / Azure) ✔ A C++ reliability architecture guide

Just tell me what direction you want to go.


Failover

1. What Is Failover?

1.1 Definition

Failover is the process by which a system automatically transfers control or functionality from a failing or inactive component to a redundant or standby component in order to maintain availability and continuity of service.

In simpler terms:

Failover means switching to backup systems when something goes wrong, ideally without user impact.

It is a core mechanism for ensuring high availability (HA) and resilience.


2. Why Failover Is Necessary

Systems fail. Failover ensures:

  • Business continuity
  • Persistent user experience
  • Protection against infrastructure outages
  • Fully automated recovery
  • Reduced MTTR (Mean Time to Repair)

Failover can respond to:

  • hardware failures
  • software crashes
  • network interruptions
  • power issues
  • overload conditions
  • misconfigurations
  • human error

3. Types of Failover Mechanisms

There are three primary models, ordered by availability performance:

Failover Type Description Recovery Speed Cost Complexity Typical Availability
Cold Standby Backup inactive until failure occurs. Requires manual restart. Minutes–hours Low Low ~99%
Warm Standby Backup partially active, pre-configured. Limited sync. Seconds–minutes Medium Medium 99.9–99.99%
Hot Standby (Active-Passive) Fully redundant running instance waiting. Auto failover. <5 seconds High High 99.99–99.999%
Active-Active All nodes active simultaneously. Load shared. Immediate takeover. <1 second Highest Highest 99.999%+

For mission-critical and carrier-grade systems → Active-Active is most recommended.


4. Failover Detection and Triggering

Failover requires fast and accurate failure detection.

4.1 Detection mechanisms

  • Health checks (ping, TCP probe, HTTP test)
  • Heartbeat messages
  • Cluster membership monitoring
  • Error rate or response time thresholds
  • Node liveness checks
  • Availability zone failure detection

Detection frequency is critical—too slow, and failover delays user impact; too fast, and you risk false positives.

Typical health check frequency: 1s–5s Failover timeout: 5s–30s depending on configuration


5. Failover Architecture Components

Component Role
Monitor Detects failure
Failover Manager Decides on promoting standby
Load Balancer Reroutes traffic
State Replication System Shares data between nodes
Cluster Coordinator Elects new primary (if needed)
Recovery Automation Restores failing components

6. Failover Strategies by Layer

6.1 Infrastructure Layer

  • Redundant power supplies
  • Dual NICs
  • Redundant routers
  • Multi-AZ deployment

6.2 Network Layer

  • VRRP (Virtual Router Redundancy Protocol)
  • BGP failover
  • Software-defined networking (SDN)
  • Floating IP failover

6.3 Load Balancing Layer

  • L4 load balancers (HAProxy, LVS)
  • L7 load balancers (Envoy, Nginx, ALB)
  • DNS-based failover
  • IP Anycast routing

6.4 Compute/Application Layer

  • Active-active clusters
  • Kubernetes with readiness & liveness probes
  • Stateless microservices
  • Heartbeat systems

6.5 Database Layer

  • Leader election (Raft, Paxos)
  • Primary/Replica failover
  • Write-ahead logs (WAL)
  • Synchronous vs asynchronous replication tradeoffs

7. Failover Timing Considerations

Phase Time Impact
Failure detection ~5–10s
Failover decision <1s
New instance promotion 1–3s
Traffic rerouting 0–2s
Overall failover 3–15 seconds (HA)

For five nines availability:

  • Failover must complete in under a few seconds
  • Manual intervention is unacceptable

8. Failover Pitfalls & Risks

8.1 Split-Brain Scenario

Occurs when both primary and backup believe they are the active node.

Consequences:

  • Conflicting writes
  • Data corruption
  • Service instability

Prevention via:

  • quorum-based consensus
  • fencing mechanisms
  • STONITH ("Shoot The Other Node In The Head")

8.2 Cascading Failures

Failover causes additional load, overwhelming secondary system → multiple failures.

8.3 Overly aggressive health checks

Leads to unnecessary failovers, flapping.

8.4 Incomplete state replication

Systems restart but cannot process due to missing state.

8.5 Uncoordinated failover across layers

Compute fails over but DNS or LB still points to old node.

8.6 Silent failures

System fails, but monitoring does not detect, no failover initiated.


9. Optimizing Failover Systems

To build robust failover solutions:

9.1 Don’t Failover Too Fast

Balance between sensitivity and stability.

9.2 Automate, but Respect Safety

Automation is required, but must avoid recursive failure loops.

9.3 Prefer Active-Active

Best HA performance, minimal recovery time.

9.4 Use Quorum-based Decision Making

Avoid split-brain with group consensus.

9.5 Implement Circuit Breakers

Prevent overload during failover.

9.6 Test with Chaos Engineering

Failover must be tested deliberately.

👉 Techniques:

  • kill nodes
  • cut network links
  • inject latency
  • crash processes

9.7 Document & Script Runbooks

If automation fails, humans need fallbacks.


10. Failback (Returning to Original State)

Failover is switching away from primary.

Failback is switching back to primary once restored.

Two approaches:

Method Description Simplicity Risk
Manual Failback Operator choice Medium Low
Automatic Failback Self-triggered return High Risk of oscillation

Best practice:

  • Use manual failback unless primary degradation was temporary and proven fixed

11. Failover in Major Cloud Environments

Cloud Provider HA Mechanism
AWS ELB/ALB, RDS Multi-AZ, Route53, ASG, DynamoDB Global Tables
GCP Global Load Balancing, Cloud SQL HA, MIG auto-heal
Azure Traffic Manager, Availability Sets, Zone Redundancy
Kubernetes Pod auto-restart, rescheduling, health probes
On-Prem Pacemaker/Corosync, VRRP, VMware HA

12. Example Failover Architecture (Simplified)

          ┌────────────────┐
          │  Global DNS    │
          └──────┬─────────┘
                 │
         ┌───────▼────────┐
         │ Load Balancer  │
         └───────┬────────┘
   ┌──────────────┼──────────────━┐
   │              │               │
┌──▼───┐     ┌────▼────┐     ┌────▼────┐
│Node A│     │Node B   │     │Node C   │
│Primary│    │Standby  │     │Standby  │
└──┬───┘     └─────────┘     └─────────┘
   │
   ├─► Replicated Database Cluster
   │   └── Synchronous Failover
   │
   └─► Shared Distributed Storage

13. Checklist: Production-Ready Failover

✔ Redundant nodes (at least 2, preferably 3+ quorum) ✔ Automated health checks ✔ Load-balanced entry points ✔ Automated switch-over ✔ Fast state synchronization ✔ No human intervention required ✔ Split-brain protection (quorum/fencing) ✔ Tested via failure injection ✔ Documented manual override ✔ Monitoring and alert workflows


14. Final Summary

Failover is:

  • the automated switching to standby systems when failures occur
  • critical for high availability
  • implemented using redundancy + health checks + automated orchestration
  • categorized into cold, warm, hot, and active-active
  • deeply interconnected with clustering, load balancing, state replication, and monitoring

Achieving high-quality failover requires:

  • architecture with no single points of failure
  • automated recovery mechanisms
  • carefully tuned detection systems
  • chaos and resilience testing
  • thorough planning and operational discipline

When implemented properly, failover enables systems to survive failures without user impact, maintaining high availability.


If you'd like to go further, I can provide:

🔹 Failover flowcharts and sequence diagrams 🔹 Examples using AWS, GCP, or Kubernetes 🔹 C++ system design practices for fault tolerance 🔹 Recommended algorithms (Raft, Paxos, VRRP, etc.) 🔹 Code-level failover handlers (for networking, sockets, services, etc.) 🔹 SLA availability budget calculator 🔹 Deep dive into Active-Active design vs. Active-Passive tradeoffs

Let me know what you'd like next!

@MangaD
Copy link
Author

MangaD commented Nov 17, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment