This document is intended to facilitate discussions and promote industry best practices for cloud-based services to achieve a resilient posture, or high availability. Achieving a highly resilient posture means being prepared for all kinds of failures such as natural disasters, security breaches, network failures, software bugs, high traffic loads, unexpected user behaviors, and the capacity to handle the unexpected gracefully and lightning fast. High availability is the product of coordinated efforts in people, processes, and technical strategy, including but not limited to,
- Disaster Recovery strategy (ranging from backups to full active/active multi-cloud deployments)
- Continuous deployment (infrastructure-as-code, automated unit/integration/load testing, staggered deployments to multiple regions with bake-time and rollback alarms)
- Observability (covering KPIs, health indicators, and assumptions with metric emissions; service alarms, metric alarms, sanity alarms, golden path service monitoring, and security alarms such as administrative permission monitoring)
- Runbooks for possible failures and gamedays (outage simulations and team preparedness)
- Security best practices (adherence to principles such as least privilege, ensuring only necessary access is granted to minimize risk)
To determine the best path for resiliance, the requirements must be locked-in. Here are some things to consider,
- Will down time potentailly cause irreversable brand damage or extensive revenue loss?
Strategy | Recovery Time Objective (RTO) |
---|---|
Backup & Restore | Hours |
Pilot Light | Tens of Minutes |
Warm Standby | Minutes |
Active/Active | Real-time |
Active/Active (multi-cloud) | Real-time |