A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."
- Does our design expect failures to happen regularly and handle them gracefully?
- Have we kept things as simple as possible?
- Have we automated everything?
- Can the service survive failure without human administrative interaction?
- Are failure paths frequently tested?
- Have you documented all conceivable component failure modes and combinations thereof?
- Does your design tolerate these failure modes? And if not, have you undertaken a risk assessment to determine the risk is acceptable?
- Are you targeting commodity hardware? (That is, your design does not require special h/w)
- Are you hosting all users on a single version of the software?
- Can you support multi-tenancy without physical isolation?
- Have you implemented (and automated) a quick service health check?
- Do you developers work in the full environment? (Requires single server deployment)