A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."
- Does the design expect failures to happen regularly and handle them gracefully?
- Have we kept things as simple as possible?
- Have we automated everything?
- Can the service survive failure without human administrative interaction?
- Are failure paths frequently tested?
- Have we documented all conceivable component failure modes and combinations thereof?
- Does our design tolerate these failure modes? And if not, have we undertaken a risk assessment to determine the risk is acceptable?
- Are we targeting commodity hardware? (That is, our design does not require special h/w)
- Are we hosting all users on a single version of the software?
- Can we support multi-tenancy without physical isolation?
- Have we implemented (and automated) a quick service health check?
- Do our developers work in the full environment? (Requires single server deployment)
- Can we continue to operate in reduced capacity if services (components) you depend on fail?
- Does our design eliminate code redundancy across services/components?
- Can our pods/clusters of services continue to operate independently of each other?
- For rare emergency human intervention, have we worked with operations to come up with recovery plans,and documented, scripted, and tested them?
- Does each of our complexity adding optimizations (if any), give at least an order of magnitude improvement?
- Have we enforced admission control at all levels?
- Can we partition the service, and is that partitioning infinitely adjustable and fine-grained?
- Have we understood the network design and reviewed it with networking specialists?
- Have we analysed throughput and latency and determined the most important metric for capacity planning?
- Are all of our operations utilities following the same code review, source code control, testing etc. as the rest of the code base?
- Have we understood the load this service will put on any backend store / services? Have we measured and validated this load?
- Is everything versioned? The goal is to run single-version software, but multiple versions will always exist during rollout and testing etc. Versions n and n+1 of all components need to peacefully co-exist.
- Have we avoided single points of failure?
- Are all of our operations restartable?
- Is all persistent state stored redundantly?
- Can we support geo-distribution / multiple data center deployments
- Have we automated provisioning and installation?
- Are configuration and code delivered by development in a single unit?
- Is the unit created by development used all the way through the lifecycle (test and prod. deployment)?
- Is there an audit log mechanism to capture all changes made in production?
- Have we designed for roles rather than servers, with the ability to deploy each 'role' on as many or few servers as needed?
- Are we handling failures and correcting errors at the service level?
- Have we eliminated any dependency on local storage for non-recoverable information?
- Is our deployment model as simple as it can possibly be? (Hard to beat file copy!)
- Are we using a chaos monkey in production?
(How to handle dependencies on other services / components).
- Can we tolerate highly variable latency in service calls? Do we have timeout mechanisms in place and can we retry interactions after a timeout (idempotency)?
- Are all retries reported, and have we bounded the number of retries?
- Do we have circuit breakers in place to prevent cascading failures? Do they 'fail fast'?
- Are we depending upon shipping and proven components wherever possible?
- Have we implemented inter-service monitoring and alerting?
- Do the services we depend on have the same (or compatible) design points (e.g SLAs)?
- Can we continue operation (perhaps in a degraded mode) if a component or service we depend on fails?
- Are we shipping often enough?
- Have we defined specific criteria around the intended user experience? Are we continuously monitoring it?
- Are we collecting the actual numbers rather than just summary reports? Raw data will always be needed for diagnosis.
- Have we minimized false-positives in the alerting system?
- Are we analyzing trends on key metrics?
- Is the system health highly visible at all times?
- Is the system continuously monitored?
- Can we support version roll-back? Is this tested and proven?
- Do we support both forward and backward compatibility on every change?
- Can we deploy on a single server to support dev and test?
- Have we run stress tests?
- Do we have a process in place to catch performance and capacity degradations in new releases?
- Are we running tests using real data?
- Do we have (and run) system-level acceptance tests?
- Do we have an environment that lets us test at scale, with the same data collection and mining techniques using in production?