# Internet Scale Services Checklist A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services." * http://mvdirona.com/jrh/talksandpapers/jamesrh_lisa.pdf ## Basic tenets - [ ] Does the design expect failures to happen regularly and handle them gracefully? - [ ] Have we kept things as simple as possible? - [ ] Have we automated everything? ## Overall Application Design & Development - [ ] Can the service survive failure without human administrative interaction? - [ ] Are failure paths frequently tested? - [ ] Have we documented all conceivable component failure modes and combinations thereof? - [ ] Does our design tolerate these failure modes? And if not, have you undertaken a risk assessment to determine the risk is acceptable? - [ ] Are we targeting commodity hardware? (That is, our design does not require special h/w) - [ ] Are we hosting all users on a single version of the software? - [ ] Can we support multi-tenancy without physical isolation? - [ ] Have we implemented (and automated) a quick service health check? - [ ] Do our developers work in the full environment? (Requires single server deployment) - [ ] Can we continue to operate in reduced capacity if services (components) you depend on fail? - [ ] Does our design eliminate code redundancy across services/components? - [ ] Can our pods/clusters of services continue to operate independently of each other? - [ ] For rare emergency human intervention, have we worked with operations to come up with recovery plans,and documented, scripted, and tested them? - [ ] Does each of our complexity adding optimizations (if any), give at least an order of magnitude improvement? - [ ] Have we enforced admission control at all levels? - [ ] Can we partition the service, and is that partitioning infinitely adjustable and fine-grained? - [ ] Have we understood the network design and reviewed it with networking specialists? - [ ] Have we analysed throughput and latency and determined the most important metric for capacity planning? - [ ] Are all of our operations utilities following the same code review, source code control, testing etc. as the rest of the code base? - [ ] Have we understood the load this service will put on any backend store / services? Have we measured and validated this load? - [ ] Is everything versioned? The goal is to run single-version software, but multiple versions will always exist during rollout and testing etc. Versions n and n+1 of all components need to peacefully co-exist. - [ ] Have we avoided single points of failure?