Skip to content

Instantly share code, notes, and snippets.

@lichti
Forked from acolyer/service-checklist.md
Created September 22, 2016 18:01
Show Gist options
  • Save lichti/bf26464bae73f83850e8978829f98fc4 to your computer and use it in GitHub Desktop.
Save lichti/bf26464bae73f83850e8978829f98fc4 to your computer and use it in GitHub Desktop.
Internet Scale Services Checklist

Internet Scale Services Checklist

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."

Basic tenets

  • Does the design expect failures to happen regularly and handle them gracefully?
  • Have we kept things as simple as possible?
  • Have we automated everything?

Overall Application Design & Development

  • Can the service survive failure without human administrative interaction?
  • Are failure paths frequently tested?
  • Have we documented all conceivable component failure modes and combinations thereof?
  • Does our design tolerate these failure modes? And if not, have you undertaken a risk assessment to determine the risk is acceptable?
  • Are we targeting commodity hardware? (That is, our design does not require special h/w)
  • Are we hosting all users on a single version of the software?
  • Can we support multi-tenancy without physical isolation?
  • Have we implemented (and automated) a quick service health check?
  • Do our developers work in the full environment? (Requires single server deployment)
  • Can we continue to operate in reduced capacity if services (components) you depend on fail?
  • Does our design eliminate code redundancy across services/components?
  • Can our pods/clusters of services continue to operate independently of each other?
  • For rare emergency human intervention, have we worked with operations to come up with recovery plans,and documented, scripted, and tested them?
  • Does each of our complexity adding optimizations (if any), give at least an order of magnitude improvement?
  • Have we enforced admission control at all levels?
  • Can we partition the service, and is that partitioning infinitely adjustable and fine-grained?
  • Have we understood the network design and reviewed it with networking specialists?
  • Have we analysed throughput and latency and determined the most important metric for capacity planning?
  • Are all of our operations utilities following the same code review, source code control, testing etc. as the rest of the code base?
  • Have we understood the load this service will put on any backend store / services? Have we measured and validated this load?
  • Is everything versioned? The goal is to run single-version software, but multiple versions will always exist during rollout and testing etc. Versions n and n+1 of all components need to peacefully co-exist.
  • Have we avoided single points of failure?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment