Skip to content

Instantly share code, notes, and snippets.

@wataruoguchi
Created August 16, 2016 05:39
Show Gist options
  • Select an option

  • Save wataruoguchi/024cb25cff3faeba94dc6a4c240bf3bd to your computer and use it in GitHub Desktop.

Select an option

Save wataruoguchi/024cb25cff3faeba94dc6a4c240bf3bd to your computer and use it in GitHub Desktop.

#Book club Notes What I learnt from the book

Part 3: Building Long-Term Value

9. Minimize Operational Burden

Instagram's studies

  • The ratio of users to employees was over 3 million to one
  • The team consisted of no more than five engineers in early years
  • The team picked proven and solid technologies instead of shiny or sexy new ones

In this chapter:

  • Examine strategies for minimizing operational burden
  • Analyze Instagram's core mantra -- do the simple thing first
  • Walk through the importance of relentlessly automating mechanical tasks
  • Talk about how making automation idempotent reduces recurring costs
  • Close with why we should practice and develop our ability to recover quickly

Embrace Operational Simplicity

When you first start off trying to solve a problem, the first solutions you come up with are very complex, and most people stop there. But if you keep going, and live with the problem and peel more layers of the onion off, you can oftentimes arrive at some very elegant and simple solutions. Most people just don't put in the time or energy to get there. -- Steve Jobs

Too complex of an architecture imposes a maintain cost

  • Engineering expertise gets splintered across multiple systems
  • Increased complexity introduces more potential single points of failure
  • New engineers face a steeper learning curve when learning and understanding the new systems
  • Effort towards improving abstractions, libraries, and tools gets diluted across the different systems

Instagram and Pinterest's studies

  • Do other team members have experience with the language?
  • Is it easy to pick up?
  • Will it be hard to hire engineers fluent in it?
  • Do your research before you use new data stores but not MySQL nor PostgreSQL
  • Consider if the data is large enough to use clusters

Build Systems to Fail Fast

Fail Fast: In a system that fails fast, when a problem occurs, it fails immediately and visibly. Failing fast is a nonintuitive technique: 'failing immediately and visibly' sounds like it would make your software more fragile, but it actually makes it more robust. Bugs are easier to find and fix, so fewer go into production.

Examples of failing fast include:

  • Crashing at startup time when encountering configuration errors
  • Validating software inputs, particularly if they won't be consumed until much later
  • Bubbling up an error from an external service that you don't know how to handle, rather than swallowing it
  • Throwing an exception as soon as possible when certain modifications to a data structure, like a collection, would render dependent data structures, like an iterator, unusable
  • Throwing an exception if key data structures have been corrupted rather than propagating that corruption further within the system
  • Asserting that key invariants hold before or after complex logic flows and attaching sufficiently descriptive failure messages
  • Alerting engineers about any invalid or inconsistent program state as early as possible

Relentlessly Automate Mechanical Tasks

The judgement call to decide whether to automate: Will I save more time overall by manually doing a particular task or by paying the upfront cost of automating the process?

Automation can help:

  • Validating that a piece of code, an interaction, or a system behaves as expected
  • Extracting, transforming, and summarizing data
  • Detecting spikes in the error rate
  • Building and deploying software to new machines
  • Capturing and restoring database snapshots
  • Periodically running batch computations
  • Restarting a web service
  • Checking code to ensure it conforms to style guidelines
  • Training a machine learning model
  • Managing user accounts or user data
  • Adding or removing a server to or from a group of services

Make Batch Processes Idempotent

Benefits of an idempotent process

  • It produces the same results regardless of whether it's run once or multiple times
  • The Ability to run infrequent processes at a more frequent rate than strictly necessary, to expose problems sooner (e.g., Monthly Report)

Hone Your Ability to Respond and Recover Quickly

Netflix's Chaos Monkey - randomly kills services

  • Kills services on weekdays during regular work hours
  • Engineers can identify architectural weaknesses
  • The best defense against major unexpected failures is to fail often

Scripting for success - Bill Walsh

  • It helps decision making

And many other companies assuming that the unexpected and the undesired will happen

  • Google runs annual, multi day Disaster Recovery Testing (DiRT) events
  • Dropbox often simulates additional load for their production systems

Ask What if

  • What if a critical bug gets deployed as part of a release? How quickly can we roll it back or respond with a fix, and can we shorten that window?
  • What if a database server fails? How do we fail over to another machine and recover any lost data?
  • What if our servers get overloaded? How can we scale up to handle the increased traffic or shed load so that we respond correctly to at least some of the requests?
  • What if our testing or staging environments get corrupted? How would we bring up a new one?
  • What if a customer reports an urgent issues? How long would it take customer support to notify engineering? How long for engineering to follow up with a fix?
  • What if a manager or other stakeholder at an infrequent review meeting raises objections about the product plan? What questions might they ask, and how might we respond?
  • What if a critical team member gets sick or injured, or leaves? How can we share knowledge so that the team continues to function?
  • What if users revolt over a new and controversial feature? What is our stance and how quickly can we respond?
  • What if a project slips past a promised deadline? How might we predict the slippage early, recover, and respond?

Key Takeaways

  • Do the simple thing first
  • Fail fast to pinpoint the source of errors
  • Automate mechanics over decision-making
  • Aim for idempotence and reentrancy
  • Plan and practice failure modes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment