#Book club Notes What I learnt from the book

Part 3: Building Long-Term Value

9. Minimize Operational Burden

Instagram's studies

The ratio of users to employees was over 3 million to one
The team consisted of no more than five engineers in early years
The team picked proven and solid technologies instead of shiny or sexy new ones

In this chapter:

Examine strategies for minimizing operational burden
Analyze Instagram's core mantra -- do the simple thing first
Walk through the importance of relentlessly automating mechanical tasks
Talk about how making automation idempotent reduces recurring costs
Close with why we should practice and develop our ability to recover quickly

Embrace Operational Simplicity

When you first start off trying to solve a problem, the first solutions you come up with are very complex, and most people stop there. But if you keep going, and live with the problem and peel more layers of the onion off, you can oftentimes arrive at some very elegant and simple solutions. Most people just don't put in the time or energy to get there. -- Steve Jobs

Too complex of an architecture imposes a maintain cost

Engineering expertise gets splintered across multiple systems
Increased complexity introduces more potential single points of failure
New engineers face a steeper learning curve when learning and understanding the new systems
Effort towards improving abstractions, libraries, and tools gets diluted across the different systems

Instagram and Pinterest's studies

Do other team members have experience with the language?
Is it easy to pick up?
Will it be hard to hire engineers fluent in it?
Do your research before you use new data stores but not MySQL nor PostgreSQL
Consider if the data is large enough to use clusters

Build Systems to Fail Fast

Fail Fast: In a system that fails fast, when a problem occurs, it fails immediately and visibly. Failing fast is a nonintuitive technique: 'failing immediately and visibly' sounds like it would make your software more fragile, but it actually makes it more robust. Bugs are easier to find and fix, so fewer go into production.

Examples of failing fast include:

Crashing at startup time when encountering configuration errors
Validating software inputs, particularly if they won't be consumed until much later
Bubbling up an error from an external service that you don't know how to handle, rather than swallowing it
Throwing an exception as soon as possible when certain modifications to a data structure, like a collection, would render dependent data structures, like an iterator, unusable
Throwing an exception if key data structures have been corrupted rather than propagating that corruption further within the system
Asserting that key invariants hold before or after complex logic flows and attaching sufficiently descriptive failure messages
Alerting engineers about any invalid or inconsistent program state as early as possible

Relentlessly Automate Mechanical Tasks

The judgement call to decide whether to automate: Will I save more time overall by manually doing a particular task or by paying the upfront cost of automating the process?

Automation can help:

Validating that a piece of code, an interaction, or a system behaves as expected
Extracting, transforming, and summarizing data
Detecting spikes in the error rate
Building and deploying software to new machines
Capturing and restoring database snapshots
Periodically running batch computations
Restarting a web service
Checking code to ensure it conforms to style guidelines
Training a machine learning model
Managing user accounts or user data
Adding or removing a server to or from a group of services

Make Batch Processes Idempotent

Benefits of an idempotent process

It produces the same results regardless of whether it's run once or multiple times
The Ability to run infrequent processes at a more frequent rate than strictly necessary, to expose problems sooner (e.g., Monthly Report)

Hone Your Ability to Respond and Recover Quickly

Netflix's Chaos Monkey - randomly kills services

Kills services on weekdays during regular work hours
Engineers can identify architectural weaknesses
The best defense against major unexpected failures is to fail often

Scripting for success - Bill Walsh

It helps decision making

And many other companies assuming that the unexpected and the undesired will happen

Google runs annual, multi day Disaster Recovery Testing (DiRT) events
Dropbox often simulates additional load for their production systems

Ask What if

What if a critical bug gets deployed as part of a release? How quickly can we roll it back or respond with a fix, and can we shorten that window?
What if a database server fails? How do we fail over to another machine and recover any lost data?
What if our servers get overloaded? How can we scale up to handle the increased traffic or shed load so that we respond correctly to at least some of the requests?
What if our testing or staging environments get corrupted? How would we bring up a new one?
What if a customer reports an urgent issues? How long would it take customer support to notify engineering? How long for engineering to follow up with a fix?
What if a manager or other stakeholder at an infrequent review meeting raises objections about the product plan? What questions might they ask, and how might we respond?
What if a critical team member gets sick or injured, or leaves? How can we share knowledge so that the team continues to function?
What if users revolt over a new and controversial feature? What is our stance and how quickly can we respond?
What if a project slips past a promised deadline? How might we predict the slippage early, recover, and respond?

Key Takeaways

Do the simple thing first
Fail fast to pinpoint the source of errors
Automate mechanics over decision-making
Aim for idempotence and reentrancy
Plan and practice failure modes

wataruoguchi/minimizeOperationalBurden.md

Select an option

No results found