Skip to content

Instantly share code, notes, and snippets.

@ryansmith3136
Last active December 14, 2015 06:39
Show Gist options
  • Select an option

  • Save ryansmith3136/5043849 to your computer and use it in GitHub Desktop.

Select an option

Save ryansmith3136/5043849 to your computer and use it in GitHub Desktop.
Predictable Failure - Waza 2013

Predictable Failure

Building systems that fail in predictable ways promote safety and sanity.

Original Audience: Heroku Waza 2013

About

Things I have worked on

  • The Heroku API
  • Message queues (e.g. queue_classic)
  • Heroku's usage & billing system
  • DNS at Heroku
  • Log delivery and visibility tooling (e.g. l2met, log-shuttle)

British Airways Flight 9

img

  • "I don't believe it –all four engines have failed!"
  • The 747 is designed to fly on a single engine.
  • Pilots are trained to treat the report of a 4 engine failure as instrumentation failure.

My takeaway from the BA9 incident

  • Failure was treated as an improbability.
  • What if the system was designed such that failure is an option?

How this applies to software

  • You system will fail, it is important to comprehend the consequences.

Catagories of applications

  1. Critical can't go down (e.g. http routing)
  2. Critical but can be down for 1 hour (e.g. visibility systems)
  3. Critical but can be down for 1 business day. (e.g. data warehouse)
  4. Not critical (e.g. tweet aggregation)

What can you do?

  • Move more apps into the down the critical chain by re-architecting.
  • Redundancy

Challenges

  • Simplicity and reliability go hand in glove. Redundancy often introduces complexity.

Query a single database server.

func Query(c Conn, query string) Result {
		return c.DoQuery(query)
}

Query redundant database servers.

func Query(conns []Conn, query string) Result {
    ch := make(chan Result, len(conns))  // buffered
    for _, conn := range conns {
        go func(c Conn) {
            ch <- c.DoQuery(query):
        }(conn)
    }
    return <-ch
}

How many redundancies do you need?

img

Look out for infinite regress!

At some point, you need to introduce the component that just works correctly.

Maybe that component is a human?

Whoops. I forgot about independence.

  • redundant systems that fail on the same input.
  • E.g. Ariane 5 Flight 501
  • Design diversity. Same functions different implementation.
  • E.g. Aircrafts use pressure, radar, and GPS to measure altitude.
  • But all components use the same type of display components? Fudge!

Dormant Failures

  • Our backup server didn't even have the latest release of our software running.

Propagation

  • Don't let you in-flight entertainment system interrupt your navigation system.
  • Don't let your logging/visibility system bring down your http router.

Don't Worry 99.0 > 98.0

Sometimes you can find yourself lost in the weeds when focused on perfection. Better availability is sufficient for success.

Examples

  • Billing
  • DNS
  • Process Partitioning

links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment