Building systems that fail in predictable ways promote safety and sanity.
Original Audience: Heroku Waza 2013
- @ryandotsmith
- Engineer at Heroku for almost 3 years.
- The Heroku API
- Message queues (e.g. queue_classic)
- Heroku's usage & billing system
- DNS at Heroku
- Log delivery and visibility tooling (e.g. l2met, log-shuttle)
- "I don't believe it –all four engines have failed!"
- The 747 is designed to fly on a single engine.
- Pilots are trained to treat the report of a 4 engine failure as instrumentation failure.
- Failure was treated as an improbability.
- What if the system was designed such that failure is an option?
- You system will fail, it is important to comprehend the consequences.
- Critical can't go down (e.g. http routing)
- Critical but can be down for 1 hour (e.g. visibility systems)
- Critical but can be down for 1 business day. (e.g. data warehouse)
- Not critical (e.g. tweet aggregation)
- Move more apps into the down the critical chain by re-architecting.
- Redundancy
- Simplicity and reliability go hand in glove. Redundancy often introduces complexity.
func Query(c Conn, query string) Result {
return c.DoQuery(query)
}func Query(conns []Conn, query string) Result {
ch := make(chan Result, len(conns)) // buffered
for _, conn := range conns {
go func(c Conn) {
ch <- c.DoQuery(query):
}(conn)
}
return <-ch
}Look out for infinite regress!
Maybe that component is a human?
- redundant systems that fail on the same input.
- E.g. Ariane 5 Flight 501
- Design diversity. Same functions different implementation.
- E.g. Aircrafts use pressure, radar, and GPS to measure altitude.
- But all components use the same type of display components? Fudge!
- Our backup server didn't even have the latest release of our software running.
- Don't let you in-flight entertainment system interrupt your navigation system.
- Don't let your logging/visibility system bring down your http router.
Sometimes you can find yourself lost in the weeds when focused on perfection. Better availability is sufficient for success.
- Billing
- DNS
- Process Partitioning

