- Hire only coders.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate lanches on them.
- Have a common staffing pool for SRE and Developers.
- Have excess Ops work overflow to the Dev team.
- Cap SRE operational load at 50 percent.
- Share 5 percent of Ops work with the Dev team.
- Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
- Aim for a maximum of two events per oncall shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
- Have a common staffing pool for SRE and Developers.
- Aim for a maximum of two events per oncall shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate launches on them.
- Hire only coders.
- Have excess Ops work overflow to the Dev team.
- Cap SRE operational load at 50 percent.
- Share 5 percent of Ops work with the Dev team.
- Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
- Postmortems are how we learn from outages.
- Each outage educates the team.
- The team then educates the entire company.
- The net result: The organization gets smarter over time.
- We must create an environment where people have time to PM every event.
- Collect a time-series of measurements.
- Up/down monitoring == bad:
- You get alerted when the system is down.
- The outage is "built in"
- Time-series == good:
- Alerting when "sick", not "down"
- Prevents outages
- Long-term history needed for planning.
Notes from Case Study: Adopting SRE Principles at StackOverflow by Tom Limoncelli, Stack Exchange, Inc.