Site Reliability Practices

Hire only coders.
Have an SLA for your service.
Measure and report performance against the SLA.
Use Error Budgets and gate lanches on them.
Have a common staffing pool for SRE and Developers.
Have excess Ops work overflow to the Dev team.
Cap SRE operational load at 50 percent.
Share 5 percent of Ops work with the Dev team.
Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
Aim for a maximum of two events per oncall shift.
Do a postmortem for every event.
Postmortems are blameless and focus on process and technology, not people.

Hire only coders.
Have excess Ops work overflow to the Dev team.
Cap SRE operational load at 50 percent.
Share 5 percent of Ops work with the Dev team.
Oncall teams should have at least eight people at one location, or six people at each of multiple locations.

Collect a time-series of measurements.
Up/down monitoring == bad:
- You get alerted when the system is down.
- The outage is "built in"
Time-series == good:
- Alerting when "sick", not "down"
- Prevents outages
- Long-term history needed for planning.

Notes from Case Study: Adopting SRE Principles at StackOverflow by Tom Limoncelli, Stack Exchange, Inc.

madalinignisca/Adopting SRE Principles at StackOverflow.md