Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save madalinignisca/4d0a9801a476c0e70acc7aad633f9f9e to your computer and use it in GitHub Desktop.
Save madalinignisca/4d0a9801a476c0e70acc7aad633f9f9e to your computer and use it in GitHub Desktop.
Notes about Site Reliability Engineering

Site Reliability Practices

Ben Treynor's SREcon14 Keynote

  1. Hire only coders.
  2. Have an SLA for your service.
  3. Measure and report performance against the SLA.
  4. Use Error Budgets and gate lanches on them.
  5. Have a common staffing pool for SRE and Developers.
  6. Have excess Ops work overflow to the Dev team.
  7. Cap SRE operational load at 50 percent.
  8. Share 5 percent of Ops work with the Dev team.
  9. Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
  10. Aim for a maximum of two events per oncall shift.
  11. Do a postmortem for every event.
  12. Postmortems are blameless and focus on process and technology, not people.

Grouped and reordered

Management/budget policies that put value on operations:

  1. Have a common staffing pool for SRE and Developers.

Handle Outages/Incidents Awesomely

  1. Aim for a maximum of two events per oncall shift.
  2. Do a postmortem for every event.
  3. Postmortems are blameless and focus on process and technology, not people.

SLA-Driven Operations & Monitoring

  1. Have an SLA for your service.
  2. Measure and report performance against the SLA.
  3. Use Error Budgets and gate launches on them.

Control Operations Overload

  1. Hire only coders.
  2. Have excess Ops work overflow to the Dev team.
  3. Cap SRE operational load at 50 percent.
  4. Share 5 percent of Ops work with the Dev team.
  5. Oncall teams should have at least eight people at one location, or six people at each of multiple locations.

Postmortems | Why importants?

  • Postmortems are how we learn from outages.
  • Each outage educates the team.
  • The team then educates the entire company.
  • The net result: The organization gets smarter over time.
  • We must create an environment where people have time to PM every event.

Monitoring

  • Collect a time-series of measurements.
  • Up/down monitoring == bad:
    • You get alerted when the system is down.
    • The outage is "built in"
  • Time-series == good:
    • Alerting when "sick", not "down"
    • Prevents outages
    • Long-term history needed for planning.

Notes from Case Study: Adopting SRE Principles at StackOverflow by Tom Limoncelli, Stack Exchange, Inc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment