Choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.
- Service Level Indicators (SLI) - a carefully defined quantitative measure of some aspect of the level of service that is provided. i.e.
- Exmaples:
- Request latency
- Error rate
- System throughput (fraction of all requests recieved; typically measured in requests per second)
- Often aggregated - raw data is collected over a measurement window and then turned into a rate, average, or percentile.
- Exmaples:
Availability
- Yield - the fraction of well-formed requests that succeed
- Durability - the likelihood that data will be retained over a long period of time
- GCE availablity is "three and a half nines" - 99.95% availability
- Service Level Objective (SLO) - a target value or range of values for a service level that is measured by an SLI
- Examples:
- structure:
SLI ≤ target, or lower bound ≤ SLI ≤ upper bound
- Return results "quickly" adopting an SLO that our average search request latency should be less than 100 milliseconds.
- structure:
- Service Level Agreements (SLA) - an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
- If there is no explicit consequence, then you are almost certainly looking at an SLO
- SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions.
- SRE does get involved in helping to avoid triggering the consequences of missed SLOs.
- SRE does help to define the SLIs: there obviously needs to be an objective way to measure the SLOs in the agreement, or disagreements will arise
- Too many and they become noise, too few may leave significant behaviors of your system unexamined
Broad Categories
- User-facing serving systems
- Availability - Could we respond to the request?
- Latency - How long did it take to respond?
- Throughput - How many requests could be handled?
- Storage Systems
- Latency - How long did it take to read or write?
- Availability - Can we access the data on demand?
- Durability - Is the data still there when we need it?
- Big data systems (data processing pipelines)
- Throughput - How much data is being processed?
- End-to-end latency - How long does it take the data to progress from ingestion to completion?
- All systems correctness
- Was the right answer returned?
- Was the right data retrieved?
- Was the right analysis done?
- Correctness is important to track as an indicator of system health, even though it’s often a property of the data in the system rather than the infrastructure per se, and so usually not an SRE responsibility to meet.
- Using averages can hide long tails or large instantaneous loads
- Use distributions/percentiles
- High-order percentile (99th or 99.9th) shows you a plausible worst-case value
- Using median (50th percentile) emphasizes the typical case
User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the grounds that if the 99.9th percentile behavior is good, then the typical experience is certainly going to be.
Standardize on common definitions for SLIs so that you don’t have to reason about them from first principles each time. Any feature that conforms to the standard definition templates can be omitted from the specification of an individual SLI.
- Aggregation intervals: "Averaged over 1 minute"
- Aggregation regions: "All the tasks in a cluster"
- How frequently measurements are made: "Every 10 seconds"
- Which requests are included: “HTTP GETs from black-box monitoring jobs”
- How the data is acquired: “Through our monitoring, measured at the server”
- Data-access latency: “Time to last byte”
To save effort, build a set of reusable SLI templates for each common metric; these also make it simpler for everyone to understand what a specific SLI means.
- Work from desired objectives backward to specific indicators
- Starting with what's easy might mean the SLO is less useful
- SLOs should specify how they’re measured and the conditions under which they’re valid.
- Rely on the SLI defaults
- DO:
99% of Get RPC calls will complete in less than 100 ms
- DON'T:
99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers)
- DO:
- Performance curves are important
- 90% of Get RPC calls will complete in less than 1 ms.
- 99% of Get RPC calls will complete in less than 10 ms.
- 99.9% of Get RPC calls will complete in less than 100 ms.
- Heterogeneous workloads: bulk processing pipeline that cares about throughput and an interactive client that cares about latency, it may be appropriate to define separate objectives for each class of workload:
- 95% of throughput clients’ Set RPC calls will complete in < 1 s.
- 99% of latency clients’ Set RPC calls with payloads < 1 kB will complete in < 10 ms.
Error Budget
- Unrealistic and undesirable to insist that SLOs will be met 100% of the time
- Can reduce the rate of innovation and deployment
- Require expensive
- Overly conservative solutions
- Error budget: a rate at which the SLOs can be missed—and track that on a daily or weekly basis.
- An error budget is just an SLO for meeting other SLOs!
- Upper management will probably want a monthly or quarterly assessment, too.
- Use gap between SLO violation rate and the error budget as an input to the process that decides when to roll out new releases.
- Don't pick a target based on current performance.
- Keep it simple.
- Avoid absolutes.
- Have as few SLOs as possible
- Chose just enough SLOs to provide good coverage of your system's attributes.
- Perfection can wait.
- Start with a loose target that you can tighten.
- Caution:
- SLOs can—and should—be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about
- A good SLO is a helpful, legitimate forcing function for a development team.
- Poorly thought-out SLO can result in wasted work if a team uses heroic efforts to meet an overly aggressive SLO, or a bad product if the SLO is too lax.
- SLOs are a massive lever: use them wisely.
- Understanding how well a system is meeting its expectations helps decide whether to invest in making the system faster, more available, and more resilient.
- Alternatively, if the service is doing fine, perhaps staff time should be spent on other priorities, such as paying off technical debt, adding new features, or introducing other products.
- Control Measures
- SLIs and SLOs are crucial elements in the control loops used to manage systems:
- Monitor and measure the system's SLIs.
- Compare the SLIs to the SLOs, and decide whether or not action is needed.
- If action is needed, figure out what needs to happen in order to meet the target.
- Take that action.
- SLIs and SLOs are crucial elements in the control loops used to manage systems:
- Set Expectations
- Keep a safety margin
- Keep internal SLO tighter than external SLOs
- Don't overachieve
- Keep a high level of performance may spoil the users
- Introduce planned outages?
- Keep a safety margin