Monitoring

Alerting

base rate fallacy: given 1% false positive, 1% false negative, and 99.9% uptime: 9.1% chance positive predictive value (true positive)
sensitivity (% true positives) vs specificity (% not false positive)
‘Alert liberally; page judiciously. Page on symptoms, rather than causes.’
‘An alert should communicate something specific about your systems in plain language: “Two Cassandra nodes are down” or “90% of all web requests are taking more than 0.5s to process and respond.”’
‘Not all alerts carry the same degree of urgency.’
‘Many alerts will not be associated with a service problem, so a human may never even need to be aware of them. […] should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work.’
‘Response times for your web application, for instance, should have an internal SLA that is at least as aggressive as your strictest customer-facing SLA. Any instance of response times exceeding your internal SLA would warrant immediate attention, whatever the hour.’
‘If you can reasonably automate a response to an issue, you should consider doing so.’
‘Your users will not know or care about server load if the website is still responding quickly’
‘Focus on concurrency, latency, and limitations on fixed-size resources, such as the maximum number of permitted connections. Ask yourself if your basic KPIs are acceptable: Is the query latency within bounds? Is replication working and keeping up with changes?’

Metrics

service level objective (SLO): ‘best defined as a percentile over intervals of time (e.g. 99.9th percentile response time is less than 5ms in every 5-minute interval)’
‘five nines availability means 99.9th percentile response time -- less than 5ms -- for all but one 5-minute period in the course of a year’
MTBF (mean time between failures): how often things are unavailable, MTTR (mean time to recovery): how long they remain that way
availability (mtbf, mttr) = mtbf / (mtbf + mtrr)
‘It’s possible to subdivide MTTR itself into two deeper components: detection time (MTTD, or mean time to detect) and remediation time. MTTR, ultimately, is the sum of the two. So even if remediation time is instantaneous, MTTR will never be less than MTTD, because you can’t fix something if you don’t detect it first. That means that it is vitally important to minimize MTTD. […] The quicker you are to declare an outage, the more likely you are to raise a false alarm.’
‘Focus on measuring what matters; Measure what breaks most; Measure in high resolution; Capture potential problems early, but alert only on actionable problems’
‘Consider everything that falls within the purview of attaining high availability: recognizing how a database might fail; recognizing what can be monitored to increase detection speed; recovering from hiccups as quickly as possible; avoiding hiccups altogether; ascertaining the percentile level of performance that the system achieves even when it’s not failing.’
‘Not specifying the time unit of SLA enforcement is most often the first mistake people make. […] I do not recommend stating an SLA in anything less than one hour spans.’

Visualization

distributions are often multi-modal (multiple peaks)
%s are misleading because they're always downsampled/averaged -- nobody stores sub-second data (or even sub-minute, long-term), so 99% isn't the true 99%
histograms are much better, but need to pick correct representation
banding is useless on a linear scale (since you're looking for outliers), but log scale misrepresents reality
banding also conflates dimensions (quantity and duration)
‘Heatmaps are great for visualizing the shape and density of latencies’
important to be able to compare across graphs -- lag and errors propagate

References

Site Reliability Engineering (SRE)

Metrics

Add graphs & alerts for latency
Add graph for stddev in latency (database & web)
Add graph for stddev in response time (database & web)
Add graph for correlation of resource usage & traffic
Add graph for load balancing
Add graph for GC

Production Readiness Review (PRR)

What parts cause problems?
What monitoring have you added/improved?
What won’t show up in monitoring?
What is the service level objective (SLO?)
What are the dependencies and how is failure handled?
How does rollback work?

Principles

‘responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning’
‘free and easy migration […] in the SRE group […] project or a system that is "bad" […] incentive for development teams to not build systems that are horrible to run’
‘Operational only projects have relatively low ROI. I don't put SREs on those.’
‘[…] an SRE team must spend at least 50% of its time actually doing development. […] teams consistently spending less than 50% of their time on development work get redirected or get dissolved.’
‘An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything. […] no user can tell the difference between a system being 100% available and, let's say, 99.999% available. Because typically there are so many other things that sit in between the user and the software service that you're running that the marginal difference is lost in the noise of everything else that can go wrong. […] if it's 99.99% available, that means that it's 0.01% unavailable. Now we are allowed to have .01% unavailability and this is a budget. We can spend it on anything we want, as long as we don't overspend it. […] if the service natively sits there and throws errors, you know, .01% of the time, you're blowing your entire unavailability budget on something that gets you nothing.’
‘The only sure way that we can bring the availability level back up is to stop all launches until you have earned back that unavailability. […] We simply freeze launches, other than P0 bug fixes -- things that by themselves represent improved availability.’
‘anything that scales headcount linearly with the size of the service will fail. If you're spending most of your time on operations, that situation does not self-correct! You eventually get the crisis where you're now spending all of your time on operations and it's still not enough, and then the service either goes down or has another major problem.’
‘There are alerts, which say a human must take action right now. Something that is happening or about to happen, that a human needs to take action immediately to improve the situation. The second category is tickets. A human needs to take action, but not immediately. You have maybe hours, typically, days, but some human action is required. The third category is logging. No one ever needs to look at this information, but it is available for diagnostic or forensic purposes. The expectation is that no one reads it.’
‘There is mean time between failure -- how often does the thing stop working. And then there is mean time to repair -- once it stops working, how long does it take until you fix it? […] You can make it fail very rarely, or you are able to fix it really quickly when it does fail. Google has a well-deserved reputation for extremely high availability. And the way SRE gets that is by doing both.’
‘if you build bad code, then all the good SREs leave and you end up either running it yourself or at best having a junior team who's willing to take a gamble.’

stilist/monitoring.md

Monitoring

Alerting

Metrics

Visualization

References

Site Reliability Engineering (SRE)

Metrics

Production Readiness Review (PRR)

Principles

References