- base rate fallacy: given 1% false positive, 1% false negative, and 99.9% uptime: 9.1% chance positive predictive value (true positive)
- sensitivity (% true positives) vs specificity (% not false positive)
- ‘Alert liberally; page judiciously. Page on symptoms, rather than causes.’
- ‘An alert should communicate something specific about your systems in plain language: “Two Cassandra nodes are down” or “90% of all web requests are taking more than 0.5s to process and respond.”’
- ‘Not all alerts carry the same degree of urgency.’
- ‘Many alerts will not be associated with a service problem, so a human may never even need to be aware of them. […] should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work.’
- ‘Response times for your web application, for instance, should have an internal SLA that is at least as aggressive as your strictest customer-facing SLA. Any instance of response times exceeding your internal SLA would warrant immediate attention, whatever the hour.’
- ‘If you can reasonably automate a response to an issue, you should consider doing so.’
- ‘Your users will not know or care about server load if the website is still responding quickly’
- ‘Focus on concurrency, latency, and limitations on fixed-size resources, such as the maximum number of permitted connections. Ask yourself if your basic KPIs are acceptable: Is the query latency within bounds? Is replication working and keeping up with changes?’
- service level objective (SLO): ‘best defined as a percentile over intervals of time (e.g. 99.9th percentile response time is less than 5ms in every 5-minute interval)’
- ‘five nines availability means 99.9th percentile response time -- less than 5ms -- for all but one 5-minute period in the course of a year’
- MTBF (mean time between failures): how often things are unavailable, MTTR (mean time to recovery): how long they remain that way
- availability (mtbf, mttr) = mtbf / (mtbf + mtrr)
- ‘It’s possible to subdivide MTTR itself into two deeper components: detection time (MTTD, or mean time to detect) and remediation time. MTTR, ultimately, is the sum of the two. So even if remediation time is instantaneous, MTTR will never be less than MTTD, because you can’t fix something if you don’t detect it first. That means that it is vitally important to minimize MTTD. […] The quicker you are to declare an outage, the more likely you are to raise a false alarm.’
- ‘Focus on measuring what matters; Measure what breaks most; Measure in high resolution; Capture potential problems early, but alert only on actionable problems’
- ‘Consider everything that falls within the purview of attaining high availability: recognizing how a database might fail; recognizing what can be monitored to increase detection speed; recovering from hiccups as quickly as possible; avoiding hiccups altogether; ascertaining the percentile level of performance that the system achieves even when it’s not failing.’
- ‘Not specifying the time unit of SLA enforcement is most often the first mistake people make. […] I do not recommend stating an SLA in anything less than one hour spans.’
- distributions are often multi-modal (multiple peaks)
- %s are misleading because they're always downsampled/averaged -- nobody stores sub-second data (or even sub-minute, long-term), so 99% isn't the true 99%
- histograms are much better, but need to pick correct representation
- banding is useless on a linear scale (since you're looking for outliers), but log scale misrepresents reality
- banding also conflates dimensions (quantity and duration)
- ‘Heatmaps are great for visualizing the shape and density of latencies’
- important to be able to compare across graphs -- lag and errors propagate
- http://www.brendangregg.com/FrequencyTrails/modes.html
- https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think
- https://www.vividcortex.com/blog/2015/05/21/updated-fault-detection/
- https://blog.danslimmon.com/2012/11/02/car-alarms-and-smoke-alarms-the-tradeoff-between-sensitivity-and-specificity/
- https://www.slideshare.net/danslimmon/car-alarms-smoke-alarms-monitorama
- https://www.vividcortex.com/blog/the-factors-that-impact-availability-visualized
- https://www.circonus.com/2015/02/problem-math/
- https://www.datadoghq.com/blog/monitoring-101-alerting/
- http://www.itproportal.com/2015/07/05/5-database-monitoring-issues-that-need-your-attention-now/