For when just having a service that does something isn't enough.
For answering the question: what the hell is going on?
shared identifier across requests allows visibility of the entire execution graph
When services serve requests, they often require making further requests to other services in order to produce their response...
Describe what a service is doing
recorded periodically to show behavior over time
gauges (instantaneous measurement)
counters (gauge for Long values)
meters (rate of events over time)
histograms (distribution of values)
timers (rate of computations with the distribution of their duration)
binary classification of a metric, representing some agreed-upon expectation of the service
Metrics describe what a service is doing; health checks describe the preconditions necessary to do its work. Can I talk to the database I'm using? Can I read my current configuration file, or does it have some syntax problem?
Frequency
Common checks
Health checks are performed at regular intervals, but a service's health should be inspected by the service itself when it starts up. If a service can't access its dependencies, maybe it shouldn't be running at all; after all, it can't do what it is supposed to do. The checks may fail because a new version of the service was deployed, and we'd perhaps want to rollback the new service if there was a bug such that a service couldn't be healthy.
"branch in code"
"deployment should be a non-event"
reduce variation to improve shared learning and understanding
stuff the compiler can't figure out is potentially dangerous
blacklists for libraries, classes, methods
don't write test fixtures, write code that generates test fixtures
aka, generative testing
good for overall sense of coverage; false sense of security for coverage improvements
"Stop the line so the line doesn't stop"
"sound the alarm when the metric goes up (down), since it should always go down (up)" e.g. # of compiler warnings, etc.
TODO: circuit-breakers, rate-limiting