A checklist for those attempting to only get out of bed when it's important and to be able to debug critial and non-critial issues.
Those emphesised should probably get you out of bed when they're too high/low/gone.
This is super opinionated but I welcome feedback. It's biased to retrofitting/cleaning up/brownfield type work because that's what I know best.
What you're serving, how much of it and how fast
- Inbound traffic volume ("requests") (too low)
-
- Group by function (subdomain/high level URL)
-
- Group by source
-
- Ideally both of the above
- Outbound traffic volume ("responses")
-
- Group by function (subdomain/high level URL) if possible
-
- Group by HTTP status code groups
-
-
- Good traffic: 2xx (too low)
-
-
-
- Specific traffic 3xx, 4xx
-
-
-
- Bad traffic 5xx (too high)
-
- Responce times
-
- Group by function (subdomain/high level URL) if possible
- HTTP/HTTPS split ratio
- External synthetic user jouney test eg. from ourside of your infrastructure, test your service like a user will use it (too slow, too broken)
- Total connections
- Worker statuses
-
- Idle
-
- Reading
-
- Sending
-
- Waiting
- Specific metrics
- That should be monitored
- When using
- Nginx
Whats coming in and how well are we handing it off. You might also monitor your over all HTTP(S) Services metrics from your load balancer, but in addition to those...
- Status of backend pool members
- Response rates from pool members
- Traffic levels to each pool member
- Specific metrics
- That should be monitored
- When using
- HAPProxy
- Specific metrics
- That should be monitored
- When using
- Nginx as a load balancer
- Query volume
- Query responce times
-
- Grouped by query pattern
- Top queries
-
- By frequency
-
- By returned volume
-
- By execution length (slow queries)
- Replication statistics
- Table Statistics
-
- Read volumes
-
- Write volumes
- Specific metrics
- That should be monitored
- When using
- MySQL
- Specific metrics
- That should be monitored
- When using
- RabbitMQ