Skip to content

Instantly share code, notes, and snippets.

@coldclimate
Last active August 2, 2023 16:42
Show Gist options
  • Save coldclimate/fc0e02612707a0cb95d74b36f5fe0a97 to your computer and use it in GitHub Desktop.
Save coldclimate/fc0e02612707a0cb95d74b36f5fe0a97 to your computer and use it in GitHub Desktop.
WIP 101 what to monitor and alert on.

Monitoring and Alerting Minimum Viable Product

A checklist for those attempting to only get out of bed when it's important and to be able to debug critial and non-critial issues.

Those emphesised should probably get you out of bed when they're too high/low/gone.

This is super opinionated but I welcome feedback. It's biased to retrofitting/cleaning up/brownfield type work because that's what I know best.

HTTP(S) Services

What you're serving, how much of it and how fast

  • Inbound traffic volume ("requests") (too low)
    • Group by function (subdomain/high level URL)
    • Group by source
    • Ideally both of the above
  • Outbound traffic volume ("responses")
    • Group by function (subdomain/high level URL) if possible
    • Group by HTTP status code groups
      • Good traffic: 2xx (too low)
      • Specific traffic 3xx, 4xx
      • Bad traffic 5xx (too high)
  • Responce times
    • Group by function (subdomain/high level URL) if possible
  • HTTP/HTTPS split ratio
  • External synthetic user jouney test eg. from ourside of your infrastructure, test your service like a user will use it (too slow, too broken)

Web Servers

Apache

  • Total connections
  • Worker statuses
    • Idle
    • Reading
    • Sending
    • Waiting

Nginx

  • Specific metrics
  • That should be monitored
  • When using
  • Nginx

Load Balancers

Whats coming in and how well are we handing it off. You might also monitor your over all HTTP(S) Services metrics from your load balancer, but in addition to those...

  • Status of backend pool members
  • Response rates from pool members
  • Traffic levels to each pool member

HAProxy

  • Specific metrics
  • That should be monitored
  • When using
  • HAPProxy

Nginx

  • Specific metrics
  • That should be monitored
  • When using
  • Nginx as a load balancer

Databases

  • Query volume
  • Query responce times
    • Grouped by query pattern
  • Top queries
    • By frequency
    • By returned volume
    • By execution length (slow queries)
  • Replication statistics
  • Table Statistics
    • Read volumes
    • Write volumes

MySQL

  • Specific metrics
  • That should be monitored
  • When using
  • MySQL

Queues

RabbitMQ

  • Specific metrics
  • That should be monitored
  • When using
  • RabbitMQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment