Skip to content

Instantly share code, notes, and snippets.

@Dieterbe
Created July 7, 2012 09:18
Show Gist options
  • Save Dieterbe/3065609 to your computer and use it in GitHub Desktop.
Save Dieterbe/3065609 to your computer and use it in GitHub Desktop.
Monitoring thoughts
## graphing/trending dashboards (not alerting)
i see 3 main uses cases:
* interactively selecting datasources to satisfy a specific need ("i want to find out what went wrong yesterday 4-5PM on our memcache cluster, so i'm gonna have a look at different graphs in this timerange, each graph containing 1-N datapoints from various sources")
* trend analysis, prediction and anomaly detection
* more "fixed" dashboards, like amount of storage available/used on storage cluster, the past month. this point becomes less important if you can set up good alerting (based on datapoints or on predictions) but is still useful to have.
### must haves
* date/time range selection through UI widgets (datetime pickers)
* date/time range selection and zoom in/out on both axis (interactive rectangle selection like cacti has)
* viewing multiple datasources (from anywhere in the tree, i.e. different systems) at once and toggling on/off,
allow applying functions on each datasource you're viewing.
* a way to quickly get useful graphs (templates?). (i.e. it should know which functions to apply on interface packet counters to see network traffic and draw percentiles)
* something built to match diamond collector plugins would be nice
* an easy way to browse through all systems and being able to see graphs that make sense for them.
(for example viewing a mysql graph for a machinename containing `mysql` or has datapoints like `machinename.mysql`)
* overlay of same data, one week/month/... ago
* trending , predictions , abberration detection (holt-winters confidence bands etc), something to compare against known trends. Theo Schlossnagle phrased it nicely `I want a system that I can say: "this looks right, tell me when it stops looking right."` (see http://lethargy.org/~jesus/writes/reconnoiter-and-another-platform). TODO: must figure out triple exponential (with seasonal variance), or maybe write my own "poor mans trending" function (using a known pattern and known variance trends - days of the week etc)
* auto page refresh (or realtime data updates without refreshing page)
* once a graph is composed through the UI, ability to save it for later reuse, in a modifyiable, version controllable way. (i.e. config file)
* self-servicing graphs and alerting (the latter not necessarily in the same product)
* spikes and gaps and anything that's weird must never be invisible. this is bad:
https://bugs.launchpad.net/graphite/+bug/850475
### nice to have
* showing numbers as you hoover over them
### contenders, with notes
everything on http://graphite.readthedocs.org/en/1.0/tools.html
* https://github.com/cebailey59/charcoal concise readme, no demo/screenshots
* https://github.com/paperlesspost/graphiti you need to define all your graphs in json (through webui), no interactive buttons on screenshot(?), but maybe they do exist because it's intended to replace the graphite graph composer
* https://github.com/etsy/dashboard supports graphite, ganglia, cacti, newrelic, ..
* https://github.com/fetep/pencil all through yaml configs?
* https://github.com/wayfair/Graphite-Tattle A self service alerting and dashboard frontend for graphite and ganglia
* https://github.com/jondot/graphene realtime dashboard fetches json and uses d3 to render. no UI to configure
* https://github.com/ripienaar/gdash define templates in yaml, not sure if you can browse through systems and have it apply the correct templates automatically.
* https://github.com/obfuscurity/descartes author knows how to use GH issues. too many deps ("stores configuration data in PostgreSQL and Google OpenID state in Redis"
* https://github.com/obfuscurity/tasseo a potential good starting point, though it's only js
### building blocks for writing dashboards yourself:
* https://github.com/prestontimmons/graphitejs
* http://code.shutterstock.com/rickshaw/ or flot. both support time based zoom, rickshaw def. supports toggling. flot def. supports interactive zoom by drawing rectangles. (also supports moving around by dragging and zoom/in out with scroll wheel)
### thoughts for DIY:
* avoid dependencies like databases, memcache/redis until there's a clear need
* installation should be just a few commands, and preferably able to run standalone on a specific port
* query graphite for data, render graphs client side, for increased interactivity (seeing numbers when hoovering over them, immediate zooming, etc). there's no need for graphite to do the rendering to a png serverside (even with huge timerange / lots of metrics, graphite can aggregate serverside, and modern browsers should be able to deal with lots of data)
## Alerting
* something like nagios/cloudkick, but with plugins so you can also use graphite data (and predicted data, points and bands)
* self-servicing, configurable through UI and saves it to plaintext config files.
### contenders
* riemann looks very interesting. but limitations: no persistency, so cannot do full alerting because cannot do holt-winters reliably, for example. unless if you have other scripts for every "long-running" thing that queries graphite and emits events back into riemann
* http://code.google.com/p/rocksteady/ sort of similar, but more generic and focused on correlation of different things
* graphite-tattle, need to check this out further
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment