Riak is a complex system that includes many moving parts to monitor, such as the health of the hardware, the well-being of the software, and the responsiveness of the network. This document will discuss metrics, thresholds, and values that indicate when your monitoring system should be sending alarms.
Metric | Threshold |
---|---|
CPU | 75% * number of cores |
Memory | 70% - buffers |
Disk Space | 75% |
Network | 70% sustained |
File Descriptors | 75% of ulimit |
Swap | > 0 KB |
String | Log File | Reason |
---|---|---|
eaccess |
console.log | File/Directory Permissions Issue |
emfile |
console.log | Exhausted File Handles |
erofs |
console.log | File System Mounted in Read-Only Mode |
noproc |
console.log | Unexpectedly Missing Process |
undef |
console.log | Missing/Incorrect Erlang Modules |
system_limit |
console.log | Erlang VM Resource Exhaustion |
Compaction error |
LevelDB LOGs | LevelDB Compaction Error |
waiting |
LevelDB LOGs | LevelDB Stalls |
Located at http://github.com/basho_labs/riak_nagios. Provided tests include:
check_connection_pools
check_file_handle_count
check_leveldb_compaction
check_node
check_node_up
check_port_count
check_riak_kv_up
check_riak_repl
-
Acreman, S. (2014, April 12). Montoring Riak. [Blog post] Retrieved from http://blog.dataloop.io/2014/04/12/monitoring-riak/
Discussion of using CollectD to consume all of the statistics provided by Riak
-
Santero, T. (2013, March 27). Monitoring Riak. [Slide deck] Retrieved from https://speakerdeck.com/tsantero/monitoring-riak
Provides information about monitoring suggestions, including system limits and monitoring tools.
Listing all of the statistics and their values
$ riak-admin stat show '*.**'
Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator
Querying for a specific statistic
$ riak-admin stat show riak.riak_kv.vnode.gets
[riak,riak_kv,vnode,gets]: [{count,0},{one,0}]
Querying for child statistics
$ riak-admin stat show 'riak.riak_kv.vnode.gets.*'
[riak,riak_kv,vnode,gets,time]: [{n,0},{mean,0},{min,0},{max,0},{median,0},{50,0},{75,0},{90,0},{95,0},{99,0},{999,0}]
Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator
Querying for a subtree
$ riak-admin stat show 'riak.riak_kv.vnode.gets.**'
[riak,riak_kv,vnode,gets]: [{count,0},{one,0}]
[riak,riak_kv,vnode,gets,time]: [{n,0},{mean,0},{min,0},{max,0},{median,0},{50,0},{75,0},{90,0},{95,0},{99,0},{999,0}]
Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator
Getting a list of all statistics and their type
$ riak-admin stat info -type '*.**'
[riak,common,cpu_stats]: type = cpu
[riak,common,mem_stats]: type = function
[riak,common,memory_stats]: type = function
[riak,riak_api,pbc_connects]: type = spiral
[riak,riak_api,pbc_connects,active]: type = function
[riak,riak_core,converge_delay]: type = duration
[riak,riak_core,dropped_vnode_requests_total]: type = counter
[riak,riak_core,gossip_received]: type = spiral
[riak,riak_core,handoff_timeouts]: type = counter
[riak,riak_core,ignored_gossip_total]: type = counter
[riak,riak_core,rebalance_delay]: type = duration
...
This output could be used as input to a script that generates a CollectD configuration.