Riak is a complex system that includes many moving parts to monitor, such as the health of the hardware, the well-being of the software, and the responsiveness of the network. This document will discuss metrics, thresholds, and values that indicate when your monitoring system should be sending alarms.
| Metric | Threshold | 
|---|---|
| CPU | 75% * number of cores | 
| Memory | 70% - buffers | 
| Disk Space | 75% | 
| Network | 70% sustained | 
| File Descriptors | 75% of ulimit | 
| Swap | > 0 KB | 
| String | Log File | Reason | 
|---|---|---|
| eaccess | console.log | File/Directory Permissions Issue | 
| emfile | console.log | Exhausted File Handles | 
| erofs | console.log | File System Mounted in Read-Only Mode | 
| noproc | console.log | Unexpectedly Missing Process | 
| undef | console.log | Missing/Incorrect Erlang Modules | 
| system_limit | console.log | Erlang VM Resource Exhaustion | 
| Compaction error | LevelDB LOGs | LevelDB Compaction Error | 
| waiting | LevelDB LOGs | LevelDB Stalls | 
Located at http://github.com/basho_labs/riak_nagios. Provided tests include:
- check_connection_pools
- check_file_handle_count
- check_leveldb_compaction
- check_node
- check_node_up
- check_port_count
- check_riak_kv_up
- check_riak_repl
- 
Acreman, S. (2014, April 12). Montoring Riak. [Blog post] Retrieved from http://blog.dataloop.io/2014/04/12/monitoring-riak/ Discussion of using CollectD to consume all of the statistics provided by Riak 
- 
Santero, T. (2013, March 27). Monitoring Riak. [Slide deck] Retrieved from https://speakerdeck.com/tsantero/monitoring-riak Provides information about monitoring suggestions, including system limits and monitoring tools. 
Listing all of the statistics and their values
$ riak-admin stat show '*.**'
Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator
Querying for a specific statistic
$ riak-admin stat show riak.riak_kv.vnode.gets  
 [riak,riak_kv,vnode,gets]: [{count,0},{one,0}]
Querying for child statistics
$ riak-admin stat show 'riak.riak_kv.vnode.gets.*'  
 [riak,riak_kv,vnode,gets,time]: [{n,0},{mean,0},{min,0},{max,0},{median,0},{50,0},{75,0},{90,0},{95,0},{99,0},{999,0}]
Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator
Querying for a subtree
$ riak-admin stat show 'riak.riak_kv.vnode.gets.**'  
 [riak,riak_kv,vnode,gets]: [{count,0},{one,0}]
 [riak,riak_kv,vnode,gets,time]: [{n,0},{mean,0},{min,0},{max,0},{median,0},{50,0},{75,0},{90,0},{95,0},{99,0},{999,0}]
Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator
Getting a list of all statistics and their type
$ riak-admin stat info -type '*.**'
 [riak,common,cpu_stats]: type = cpu
 [riak,common,mem_stats]: type = function
 [riak,common,memory_stats]: type = function
 [riak,riak_api,pbc_connects]: type = spiral
 [riak,riak_api,pbc_connects,active]: type = function
 [riak,riak_core,converge_delay]: type = duration
 [riak,riak_core,dropped_vnode_requests_total]: type = counter
 [riak,riak_core,gossip_received]: type = spiral
 [riak,riak_core,handoff_timeouts]: type = counter
 [riak,riak_core,ignored_gossip_total]: type = counter
 [riak,riak_core,rebalance_delay]: type = duration
 ...
This output could be used as input to a script that generates a CollectD configuration.