According to coreos/prometheus-operator, we will have following crds for prometheus monitoring.
- prometheus.monitoring.coreos.com
- prometheusrule.monitoring.coreos.com
- servicemonitor.monitoring.coreos.com
- alertmanager.monitoring.coreos.com
When cluster monitoring is enabled, Rancher will have following functions.
- Metrics graph in UI
- Enable the ability for users to manage prometheus operator crd in Rancher UI and then they can deploy their prometheus and prometheus rules.
- Provide the node exporter metrics for project monitoring
Only cluster admin can enable/disable cluster monitoring in cluster dashboard. After enabled, Rancher will show metrics graph in UI. Following metrics will be shown in UI:
Cluster
- CPU
- CPU usage per second in 2min:
sum(rate(node_cpu{mode!="idle", mode!="iowait", mode!~"^(?:guest.*)$}[2m]))
- CPU user seconds in 2min:
sum(rate(node_cpu{mode="user"}[2m]))
- CPU system seconds in 2min:
sum(rate(node_cpu{mode="system"}[2m]))
- CPU load in 1min per cpu:
sum(node_load1) / count(node_cpu{mode="system"})
- CPU load in 5min per cpu:
sum(node_load5) / count(node_cpu{mode="system"})
- CPU load in 15min per cpu:
sum(node_load15) / count(node_cpu{mode="system"})
- CPU usage per second in 2min:
- Memory
- Memory usage percentage:
1 - node_memory_MemAvailable / node_memory_MemTotal
- Memory total page out in bytes per second in 2min: cluster:
1e3 * sum((rate(node_vmstat_pgpgout[2m])))
- Memory total page in in bytes per second in 2min:
1e3 * sum((rate(node_vmstat_pgpgin[2m])))
- Memory usage percentage:
- Network
- Network receive/transmit bytes per second in 2min:
sum (rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_bytes{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
- Network receive/transmit packets drop per second in 2min:
sum (rate(node_network_receive_drop{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_drop{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
- Network receive/transmit errors per second in 2min:
sum (rate(node_network_receive_errs{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_errs{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
- Network receive/transmit packets per second in 2min:
sum (rate(node_network_receive_packets{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_packets{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
- Network receive/transmit bytes per second in 2min:
- DiskIO
- FileSystem
Node
- CPU
- Memory
- Network
- DiskIO
- FileSystem
Workload
- CPU
- Memory
- Network
- DiskIO
- FileSystem
Pod
- CPU
- Memory
- Network
- DiskIO
- FileSystem
Contailer
- CPU
- Memory
- Network
- DiskIO
- FileSystem