Skip to content

Instantly share code, notes, and snippets.

@orangedeng
Last active August 29, 2018 07:59
Show Gist options
  • Save orangedeng/0b54cc47a7251fba4e6d8f814e7e1e34 to your computer and use it in GitHub Desktop.
Save orangedeng/0b54cc47a7251fba4e6d8f814e7e1e34 to your computer and use it in GitHub Desktop.
Rancher 2.0 Monitoring Design v0.2

Monitoring Design

Custom Resource Definition

According to coreos/prometheus-operator, we will have following crds for prometheus monitoring.

  • prometheus.monitoring.coreos.com
  • prometheusrule.monitoring.coreos.com
  • servicemonitor.monitoring.coreos.com
  • alertmanager.monitoring.coreos.com

Cluster Monitoring

When cluster monitoring is enabled, Rancher will have following functions.

  • Metrics graph in UI
  • Enable the ability for users to manage prometheus operator crd in Rancher UI and then they can deploy their prometheus and prometheus rules.
  • Provide the node exporter metrics for project monitoring

Only cluster admin can enable/disable cluster monitoring in cluster dashboard. After enabled, Rancher will show metrics graph in UI. Following metrics will be shown in UI:

Cluster

  • CPU
    • CPU usage per second in 2min: sum(rate(node_cpu{mode!="idle", mode!="iowait", mode!~"^(?:guest.*)$}[2m]))
    • CPU user seconds in 2min: sum(rate(node_cpu{mode="user"}[2m]))
    • CPU system seconds in 2min: sum(rate(node_cpu{mode="system"}[2m]))
    • CPU load in 1min per cpu: sum(node_load1) / count(node_cpu{mode="system"})
    • CPU load in 5min per cpu: sum(node_load5) / count(node_cpu{mode="system"})
    • CPU load in 15min per cpu: sum(node_load15) / count(node_cpu{mode="system"})
  • Memory
    • Memory usage percentage: 1 - node_memory_MemAvailable / node_memory_MemTotal
    • Memory total page out in bytes per second in 2min: cluster: 1e3 * sum((rate(node_vmstat_pgpgout[2m])))
    • Memory total page in in bytes per second in 2min: 1e3 * sum((rate(node_vmstat_pgpgin[2m])))
  • Network
    • Network receive/transmit bytes per second in 2min: sum (rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_bytes{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
    • Network receive/transmit packets drop per second in 2min: sum (rate(node_network_receive_drop{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_drop{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
    • Network receive/transmit errors per second in 2min: sum (rate(node_network_receive_errs{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_errs{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
    • Network receive/transmit packets per second in 2min: sum (rate(node_network_receive_packets{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m])) sum (rate(node_network_transmit_packets{device!~"lo|veth.*|docker.*|cbr.*|flannel.*"}[2m]))
  • DiskIO
  • FileSystem

Node

  • CPU
  • Memory
  • Network
  • DiskIO
  • FileSystem

Workload

  • CPU
  • Memory
  • Network
  • DiskIO
  • FileSystem

Pod

  • CPU
  • Memory
  • Network
  • DiskIO
  • FileSystem

Contailer

  • CPU
  • Memory
  • Network
  • DiskIO
  • FileSystem

Project Monitoring

Alerting

Implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment