Metrics checklist

Pick any metric documentation snippet in Documentation/Metrics.

Check that:

The help string is a string which makes sense as the heading of a Grafana graph.
The unit is correct.
The category makes sense and is correct (for a list of categories see Documentation/Metrics/template.yaml).
The complexity is chosen correctly, there is "simple", "medium" and "advanced". This complexity is used to decide which of the personas we have defined will see the metric in their dashboard. See here for details.
The list of instance roles which expose the metric is correct. You will have to try it out or look at the code. You can find the name with grep in the code.
The description is understandable and describes well what the metric measures.
If applicable, there is a threshold entry which describes what value ranges are expected and are normal, and what should be considered exceptional.
If there is a threshold entry, there should also be a troubleshoot entry.
Try out the metric with a Prometheus attached and try the generated dashboard and see if the graphics make sense and are understandable.

Examples which I have already done:

arangodb_shards_number
arangodb_shards_out_of_sync
arangodb_shards_not_replicated
arangodb_shards_leader_number
arangodb_sync_wrong_checksum_total
arangodb_agencycomm_request_time_msec
arangodb_agency_read_no_leader_total

Instructions for Prometheus and Grafana:

See this blog post for instructions of setting up metrics. For 3.8, you want to use the path

/_admin/metrics/v2

instead of

/_admin/metrics

in the article. Here is my prometheus.yaml which I use for my startLocalCluster.sh deployments:

# Sample config for Prometheus.

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  external_labels:
      monitor: 'example'

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    scrape_timeout: 5s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    # If prometheus-node-exporter is installed, grab stats about the local
    # machine by default.
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'arangodb'
    metrics_path: /_admin/metrics/v2
    params:
      version: ["2"]
    scrape_interval: 5s
    scrape_timeout: 5s
    static_configs:
      - targets: [
              'localhost:8529',
              'localhost:8530', 'localhost:8531', 'localhost:8532',
              'localhost:8629', 'localhost:8630', 'localhost:8631',
              'localhost:4001', 'localhost:4002', 'localhost:4003' ]

neunhoef/MetricsChecklist.md

Metrics checklist