Tempo Operations

This document serves as a summary of all the information regarding of the tempo operational aspects.

Instrumentation

Tempo is instrumented and provides three ways of monitoring it:

Exposes prometheus metrics,
Emit logs in key=value format
Read and write path are instrumented using tracing instrumentation jaeger SDK. (https://grafana.com/docs/tempo/latest/operations/monitoring/#traces)

For the exposed metrics it provides grafana dashboards, for write metrics, read metrics, resources. Also provided an additional dashboard with the main metrics for operate tempo. Cluster dev dashboard

Some relevant metrics

A couple of metrics to check for:

Issues in ingestion of the data into Tempo:

tempo_distributor_spans_received_total
tempo_ingester_traces_created_total

If the value of tempo_ingester_traces_created_total is 0, the possible reason is network issues between distributors and ingesters.

Alerts

It provides some alerts preconfigures for the main metrics we should take care.

Ring metrics that could indicate an unhealthy component. (cortex_ring_members{state="Unhealthy", name="compactor", namespace=~".*"})
Errors in compaction tempodb_compaction_errors_total
Errors in flushing ingestor to backend tempo_ingester_failed_flushes_total

Alerts definition: https://github.com/grafana/tempo/blob/main/operations/tempo-mixin-compiled/alerts.yaml

Runbook

For some situations tempo provides a runbook that suggested some actions to take: https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md

Backend storage

https://github.com/os-observability/perf-test-tempo-opensearch/blob/main/resources-tempo/helm-values/tempo-helm-values.yaml#L27

How to monitor in K8s

https://github.com/os-observability/perf-test-tempo-opensearch/blob/4d51b2ffa51fbf7ad4b7ae8a4846ef7bccdf79f9/openshift-tempo/monitoring.yaml