This document serves as a summary of all the information regarding of the tempo operational aspects.
Tempo is instrumented and provides three ways of monitoring it:
- Exposes prometheus metrics,
- Emit logs in key=value format
- Read and write path are instrumented using tracing instrumentation jaeger SDK. (https://grafana.com/docs/tempo/latest/operations/monitoring/#traces)
For the exposed metrics it provides grafana dashboards, for write metrics, read metrics, resources. Also provided an additional dashboard with the main metrics for operate tempo. Cluster dev dashboard
A couple of metrics to check for:
Issues in ingestion of the data into Tempo:
- tempo_distributor_spans_received_total
- tempo_ingester_traces_created_total
If the value of tempo_ingester_traces_created_total is 0, the possible reason is network issues between distributors and ingesters.
It provides some alerts preconfigures for the main metrics we should take care.
- Ring metrics that could indicate an unhealthy component.
(cortex_ring_members{state="Unhealthy", name="compactor", namespace=~".*"})
- Errors in compaction
tempodb_compaction_errors_total
- Errors in flushing ingestor to backend
tempo_ingester_failed_flushes_total
Alerts definition: https://github.com/grafana/tempo/blob/main/operations/tempo-mixin-compiled/alerts.yaml
For some situations tempo provides a runbook that suggested some actions to take: https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md