When investigating #2775, the TC decided to look into expanding the notion of Resource
to include Entity
.
During this discussion, we identified a lot of hard, challenging problems OpenTelemetry must tackle going forward, including:
- Telemetry Identity evolving as "scope" increases. E.g. a Jaeger instance running in a single k8s cluster may not need to know the identity of the k8s cluster, as there's only one. However, a datastore spanning mulitple k8s clusters WILL need this information.
- A simple "Service" model for OpenTelemetry SDKs (e.g. requiring
service.name
attribute) works in the short run, but beings to struggle in large distributed systems. - Defining a consistent guideline on what resource / entity means in practice, how to choose one and what our Entity <-> Signal modelling needs to look like in the long run.
We do believe that an Entity model that allows identity enrichment is the right path forward. These problems are large, throny, and require some deep thought and attention. However, halting progress on instrumentation to tackle them could halt OpenTelemetry's forward momentum and put instrumetnation efforts at great risk. We expect solving (all of) these problems fully to be on the order of years, not months or weeks.
This proposal aims to unblock language instrumentation driven through SDKs (i.e. not opentelemetry-collector). Specifically, if accepted, this proposal would allow the continuation of:
- Trace Instrumentation Semantic conventions (HTTP, RPC, Messaging, etc.)
- Metric Instrumentation Semantic Conventions (HTTP, RPC, Java, etc. but not Process, host, etc.)
- RUM / Client-side Instrumentation
In addition this should allow progress towards Logging semantic conventions and community-convergence discussions with Elastic Common Schema.
The proposal is split into a few components, but hinges on requring all SDKs to use a "service" as their defacto Resource and identity for metrics. There are these tasks / changes to the OpenTelemetry Specification:
- Update Service resource Semantic Conventions to require SDKs to provide service.instance.id
- Update Resource SDK specification to require Service resource attributes to be discovered first
- Update OpenTelemetry Metrics Data Model such that only identifying attributes in a Resource participate in time series identity
- For
Service
resource this would includeservice.name
andservice.instance.id
andservice.namespace
when present. - Allow other resource types to be defined w/ identity on an ad-hoc / necessity basis. This should only be done to unblock major instrumentation efforts and when a forwards-compatible / "fixable" / "future-proof" design can be made.
- For
- Update Semantic Conventions to include a "sharable" flag for attributes to indicate applicability of sharing an attribute between metrics and other signals.
- This flag should denote whether expected cardinality of a flag is acceptable for most metric backends.
- This will help prevent issues like java-instrumentation#5307, where
http.url
(a high cardinality label) was encoded in latency metrics.
- Allow Prometheus Metrics Exporters to:
- Drop
service.*
resource attributes as is the expectation in prometheus where service discovery will provide these. - Ensure OTLP => Prometheus-Remote-Write leverages these identifying metrics.
- Drop
The RUM / Client-instrumentation SiG has already raised concerns over forcing a "Service" abstraction everywhere. We also know this is a concern for the OpenTelemetry Collector, e.g the hostmetricsreceiver. We intend to lift this restriction as progress is made towards underlying issues around Entity, Identity and topology of signal-generators. However, we see aligning on a simplistic model as a first step towards unblocking instrumentation that is in-line with common industry standards and something we can evolve over time to address these concerns.