This doc compares the capabilities of popular telemetry sampling systems. The dimensions that are compared:
Temporal resolution: The time range that the limiting occurs on. E.g., limit the number of...
- Spans per second
- Spans per calendar month
Degree of limiting: In a steady state with spans created at a rate R span/s that is greater than the desired limit,
- hard limiting: throughput = limit
- soft limiting: E[throughput] = limit
Horizontally scalable: Is the desired limit enforced per-sampler, or is it a global limit?
- Yes: Global
- No: Per-sampler
Responsiveness: How quickly does the system return to steady state when perturbed (i.e., when R changes)?
Supports statistical estimation: Modifies span metadata such that post hoc analysis can compute unbiased estimates from the data ("count the spans").
- Yes
- No
- supports estimation: No
- limiting:
- temporal resolution: Spans per second
- degree of limiting: Hard
- horizontally scalable: No
- responsiveness: < 1 s (token buckets are replenished each second)
The tailsampling processor implements a ratelimiting
policy (src) equivalent to a token bucket with capacity of spans_per_second
many tokens, replenished every second. Sampling a trace costs trace.SpanCount
many tokens. Support for updating span p-values has been requested in #7962.
It also has a composite
policy which is characterized by a sequence of sub-policies, each of which are subject to individual token bucket limiting. Each bucket's capacity is computed as a share of an overall max_total_spans_per_second
, but otherwise the decisions are identical to those done by ratelimiting
(src).
Takes a concept of "allocating bandwidth" (span throughput) to different families of traces. See design doc linked from open-telemetry/opentelemetry-collector-contrib#1410.
If there's more than one otelcol instance in the system, in order to guarantee complete traces you need to somehow guarantee that all spans in a given trace are routed to a given otelcol instance. One way to do that is with the loadbalancing exporter.
References
- open-telemetry/opentelemetry-collector-contrib#4758
- aggregate processor, described in https://grafana.com/blog/2020/06/18/how-grafana-labs-enables-horizontally-scalable-tail-sampling-in-the-opentelemetry-collector/
- Issue associated with the loadbalancing exporter
- Issue associated with the tailsampling processor's
composite
policy
- supports estimation: No
- limiting (
sampler.type == 'ratelimiting'
):- temporal resolution: Traces per second
- degree of limiting: Hard
- horizontally scalable: No
- responsiveness: < 1 s (token buckets are replenished each second)
- limiting (
SAMPLING_CONFIG_TYPE == 'adaptive'
)- temporal resolution: Traces per second
- degree of limiting: Soft (typically) or none (if data is generated at a high enough volume for
--sampling.min-sampling-probability
to overtake--sampling.target-samples-per-second
) - horizontally scalable: Yes
- responsiveness: Configurable (at most jaeger-client's polling interval + jaeger-collector's
--sampling.calculation-interval
)
Jaeger SDKs (jaeger-client) get sampling policy various ways:
- local: hardcoded
AlwaysOn
,AlwaysOff
,probability
(static p),ratelimiting
(token bucket, parameter: maximum samples per sec). No stratification. - remote,
file
: per-stratumprobability
orratelimiting
. jaeger-collector reloads from filesystem or URL; clients polls jaeger-agent, who proxies requests to jaeger-collector. - remote,
adaptive
: each stratum as a target throughput + some minimums. jaeger-collector maintains policy based on spans it's received; client polls jaeger-agent, who proxies requests to jaeger-collector. - First two options use local memory for
ratelimiting
. Third option has cluster-level coordination. - Spans are stratified by a list of priority-ordered rules: (Service name, Span name) > Span name default > (Service name) > global default.
- In
adaptive
, many jaeger-collectors write strata statistics to shared memory. From this data, every jaeger-collector can independently calculate the whole-system stats needed to adjust sampling probabilities. A collector reads statistics (from a configurable number of epochs back; 1 by default), combines them to get whole-cluster strata stats, and recalculates new per-strata sampling probabilities. Defaults:- stratum sampling probability: initial (1 in 1,000), minimum (1 in 100,000)
- stratum throughput: target (1 /s), minimum (1 /min)
- Because collectors receive spans, clients don't need to explicitly send statistics themselves (contrast w/ X-Ray, whose sampling and collection APIs are independent)
- supports estimation: No
- limiting:
- temporal resolution: Traces per second
- degree of limiting: Soft
- horizontally scalable: Yes
- responsiveness: < 10 s (token buckets are replenished via GetSamplingTargets requests, which occur every 10 s by default)
Each actor performing sampling sends statistics to a central API describing how many spans it's seen in a period. At least two SDKs (Java, Go) have contrib Sampler
implementations that obtain sampling configuration from AWS X-Ray. Like Jaeger's adaptive
remote sampling, X-Ray serves advisory sampling policies to clients. An X-Ray based sampling system behaves like so (on average):
- Define a rule as a triple: a predicate over span attributes, a token bucket (e.g.), and a number in [0, 1] called the rule's fixed rate.
- Define the global sampling policy as an ordered collection of rules.
- Given a root span in need of a sampling decision,
- Match the span to the first rule whose predicate it satisfies.
- If the token bucket contains at least 1 token, deduct 1 token from the bucket and sample the span and its descendants.
- Else, sample with probability equal to the matched rule's fixed rate.
Docs refer to "reservoirs", which are per-rule token buckets: https://github.com/open-telemetry/opentelemetry-java-contrib/blob/42818333e243682bb50e510f4f91381016f61f71/aws-xray/src/main/java/io/opentelemetry/contrib/awsxray/SamplingRuleApplier.java#L272. Actors doing sampling are dynamically allotted portions of the desired reservoir size (token bucket capacity) called ReservoirQuota
in the GetSamplingTargets API response (docs).
References:
- supports estimation: Yes, via span attribute
SampleRate
value = N in "1-in-N" (feature request to support p-value here) - limiting (
EMADynamicSampler
):- temporal resolution: Spans per second
- degree of limiting: Soft
- horizontally scalable: No (limiting is per Refinery node)
- responsiveness: Configurable as
AdjustmentInterval
- limiting (
TotalThroughputSampler
):- temporal resolution: Spans per second
- degree of limiting: Hard
- horizontally scalable: No (limiting is per Refinery node)
- responsiveness: Configurable as
ClearFrequencySec
Horizontally scales by forwarding spans to the appropriate node as necessary. The node which ought to handle a given trace is determined via consistent hashing of trace ID (src). Peers are discovered via either Redis or specified in Refinery's configuration file (docs).
Not set-it-and-forget-it: as one's system's rate of telemetry production increases over time, either GoalSampleRate
or their Honeycomb events-per-month quota will need to be adjusted.
- limiting: Support all of both spans per second, spans per month, GB per month (approximated)
- degree of limiting: Soft is ok
- horizontally scalable: Yes
- Prioritize tail sampling in Collector over head sampling in SDK
- Strive for a configuration that is "set it and forget it" (notwithstanding ad hoc changes to aid in investigation or incident response)
I think this is very much incorrect. On the contrary, we went to great lengths to make sure that probabilistic sampling is the prevailing mode, where trace_weight = 1 / p, and p is captured on the root span. There are various rate-limiting capabilities, but they are mostly for overload protection and other edge cases. E.g. with adaptive sampling, you do not expect the rate limiters to ever fire.