Sampler survey

This doc compares the capabilities of popular telemetry sampling systems. The dimensions that are compared:

Dimensions related to limiting throughput

Temporal resolution: The time range that the limiting occurs on. E.g., limit the number of...

Spans per second
Spans per calendar month

Degree of limiting: In a steady state with spans created at a rate R span/s that is greater than the desired limit,

hard limiting: throughput = limit
soft limiting: E[throughput] = limit

Horizontally scalable: Is the desired limit enforced per-sampler, or is it a global limit?

Yes: Global
No: Per-sampler

Responsiveness: How quickly does the system return to steady state when perturbed (i.e., when R changes)?

Other dimensions

Supports statistical estimation: Modifies span metadata such that post hoc analysis can compute unbiased estimates from the data ("count the spans").

Sampling systems

otelcol's tailsampling processor

supports estimation: No
limiting:
- temporal resolution: Spans per second
- degree of limiting: Hard
- horizontally scalable: No
- responsiveness: < 1 s (token buckets are replenished each second)

The tailsampling processor implements a ratelimiting policy (src) equivalent to a token bucket with capacity of spans_per_second many tokens, replenished every second. Sampling a trace costs trace.SpanCount many tokens. Support for updating span p-values has been requested in #7962.

It also has a composite policy which is characterized by a sequence of sub-policies, each of which are subject to individual token bucket limiting. Each bucket's capacity is computed as a share of an overall max_total_spans_per_second, but otherwise the decisions are identical to those done by ratelimiting (src).

Takes a concept of "allocating bandwidth" (span throughput) to different families of traces. See design doc linked from open-telemetry/opentelemetry-collector-contrib#1410.

If there's more than one otelcol instance in the system, in order to guarantee complete traces you need to somehow guarantee that all spans in a given trace are routed to a given otelcol instance. One way to do that is with the loadbalancing exporter.

References

open-telemetry/opentelemetry-collector-contrib#4758
aggregate processor, described in https://grafana.com/blog/2020/06/18/how-grafana-labs-enables-horizontally-scalable-tail-sampling-in-the-opentelemetry-collector/
Issue associated with the loadbalancing exporter
Issue associated with the tailsampling processor's composite policy

Jaeger

supports estimation: No
limiting (sampler.type == 'ratelimiting'):
- temporal resolution: Traces per second
- degree of limiting: Hard
- horizontally scalable: No
- responsiveness: < 1 s (token buckets are replenished each second)
limiting (SAMPLING_CONFIG_TYPE == 'adaptive')
- temporal resolution: Traces per second
- degree of limiting: Soft (typically) or none (if data is generated at a high enough volume for --sampling.min-sampling-probability to overtake --sampling.target-samples-per-second)
- horizontally scalable: Yes
- responsiveness: Configurable (at most jaeger-client's polling interval + jaeger-collector's --sampling.calculation-interval)

Jaeger SDKs (jaeger-client) get sampling policy various ways:

local: hardcoded AlwaysOn, AlwaysOff, probability (static p), ratelimiting (token bucket, parameter: maximum samples per sec). No stratification.
remote, file: per-stratum probability or ratelimiting. jaeger-collector reloads from filesystem or URL; clients polls jaeger-agent, who proxies requests to jaeger-collector.
remote, adaptive: each stratum as a target throughput + some minimums. jaeger-collector maintains policy based on spans it's received; client polls jaeger-agent, who proxies requests to jaeger-collector.
First two options use local memory for ratelimiting. Third option has cluster-level coordination.
Spans are stratified by a list of priority-ordered rules: (Service name, Span name) > Span name default > (Service name) > global default.
In adaptive, many jaeger-collectors write strata statistics to shared memory. From this data, every jaeger-collector can independently calculate the whole-system stats needed to adjust sampling probabilities. A collector reads statistics (from a configurable number of epochs back; 1 by default), combines them to get whole-cluster strata stats, and recalculates new per-strata sampling probabilities. Defaults:
- stratum sampling probability: initial (1 in 1,000), minimum (1 in 100,000)
- stratum throughput: target (1 /s), minimum (1 /min)
Because collectors receive spans, clients don't need to explicitly send statistics themselves (contrast w/ X-Ray, whose sampling and collection APIs are independent)

AWS X-Ray

supports estimation: No
limiting:
- temporal resolution: Traces per second
- degree of limiting: Soft
- horizontally scalable: Yes
- responsiveness: < 10 s (token buckets are replenished via GetSamplingTargets requests, which occur every 10 s by default)

Each actor performing sampling sends statistics to a central API describing how many spans it's seen in a period. At least two SDKs (Java, Go) have contrib Sampler implementations that obtain sampling configuration from AWS X-Ray. Like Jaeger's adaptive remote sampling, X-Ray serves advisory sampling policies to clients. An X-Ray based sampling system behaves like so (on average):

Define a rule as a triple: a predicate over span attributes, a token bucket (e.g.), and a number in [0, 1] called the rule's fixed rate.
Define the global sampling policy as an ordered collection of rules.
Given a root span in need of a sampling decision,
1. Match the span to the first rule whose predicate it satisfies.
2. If the token bucket contains at least 1 token, deduct 1 token from the bucket and sample the span and its descendants.
3. Else, sample with probability equal to the matched rule's fixed rate.

Docs refer to "reservoirs", which are per-rule token buckets: https://github.com/open-telemetry/opentelemetry-java-contrib/blob/42818333e243682bb50e510f4f91381016f61f71/aws-xray/src/main/java/io/opentelemetry/contrib/awsxray/SamplingRuleApplier.java#L272. Actors doing sampling are dynamically allotted portions of the desired reservoir size (token bucket capacity) called ReservoirQuota in the GetSamplingTargets API response (docs).

References:

Design doc for X-Ray Sampler implementations

Honeycomb Refinery

supports estimation: Yes, via span attribute SampleRate value = N in "1-in-N" (feature request to support p-value here)
limiting (EMADynamicSampler):
- temporal resolution: Spans per second
- degree of limiting: Soft
- horizontally scalable: No (limiting is per Refinery node)
- responsiveness: Configurable as AdjustmentInterval
limiting (TotalThroughputSampler):
- temporal resolution: Spans per second
- degree of limiting: Hard
- horizontally scalable: No (limiting is per Refinery node)
- responsiveness: Configurable as ClearFrequencySec

Horizontally scales by forwarding spans to the appropriate node as necessary. The node which ought to handle a given trace is determined via consistent hashing of trace ID (src). Peers are discovered via either Redis or specified in Refinery's configuration file (docs).

Not set-it-and-forget-it: as one's system's rate of telemetry production increases over time, either GoalSampleRate or their Honeycomb events-per-month quota will need to be adjusted.

Opinion: Ideal state

limiting: Support all of both spans per second, spans per month, GB per month (approximated)
degree of limiting: Soft is ok
horizontally scalable: Yes
Prioritize tail sampling in Collector over head sampling in SDK
Strive for a configuration that is "set it and forget it" (notwithstanding ad hoc changes to aid in investigation or incident response)

spencerwilson/otel_sampler_survey.md