Skip to content

Instantly share code, notes, and snippets.

@daviddahl
Created April 7, 2026 16:16
Show Gist options
  • Select an option

  • Save daviddahl/4bd41ab917a69336778bbe419a7efcdb to your computer and use it in GitHub Desktop.

Select an option

Save daviddahl/4bd41ab917a69336778bbe419a7efcdb to your computer and use it in GitHub Desktop.
otap-dataflow buffering and backpressure

How otap-dataflow Handles Data Buffering & Backpressure

The Big Picture

otap-dataflow processes telemetry data (logs, metrics, traces) through a pipeline of three stages:

Receiver ──> Processor ──> Exporter ──> Remote destination
 (ingests)    (transforms)   (sends out)

Each stage is connected by a bounded queue — like a fixed-size conveyor belt. When a downstream stage can't keep up, the queue fills up, and the upstream stage is forced to slow down. This is backpressure — the system's way of protecting itself from being overwhelmed.

There is no central traffic controller. Instead, each queue independently enforces its own capacity limit, and pressure ripples backward naturally through the pipeline.


How the Queues Work

Every connection between pipeline stages uses a queue with a fixed maximum size (default: 128 messages). This is configurable per deployment.

  • Queue not full: Data flows through immediately.
  • Queue full: The sender pauses until space opens up. No data is dropped — the sender simply waits.
  • Queue closed (shutdown): Remaining data is drained before the pipeline stops.

This means the system self-regulates: if a remote destination slows down, the exporter slows down, its input queue fills up, the processor slows down, its input queue fills up, and ultimately the receiver slows down its ingestion rate.


Stage-by-Stage Breakdown

Receivers (Data Ingestion)

Receivers accept incoming telemetry from external sources (gRPC, HTTP, syslog, etc.) and push it into the pipeline.

Key backpressure behaviors:

Receiver What happens when the pipeline is full?
OTLP (gRPC/HTTP) Automatically limits how many requests it accepts at once, matched to pipeline capacity. Excess requests get HTTP 503 (Service Unavailable).
OTAP (Arrow streaming) Caps concurrent requests at 1000 by default.
Topic Tracks how long it's blocked waiting for pipeline space. Logs warnings when blocked > 500ms.
Syslog Batches messages (up to 100 per batch, or every 100ms) before sending downstream.

The OTLP receiver is the most sophisticated — it auto-tunes its admission rate to match the pipeline's capacity, so it never accepts more data than the pipeline can handle.

Processors (Data Transformation)

Processors sit in the middle. Some are stateless pass-throughs (filtering, routing), while others accumulate data internally.

Processors with internal buffering:

Processor What it buffers How it handles overload
Batch Accumulates items until a size or time threshold is met (default: flush every 200ms). Rejects (NACKs) incoming data when its internal tracking slots are full (1024 inbound / 512 outbound by default).
Durable Buffer Persists data to disk (write-ahead log) before forwarding. Default: up to 10 GB. When disk is full, either rejects new data (default) or drops the oldest data to make room. Retries failed downstream sends with exponential backoff (1s to 30s).
Temporal Reaggregation Accumulates metrics over a time window (default: 60 seconds). Tells the engine to stop sending it data when its tracking capacity is nearly full. Resumes when capacity frees up.
Fanout Tracks messages sent to multiple outputs simultaneously. Tells the engine to stop sending data when 10,000 messages are in-flight. Resumes as acknowledgments arrive.
Retry No internal buffer — just reschedules failed messages. Exponential backoff: waits 5s, then 7.5s, then 11.25s, etc., up to 30s per attempt and 5 minutes total.

Stateless processors (Filter, Attributes, Debug, Content Router, Signal Type Router, Log Sampling, Transform) have no internal buffering — they process each message and immediately pass it along.

Exporters (Data Egress)

Exporters send data to remote destinations. They are intentionally lightweight — they handle concurrent dispatch but rely on upstream processors (especially the Durable Buffer) for retry and durability.

Exporter Concurrent sends What happens when the remote is slow?
OTLP gRPC Up to 5 RPCs at once (configurable). Stops accepting new data from the pipeline until an in-flight RPC completes. Failures are NACKed upstream for retry.
OTLP HTTP Up to 5 requests at once (configurable). Same as gRPC. Classifies errors as retryable (429, 502, 503, 504) or permanent.
OTAP (Arrow streaming) 64-message internal queue per signal type. Reconnects with exponential backoff (10ms to 10s) on connection failure.
Azure Monitor Up to 16 concurrent uploads (hardcoded). Pauses intake when at capacity or when authentication token is unavailable. Batches data into compressed payloads (up to 1 MB compressed).
Topic Configurable. Either blocks until the topic queue has space, or drops the newest message and NACKs upstream.

wait_for_result — Should the Client Know About Failures?

By default, when a receiver accepts data from a client, it responds with "success" as soon as the data is placed onto the internal pipeline queue. If something fails downstream (e.g., the exporter can't reach the remote destination), the client never finds out — from its perspective, the send succeeded.

The wait_for_result setting (default: false) changes this behavior:

Setting Client sees... Trade-off
wait_for_result: false Immediate success once data is queued internally. Downstream errors are invisible. Lower latency, higher throughput. Client can't react to failures.
wait_for_result: true Success or failure based on the next pipeline stage's response. Higher latency (client waits for processing), but the client can retry on failure.

Important nuance: wait_for_result: true only waits for the immediate downstream component — not the entire pipeline. If the next stage is a Durable Buffer that ACKs after writing to disk, the client gets a response as soon as the disk write succeeds, even if the final export hasn't happened yet.

Which receivers support it?

  • OTLP receiver — Yes (independently for gRPC and HTTP)
  • OTAP receiver — Yes
  • Topic receiver — No
  • Syslog receiver — No (connectionless protocol, fire-and-forget by nature)

How it affects backpressure

When wait_for_result is enabled, each client request holds a concurrency slot for the entire processing duration (not just the time to enqueue). This means:

  • Under heavy load, slots fill up sooner
  • Excess requests are rejected earlier (HTTP 503 / gRPC RESOURCE_EXHAUSTED)
  • Clients see backpressure directly and can implement their own retry logic

When disabled, slots turn over quickly (just the enqueue time), so the receiver appears to have more capacity — but failures are hidden from the client.

Shutdown behavior

When the pipeline is shutting down, receivers with in-flight wait_for_result requests:

  1. Stop accepting new requests
  2. Keep the connection open for in-flight requests to complete
  3. If the shutdown deadline expires before all requests complete, remaining clients receive an explicit error (not a hang)

The Acknowledgment Loop

Every piece of data that enters the pipeline is tracked with an acknowledgment (ACK/NACK) system:

  • ACK: "I successfully processed/sent this data." Flows backward through the pipeline so each stage knows it can release its reference.
  • NACK (transient): "I failed, but it's worth retrying." Upstream processors (like the Durable Buffer or Retry processor) will re-attempt delivery.
  • NACK (permanent): "This data is rejected and should not be retried." The data is dropped.

This means data loss only happens in explicitly configured scenarios (e.g., the Durable Buffer's "drop oldest" policy when disk is full).


Graceful Shutdown

When the pipeline shuts down:

  1. Receivers stop accepting new data first — they drain any in-progress requests.
  2. Only after all receivers confirm they're drained, processors and exporters receive the shutdown signal.
  3. Processors and exporters drain their remaining queued data before stopping, with a configurable deadline.

This two-phase approach ensures data in the pipeline is flushed rather than dropped.


Configuration Knobs

Setting What it controls Default
policies.channel_capacity.pdata Queue size between pipeline stages 128 messages
policies.channel_capacity.control.node Control message queue per node 256
Batch processor max_batch_duration How long to wait before flushing a partial batch 200ms
Durable Buffer retention_size_cap Maximum disk storage for buffered data 10 GB
Durable Buffer size_cap_policy What to do when disk is full: backpressure or drop_oldest backpressure
Durable Buffer poll_interval How often to check for data to forward 100ms
OTLP Receiver max_concurrent_requests Max simultaneous inbound requests (0 = auto-tune) 0 (auto)
Receiver wait_for_result Whether to hold the client request open until the next stage ACKs/NACKs false
Exporter max_in_flight Max concurrent outbound requests 5 (gRPC/HTTP)

Key Takeaways

  1. No data is silently dropped. Backpressure causes senders to slow down, not discard data. Data loss only occurs through explicit policy (e.g., drop_oldest).

  2. The system is self-regulating. If any stage slows down, pressure automatically propagates backward to the source, slowing ingestion to match throughput.

  3. Exporters are thin; durability lives in processors. Exporters focus on efficient dispatch. The Durable Buffer processor provides disk-backed persistence and retry logic.

  4. The OTLP receiver is the smartest ingress point. It auto-tunes its admission rate to match downstream capacity, rejecting excess requests at the transport layer before they consume memory.

  5. Everything is configurable. Queue sizes, concurrency limits, retry intervals, and capacity policies can all be tuned per deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment