How otap-dataflow Handles Data Buffering & Backpressure

The Big Picture

otap-dataflow processes telemetry data (logs, metrics, traces) through a pipeline of three stages:

Receiver ──> Processor ──> Exporter ──> Remote destination
 (ingests)    (transforms)   (sends out)

Each stage is connected by a bounded queue — like a fixed-size conveyor belt. When a downstream stage can't keep up, the queue fills up, and the upstream stage is forced to slow down. This is backpressure — the system's way of protecting itself from being overwhelmed.

There is no central traffic controller. Instead, each queue independently enforces its own capacity limit, and pressure ripples backward naturally through the pipeline.

How the Queues Work

Every connection between pipeline stages uses a queue with a fixed maximum size (default: 128 messages). This is configurable per deployment.

Queue not full: Data flows through immediately.
Queue full: The sender pauses until space opens up. No data is dropped — the sender simply waits.
Queue closed (shutdown): Remaining data is drained before the pipeline stops.

This means the system self-regulates: if a remote destination slows down, the exporter slows down, its input queue fills up, the processor slows down, its input queue fills up, and ultimately the receiver slows down its ingestion rate.

Stage-by-Stage Breakdown

Receivers (Data Ingestion)

Receivers accept incoming telemetry from external sources (gRPC, HTTP, syslog, etc.) and push it into the pipeline.

Key backpressure behaviors:

Receiver	What happens when the pipeline is full?
OTLP (gRPC/HTTP)	Automatically limits how many requests it accepts at once, matched to pipeline capacity. Excess requests get HTTP 503 (Service Unavailable).
OTAP (Arrow streaming)	Caps concurrent requests at 1000 by default.
Topic	Tracks how long it's blocked waiting for pipeline space. Logs warnings when blocked > 500ms.
Syslog	Batches messages (up to 100 per batch, or every 100ms) before sending downstream.

The OTLP receiver is the most sophisticated — it auto-tunes its admission rate to match the pipeline's capacity, so it never accepts more data than the pipeline can handle.

Processors (Data Transformation)

Processors sit in the middle. Some are stateless pass-throughs (filtering, routing), while others accumulate data internally.

Processors with internal buffering:

Processor	What it buffers	How it handles overload
Batch	Accumulates items until a size or time threshold is met (default: flush every 200ms).	Rejects (NACKs) incoming data when its internal tracking slots are full (1024 inbound / 512 outbound by default).
Durable Buffer	Persists data to disk (write-ahead log) before forwarding. Default: up to 10 GB.	When disk is full, either rejects new data (default) or drops the oldest data to make room. Retries failed downstream sends with exponential backoff (1s to 30s).
Temporal Reaggregation	Accumulates metrics over a time window (default: 60 seconds).	Tells the engine to stop sending it data when its tracking capacity is nearly full. Resumes when capacity frees up.
Fanout	Tracks messages sent to multiple outputs simultaneously.	Tells the engine to stop sending data when 10,000 messages are in-flight. Resumes as acknowledgments arrive.
Retry	No internal buffer — just reschedules failed messages.	Exponential backoff: waits 5s, then 7.5s, then 11.25s, etc., up to 30s per attempt and 5 minutes total.

Stateless processors (Filter, Attributes, Debug, Content Router, Signal Type Router, Log Sampling, Transform) have no internal buffering — they process each message and immediately pass it along.

Exporters (Data Egress)

Exporters send data to remote destinations. They are intentionally lightweight — they handle concurrent dispatch but rely on upstream processors (especially the Durable Buffer) for retry and durability.

Exporter	Concurrent sends	What happens when the remote is slow?
OTLP gRPC	Up to 5 RPCs at once (configurable).	Stops accepting new data from the pipeline until an in-flight RPC completes. Failures are NACKed upstream for retry.
OTLP HTTP	Up to 5 requests at once (configurable).	Same as gRPC. Classifies errors as retryable (429, 502, 503, 504) or permanent.
OTAP (Arrow streaming)	64-message internal queue per signal type.	Reconnects with exponential backoff (10ms to 10s) on connection failure.
Azure Monitor	Up to 16 concurrent uploads (hardcoded).	Pauses intake when at capacity or when authentication token is unavailable. Batches data into compressed payloads (up to 1 MB compressed).
Topic	Configurable.	Either blocks until the topic queue has space, or drops the newest message and NACKs upstream.

`wait_for_result` — Should the Client Know About Failures?

By default, when a receiver accepts data from a client, it responds with "success" as soon as the data is placed onto the internal pipeline queue. If something fails downstream (e.g., the exporter can't reach the remote destination), the client never finds out — from its perspective, the send succeeded.

The wait_for_result setting (default: false) changes this behavior:

Setting	Client sees...	Trade-off
`wait_for_result: false`	Immediate success once data is queued internally. Downstream errors are invisible.	Lower latency, higher throughput. Client can't react to failures.
`wait_for_result: true`	Success or failure based on the next pipeline stage's response.	Higher latency (client waits for processing), but the client can retry on failure.

Important nuance: wait_for_result: true only waits for the immediate downstream component — not the entire pipeline. If the next stage is a Durable Buffer that ACKs after writing to disk, the client gets a response as soon as the disk write succeeds, even if the final export hasn't happened yet.

Which receivers support it?

OTLP receiver — Yes (independently for gRPC and HTTP)
OTAP receiver — Yes
Topic receiver — No
Syslog receiver — No (connectionless protocol, fire-and-forget by nature)

How it affects backpressure

When wait_for_result is enabled, each client request holds a concurrency slot for the entire processing duration (not just the time to enqueue). This means:

Under heavy load, slots fill up sooner
Excess requests are rejected earlier (HTTP 503 / gRPC RESOURCE_EXHAUSTED)
Clients see backpressure directly and can implement their own retry logic

When disabled, slots turn over quickly (just the enqueue time), so the receiver appears to have more capacity — but failures are hidden from the client.

Shutdown behavior

When the pipeline is shutting down, receivers with in-flight wait_for_result requests:

Stop accepting new requests
Keep the connection open for in-flight requests to complete
If the shutdown deadline expires before all requests complete, remaining clients receive an explicit error (not a hang)

The Acknowledgment Loop

Every piece of data that enters the pipeline is tracked with an acknowledgment (ACK/NACK) system:

ACK: "I successfully processed/sent this data." Flows backward through the pipeline so each stage knows it can release its reference.
NACK (transient): "I failed, but it's worth retrying." Upstream processors (like the Durable Buffer or Retry processor) will re-attempt delivery.
NACK (permanent): "This data is rejected and should not be retried." The data is dropped.

This means data loss only happens in explicitly configured scenarios (e.g., the Durable Buffer's "drop oldest" policy when disk is full).

Graceful Shutdown

When the pipeline shuts down:

Receivers stop accepting new data first — they drain any in-progress requests.
Only after all receivers confirm they're drained, processors and exporters receive the shutdown signal.
Processors and exporters drain their remaining queued data before stopping, with a configurable deadline.

This two-phase approach ensures data in the pipeline is flushed rather than dropped.

Configuration Knobs

Setting	What it controls	Default
`policies.channel_capacity.pdata`	Queue size between pipeline stages	128 messages
`policies.channel_capacity.control.node`	Control message queue per node	256
Batch processor `max_batch_duration`	How long to wait before flushing a partial batch	200ms
Durable Buffer `retention_size_cap`	Maximum disk storage for buffered data	10 GB
Durable Buffer `size_cap_policy`	What to do when disk is full: `backpressure` or `drop_oldest`	`backpressure`
Durable Buffer `poll_interval`	How often to check for data to forward	100ms
OTLP Receiver `max_concurrent_requests`	Max simultaneous inbound requests (0 = auto-tune)	0 (auto)
Receiver `wait_for_result`	Whether to hold the client request open until the next stage ACKs/NACKs	`false`
Exporter `max_in_flight`	Max concurrent outbound requests	5 (gRPC/HTTP)

Key Takeaways

No data is silently dropped. Backpressure causes senders to slow down, not discard data. Data loss only occurs through explicit policy (e.g., drop_oldest).
The system is self-regulating. If any stage slows down, pressure automatically propagates backward to the source, slowing ingestion to match throughput.
Exporters are thin; durability lives in processors. Exporters focus on efficient dispatch. The Durable Buffer processor provides disk-backed persistence and retry logic.
The OTLP receiver is the smartest ingress point. It auto-tunes its admission rate to match downstream capacity, rejecting excess requests at the transport layer before they consume memory.
Everything is configurable. Queue sizes, concurrency limits, retry intervals, and capacity policies can all be tuned per deployment.

daviddahl/otap-dataflow-buffer-backpressure.md

Select an option

No results found

Select an option

No results found

How otap-dataflow Handles Data Buffering & Backpressure

The Big Picture

How the Queues Work

Stage-by-Stage Breakdown

Receivers (Data Ingestion)

Processors (Data Transformation)

Exporters (Data Egress)

`wait_for_result` — Should the Client Know About Failures?

Which receivers support it?

How it affects backpressure

Shutdown behavior

The Acknowledgment Loop

Graceful Shutdown

Configuration Knobs

Key Takeaways

daviddahl/otap-dataflow-buffer-backpressure.md

How otap-dataflow Handles Data Buffering & Backpressure

The Big Picture

How the Queues Work

Stage-by-Stage Breakdown

Receivers (Data Ingestion)

Processors (Data Transformation)

Exporters (Data Egress)

wait_for_result — Should the Client Know About Failures?

Which receivers support it?

How it affects backpressure

Shutdown behavior

The Acknowledgment Loop

Graceful Shutdown

Configuration Knobs

Key Takeaways

`wait_for_result` — Should the Client Know About Failures?