Skip to content

Instantly share code, notes, and snippets.

@eevmanu
Last active June 25, 2025 15:47
Show Gist options
  • Save eevmanu/0a22fcb71af35a6bbdbda4cf1336b9d0 to your computer and use it in GitHub Desktop.
Save eevmanu/0a22fcb71af35a6bbdbda4cf1336b9d0 to your computer and use it in GitHub Desktop.
retry strategies - distributed systems - from simplest one to most sophisticated one
  • no retries (the baseline)
    • The caller makes one attempt and propagates any error.
  • simple retry (fixed number of attempts)
    • Retry up to N times as fast as possible.
    • The most basic form of control flow is added: a loop. It introduces the concept of "more than one try" without any timing logic.
  • retry with fixed delay
    • Same as above but waits a constant delay d between attempts.
    • Adds a single, simple parameter—a static wait time. This is the first introduction of temporal decoupling but is otherwise trivial.
  • linear / incremental backoff
    • Delay grows by a fixed increment Δ: t = base + i·Δ.
    • The delay is now dynamic. It requires tracking the attempt number and performing simple arithmetic (addition), making it slightly more complex than a fixed delay.
  • exponential backoff
    • Delay doubles each attempt: base·2^i (capped).
    • The delay calculation becomes more complex (multiplication/exponentiation). This strategy is more effective at handling escalating contention than linear backoff, representing a more sophisticated understanding of system dynamics.
  • exponential backoff with jitter
    • Adds randomness to each delay to decorrelate clients.
    • Adds a layer of randomness on top of the exponential calculation. This sophistication arises from solving a specific, non-obvious distributed systems problem: the "thundering herd" or client synchronization
    • Common variants
      • Full Jitter : sleep = rand(0, backoff)
      • Equal Jitter : sleep = backoff/2 + rand(0, backoff/2)
      • Decorrelated : sleep = rand(base, prev*3)
  • capped backoff & retry budgets
    • Hard upper delay cap + quota of retries per time-window (budget).
    • Introduces the concept of constraints and state management over a time window (the budget). It's more complex because it's not just about a single operation's lifecycle but about managing the client's overall behavior over time.
  • circuit breaker + retry
    • Circuit breaker trips open when error rate or latency crosses a threshold; retries are short-circuited until a cool-down.
    • A significant leap in sophistication. This pattern introduces a state machine (CLOSED, OPEN, HALF-OPEN) into the client, making it stateful and responsive to the perceived health of the downstream service. It decides whether to try at all, not just how long to wait.
  • bulkheads & isolation-first retries
    • Constrain retries to a dedicated pool (threads, connection pool). Fail fast if the pool is full.
    • Moves beyond temporal logic to resource management. It requires isolating retries into dedicated resource pools (e.g., thread or connection pools), a system design pattern that contains failures. This is more complex as it involves application architecture, not just an algorithm.
  • server-hinted / retry-after aware backoff
    • Server replies with Retry-After seconds or a token; client obeys.
    • Introduces inter-system coordination. While the client-side logic can be simple (obey a header), it represents a more sophisticated, cooperative architecture where the server actively participates in load management.
  • adaptive / token-bucket retries
    • Client observes success rate or latency; a controller adjusts whether to retry and how many outstanding retries are allowed.
    • The client is no longer passive or pre-programmed; it actively learns from the outcomes of its requests (success/failure rates) and adjusts its retry policy dynamically, often using a token-bucket algorithm. This is a closed-loop feedback system.
  • request hedging (speculative or parallel retries)
    • Send the same request to two or more replicas, cancel slower ones when the first completes.
    • Introduces parallelism. This adds significant complexity in managing concurrent requests, handling race conditions for the first successful response, and cancelling redundant, in-flight requests.
  • adaptive concurrency & load-shed retries
    • Combine concurrency-limiting algorithm (e.g., TCP BBR, Vegas, or gradient-based) with backoff. When queue grows, drop or shed.
    • Combines adaptive retries with advanced, dynamic concurrency limiting (e.g., TCP Vegas-style algorithms). The client is not just adapting its retry timing but its entire outbound request volume based on observed latency gradients, representing a deep level of system awareness.
  • coordinated hedging (quorum or erasure-coded)
    • Instead of full duplicates, send partial work (shards or coded pieces) to multiple servers; reconstruct result from the fastest subset (e.g., Facebook’s f4 erasure coded retries).
    • A highly advanced form of parallelism. Instead of sending redundant full requests, it requires complex algorithmic work to shard or erasure-code the request/data, distribute it, and reconstruct the result from a subset of responses.
  • predictive / ml-aware retry (state-of-the-art)
    • Use on-device or edge model to predict failure probability & latency, choose strategy accordingly (skip retry, change pool, hedge, etc.).
    • The pinnacle of complexity. It replaces static rules or simple feedback loops with a predictive machine learning model that makes sophisticated, context-aware decisions on whether and how to retry, hedge, or fail fast.

gemini 2.5 pro from aistudio

Of course. As an expert in distributed systems resilience, I'd be delighted to provide a comprehensive guide to retry strategies.

This is a critical topic. A naive retry strategy can be more dangerous than no retry strategy at all, turning a minor transient fault into a catastrophic system-wide outage. The following spectrum is designed to guide an engineer from basic concepts to a state-of-the-art, robust implementation that actively protects a system from complex failures.

Here is the full spectrum of retry strategies, from the simplest to the most sophisticated.


The Spectrum of Retry Strategies

0. No Retries (The Baseline)

This isn't a strategy but a baseline for comparison. When a request fails, it fails immediately and permanently for the client. The error is propagated up the call stack without any attempt to recover.

  • Concept: A request is attempted exactly once. If it fails, the failure is final.

  • Pros:

    • Simplicity: The easiest possible implementation (i.e., no implementation).
    • Fails Fast: The client gets an immediate response on failure, releasing resources quickly.
    • Safe for Downstream: Puts zero additional load on a potentially struggling downstream service.
  • Cons:

    • Brittle: The system has zero resilience to even the most minor transient network blips or temporary server unavailability. A single dropped packet can cause a user-facing error.
  • Pseudocode:

    function make_request(request) {
        try {
            return downstream_service.call(request);
        } catch (error) {
            // Log the error and return it immediately.
            log("Request failed, no retry.", error);
            throw error;
        }
    }
    
  • Analysis of Complex Failures:

    • This approach cannot cause cascading failures, thundering herds, or metastable failures. By not adding any retry load, it respects the state of the downstream service. However, it also makes the entire system fragile and unable to self-heal from transient issues, leading to poor availability. It's the safest option for the downstream service but the worst for client-side resilience.

1. Simple Retry (Fixed Number of Attempts)

This is the most basic and intuitive form of retry. If an operation fails, try it again a few more times.

  • Concept: Attempt the operation up to a fixed number, N, of times. The retries happen immediately, one after another, with no delay.

  • Pros:

    • Simple to Implement: Very little code is required.
    • Handles Instantaneous Flukes: Can recover from extremely short-lived errors where the resource becomes available again in microseconds.
  • Cons:

    • Brutal on Downstream Services: This is effectively a denial-of-service (DoS) attack on a struggling service. Sending rapid-fire retries will overwhelm it.
    • Wastes Resources: The client holds onto resources (threads, connections) while performing tight-loop retries.
  • Pseudocode:

    function make_request_simple_retry(request) {
        const max_attempts = 3;
        let last_error;
    
        for (let attempt = 1; attempt <= max_attempts; attempt++) {
            try {
                return downstream_service.call(request);
            } catch (error) {
                log(`Attempt ${attempt} failed.`);
                last_error = error;
            }
        }
        // All attempts failed, throw the last captured error.
        throw last_error;
    }
    
  • Analysis of Complex Failures:

    • Cascading Failures: This is a major contributor to cascading failures. If a service is slow because it's overloaded, hammering it with immediate retries will push it from "slow" to "completely failed." This failure then cascades to services that depend on it.
    • Thundering Herd: Does not directly cause a thundering herd but exacerbates it. It ensures that every client hammers the downstream service as hard as possible during an outage.

2. Retry with Fixed Delay

A minor improvement, this strategy introduces a waiting period between retries, giving the downstream service a small amount of breathing room.

  • Concept: Wait a fixed amount of time (e.g., 200ms) between each failed attempt.

  • Pros:

    • Slightly Kinder: Less aggressive than immediate retries. Gives the downstream service a brief moment to recover.
    • Still Simple: Easy to understand and implement.
  • Cons:

    • Synchronized Retries (The "Thundering Herd" Problem): This is the critical flaw. If a service fails and all its clients have a 200ms fixed delay, they will all wait 200ms and then retry at the exact same time. This synchronized wave of traffic can knock the service over just as it's trying to recover.
  • Pseudocode:

    function make_request_fixed_delay(request) {
        const max_attempts = 3;
        const fixed_delay_ms = 200;
        let last_error;
    
        for (let attempt = 1; attempt <= max_attempts; attempt++) {
            try {
                return downstream_service.call(request);
            } catch (error) {
                log(`Attempt ${attempt} failed.`);
                last_error = error;
                if (attempt < max_attempts) {
                    sleep(fixed_delay_ms);
                }
            }
        }
        throw last_error;
    }
    
  • Analysis of Complex Failures:

    • Cascading Failures: Still a significant risk. While better than no delay, the synchronized retries can still overwhelm a struggling service.
    • Thundering Herd: This strategy is a classic cause of the thundering herd problem. The fixed delay synchronizes clients, creating waves of traffic that are highly destructive to recovering systems.

3. Exponential Backoff

This is the first truly viable retry strategy for distributed systems. The delay between retries increases exponentially with each failed attempt.

  • Concept: The wait time between retries grows, typically by a factor of 2, after each failure. For example: wait 100ms, then 200ms, then 400ms, and so on. A maximum delay (cap) is often used to prevent excessively long waits.

  • Pros:

    • Adapts to Outage Duration: By backing off, clients give the downstream service progressively more time to recover. This is effective for both short and longer-lasting issues.
    • Reduces System Pressure: The total load on the struggling service decreases over time as clients wait longer and longer.
  • Cons:

    • Still Prone to Synchronization: While the timing of retries is spread out for a single client, different clients that experience failure at the same time will still have correlated retry schedules (e.g., all retry at 100ms, then all at 200ms, etc.). This can still lead to spikes of traffic.
  • Pseudocode:

    function make_request_exponential_backoff(request) {
        const max_attempts = 5;
        const base_delay_ms = 100;
        let delay_ms = base_delay_ms;
        let last_error;
    
        for (let attempt = 1; attempt <= max_attempts; attempt++) {
            try {
                return downstream_service.call(request);
            } catch (error) {
                log(`Attempt ${attempt} failed.`);
                last_error = error;
                if (attempt < max_attempts) {
                    sleep(delay_ms);
                    // Exponentially increase the delay for the next attempt
                    delay_ms *= 2;
                }
            }
        }
        throw last_error;
    }
    
  • Analysis of Complex Failures:

    • Cascading Failures: Significantly reduces the risk compared to fixed delays. By backing off, it gives the downstream service a real chance to recover, preventing the client from delivering the final blow.
    • Thundering Herd: Helps mitigate, but does not solve. It can still create smaller, synchronized waves of retries that can be harmful, especially at the first and second retry intervals.

4. Exponential Backoff with Jitter

This is the industry-standard best practice for general-purpose retries. It builds on exponential backoff by adding randomness (jitter) to the wait times. This randomness is the key to breaking the synchronization that causes thundering herds.

  • Concept: The wait time is not just a fixed exponential value but a value chosen randomly from a window that still grows exponentially.

  • Types of Jitter:

    • Full Jitter: The simplest form. The delay is a random number between 0 and the exponential backoff ceiling.
      • sleep = random(0, base_delay * 2^attempt)
      • Pro: Excellent distribution, breaks synchronization effectively.
      • Con: Can result in very short (near-zero) delays on early retries, which can be aggressive.
    • Equal Jitter: A good compromise. It adds jitter to half of the backoff ceiling, ensuring a minimum wait time.
      • temp = base_delay * 2^attempt; sleep = temp/2 + random(0, temp/2)
      • Pro: Prevents the near-zero delays of Full Jitter while still providing good distribution.
    • Decorrelated Jitter (Often considered the best): A more advanced formula that uses the previous sleep time to create a smoother distribution over time, preventing collisions even better.
      • sleep = min(cap, random(base_delay, previous_sleep * 3))
      • Pro: Spreads out retries extremely well. It's highly effective at preventing traffic spikes.
  • Pseudocode (using Decorrelated Jitter):

    function make_request_with_jitter(request) {
        const max_attempts = 8;
        const base_delay_ms = 100;
        const max_delay_ms = 5000;
        let sleep_ms = base_delay_ms;
        let last_error;
    
        for (let attempt = 1; attempt <= max_attempts; attempt++) {
            try {
                // IMPORTANT: Only retry on transient/server-side errors (e.g., 503, not 400).
                return downstream_service.call(request);
            } catch (error) {
                log(`Attempt ${attempt} failed.`);
                last_error = error;
                if (attempt < max_attempts) {
                    sleep(sleep_ms);
                    // Decorrelated Jitter formula
                    sleep_ms = min(max_delay_ms, random(base_delay_ms, sleep_ms * 3));
                }
            }
        }
        throw last_error;
    }
    
  • Analysis of Complex Failures:

    • Cascading Failures: Highly effective at preventing them. The combination of backing off and adding jitter gives downstream services maximum breathing room to recover.
    • Thundering Herd: This is the primary solution to the thundering herd problem. By randomizing the retry times, clients no longer retry in synchronized waves. The load from retries is smeared out over time, appearing as a gentle, constant pressure rather than a series of hammer blows.
    • Metastable Failures: Jitter helps prevent the system from getting stuck in a bad state. By smoothing out retry load, it avoids the periodic spikes that can keep a service perpetually overloaded and unable to fully recover.

5. The Role of Circuit Breakers

A Circuit Breaker is not a retry strategy itself, but a critical companion pattern that governs if a request should even be attempted. It acts as a state machine that wraps around our retry logic.

  • Concept: Based on the electrical analogy, a circuit breaker monitors failures.

    1. Closed: The default state. Requests are allowed to pass through (and can be retried using backoff with jitter if they fail). If the number of failures exceeds a threshold in a given time window, the breaker "trips" and moves to the Open state.
    2. Open: For a configured timeout period, all requests fail immediately without even being attempted. This is "failing fast." This gives the downstream service complete protection and time to recover.
    3. Half-Open: After the timeout expires, the breaker allows a single "probe" request to pass through. If it succeeds, the breaker moves to Closed. If it fails, it returns to Open for another timeout period.
  • Pros:

    • Ultimate Downstream Protection: The most effective way to prevent a struggling service from being overwhelmed. It completely stops traffic to it.
    • Client-Side Resilience: Prevents the client application from wasting resources (threads, sockets, CPU) trying to contact a service that is known to be down.
    • Prevents Cascading Failures: This is its primary purpose. By stopping calls to a failed service, it prevents that failure from propagating up to the services that depend on the client.
  • Cons:

    • Implementation Complexity: Adds state and more configuration (failure thresholds, open-state duration) to the client.
    • Can Isolate a Recovered Service: If tuned incorrectly, the "open" state might last too long, preventing traffic from reaching a service that has already recovered.
  • Pseudocode (Conceptual):

    // State is shared across all calls to a specific service
    let circuit_breaker = new CircuitBreaker({
        failure_threshold: 5,
        reset_timeout_ms: 10000 // 10 seconds
    });
    
    function execute_with_circuit_breaker(request) {
        if (circuit_breaker.is_open()) {
            throw new Error("Circuit Breaker is open. Failing fast.");
        }
    
        try {
            // The retry logic (e.g., exponential backoff with jitter) goes here
            let result = make_request_with_jitter(request);
            circuit_breaker.record_success();
            return result;
        } catch (error) {
            circuit_breaker.record_failure();
            throw error;
        }
    }
    
  • Analysis of Complex Failures:

    • A Circuit Breaker is the single most important pattern for preventing cascading failures. It acts as a firewall, containing the "blast radius" of a failure to a single service integration point. It allows a system to gracefully degrade instead of collapsing entirely.

6. Advanced: Request Hedging

This is a sophisticated strategy focused on tail-latency reduction, but it has significant implications for resilience. It's not a replacement for the above but can be used in specific scenarios.

  • Concept: Instead of waiting for a request to fail, you proactively send a second, "hedged" request if the first one doesn't respond within a certain time (e.g., the 95th percentile latency). You then take whichever response comes back first and cancel the other.

  • Pros:

    • Reduces P99 Latency: Excellent at masking the occasional slow response from a server instance or a transient network hiccup, improving the user experience.
    • Increases Success Rate: Can turn a slow request (that might have timed out) into a successful one.
  • Cons:

    • Increases Load: This is the critical danger. Under normal conditions, it might only increase load by 5%, but if the downstream service becomes slow for all requests, hedging will effectively double the traffic, potentially triggering a catastrophic failure.
    • Requires Idempotency: The downstream operation must be idempotent, as it might be executed twice.
  • Pseudocode (Conceptual):

    function make_hedged_request(request) {
        const hedge_delay_ms = 100; // e.g., P95 latency
    
        let request1 = downstream_service.call_async(request);
        let timer = sleep_async(hedge_delay_ms);
    
        // Race the first request against the timer
        let first_result = await Promise.race([request1, timer]);
    
        if (first_result !== "timer_finished") {
            return first_result; // Request 1 finished quickly
        } else {
            // Timer finished, hedge the request
            let request2 = downstream_service.call_async(request);
            // Return whichever request finishes first now
            return await Promise.race([request1, request2]);
        }
    }
    
  • Analysis of Complex Failures:

    • Request Hedging is a double-edged sword. Used judiciously on a healthy system with spare capacity, it improves performance. Used carelessly or on a system already under load, it can actively cause a cascading failure or thundering herd by doubling the request volume at the worst possible moment. It should only be used for critical, idempotent operations and ideally be disabled automatically when system-wide error rates are high (e.g., when a circuit breaker is open).

Putting It All Together: The Ideal Resilience Strategy

For a modern, resilient distributed system, you don't choose one strategy; you layer them.

  1. Foundation: Every external call that can fail transiently should use Exponential Backoff with Decorrelated Jitter. This should be the default retry mechanism.
  2. Protection: Wrap this retry logic within a Circuit Breaker. This protects the rest of your system from a failing dependency and gives that dependency room to recover.
  3. Idempotency: Ensure that any operation that is retried is idempotent. Use mechanisms like a unique Request-ID or Idempotency-Key that the server can use to de-duplicate retried requests.
  4. Selective Optimization: For a small number of critical, latency-sensitive, and idempotent operations where P99 latency is paramount, consider adding Request Hedging, but ensure it can be disabled and is monitored closely.

By following this layered approach, you move from simply hoping a retry works to designing a system that is fundamentally resilient, self-healing, and robust against the complex failure modes that plague distributed environments.




o3 + search from chatgpt

Modern distributed systems live or die by how gracefully they handle the inevitable transient failure. Below is a progression of retry techniques—from “do nothing” to state-of-the-art patterns—showing where each shines, where it breaks, and how it influences dynamics like thundering-herd surges, metastable loops, and cascading outages. Read straight through to level-up, or jump to the stage that matches your current architecture.

1 No Retries (baseline)

What it is Fail fast: return the error to the caller and let higher layers decide.

Pros

  • ● Zero extra load; simplest to reason about.
  • ● Eliminates the risk of retry storms entirely.

Cons

  • ● Caller must cope with every transient network blip.
  • ● End-user experience may degrade even though the service would have recovered milliseconds later.
resp = callService()
if resp.isError():
    return error           # no retry

Impact on complex failures Doesn’t cause herd or cascade problems—but doesn’t help avoid them either. Useful as a contrast point when measuring incremental patterns.


2 Simple Retry (N attempts, no delay)

What it is Blindly re-send the request a fixed number of times.

Pros

  • ● Handles the most ephemeral glitches with minimal code.

Cons

  • ● All retries land back-to-back, compounding server load; can ignite a thundering herd during a short outage. (encore.dev)
for attempt in 1..N:
    resp = call()
    if resp.ok(): return resp
return error

Failure dynamics If many clients hit the same service, they synchronize their immediate retries, potentially turning a blip into overload and metastable failure. (sigops.org, usenix.org)


3 Retry with Fixed Delay

Insert a constant sleep between attempts.

Pros

  • ● Gives the target breathing room; trivial to implement.

Cons

  • ● Hard-coded delay is either too short (still herding) or too long (hurts latency).
  • ● Responds identically to minor and major outages.
for attempt in 1..N:
    resp = call()
    if resp.ok(): return resp
    sleep(300 ms)

Failure dynamics Reduces spike amplitude but still lines requests up (a 300 ms cadence can pulse traffic).


4 Exponential Backoff

Delay doubles (or grows exponentially) each retry up to a cap. (aws.amazon.com, docs.aws.amazon.com)

Pros

  • ● Fast recovery for flukes, increasingly patient for longer outages.
  • ● Spreads load geometrically, lowering the odds of a service-wide meltdown.

Cons

  • ● Without randomness, many clients still share identical delay schedule—herd just stretches in waves.
base = 100 ms
for attempt in 0..N-1:
    resp = call()
    if resp.ok(): return resp
    sleep(min(base * 2**attempt, MAX))

Failure dynamics Mitigates cascades by rapidly throttling callers, but synchronized waves can still pile up.


5 Exponential Backoff with Jitter

Add randomness to each delay. Variants: Full, Equal, Decorrelated jitter. (aws.amazon.com, aws.amazon.com, medium.com)

Pros

  • ● Breaks synchronization; requests smear across time.
  • ● Proven by AWS simulations to cut server work by orders of magnitude vs. plain exponential.

Cons

  • ● Requires a good RNG; analysis is probabilistic, not deterministic.
base = 100 ms
sleep_ms = random(0, base * 2**attempt)          # “Full Jitter”
sleep(sleep_ms)

Failure dynamics Best single tactic for preventing thundering herds and metastable feedback loops—calls rarely collide.


6 Capped Backoff & Retry Budgets

Introduce a maximum delay and a global “retry budget” (e.g., ≤10 % of traffic). (medium.com, medium.com)

Pros

  • ● Guarantees upper-bound on user latency.
  • ● Budgets limit total retries across a fleet, preventing storm amplification.

Cons

  • ● Needs shared metrics or a client-side token bucket.
  • ● Budget exhaustion may reduce availability during protracted outages.
if retries_this_minute > budget:
    return error                # abort quickly

Failure dynamics Stops runaway queues that fuel cascades; keeps the system inside its stress envelope.


7 Adaptive / Token-Bucket Retries

Backoff factor is adjusted dynamically based on recent throttle responses (429/503). AWS SDK “adaptive” mode is a reference implementation. (docs.aws.amazon.com)

Pros

  • ● Learns the safe send-rate per client in real time.
  • ● Naturally soaks up traffic spikes without a central coordinator.

Cons

  • ● Slightly more complex; relies on accurate error classification.
bucket.take()         # blocks until token available
call()
bucket.refill(rateFromServerFeedback)

Failure dynamics Excels at preventing cascades because it enforces proportional back-pressure on every caller.


8 Circuit Breaker + Retry

A local fuse that “opens” after too many consecutive failures and “half-opens” to probe recovery. (medium.com, resilience4j.readme.io, dev.to)

Pros

  • ● Converts heavy retry loops into fast failures, shielding dependencies.
  • ● Provides real-time health signal to upstream layers.

Cons

  • ● Poorly tuned thresholds can oscillate.
  • ● Adds statefulness to stateless clients.
if breaker.isOpen():
    return fallback()

resp = call()
if resp.isError():
    breaker.recordFailure()
else:
    breaker.recordSuccess()

Failure dynamics Short-circuits cascades and caps the size of retry storms; when combined with jittered backoff, offers robust protection against metastable states.


9 Bulkheads & Isolation-First Retries

Retry only within a partition (per-tenant, per-priority queue).

Pros

  • ● Keeps a noisy neighbor from starving critical traffic.
  • ● Each bulkhead can tune its own retry policy.

Cons

  • ● Requires careful partition key choice; raises operational complexity.

Failure dynamics Prevents failures in one slice from propagating system-wide, containing potential cascades.


10 Request Hedging (Speculative or Parallel Retries)

Send a second copy of a long-running request to another replica after a short hedge delay; keep the fastest response, cancel the rest. (1pkg.github.io, huizhou92.com)

Pros

  • ● Compresses long-tail latency (P99/- P999).
  • ● Gracefully handles random stragglers without punishing median latency.

Cons

  • ● Extra load if hedges fire too early/too often.
  • ● Not a substitute for fixing root-cause slowness.
t = timer(HEDGE_DELAY)
primary = callAsync()
if primary.done(): return primary.result()
hedge = callAsync()            # second replica
return firstCompleted(primary, hedge)

Failure dynamics Reduces perceived outage scope by masking single-node brownouts; however, if overused during system-wide failures it can increase load—pair with budgets and circuit breakers.


11 Adaptive Concurrency & Load-Shed Retries

Combine retry patterns with concurrency-limit algorithms (BBR, AIMD) so each client maintains just enough in-flight calls. (aws.amazon.com, beerus11.medium.com)

Pros

  • ● Achieves near-optimal throughput vs. latency across wide failure modes.
  • ● Automatically sheds lower-priority work when saturation is near.

Cons

  • ● Heavier implementation effort; needs caller-side latency measurement.

Failure dynamics Actively prevents metastable overload loops by throttling at the source based on observed RTT spikes—one of the few techniques strong enough to arrest live cascades.


Putting it all together

In practice you stack these patterns:

  1. Start with exponential backoff with jitter.
  2. Cap total retries with budgets and deadlines.
  3. Wrap every dependency in a circuit breaker.
  4. Guard concurrency with token buckets or adaptive algorithms.
  5. Add hedging selectively for latency-critical calls.

When simulated at Amazon, the jittered-backoff + budget + breaker stack reduced peak load during an outage by orders of magnitude compared with naïve retries. (aws.amazon.com, docs.aws.amazon.com, aws.amazon.com)


Key Take-aways

  • Randomized backoff is the single most effective upgrade from basic retries.
  • Budgets and token buckets curb retry storms by making retries a scarce resource.
  • Circuit breakers act as automatic firewalls between components, limiting blast-radius.
  • Hedging tackles tail latency but must be paired with budgets to avoid self-harm.
  • Adaptive concurrency closes the loop by using live performance feedback to self-throttle.

Master these rungs, and your system will survive the next partial outage rather than spiraling into a self-inflicted catastrophe.




gemini 2.5 pro + grounding from aistudio

Of course. Here is a comprehensive guide to the full spectrum of retry strategies, designed to be both thoughtful and forward-thinking for engineers building resilient distributed systems.

The Spectrum of Retry Strategies: From Simple to Sophisticated

Here is a breakdown of retry strategies, starting with the most basic and moving towards more elaborate and robust solutions.

1. No Retries (The Baseline)

  • Concept: This is the simplest approach, where a failed request is not attempted again. The failure is immediately propagated back to the calling service or user.
  • Pros:
    • Extremely simple to implement.
    • Prevents the risk of "retry storms" where multiple clients overwhelm a struggling service with repeated requests.
  • Cons:
    • Offers no resilience to transient failures, which are common in distributed systems.
    • Can lead to a poor user experience as temporary glitches result in hard failures.
  • Pseudocode:
    function makeRequest(request) {
      try {
        return execute(request);
      } catch (error) {
        // Immediately return the error
        throw error;
      }
    }
    
  • Analysis: While "No Retries" avoids the complexities and potential dangers of more advanced strategies, it does nothing to prevent cascading failures. A single transient error in a downstream service can ripple upwards, causing failures throughout the system.

2. Simple Retry (Fixed Number of Attempts)

  • Concept: When a request fails, it is immediately retried a fixed number of times.
  • Pros:
    • Simple to implement and understand.
    • Can handle very short-lived transient failures.
  • Cons:
    • Can easily lead to a "thundering herd" problem, where many clients simultaneously retry, overwhelming the downstream service.
    • Does not account for the possibility that the service needs time to recover.
  • Pseudocode:
    function makeRequest(request, maxRetries = 3) {
      for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
          return execute(request);
        } catch (error) {
          if (attempt === maxRetries) {
            throw error;
          }
        }
      }
    }
    
  • Analysis: This strategy can turn a transient issue into a major outage. By immediately retrying, it can amplify the load on a struggling service, potentially causing a cascading failure.

3. Retry with Fixed Delay

  • Concept: This strategy introduces a fixed waiting period between retry attempts.
  • Pros:
    • Slightly more sophisticated than a simple retry, as it gives the downstream service a brief recovery period.
  • Cons:
    • Still susceptible to the "thundering herd" problem if many clients retry after the same fixed delay.
    • The fixed delay may not be optimal for all failure scenarios.
  • Pseudocode:
    function makeRequest(request, maxRetries = 3, delay = 1000) {
      for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
          return execute(request);
        } catch (error) {
          if (attempt === maxRetries) {
            throw error;
          }
          sleep(delay);
        }
      }
    }
    
  • Analysis: While an improvement, the fixed delay does little to prevent synchronized retries, a key contributor to cascading failures.

4. Exponential Backoff

  • Concept: The delay between retries increases exponentially with each failed attempt. For example, the first retry might be after 1 second, the second after 2 seconds, the third after 4 seconds, and so on.
  • Pros:
    • Significantly reduces the likelihood of a "thundering herd" by spreading out retry attempts over time.
    • Gives the downstream service progressively more time to recover.
  • Cons:
    • Without a cap, the backoff delay can become excessively long.
    • Synchronized failures can still occur if multiple clients start their backoff sequence at the same time.
  • Pseudocode:
    function makeRequest(request, maxRetries = 5) {
      let delay = 1000; // Initial delay of 1 second
      for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
          return execute(request);
        } catch (error) {
          if (attempt === maxRetries) {
            throw error;
          }
          sleep(delay);
          delay *= 2; // Double the delay for the next attempt
        }
      }
    }
    
  • Analysis: Exponential backoff is a significant step towards preventing cascading failures. By reducing the rate of retries, it lessens the pressure on a struggling service.

5. Exponential Backoff with Jitter

  • Concept: This strategy adds a small, random amount of time to the exponential backoff delay. This "jitter" prevents clients from retrying in synchronized waves.
    • Full Jitter: The next delay is a random value between 0 and the exponential backoff value.
    • Equal Jitter: The next delay is the exponential backoff value plus a small random value.
  • Pros:
    • Highly effective at preventing the "thundering herd" problem by desynchronizing client retries.
  • Cons:
    • Slightly more complex to implement than simple exponential backoff.
  • Pseudocode (Full Jitter):
    function makeRequest(request, maxRetries = 5) {
      let baseDelay = 1000;
      for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
          return execute(request);
        } catch (error) {
          if (attempt === maxRetries) {
            throw error;
          }
          const jitter = Math.random() * baseDelay;
          sleep(jitter);
          baseDelay *= 2;
        }
      }
    }
    
  • Analysis: Exponential backoff with jitter is a robust strategy for preventing cascading failures caused by retry storms. It is a widely recommended best practice.

6. Circuit Breaker + Retry

  • Concept: A circuit breaker monitors the number of failures from a particular service. If the failure rate exceeds a threshold, the circuit breaker "trips" or "opens," and for a period of time, all subsequent requests to that service will fail immediately without even being attempted. After a timeout, the circuit breaker enters a "half-open" state, allowing a limited number of test requests. If these succeed, the circuit closes and normal operation resumes. If they fail, the circuit remains open.
  • Pros:
    • Prevents an application from repeatedly trying to execute an operation that is likely to fail, saving system resources.
    • Gives a failing service time to recover by shedding load completely.
  • Cons:
    • Can be complex to configure and tune correctly.
    • In a "half-open" state, a recovering service could be flooded with requests.
  • Pseudocode:
    const circuitBreaker = new CircuitBreaker(options);
    
    function makeRequestWithCircuitBreaker(request) {
      if (!circuitBreaker.isClosed()) {
        throw new Error("Circuit breaker is open");
      }
      try {
        const response = execute(request);
        circuitBreaker.recordSuccess();
        return response;
      } catch (error) {
        circuitBreaker.recordFailure();
        // Can be combined with a retry strategy here
        throw error;
      }
    }
    
  • Analysis: The Circuit Breaker pattern is a powerful defense against cascading failures. By stopping requests to a failing service, it prevents the problem from spreading. When combined with a retry strategy, the retry logic should be aware of the circuit breaker's state to avoid retrying when the circuit is open.

7. Request Hedging (Speculative or Parallel Retries)

  • Concept: Instead of waiting for a request to fail before retrying, request hedging sends a second request after a short delay, without cancelling the first. The first response to return is used, and the other request is cancelled. This is particularly useful for reducing tail latency.
  • Pros:
    • Can significantly improve the latency of requests, especially in systems with high variance in response times.
  • Cons:
    • Increases the load on the downstream service by sending duplicate requests.
    • Should only be used for idempotent operations.
  • Pseudocode:
    function makeHedgedRequest(request, hedgeDelay = 100) {
      const originalRequest = execute(request);
      const hedgedRequest = sleep(hedgeDelay).then(() => execute(request));
    
      return Promise.race([originalRequest, hedgedRequest]);
    }
    
  • Analysis: While effective for latency reduction, request hedging must be used with caution. Unchecked, it can easily exacerbate a system overload and contribute to cascading failures. It's often combined with other mechanisms like retry budgets to limit the number of hedged requests.

8. Adaptive / Token-Bucket Retries

  • Concept: This is an advanced strategy where the retry behavior adapts to the real-time conditions of the system. One common implementation is the "token bucket" algorithm. A client has a "bucket" of tokens that are replenished at a fixed rate. Each retry consumes a token. If the bucket is empty, no more retries are allowed until a token is replenished.
  • Pros:
    • Provides a dynamic and responsive way to control retry rates based on system health.
    • Can prevent a single misbehaving client from overwhelming a service.
  • Cons:
    • More complex to implement and tune than other strategies.
  • Pseudocode (Token Bucket):
    const tokenBucket = new TokenBucket(capacity, refillRate);
    
    function makeRequestWithTokenBucket(request) {
      try {
        return execute(request);
      } catch (error) {
        if (tokenBucket.consume()) {
          // Retry the request
        } else {
          throw new Error("Out of retry tokens");
        }
      }
    }
    
  • Analysis: Adaptive retries are highly effective at preventing cascading failures by intelligently throttling retries based on the system's ability to handle them. This approach helps to avoid both metastable failures and thundering herds.

9. Bulkheads & Isolation-First Retries

  • Concept: The bulkhead pattern isolates elements of an application into pools so that if one fails, the others can continue to function. In the context of retries, this means isolating the resources (like thread pools or connection pools) used for retrying requests for a specific service.
  • Pros:
    • Prevents a failure in one service from consuming all the resources of the calling application, thus containing the failure.
  • Cons:
    • Can add complexity to the application's architecture.
  • Pseudocode:
    // Each service has its own dedicated thread pool for retries
    const serviceAThreadPool = new ThreadPool(10);
    const serviceBThreadPool = new ThreadPool(10);
    
    function retryForServiceA(request) {
      serviceAThreadPool.submit(() => {
        // Retry logic for service A
      });
    }
    
  • Analysis: The bulkhead pattern is a crucial strategy for building resilient systems. By isolating failures, it prevents them from cascading and bringing down the entire application. When combined with intelligent retry strategies, it provides a very robust defense against complex failures.

10. Adaptive Concurrency & Load-Shed Retries

  • Concept: This is a sophisticated approach where the system adaptively controls the concurrency of requests based on observed latency. If latency increases, the system reduces the concurrency limit, effectively "shedding" load. Retries are only attempted if there is available capacity within the current concurrency limit.
  • Pros:
    • Provides a very fine-grained and adaptive control over the load placed on downstream services.
    • Can prioritize critical requests during periods of high load.
  • Cons:
    • Requires a sophisticated monitoring and control system.
  • Pseudocode:
    const adaptiveConcurrencyControl = new AdaptiveConcurrencyControl();
    
    function makeRequestWithAdaptiveConcurrency(request) {
      if (adaptiveConcurrencyControl.hasCapacity()) {
        const promise = execute(request);
        adaptiveConcurrencyControl.onRequestStart();
        promise.finally(() => adaptiveConcurrencyControl.onRequestEnd());
        return promise;
      } else {
        // Shed load, potentially retrying later if capacity becomes available
        throw new Error("Concurrency limit reached");
      }
    }
    
  • Analysis: This is one of the most advanced and effective strategies for preventing complex failures. By directly managing concurrency based on real-time performance, it can prevent overload scenarios before they escalate into cascading failures or metastable failure states.



o3-pro from platform openai - 8m4s

THE RETRY SPECTRUM
(Ordered from the most naïve to the most sophisticated)

  1. NO RETRY (BASELINE)

    1. Explanation
      • The caller makes one attempt and propagates any error.
    2. Pros / Cons
      • Zero added load, simplest reasoning.
        – Low availability, caller must deal with every transient error.
    3. Pseudocode
      result = call() # if it fails, bubble up
    4. Effect on Complex Failures
      • Safest for the server: no amplification or thundering herd.
      • Caller must implement its own fallback or accept the outage.
  2. IMMEDIATE SIMPLE RETRY (FIXED ATTEMPTS, NO DELAY)

    1. Explanation
      • Retry up to N times as fast as possible.
    2. Pros / Cons
      • Hides ultra-short blips (TCP SYN drop, etc.).
        – Multiplies traffic by N under real failure, accelerating overload.
    3. Pseudocode
      for i in 1‥N:
      if call(): return ok
      raise Error
    4. Failure Analysis
      • Worsens thundering herd, metastable and cascading failures because every failing request becomes N new requests immediately.
  3. RETRY WITH FIXED DELAY

    1. Explanation
      • Same as above but waits a constant delay d between attempts.
    2. Pros / Cons
      • Gives failing server breathing room.
        – Delay must be tuned; too small = herd, too large = latency.
    3. Pseudocode
      for i in 1‥N:
      if call(): return ok
      sleep(d)
    4. Failure Analysis
      • Reduces rate-amplification factor by spacing retries, but the retry waves remain synchronized → still vulnerable to herd effects.
  4. LINEAR / INCREMENTAL BACKOFF

    1. Explanation
      • Delay grows by a fixed increment Δ: t = base + i·Δ.
    2. Pros / Cons
      • Simpler math than exponential; controllable upper bound.
        – Slow to ease pressure compared with exponential.
    3. Pseudocode
      delay = base
      for i in 1‥N:
      if call(): return ok
      sleep(delay)
      delay += Δ
    4. Failure Analysis
      • Better than fixed but still partially synchronized → waves of load.
  5. EXPONENTIAL BACKOFF

    1. Explanation
      • Delay doubles each attempt: base·2^i (capped).
    2. Pros / Cons
      • Rapidly shrinks retry traffic on persistent failure.
        – Clients become synchronized because they are deterministic.
    3. Pseudocode
      delay = base
      for i in 1‥N:
      if call(): return ok
      sleep(delay)
      delay = min(delay*2, maxDelay)
    4. Failure Analysis
      • Helps cascading failures by lowering pressure quickly, but the synchronized doubling can still produce retry bursts (see AWS 2021 Kinesis outage).
  6. EXPONENTIAL BACKOFF WITH JITTER

    1. Explanation
      • Adds randomness to each delay to decorrelate clients.
      • Common variants
      a. Full Jitter : sleep = rand(0, backoff)
      b. Equal Jitter : sleep = backoff/2 + rand(0, backoff/2)
      c. Decorrelated : sleep = rand(base, prev*3)
    2. Pros / Cons
      • Breaks synchronization, dramatically lowers peak load.
        – Harder to test; long tail latency for unlucky clients.
    3. Pseudocode (Full Jitter)
      delay = base
      for i in 1‥N:
      if call(): return ok
      sleep(rand(0, delay))
      delay = min(delay*2, maxDelay)
    4. Failure Analysis
      • One of the best generic defences against thundering herds and metastable loops. Peaks smooth out; average retries still limited by exponential growth.
  7. CAPPED BACKOFF & RETRY BUDGET

    1. Explanation
      • Hard upper delay cap + quota of retries per time-window (budget).
    2. Pros / Cons
      • Prevents “retry storms” when outages last hours.
      • Budgets give SREs a knob (e.g., 10% of traffic allowed as retries).
        – Might starve low-QPS services during budget exhaustion.
    3. Pseudocode
      bucket = TokenBucket(rate=R, burst=B)
      for i in 1‥N:
      if not bucket.take(): break # budget exhausted
      if call(): return ok
      sleep(rand(0, min(maxDelay, base*2**i)))
    4. Failure Analysis
      • Converts unbounded amplification into a consciously chosen factor (<=B). Greatly limits cascades.
  8. SERVER-HINTED / RETRY-AFTER AWARE BACKOFF

    1. Explanation
      • Server replies with Retry-After seconds or a token; client obeys.
    2. Pros / Cons
      • Server can coordinate clients (pushback).
        – Requires protocol support and correct server logic.
    3. Pseudocode
      resp = call()
      if resp.status == 429: sleep(resp.retryAfter) ; retry()
    4. Failure Analysis
      • Dramatically reduces overload because the server dictates pacing. Still needs fallbacks if the server is completely down.
  9. TOKEN-BUCKET / ADAPTIVE RETRY (CLIENT SIDE)

    1. Explanation
      • Client observes success rate or latency; a controller adjusts whether to retry and how many outstanding retries are allowed.
    2. Pros / Cons
      • Self-tuning to current conditions.
        – Requires measurements and a feedback loop; unstable if tuned poorly.
    3. Pseudocode (success-rate gating)
      window = SlidingWindow(100)
      for i in 1‥N:
      if call(): return ok
      if window.successRate() < 0.8: break # abort retries
      sleep(rand(0, backoff))
    4. Failure Analysis
      • Cuts retries once success deteriorates → protects the backend from a spiraling failure, avoids metastable amplification.
  10. CIRCUIT BREAKER + RETRY

    1. Explanation
      • Circuit breaker trips open when error rate or latency crosses a threshold; retries are short-circuited until a cool-down.
    2. Pros / Cons
      • Fast fail protects both client resources and server.
      • The “half-open” probe allows early recovery.
        – Requires correct thresholds; can flap under borderline load.
    3. Pseudocode
      cb = CircuitBreaker(failureRate=0.5, window=10s)
      if not cb.allow(): raise FastFail
      if call(): cb.onSuccess()
      else: cb.onFailure()
    4. Failure Analysis
      • Prevents cascading failures by blocking avalanche of retries once the service is obviously sick.
  11. BULKHEAD / ISOLATION-FIRST RETRIES

    1. Explanation
      • Constrain retries to a dedicated pool (threads, connection pool). Fail fast if the pool is full.
    2. Pros / Cons
      • One tenant or feature cannot exhaust shared resources.
        – Added complexity; needs careful pool sizing.
    3. Pseudocode
      with retryPool.acquire(timeout=0):
      callWithBackoff()
    4. Failure Analysis
      • Limits blast-radius: even if retries misbehave, only the partitioned pool is affected, shielding the rest of the system.
  12. LOAD-SHED / ADAPTIVE CONCURRENCY WITH RETRY

    1. Explanation
      • Combine concurrency-limiting algorithm (e.g., TCP BBR, Vegas, or gradient-based) with backoff. When queue grows, drop or shed.
    2. Pros / Cons
      • Keeps p99 latency predictable under load.
        – Complex control theory, may under-utilise capacity.
    3. Pseudocode
      if inflight >= allowed(): raise FastFail # shed
      callWithBackoff()
    4. Failure Analysis
      • Stops the driver of metastability: queue growth. By refusing excess work, the system avoids entering bad equilibria.
  13. REQUEST HEDGING (SPECULATIVE / PARALLEL RETRIES)

    1. Explanation
      • Send the same request to two or more replicas, cancel slower ones when the first completes.
    2. Pros / Cons
      • Shrinks tail latency in healthy state.
        – Doubles load; catastrophic if used against a degraded service.
        – Requires idempotency!
    3. Pseudocode
      t = hedgeDelay
      future1 = async call(replicaA)
      sleep(t)
      future2 = async call(replicaB)
      return firstCompleted(future1, future2)
    4. Failure Analysis
      • Not a protection technique per-se; must be combined with budgets & breakers. Otherwise, can exacerbate thundering herds.
  14. COORDINATED HEDGING (QUORUM OR ERASURE-CODED)

    1. Explanation
      • Instead of full duplicates, send partial work (shards or coded pieces) to multiple servers; reconstruct result from the fastest subset (e.g., Facebook’s f4 erasure coded retries).
    2. Pros / Cons
      • Hedge while bounding extra load to <= 1 replica worth.
        – Complex, application-specific encoding / merging logic.
    3. Pseudocode
      pieces = encode(request, k=6, m=4) # need any 6 of 10
      launch async call(piece) for each
      wait for 6 successes then decode
    4. Failure Analysis
      • Spreads load horizontally; avoids piling queries onto the same sick node; less risk of cascading overload than raw duplication.
  15. PREDICTIVE / ML-AWARE RETRY (STATE-OF-THE-ART)

    1. Explanation
      • Use on-device or edge model to predict failure probability & latency, choose strategy accordingly (skip retry, change pool, hedge, etc.).
    2. Pros / Cons
      • Can approach optimal trade-off dynamically.
        – Requires data, features, an offline pipeline, and constant validation.
    3. Pseudocode
      score = model.predict(features)
      if score.lowRisk(): return simpleRetry()
      elif score.medium(): return backoffWithJitter()
      else: return fastFail()
    4. Failure Analysis
      • If trained well, prevents metastable situations by detecting the onset earlier than threshold-based breakers. If trained poorly, can oscillate—so keep hard budgets as guard-rails.

PRACTICAL GUIDELINES

A. Compose, don’t replace: Jitter + Caps + Budgets + Circuit Breaker is the proven “safe default” stack for 90 % of systems.

B. Respect idempotency: Always ensure requests are safe to replay (dedupe keys, idempotency tokens) before enabling automatic retries or hedging.

C. Observe & adapt: Instrument retries (attempt#, latency, outcome). Feed dashboards & alerts; tune before the outage, not during.

D. Fail fast at the edge: The further a request has travelled, the more expensive its retry. Push decision logic as close to the caller as possible (mobile SDK, sidecar).

E. Combine with chaos testing: Inject 500s and latency to ensure your chosen strategy truly contains thundering herds.

By graduating through this spectrum—adding just enough sophistication for your reliability and latency targets—you can prevent the classic distributed-system failure modes without succumbing to over-engineering.

@eevmanu
Copy link
Author

eevmanu commented Jun 25, 2025

more detailed info per “novel” (imo ymmv) retry strategy - https://rentry.co/9edz2eie

@eevmanu
Copy link
Author

eevmanu commented Jun 25, 2025

amazing article related to retry strategies

https://www.barroso.org/publications/TheTailAtScale.pdf

The Tail at Scale

Jeffrey Dean and Luiz André Barroso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment