retry strategies - distributed systems - from simplest one to most sophisticated one

gemini 2.5 pro from aistudio

Of course. As an expert in distributed systems resilience, I'd be delighted to provide a comprehensive guide to retry strategies.

This is a critical topic. A naive retry strategy can be more dangerous than no retry strategy at all, turning a minor transient fault into a catastrophic system-wide outage. The following spectrum is designed to guide an engineer from basic concepts to a state-of-the-art, robust implementation that actively protects a system from complex failures.

Here is the full spectrum of retry strategies, from the simplest to the most sophisticated.

The Spectrum of Retry Strategies

0. No Retries (The Baseline)

This isn't a strategy but a baseline for comparison. When a request fails, it fails immediately and permanently for the client. The error is propagated up the call stack without any attempt to recover.

Concept: A request is attempted exactly once. If it fails, the failure is final.
Pros:
- Simplicity: The easiest possible implementation (i.e., no implementation).
- Fails Fast: The client gets an immediate response on failure, releasing resources quickly.
- Safe for Downstream: Puts zero additional load on a potentially struggling downstream service.
Cons:
- Brittle: The system has zero resilience to even the most minor transient network blips or temporary server unavailability. A single dropped packet can cause a user-facing error.

Pseudocode:

function make_request(request) {
    try {
        return downstream_service.call(request);
    } catch (error) {
        // Log the error and return it immediately.
        log("Request failed, no retry.", error);
        throw error;
    }
}

Analysis of Complex Failures:
- This approach cannot cause cascading failures, thundering herds, or metastable failures. By not adding any retry load, it respects the state of the downstream service. However, it also makes the entire system fragile and unable to self-heal from transient issues, leading to poor availability. It's the safest option for the downstream service but the worst for client-side resilience.

1. Simple Retry (Fixed Number of Attempts)

This is the most basic and intuitive form of retry. If an operation fails, try it again a few more times.

Concept: Attempt the operation up to a fixed number, N, of times. The retries happen immediately, one after another, with no delay.
Pros:
- Simple to Implement: Very little code is required.
- Handles Instantaneous Flukes: Can recover from extremely short-lived errors where the resource becomes available again in microseconds.
Cons:
- Brutal on Downstream Services: This is effectively a denial-of-service (DoS) attack on a struggling service. Sending rapid-fire retries will overwhelm it.
- Wastes Resources: The client holds onto resources (threads, connections) while performing tight-loop retries.

Pseudocode:

function make_request_simple_retry(request) {
    const max_attempts = 3;
    let last_error;

    for (let attempt = 1; attempt <= max_attempts; attempt++) {
        try {
            return downstream_service.call(request);
        } catch (error) {
            log(`Attempt ${attempt} failed.`);
            last_error = error;
        }
    }
    // All attempts failed, throw the last captured error.
    throw last_error;
}

Analysis of Complex Failures:
- Cascading Failures: This is a major contributor to cascading failures. If a service is slow because it's overloaded, hammering it with immediate retries will push it from "slow" to "completely failed." This failure then cascades to services that depend on it.
- Thundering Herd: Does not directly cause a thundering herd but exacerbates it. It ensures that every client hammers the downstream service as hard as possible during an outage.

2. Retry with Fixed Delay

A minor improvement, this strategy introduces a waiting period between retries, giving the downstream service a small amount of breathing room.

Concept: Wait a fixed amount of time (e.g., 200ms) between each failed attempt.
Pros:
- Slightly Kinder: Less aggressive than immediate retries. Gives the downstream service a brief moment to recover.
- Still Simple: Easy to understand and implement.
Cons:
- Synchronized Retries (The "Thundering Herd" Problem): This is the critical flaw. If a service fails and all its clients have a 200ms fixed delay, they will all wait 200ms and then retry at the exact same time. This synchronized wave of traffic can knock the service over just as it's trying to recover.

Pseudocode:

function make_request_fixed_delay(request) {
    const max_attempts = 3;
    const fixed_delay_ms = 200;
    let last_error;

    for (let attempt = 1; attempt <= max_attempts; attempt++) {
        try {
            return downstream_service.call(request);
        } catch (error) {
            log(`Attempt ${attempt} failed.`);
            last_error = error;
            if (attempt < max_attempts) {
                sleep(fixed_delay_ms);
            }
        }
    }
    throw last_error;
}

Analysis of Complex Failures:
- Cascading Failures: Still a significant risk. While better than no delay, the synchronized retries can still overwhelm a struggling service.
- Thundering Herd: This strategy is a classic cause of the thundering herd problem. The fixed delay synchronizes clients, creating waves of traffic that are highly destructive to recovering systems.

3. Exponential Backoff

This is the first truly viable retry strategy for distributed systems. The delay between retries increases exponentially with each failed attempt.

Concept: The wait time between retries grows, typically by a factor of 2, after each failure. For example: wait 100ms, then 200ms, then 400ms, and so on. A maximum delay (cap) is often used to prevent excessively long waits.
Pros:
- Adapts to Outage Duration: By backing off, clients give the downstream service progressively more time to recover. This is effective for both short and longer-lasting issues.
- Reduces System Pressure: The total load on the struggling service decreases over time as clients wait longer and longer.
Cons:
- Still Prone to Synchronization: While the timing of retries is spread out for a single client, different clients that experience failure at the same time will still have correlated retry schedules (e.g., all retry at 100ms, then all at 200ms, etc.). This can still lead to spikes of traffic.

Pseudocode:

function make_request_exponential_backoff(request) {
    const max_attempts = 5;
    const base_delay_ms = 100;
    let delay_ms = base_delay_ms;
    let last_error;

    for (let attempt = 1; attempt <= max_attempts; attempt++) {
        try {
            return downstream_service.call(request);
        } catch (error) {
            log(`Attempt ${attempt} failed.`);
            last_error = error;
            if (attempt < max_attempts) {
                sleep(delay_ms);
                // Exponentially increase the delay for the next attempt
                delay_ms *= 2;
            }
        }
    }
    throw last_error;
}

Analysis of Complex Failures:
- Cascading Failures: Significantly reduces the risk compared to fixed delays. By backing off, it gives the downstream service a real chance to recover, preventing the client from delivering the final blow.
- Thundering Herd: Helps mitigate, but does not solve. It can still create smaller, synchronized waves of retries that can be harmful, especially at the first and second retry intervals.

4. Exponential Backoff with Jitter

This is the industry-standard best practice for general-purpose retries. It builds on exponential backoff by adding randomness (jitter) to the wait times. This randomness is the key to breaking the synchronization that causes thundering herds.

Concept: The wait time is not just a fixed exponential value but a value chosen randomly from a window that still grows exponentially.
Types of Jitter:
- Full Jitter: The simplest form. The delay is a random number between 0 and the exponential backoff ceiling.
  - sleep = random(0, base_delay * 2^attempt)
  - Pro: Excellent distribution, breaks synchronization effectively.
  - Con: Can result in very short (near-zero) delays on early retries, which can be aggressive.
- Equal Jitter: A good compromise. It adds jitter to half of the backoff ceiling, ensuring a minimum wait time.
  - temp = base_delay * 2^attempt; sleep = temp/2 + random(0, temp/2)
  - Pro: Prevents the near-zero delays of Full Jitter while still providing good distribution.
- Decorrelated Jitter (Often considered the best): A more advanced formula that uses the previous sleep time to create a smoother distribution over time, preventing collisions even better.
  - sleep = min(cap, random(base_delay, previous_sleep * 3))
  - Pro: Spreads out retries extremely well. It's highly effective at preventing traffic spikes.

Pseudocode (using Decorrelated Jitter):

function make_request_with_jitter(request) {
    const max_attempts = 8;
    const base_delay_ms = 100;
    const max_delay_ms = 5000;
    let sleep_ms = base_delay_ms;
    let last_error;

    for (let attempt = 1; attempt <= max_attempts; attempt++) {
        try {
            // IMPORTANT: Only retry on transient/server-side errors (e.g., 503, not 400).
            return downstream_service.call(request);
        } catch (error) {
            log(`Attempt ${attempt} failed.`);
            last_error = error;
            if (attempt < max_attempts) {
                sleep(sleep_ms);
                // Decorrelated Jitter formula
                sleep_ms = min(max_delay_ms, random(base_delay_ms, sleep_ms * 3));
            }
        }
    }
    throw last_error;
}

Analysis of Complex Failures:
- Cascading Failures: Highly effective at preventing them. The combination of backing off and adding jitter gives downstream services maximum breathing room to recover.
- Thundering Herd: This is the primary solution to the thundering herd problem. By randomizing the retry times, clients no longer retry in synchronized waves. The load from retries is smeared out over time, appearing as a gentle, constant pressure rather than a series of hammer blows.
- Metastable Failures: Jitter helps prevent the system from getting stuck in a bad state. By smoothing out retry load, it avoids the periodic spikes that can keep a service perpetually overloaded and unable to fully recover.

5. The Role of Circuit Breakers

A Circuit Breaker is not a retry strategy itself, but a critical companion pattern that governs if a request should even be attempted. It acts as a state machine that wraps around our retry logic.

Concept: Based on the electrical analogy, a circuit breaker monitors failures.
1. Closed: The default state. Requests are allowed to pass through (and can be retried using backoff with jitter if they fail). If the number of failures exceeds a threshold in a given time window, the breaker "trips" and moves to the Open state.
2. Open: For a configured timeout period, all requests fail immediately without even being attempted. This is "failing fast." This gives the downstream service complete protection and time to recover.
3. Half-Open: After the timeout expires, the breaker allows a single "probe" request to pass through. If it succeeds, the breaker moves to Closed. If it fails, it returns to Open for another timeout period.
Pros:
- Ultimate Downstream Protection: The most effective way to prevent a struggling service from being overwhelmed. It completely stops traffic to it.
- Client-Side Resilience: Prevents the client application from wasting resources (threads, sockets, CPU) trying to contact a service that is known to be down.
- Prevents Cascading Failures: This is its primary purpose. By stopping calls to a failed service, it prevents that failure from propagating up to the services that depend on the client.
Cons:
- Implementation Complexity: Adds state and more configuration (failure thresholds, open-state duration) to the client.
- Can Isolate a Recovered Service: If tuned incorrectly, the "open" state might last too long, preventing traffic from reaching a service that has already recovered.

Pseudocode (Conceptual):

// State is shared across all calls to a specific service
let circuit_breaker = new CircuitBreaker({
    failure_threshold: 5,
    reset_timeout_ms: 10000 // 10 seconds
});

function execute_with_circuit_breaker(request) {
    if (circuit_breaker.is_open()) {
        throw new Error("Circuit Breaker is open. Failing fast.");
    }

    try {
        // The retry logic (e.g., exponential backoff with jitter) goes here
        let result = make_request_with_jitter(request);
        circuit_breaker.record_success();
        return result;
    } catch (error) {
        circuit_breaker.record_failure();
        throw error;
    }
}

Analysis of Complex Failures:
- A Circuit Breaker is the single most important pattern for preventing cascading failures. It acts as a firewall, containing the "blast radius" of a failure to a single service integration point. It allows a system to gracefully degrade instead of collapsing entirely.

6. Advanced: Request Hedging

This is a sophisticated strategy focused on tail-latency reduction, but it has significant implications for resilience. It's not a replacement for the above but can be used in specific scenarios.

Concept: Instead of waiting for a request to fail, you proactively send a second, "hedged" request if the first one doesn't respond within a certain time (e.g., the 95th percentile latency). You then take whichever response comes back first and cancel the other.
Pros:
- Reduces P99 Latency: Excellent at masking the occasional slow response from a server instance or a transient network hiccup, improving the user experience.
- Increases Success Rate: Can turn a slow request (that might have timed out) into a successful one.
Cons:
- Increases Load: This is the critical danger. Under normal conditions, it might only increase load by 5%, but if the downstream service becomes slow for all requests, hedging will effectively double the traffic, potentially triggering a catastrophic failure.
- Requires Idempotency: The downstream operation must be idempotent, as it might be executed twice.

Pseudocode (Conceptual):

function make_hedged_request(request) {
    const hedge_delay_ms = 100; // e.g., P95 latency

    let request1 = downstream_service.call_async(request);
    let timer = sleep_async(hedge_delay_ms);

    // Race the first request against the timer
    let first_result = await Promise.race([request1, timer]);

    if (first_result !== "timer_finished") {
        return first_result; // Request 1 finished quickly
    } else {
        // Timer finished, hedge the request
        let request2 = downstream_service.call_async(request);
        // Return whichever request finishes first now
        return await Promise.race([request1, request2]);
    }
}

Analysis of Complex Failures:
- Request Hedging is a double-edged sword. Used judiciously on a healthy system with spare capacity, it improves performance. Used carelessly or on a system already under load, it can actively cause a cascading failure or thundering herd by doubling the request volume at the worst possible moment. It should only be used for critical, idempotent operations and ideally be disabled automatically when system-wide error rates are high (e.g., when a circuit breaker is open).

Putting It All Together: The Ideal Resilience Strategy

For a modern, resilient distributed system, you don't choose one strategy; you layer them.

Foundation: Every external call that can fail transiently should use Exponential Backoff with Decorrelated Jitter. This should be the default retry mechanism.
Protection: Wrap this retry logic within a Circuit Breaker. This protects the rest of your system from a failing dependency and gives that dependency room to recover.
Idempotency: Ensure that any operation that is retried is idempotent. Use mechanisms like a unique Request-ID or Idempotency-Key that the server can use to de-duplicate retried requests.
Selective Optimization: For a small number of critical, latency-sensitive, and idempotent operations where P99 latency is paramount, consider adding Request Hedging, but ensure it can be disabled and is monitored closely.

By following this layered approach, you move from simply hoping a retry works to designing a system that is fundamentally resilient, self-healing, and robust against the complex failure modes that plague distributed environments.

o3 + search from chatgpt

Modern distributed systems live or die by how gracefully they handle the inevitable transient failure. Below is a progression of retry techniques—from “do nothing” to state-of-the-art patterns—showing where each shines, where it breaks, and how it influences dynamics like thundering-herd surges, metastable loops, and cascading outages. Read straight through to level-up, or jump to the stage that matches your current architecture.

1 No Retries (baseline)

What it is Fail fast: return the error to the caller and let higher layers decide.

Pros

● Zero extra load; simplest to reason about.
● Eliminates the risk of retry storms entirely.

Cons

● Caller must cope with every transient network blip.
● End-user experience may degrade even though the service would have recovered milliseconds later.

resp = callService()
if resp.isError():
    return error           # no retry

Impact on complex failures Doesn’t cause herd or cascade problems—but doesn’t help avoid them either. Useful as a contrast point when measuring incremental patterns.

2 Simple Retry (N attempts, no delay)

What it is Blindly re-send the request a fixed number of times.

Pros

● Handles the most ephemeral glitches with minimal code.

Cons

● All retries land back-to-back, compounding server load; can ignite a thundering herd during a short outage. (encore.dev)

for attempt in 1..N:
    resp = call()
    if resp.ok(): return resp
return error

Failure dynamics If many clients hit the same service, they synchronize their immediate retries, potentially turning a blip into overload and metastable failure. (sigops.org, usenix.org)

3 Retry with Fixed Delay

Insert a constant sleep between attempts.

Pros

● Gives the target breathing room; trivial to implement.

Cons

● Hard-coded delay is either too short (still herding) or too long (hurts latency).
● Responds identically to minor and major outages.

for attempt in 1..N:
    resp = call()
    if resp.ok(): return resp
    sleep(300 ms)

Failure dynamics Reduces spike amplitude but still lines requests up (a 300 ms cadence can pulse traffic).

4 Exponential Backoff

Delay doubles (or grows exponentially) each retry up to a cap. (aws.amazon.com, docs.aws.amazon.com)

Pros

● Fast recovery for flukes, increasingly patient for longer outages.
● Spreads load geometrically, lowering the odds of a service-wide meltdown.

Cons

● Without randomness, many clients still share identical delay schedule—herd just stretches in waves.

base = 100 ms
for attempt in 0..N-1:
    resp = call()
    if resp.ok(): return resp
    sleep(min(base * 2**attempt, MAX))

Failure dynamics Mitigates cascades by rapidly throttling callers, but synchronized waves can still pile up.

5 Exponential Backoff with Jitter

Add randomness to each delay. Variants: Full, Equal, Decorrelated jitter. (aws.amazon.com, aws.amazon.com, medium.com)

Pros

● Breaks synchronization; requests smear across time.
● Proven by AWS simulations to cut server work by orders of magnitude vs. plain exponential.

Cons

● Requires a good RNG; analysis is probabilistic, not deterministic.

base = 100 ms
sleep_ms = random(0, base * 2**attempt)          # “Full Jitter”
sleep(sleep_ms)

Failure dynamics Best single tactic for preventing thundering herds and metastable feedback loops—calls rarely collide.

6 Capped Backoff & Retry Budgets

Introduce a maximum delay and a global “retry budget” (e.g., ≤10 % of traffic). (medium.com, medium.com)

Pros

● Guarantees upper-bound on user latency.
● Budgets limit total retries across a fleet, preventing storm amplification.

Cons

● Needs shared metrics or a client-side token bucket.
● Budget exhaustion may reduce availability during protracted outages.

if retries_this_minute > budget:
    return error                # abort quickly

Failure dynamics Stops runaway queues that fuel cascades; keeps the system inside its stress envelope.

7 Adaptive / Token-Bucket Retries

Backoff factor is adjusted dynamically based on recent throttle responses (429/503). AWS SDK “adaptive” mode is a reference implementation. (docs.aws.amazon.com)

Pros

● Learns the safe send-rate per client in real time.
● Naturally soaks up traffic spikes without a central coordinator.

Cons

● Slightly more complex; relies on accurate error classification.

bucket.take()         # blocks until token available
call()
bucket.refill(rateFromServerFeedback)

Failure dynamics Excels at preventing cascades because it enforces proportional back-pressure on every caller.

8 Circuit Breaker + Retry

A local fuse that “opens” after too many consecutive failures and “half-opens” to probe recovery. (medium.com, resilience4j.readme.io, dev.to)

Pros

● Converts heavy retry loops into fast failures, shielding dependencies.
● Provides real-time health signal to upstream layers.

Cons

● Poorly tuned thresholds can oscillate.
● Adds statefulness to stateless clients.

if breaker.isOpen():
    return fallback()

resp = call()
if resp.isError():
    breaker.recordFailure()
else:
    breaker.recordSuccess()

Failure dynamics Short-circuits cascades and caps the size of retry storms; when combined with jittered backoff, offers robust protection against metastable states.

9 Bulkheads & Isolation-First Retries

Retry only within a partition (per-tenant, per-priority queue).

Pros

● Keeps a noisy neighbor from starving critical traffic.
● Each bulkhead can tune its own retry policy.

Cons

● Requires careful partition key choice; raises operational complexity.

Failure dynamics Prevents failures in one slice from propagating system-wide, containing potential cascades.

10 Request Hedging (Speculative or Parallel Retries)

Send a second copy of a long-running request to another replica after a short hedge delay; keep the fastest response, cancel the rest. (1pkg.github.io, huizhou92.com)

Pros

● Compresses long-tail latency (P99/- P999).
● Gracefully handles random stragglers without punishing median latency.

Cons

● Extra load if hedges fire too early/too often.
● Not a substitute for fixing root-cause slowness.

t = timer(HEDGE_DELAY)
primary = callAsync()
if primary.done(): return primary.result()
hedge = callAsync()            # second replica
return firstCompleted(primary, hedge)

Failure dynamics Reduces perceived outage scope by masking single-node brownouts; however, if overused during system-wide failures it can increase load—pair with budgets and circuit breakers.

11 Adaptive Concurrency & Load-Shed Retries

Combine retry patterns with concurrency-limit algorithms (BBR, AIMD) so each client maintains just enough in-flight calls. (aws.amazon.com, beerus11.medium.com)

Pros

● Achieves near-optimal throughput vs. latency across wide failure modes.
● Automatically sheds lower-priority work when saturation is near.

Cons

● Heavier implementation effort; needs caller-side latency measurement.

Failure dynamics Actively prevents metastable overload loops by throttling at the source based on observed RTT spikes—one of the few techniques strong enough to arrest live cascades.

Putting it all together

In practice you stack these patterns:

Start with exponential backoff with jitter.
Cap total retries with budgets and deadlines.
Wrap every dependency in a circuit breaker.
Guard concurrency with token buckets or adaptive algorithms.
Add hedging selectively for latency-critical calls.

When simulated at Amazon, the jittered-backoff + budget + breaker stack reduced peak load during an outage by orders of magnitude compared with naïve retries. (aws.amazon.com, docs.aws.amazon.com, aws.amazon.com)

Key Take-aways

Randomized backoff is the single most effective upgrade from basic retries.
Budgets and token buckets curb retry storms by making retries a scarce resource.
Circuit breakers act as automatic firewalls between components, limiting blast-radius.
Hedging tackles tail latency but must be paired with budgets to avoid self-harm.
Adaptive concurrency closes the loop by using live performance feedback to self-throttle.

Master these rungs, and your system will survive the next partial outage rather than spiraling into a self-inflicted catastrophe.

gemini 2.5 pro + grounding from aistudio

Of course. Here is a comprehensive guide to the full spectrum of retry strategies, designed to be both thoughtful and forward-thinking for engineers building resilient distributed systems.

The Spectrum of Retry Strategies: From Simple to Sophisticated

Here is a breakdown of retry strategies, starting with the most basic and moving towards more elaborate and robust solutions.

1. No Retries (The Baseline)

Concept: This is the simplest approach, where a failed request is not attempted again. The failure is immediately propagated back to the calling service or user.
Pros:
- Extremely simple to implement.
- Prevents the risk of "retry storms" where multiple clients overwhelm a struggling service with repeated requests.
Cons:
- Offers no resilience to transient failures, which are common in distributed systems.
- Can lead to a poor user experience as temporary glitches result in hard failures.

Pseudocode:

function makeRequest(request) {
  try {
    return execute(request);
  } catch (error) {
    // Immediately return the error
    throw error;
  }
}

Analysis: While "No Retries" avoids the complexities and potential dangers of more advanced strategies, it does nothing to prevent cascading failures. A single transient error in a downstream service can ripple upwards, causing failures throughout the system.

2. Simple Retry (Fixed Number of Attempts)

Concept: When a request fails, it is immediately retried a fixed number of times.
Pros:
- Simple to implement and understand.
- Can handle very short-lived transient failures.
Cons:
- Can easily lead to a "thundering herd" problem, where many clients simultaneously retry, overwhelming the downstream service.
- Does not account for the possibility that the service needs time to recover.

Pseudocode:

function makeRequest(request, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return execute(request);
    } catch (error) {
      if (attempt === maxRetries) {
        throw error;
      }
    }
  }
}

Analysis: This strategy can turn a transient issue into a major outage. By immediately retrying, it can amplify the load on a struggling service, potentially causing a cascading failure.

3. Retry with Fixed Delay

Concept: This strategy introduces a fixed waiting period between retry attempts.
Pros:
- Slightly more sophisticated than a simple retry, as it gives the downstream service a brief recovery period.
Cons:
- Still susceptible to the "thundering herd" problem if many clients retry after the same fixed delay.
- The fixed delay may not be optimal for all failure scenarios.

Pseudocode:

function makeRequest(request, maxRetries = 3, delay = 1000) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return execute(request);
    } catch (error) {
      if (attempt === maxRetries) {
        throw error;
      }
      sleep(delay);
    }
  }
}

Analysis: While an improvement, the fixed delay does little to prevent synchronized retries, a key contributor to cascading failures.

4. Exponential Backoff

Concept: The delay between retries increases exponentially with each failed attempt. For example, the first retry might be after 1 second, the second after 2 seconds, the third after 4 seconds, and so on.
Pros:
- Significantly reduces the likelihood of a "thundering herd" by spreading out retry attempts over time.
- Gives the downstream service progressively more time to recover.
Cons:
- Without a cap, the backoff delay can become excessively long.
- Synchronized failures can still occur if multiple clients start their backoff sequence at the same time.

Pseudocode:

function makeRequest(request, maxRetries = 5) {
  let delay = 1000; // Initial delay of 1 second
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return execute(request);
    } catch (error) {
      if (attempt === maxRetries) {
        throw error;
      }
      sleep(delay);
      delay *= 2; // Double the delay for the next attempt
    }
  }
}

Analysis: Exponential backoff is a significant step towards preventing cascading failures. By reducing the rate of retries, it lessens the pressure on a struggling service.

5. Exponential Backoff with Jitter

Concept: This strategy adds a small, random amount of time to the exponential backoff delay. This "jitter" prevents clients from retrying in synchronized waves.
- Full Jitter: The next delay is a random value between 0 and the exponential backoff value.
- Equal Jitter: The next delay is the exponential backoff value plus a small random value.
Pros:
- Highly effective at preventing the "thundering herd" problem by desynchronizing client retries.
Cons:
- Slightly more complex to implement than simple exponential backoff.

Pseudocode (Full Jitter):

function makeRequest(request, maxRetries = 5) {
  let baseDelay = 1000;
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return execute(request);
    } catch (error) {
      if (attempt === maxRetries) {
        throw error;
      }
      const jitter = Math.random() * baseDelay;
      sleep(jitter);
      baseDelay *= 2;
    }
  }
}

Analysis: Exponential backoff with jitter is a robust strategy for preventing cascading failures caused by retry storms. It is a widely recommended best practice.

6. Circuit Breaker + Retry

Concept: A circuit breaker monitors the number of failures from a particular service. If the failure rate exceeds a threshold, the circuit breaker "trips" or "opens," and for a period of time, all subsequent requests to that service will fail immediately without even being attempted. After a timeout, the circuit breaker enters a "half-open" state, allowing a limited number of test requests. If these succeed, the circuit closes and normal operation resumes. If they fail, the circuit remains open.
Pros:
- Prevents an application from repeatedly trying to execute an operation that is likely to fail, saving system resources.
- Gives a failing service time to recover by shedding load completely.
Cons:
- Can be complex to configure and tune correctly.
- In a "half-open" state, a recovering service could be flooded with requests.

Pseudocode:

const circuitBreaker = new CircuitBreaker(options);

function makeRequestWithCircuitBreaker(request) {
  if (!circuitBreaker.isClosed()) {
    throw new Error("Circuit breaker is open");
  }
  try {
    const response = execute(request);
    circuitBreaker.recordSuccess();
    return response;
  } catch (error) {
    circuitBreaker.recordFailure();
    // Can be combined with a retry strategy here
    throw error;
  }
}

Analysis: The Circuit Breaker pattern is a powerful defense against cascading failures. By stopping requests to a failing service, it prevents the problem from spreading. When combined with a retry strategy, the retry logic should be aware of the circuit breaker's state to avoid retrying when the circuit is open.

7. Request Hedging (Speculative or Parallel Retries)

Concept: Instead of waiting for a request to fail before retrying, request hedging sends a second request after a short delay, without cancelling the first. The first response to return is used, and the other request is cancelled. This is particularly useful for reducing tail latency.
Pros:
- Can significantly improve the latency of requests, especially in systems with high variance in response times.
Cons:
- Increases the load on the downstream service by sending duplicate requests.
- Should only be used for idempotent operations.

Pseudocode:

function makeHedgedRequest(request, hedgeDelay = 100) {
  const originalRequest = execute(request);
  const hedgedRequest = sleep(hedgeDelay).then(() => execute(request));

  return Promise.race([originalRequest, hedgedRequest]);
}

Analysis: While effective for latency reduction, request hedging must be used with caution. Unchecked, it can easily exacerbate a system overload and contribute to cascading failures. It's often combined with other mechanisms like retry budgets to limit the number of hedged requests.

8. Adaptive / Token-Bucket Retries

Concept: This is an advanced strategy where the retry behavior adapts to the real-time conditions of the system. One common implementation is the "token bucket" algorithm. A client has a "bucket" of tokens that are replenished at a fixed rate. Each retry consumes a token. If the bucket is empty, no more retries are allowed until a token is replenished.
Pros:
- Provides a dynamic and responsive way to control retry rates based on system health.
- Can prevent a single misbehaving client from overwhelming a service.
Cons:
- More complex to implement and tune than other strategies.

Pseudocode (Token Bucket):

const tokenBucket = new TokenBucket(capacity, refillRate);

function makeRequestWithTokenBucket(request) {
  try {
    return execute(request);
  } catch (error) {
    if (tokenBucket.consume()) {
      // Retry the request
    } else {
      throw new Error("Out of retry tokens");
    }
  }
}

Analysis: Adaptive retries are highly effective at preventing cascading failures by intelligently throttling retries based on the system's ability to handle them. This approach helps to avoid both metastable failures and thundering herds.

9. Bulkheads & Isolation-First Retries

Concept: The bulkhead pattern isolates elements of an application into pools so that if one fails, the others can continue to function. In the context of retries, this means isolating the resources (like thread pools or connection pools) used for retrying requests for a specific service.
Pros:
- Prevents a failure in one service from consuming all the resources of the calling application, thus containing the failure.
Cons:
- Can add complexity to the application's architecture.

Pseudocode:

// Each service has its own dedicated thread pool for retries
const serviceAThreadPool = new ThreadPool(10);
const serviceBThreadPool = new ThreadPool(10);

function retryForServiceA(request) {
  serviceAThreadPool.submit(() => {
    // Retry logic for service A
  });
}

Analysis: The bulkhead pattern is a crucial strategy for building resilient systems. By isolating failures, it prevents them from cascading and bringing down the entire application. When combined with intelligent retry strategies, it provides a very robust defense against complex failures.

10. Adaptive Concurrency & Load-Shed Retries

Concept: This is a sophisticated approach where the system adaptively controls the concurrency of requests based on observed latency. If latency increases, the system reduces the concurrency limit, effectively "shedding" load. Retries are only attempted if there is available capacity within the current concurrency limit.
Pros:
- Provides a very fine-grained and adaptive control over the load placed on downstream services.
- Can prioritize critical requests during periods of high load.
Cons:
- Requires a sophisticated monitoring and control system.

Pseudocode:

const adaptiveConcurrencyControl = new AdaptiveConcurrencyControl();

function makeRequestWithAdaptiveConcurrency(request) {
  if (adaptiveConcurrencyControl.hasCapacity()) {
    const promise = execute(request);
    adaptiveConcurrencyControl.onRequestStart();
    promise.finally(() => adaptiveConcurrencyControl.onRequestEnd());
    return promise;
  } else {
    // Shed load, potentially retrying later if capacity becomes available
    throw new Error("Concurrency limit reached");
  }
}

Analysis: This is one of the most advanced and effective strategies for preventing complex failures. By directly managing concurrency based on real-time performance, it can prevent overload scenarios before they escalate into cascading failures or metastable failure states.

o3-pro from platform openai - 8m4s

THE RETRY SPECTRUM
(Ordered from the most naïve to the most sophisticated)

NO RETRY (BASELINE)
1. Explanation
  • The caller makes one attempt and propagates any error.
2. Pros / Cons
  - Zero added load, simplest reasoning.
    – Low availability, caller must deal with every transient error.
3. Pseudocode
  result = call() # if it fails, bubble up
4. Effect on Complex Failures
  • Safest for the server: no amplification or thundering herd.
  • Caller must implement its own fallback or accept the outage.
IMMEDIATE SIMPLE RETRY (FIXED ATTEMPTS, NO DELAY)
1. Explanation
  • Retry up to N times as fast as possible.
2. Pros / Cons
  - Hides ultra-short blips (TCP SYN drop, etc.).
    – Multiplies traffic by N under real failure, accelerating overload.
3. Pseudocode
  for i in 1‥N:
  if call(): return ok
  raise Error
4. Failure Analysis
  • Worsens thundering herd, metastable and cascading failures because every failing request becomes N new requests immediately.
RETRY WITH FIXED DELAY
1. Explanation
  • Same as above but waits a constant delay d between attempts.
2. Pros / Cons
  - Gives failing server breathing room.
    – Delay must be tuned; too small = herd, too large = latency.
3. Pseudocode
  for i in 1‥N:
  if call(): return ok
  sleep(d)
4. Failure Analysis
  • Reduces rate-amplification factor by spacing retries, but the retry waves remain synchronized → still vulnerable to herd effects.
LINEAR / INCREMENTAL BACKOFF
1. Explanation
  • Delay grows by a fixed increment Δ: t = base + i·Δ.
2. Pros / Cons
  - Simpler math than exponential; controllable upper bound.
    – Slow to ease pressure compared with exponential.
3. Pseudocode
  delay = base
  for i in 1‥N:
  if call(): return ok
  sleep(delay)
  delay += Δ
4. Failure Analysis
  • Better than fixed but still partially synchronized → waves of load.
EXPONENTIAL BACKOFF
1. Explanation
  • Delay doubles each attempt: base·2^i (capped).
2. Pros / Cons
  - Rapidly shrinks retry traffic on persistent failure.
    – Clients become synchronized because they are deterministic.
3. Pseudocode
  delay = base
  for i in 1‥N:
  if call(): return ok
  sleep(delay)
  delay = min(delay*2, maxDelay)
4. Failure Analysis
  • Helps cascading failures by lowering pressure quickly, but the synchronized doubling can still produce retry bursts (see AWS 2021 Kinesis outage).
EXPONENTIAL BACKOFF WITH JITTER
1. Explanation
  • Adds randomness to each delay to decorrelate clients.
  • Common variants
  a. Full Jitter : sleep = rand(0, backoff)
  b. Equal Jitter : sleep = backoff/2 + rand(0, backoff/2)
  c. Decorrelated : sleep = rand(base, prev*3)
2. Pros / Cons
  - Breaks synchronization, dramatically lowers peak load.
    – Harder to test; long tail latency for unlucky clients.
3. Pseudocode (Full Jitter)
  delay = base
  for i in 1‥N:
  if call(): return ok
  sleep(rand(0, delay))
  delay = min(delay*2, maxDelay)
4. Failure Analysis
  • One of the best generic defences against thundering herds and metastable loops. Peaks smooth out; average retries still limited by exponential growth.
CAPPED BACKOFF & RETRY BUDGET
1. Explanation
  • Hard upper delay cap + quota of retries per time-window (budget).
2. Pros / Cons
  - Prevents “retry storms” when outages last hours.
  - Budgets give SREs a knob (e.g., 10% of traffic allowed as retries).
    – Might starve low-QPS services during budget exhaustion.
3. Pseudocode
  bucket = TokenBucket(rate=R, burst=B)
  for i in 1‥N:
  if not bucket.take(): break # budget exhausted
  if call(): return ok
  sleep(rand(0, min(maxDelay, base*2**i)))
4. Failure Analysis
  • Converts unbounded amplification into a consciously chosen factor (<=B). Greatly limits cascades.
SERVER-HINTED / RETRY-AFTER AWARE BACKOFF
1. Explanation
  • Server replies with Retry-After seconds or a token; client obeys.
2. Pros / Cons
  - Server can coordinate clients (pushback).
    – Requires protocol support and correct server logic.
3. Pseudocode
  resp = call()
  if resp.status == 429: sleep(resp.retryAfter) ; retry()
4. Failure Analysis
  • Dramatically reduces overload because the server dictates pacing. Still needs fallbacks if the server is completely down.
TOKEN-BUCKET / ADAPTIVE RETRY (CLIENT SIDE)
1. Explanation
  • Client observes success rate or latency; a controller adjusts whether to retry and how many outstanding retries are allowed.
2. Pros / Cons
  - Self-tuning to current conditions.
    – Requires measurements and a feedback loop; unstable if tuned poorly.
3. Pseudocode (success-rate gating)
  window = SlidingWindow(100)
  for i in 1‥N:
  if call(): return ok
  if window.successRate() < 0.8: break # abort retries
  sleep(rand(0, backoff))
4. Failure Analysis
  • Cuts retries once success deteriorates → protects the backend from a spiraling failure, avoids metastable amplification.
CIRCUIT BREAKER + RETRY
1. Explanation
  • Circuit breaker trips open when error rate or latency crosses a threshold; retries are short-circuited until a cool-down.
2. Pros / Cons
  - Fast fail protects both client resources and server.
  - The “half-open” probe allows early recovery.
    – Requires correct thresholds; can flap under borderline load.
3. Pseudocode
  cb = CircuitBreaker(failureRate=0.5, window=10s)
  if not cb.allow(): raise FastFail
  if call(): cb.onSuccess()
  else: cb.onFailure()
4. Failure Analysis
  • Prevents cascading failures by blocking avalanche of retries once the service is obviously sick.
BULKHEAD / ISOLATION-FIRST RETRIES
1. Explanation
  • Constrain retries to a dedicated pool (threads, connection pool). Fail fast if the pool is full.
2. Pros / Cons
  - One tenant or feature cannot exhaust shared resources.
    – Added complexity; needs careful pool sizing.
3. Pseudocode
  with retryPool.acquire(timeout=0):
  callWithBackoff()
4. Failure Analysis
  • Limits blast-radius: even if retries misbehave, only the partitioned pool is affected, shielding the rest of the system.
LOAD-SHED / ADAPTIVE CONCURRENCY WITH RETRY
1. Explanation
  • Combine concurrency-limiting algorithm (e.g., TCP BBR, Vegas, or gradient-based) with backoff. When queue grows, drop or shed.
2. Pros / Cons
  - Keeps p99 latency predictable under load.
    – Complex control theory, may under-utilise capacity.
3. Pseudocode
  if inflight >= allowed(): raise FastFail # shed
  callWithBackoff()
4. Failure Analysis
  • Stops the driver of metastability: queue growth. By refusing excess work, the system avoids entering bad equilibria.
REQUEST HEDGING (SPECULATIVE / PARALLEL RETRIES)
1. Explanation
  • Send the same request to two or more replicas, cancel slower ones when the first completes.
2. Pros / Cons
  - Shrinks tail latency in healthy state.
    – Doubles load; catastrophic if used against a degraded service.
    – Requires idempotency!
3. Pseudocode
  t = hedgeDelay
  future1 = async call(replicaA)
  sleep(t)
  future2 = async call(replicaB)
  return firstCompleted(future1, future2)
4. Failure Analysis
  • Not a protection technique per-se; must be combined with budgets & breakers. Otherwise, can exacerbate thundering herds.
COORDINATED HEDGING (QUORUM OR ERASURE-CODED)
1. Explanation
  • Instead of full duplicates, send partial work (shards or coded pieces) to multiple servers; reconstruct result from the fastest subset (e.g., Facebook’s f4 erasure coded retries).
2. Pros / Cons
  - Hedge while bounding extra load to <= 1 replica worth.
    – Complex, application-specific encoding / merging logic.
3. Pseudocode
  pieces = encode(request, k=6, m=4) # need any 6 of 10
  launch async call(piece) for each
  wait for 6 successes then decode
4. Failure Analysis
  • Spreads load horizontally; avoids piling queries onto the same sick node; less risk of cascading overload than raw duplication.
PREDICTIVE / ML-AWARE RETRY (STATE-OF-THE-ART)
1. Explanation
  • Use on-device or edge model to predict failure probability & latency, choose strategy accordingly (skip retry, change pool, hedge, etc.).
2. Pros / Cons
  - Can approach optimal trade-off dynamically.
    – Requires data, features, an offline pipeline, and constant validation.
3. Pseudocode
  score = model.predict(features)
  if score.lowRisk(): return simpleRetry()
  elif score.medium(): return backoffWithJitter()
  else: return fastFail()
4. Failure Analysis
  • If trained well, prevents metastable situations by detecting the onset earlier than threshold-based breakers. If trained poorly, can oscillate—so keep hard budgets as guard-rails.

PRACTICAL GUIDELINES

A. Compose, don’t replace: Jitter + Caps + Budgets + Circuit Breaker is the proven “safe default” stack for 90 % of systems.

B. Respect idempotency: Always ensure requests are safe to replay (dedupe keys, idempotency tokens) before enabling automatic retries or hedging.

C. Observe & adapt: Instrument retries (attempt#, latency, outcome). Feed dashboards & alerts; tune before the outage, not during.

D. Fail fast at the edge: The further a request has travelled, the more expensive its retry. Push decision logic as close to the caller as possible (mobile SDK, sidecar).

E. Combine with chaos testing: Inject 500s and latency to ensure your chosen strategy truly contains thundering herds.

By graduating through this spectrum—adding just enough sophistication for your reliability and latency targets—you can prevent the classic distributed-system failure modes without succumbing to over-engineering.

eevmanu/retry-strategies.md

gemini 2.5 pro from aistudio

The Spectrum of Retry Strategies

0. No Retries (The Baseline)

1. Simple Retry (Fixed Number of Attempts)

2. Retry with Fixed Delay

3. Exponential Backoff

4. Exponential Backoff with Jitter

5. The Role of Circuit Breakers

6. Advanced: Request Hedging

Putting It All Together: The Ideal Resilience Strategy

o3 + search from chatgpt

1 No Retries (baseline)

2 Simple Retry (N attempts, no delay)

3 Retry with Fixed Delay

4 Exponential Backoff

5 Exponential Backoff with Jitter

6 Capped Backoff & Retry Budgets

7 Adaptive / Token-Bucket Retries

8 Circuit Breaker + Retry

9 Bulkheads & Isolation-First Retries

10 Request Hedging (Speculative or Parallel Retries)

11 Adaptive Concurrency & Load-Shed Retries

Putting it all together

Key Take-aways

gemini 2.5 pro + grounding from aistudio

The Spectrum of Retry Strategies: From Simple to Sophisticated

1. No Retries (The Baseline)

2. Simple Retry (Fixed Number of Attempts)

3. Retry with Fixed Delay

4. Exponential Backoff

5. Exponential Backoff with Jitter

6. Circuit Breaker + Retry

7. Request Hedging (Speculative or Parallel Retries)

8. Adaptive / Token-Bucket Retries

9. Bulkheads & Isolation-First Retries

10. Adaptive Concurrency & Load-Shed Retries

o3-pro from platform openai - 8m4s

eevmanu commented Jun 25, 2025

Uh oh!

eevmanu commented Jun 25, 2025

Uh oh!