Skip to content

Instantly share code, notes, and snippets.

@rponte
Last active June 20, 2025 20:05
Show Gist options
  • Save rponte/8489a7acf95a3ba61b6d012fd5b90ed3 to your computer and use it in GitHub Desktop.
Save rponte/8489a7acf95a3ba61b6d012fd5b90ed3 to your computer and use it in GitHub Desktop.
THEORY: Little's Law and Applying Back Pressure When Overloaded

Applying Back Pressure When Overloaded

[...]

Let’s assume we have asynchronous transaction services fronted by an input and output queues, or similar FIFO structures. If we want the system to meet a response time quality-of-service (QOS) guarantee, then we need to consider the three following variables:

  1. The time taken for individual transactions on a thread
  2. The number of threads in a pool that can execute transactions in parallel
  3. The length of the input queue to set the maximum acceptable latency
max latency  = (transaction time / number of threads) * queue length
queue length = max latency / (transaction time / number of threads)

By allowing the queue to be unbounded the latency will continue to increase. So if we want to set a maximum response time then we need to limit the queue length.

By bounding the input queue we block the thread receiving network packets which will apply back pressure up stream. If the network protocol is TCP, similar back pressure is applied via the filling of network buffers, on the sender. This process can repeat all the way back via the gateway to the customer. For each service we need to configure the queues so that they do their part in achieving the required quality-of-service for the end-to-end customer experience.

One of the biggest wins I often find is to improve the time taken to process individual transaction latency. This helps in the best and worst case scenarios.

[...]

@rponte
Copy link
Author

rponte commented Mar 12, 2019

@rponte
Copy link
Author

rponte commented May 12, 2021

Notes on Distributed Systems for Young Bloods

Implement backpressure throughout your system. Backpressure is the signaling of failure from a serving system to the requesting system and how the requesting system handles those failures to prevent overloading itself and the serving system. Designing for backpressure means bounding resource use during times of overload and times of system failure. This is one of the basic building blocks of creating a robust distributed system.

Implementations of backpressure usually involve either dropping new messages on the floor, or shipping errors back to users (and incrementing a metric in both cases) when a resource becomes limited or failures occur. Timeouts and exponential back-offs on connections and requests to other systems are also essential.

Without backpressure mechanisms in place, cascading failure or unintentional message loss become likely. When a system is not able to handle the failures of another, it tends to emit failures to another system that depends on it.

@rponte
Copy link
Author

rponte commented May 13, 2021

https://dev.to/avraammavridis/implementing-backpressure-for-smoother-user-experience-in-low-end-devices-3cem

Reactive Manifesto has a better definition of backpressure than what I could probably write:

When one component is struggling to keep-up, the system as a whole needs to respond in a sensible way. It is unacceptable for the component under stress to fail catastrophically or to drop messages in an uncontrolled fashion. Since it can’t cope and it can’t fail it should communicate the fact that it is under stress to upstream components and so get them to reduce the load. This back-pressure is an important feedback mechanism that allows systems to gracefully respond to load rather than collapse under it. The back-pressure may cascade all the way up to the user, at which point responsiveness may degrade, but this mechanism will ensure that the system is resilient under load, and will provide information that may allow the system itself to apply other resources to help distribute the load.

There are two ways to achieve backpressure and we have to choose based on the needs of our application, the loss-less strategy, and the lossy strategy.

@rponte
Copy link
Author

rponte commented May 25, 2021

Asynchronous streams and callbacks in Java

Streams require flow control to prevent a publisher from overflowing a subscriber. Where synchronous streams can rely on primitives such as a BlockingQueue to provide this flow control, asynchronous streams need to rely on callbacks to simulate backpressure. This creates a complex contract between a publisher and subscriber, making implementation and usage of asynchronous streams harder and more error prone than their synchronous counterparts. This is not to say asynchronous streams are bad. But it is to demonstrate that while useful, asynchronous streams are build on top of a complex abstraction, namely that of simulating flow control with callbacks.

@rponte
Copy link
Author

rponte commented May 25, 2021

Health Checks and Graceful Degradation in Distributed Systems - by Cindy Sridharan (@copyconstruct )

Coarse-grained health checks might be sufficient for orchestration systems, but prove to be inadequate to ensure quality-of-service and prevent cascading failures in distributed systems. Load balancers need application level visibility in order to successfully and accurately propagate backpressure to clients. It’s impossible for a service to degrade gracefully if it’s not possible to determine its health at any given time accurately. Without timely and sufficient backpressure, services can quickly descend into the quicksands of failure.

@rponte
Copy link
Author

rponte commented Jul 4, 2023

ScyllaDB's talk: How to Maximize Database Concurrency

Q: No system supports unlimited concurrency. Otherwise, no matter how fast your database is, you’ll eventually end up reaching the retrograde region. Do you have any recommendations for workloads that are unbound and/or spiky by nature?

A: Yeah, in my career, all workloads are spiky so you always get to solve this problem. You need to think about what happens when the wheels fall off. When work bunches up, what are you going to do? You have two choices. You can queue and wait or you can load shed. You need to make that choice consistently for your system. Your chosen scale chooses a throughput you can handle within SLA. Your SLA tells you how often you’re allowed to violate your latency target so make sure the reality you’ve observed violates your design throughput well less than the limit your SLA requires. And then when reality drives you outside that window, you either queue and wait and blow up your latency or you load shed and effectively force the user to implement the queue themselves. Load shedding is not magic. It’s just choosing to force the user to implement the queue.

@rponte
Copy link
Author

rponte commented Jul 11, 2024

@rponte
Copy link
Author

rponte commented Aug 9, 2024

Excelent article by Gregor Hohpe on Queues invert control flow but require flow control 👏🏻👏🏻

💡 Tweet by @ghohpe
⭐️ EIP: Queues invert control flow but require flow control

In this article, Gregor talks about how queues change the Traffic Shaping of messaging systems because they invert the control flow of the communication, which allows a system to handle spikes or increases in load:

image

The great advantage of traffic shaping is that it's much easier to build and tune a system that handles the traffic pattern on the consumer (receiver) side than the producer (sender) side.

When Gregor talks about flow control he's talking about:

Managing the rate of data transmission between two nodes to prevent a fast sender from overwhelming a slow receiver.

Which basically is about avoiding overloading a system through techniques such as back pressure and load shedding. In his words, there are 3 approaches:

ControlFlow_flowcontrol

  1. Time-to-live (TTL): A limited time to live drops old messages from the queue to make room for new messages. This approach works well for messages whose value decreases over time, such as data streams (the CPU utilization from 2 weeks ago may no longer be relevant for an alert) or customer orders (customers may be surprised to have an order processed that they placed weeks ago).
  2. A Tail drop (see Wikipedia) does the opposite by dropping new messages that arrive. This may be appropriate if old messages are too valuable to be dropped or if senders have a feedback mechanism that allows them to realize that messages are being dropped, allowing them to retry later.
  3. Backpressure informs upstream systems that the queue isn't able to handle incoming messages so that those systems can reduce the arrival rate, for example by showing an error message to the user.

He also comments on some examples of back pressuring and load shedding implemented by systems. Besides that, he also points out that Rate Limiting is a proactive way of implementing Traffic Shaping through back pressure. In his words:

In a way, you are exerting constant back pressure to limit the rate of message arrival

According to Gregor, Queues require flow control, and understanding flow control allows us to build robust systems that include queues.

@rponte
Copy link
Author

rponte commented Aug 12, 2024

@rponte
Copy link
Author

rponte commented Aug 30, 2024

Simple examples of how to apply back-pressuring on RabbitMQ producers on StackOverflow.

@rponte
Copy link
Author

rponte commented Aug 30, 2024

@rponte
Copy link
Author

rponte commented Sep 17, 2024

Building Reliable Systems

Some good articles by Tyler Treat:

  • Designed to Fail
    • With monolithic systems, we care more about preventing failure from occurring. [...] With highly distributed microservice architectures where failure is all but guaranteed, we embrace it.

    • What does it mean to embrace failure? Anticipating failure is understanding the behavior when things go wrong, building systems to be resilient to it, and having a game plan for when it happens, either manual or automated.

    • The key to being highly available is learning to be partially available. Frequently, one of the requirements for partial availability is telling the client “no.”.

    • [...] we made explicit decisions to drop messages on the floor if we detect the system is becoming overloaded. As queues become backed up, incoming messages are discarded, a statsd counter is incremented, and a backpressure notification is sent to the client. Upon receiving this notification, the client can respond accordingly by failing fast, exponentially backing off, or using some other flow-control strategy.

    • If an overloaded service is not essential to core business, we fail fast on calls to it to prevent availability or latency problems upstream. For example, a spam-detection service is not essential to an email system, so if it’s unavailable or overwhelmed, we can simply bypass it.

    • By isolating and controlling it, we can prevent failure from becoming widespread and unpredictable. By building in backpressure mechanisms and other types of intentional “failure” modes, we can ensure better availability and reliability for our systems through graceful degradation.

  • Take It to the Limit: Considerations for Building Reliable Systems
    • Anticipating failure is the first step to resilience zen, but the second is embracing it. Telling the client “no” and failing on purpose is better than failing in unpredictable or unexpected ways.

    • Backpressure is another critical resilience engineering pattern. Fundamentally, it’s about enforcing limits. This comes in the form of queue lengths, bandwidth throttling, traffic shaping, message rate limits, max payload sizes, etc. [...] Relying on unbounded queues and other implicit limits is like someone saying they know when to stop drinking because they eventually pass out.

    • When we choose not to put an upper bound on message sizes, we are making an implicit assumption. Put another way, you and everyone you interact with (likely unknowingly) enters an unspoken contract of which neither party can opt out. This is because any actor may send a message of arbitrary size. This means any downstream consumers of this message, either directly or indirectly, must also support arbitrarily large messages.

    • How can we test something that is arbitrary? We can’t. We have two options: either we make the limit explicit or we keep this implicit, arbitrarily binding contract.

    • Unbounded anything—whether its queues, message sizes, queries, or traffic—is a resilience engineering anti-pattern. Without explicit limits, things fail in unexpected and unpredictable ways.

    • [...] Remember, the limits exist, they’re just hidden. By making them explicit, we restrict the failure domain giving us more predictability, longer mean time between failures, and shorter mean time to recovery at the cost of more upfront work or slightly more complexity.

    • [...] By requiring developers to deal with these limitations directly, they will think through their APIs and business logic more thoroughly and design better interactions with respect to stability, scalability, and performance

  • Slides: Push It to the Limit: Considerations for Building Reliable Systems - by Tyler Treat
  • Sometimes Kill -9 Isn’t Enough
    • If there’s one thing to know about distributed systems, it’s that they have to be designed with the expectation of failure.

    • If you have two different processes talking to each other, you have a distributed system, and it doesn’t matter if those processes are local or intergalactically displaced.

    • We need to be pessimists and design for failure, but injecting failure isn’t enough.

    • Simulating failure is a necessary element for building reliable distributed systems, but system behavior isn’t black and white, it’s a continuum.

    • We build our system in a vacuum and (hopefully) test it under failure, but we should also be observing it in this gray area. How does it perform with unreliable network connections? Low bandwidth? High latency? Dropped packets? Out-of-order packets? Duplicate packets? Not only do our systems need to be fault-tolerant, they need to be pressure-tolerant.

    • Simulating Pressure:
      • iptables -A INPUT -m statistic --mode random --probability 0.1 -j DROP
      • iptables -A OUTPUT -m statistic --mode random --probability 0.1 -j DROP
      • tc qdisc add dev eth0 root netem delay 250ms loss 10% rate 1mbps
      • tc qdisc add dev eth0 root netem delay 50ms 20ms
      • $ tc qdisc add dev eth0 root netem delay 50ms 20ms distribution normal
      • tc qdisc add dev eth0 root netem reorder 0.02 duplicate 0.05 corrupt 0.01
    • Injecting failure is crucial to understanding systems and building confidence, but like good test coverage, it’s important to examine suboptimal-but-operating scenarios.

@rponte
Copy link
Author

rponte commented Dec 31, 2024

  • ⭐️ I'm not feeling the async pressure
    • So why is back pressure all the sudden a topic to discuss when we wrote thread based software for years and it did not seem to come up? A combination of many factors some of which are just the easy to shoot yourself into the foot.

##
# Service-side using semaphores to implement some backpressure (queueing) along with 
# the service's API exposing its actual state
##
from hypothetical_asyncio.sync import Semaphore, Service

semaphore = Semaphore(200)

class RequestHandlerService(Service):
    async def handle(self, request):
        await semaphore.acquire()
        try:
            return generate_response(request)
        finally:
            semaphore.release()

    @property
    def is_ready(self):
        return semaphore.tokens_available()

##
# Caller-side evaluating if the service is overloaded so that it can give up earlier instead of waiting and
# piling up calls infinitely
##
request_handler = RequestHandlerService()
if not request_handler.is_ready:
    response = Response(status_code=503)
else:
    response = await request_handler.handle(request)

@rponte
Copy link
Author

rponte commented Jan 2, 2025

⭐️ Articles about handling overloaded systems

These are the best articles about handling overloaded systems and back-pressure mechanisms I have read. Fred Hebert wrote most of them or referenced many in his articles.

@rponte
Copy link
Author

rponte commented Jan 3, 2025

@rponte
Copy link
Author

rponte commented Jan 8, 2025

⭐️ Retry strategies and their impact on overloaded systems

⭐️⭐️ Good Retry, Bad Retry: An Incident Story

This article is gold! It shows how some retry techniques might overload a system through a DIDACTIC and well-written story. It covers techniques such as:

  1. Simple retry;
  2. Retry with backoff;
  3. Retry with backoff and jitter;
  4. Retry circuit breaker: The service client completely disables retries if the percentage of service errors exceeds a certain threshold (for example, 10%). As soon as the percentage of errors within an arbitrary minute drops below the threshold, retries are resumed. If the service experiences problems, it won’t receive any additional load from retries;
  5. Retry budget (or adaptive retry): Retries are always allowed, but within a budget, for example, no more than 10% of the number of successful requests. In case of service problems, it can receive no more than 10% of additional traffic;
  6. Retry + Circuit breaker(threshold=10%);
  7. Retry + Circuit breaker(threshold=50%);
  8. Retry + Deadline propagation;

Both (Retry circuit breaker and Retry budget) options guarantee that in case of service problems, clients will add no more than n% of additional load to it

[...] it’s necessary to differentiate between scenarios when the service is healthy and when it’s experiencing problems. If the service is healthy, it can be retried because errors might be transient. If the service is having issues, retries should be stopped or minimized.

The percentage of retries can be calculated locally without complicating the system with global statistics synchronization.

Ben conducted a simulation: for long-lived clients, local statistics behave identically to global ones, and exponential backoff doesn’t significantly impact amplification.

Based on these findings, Ben decided to propose a new postmortem action item: implementing a retry budget with a 10% limit, in addition to the existing exponential backoff. There’s no need for global statistics synchronization — a local token bucket should be enough.

References that it's worth to read it

Annotations (pt_BR)

Esse artigo eh PERFEITO, gesuis! 🤩🤩🤩
https://medium.com/yandex/good-retry-bad-retry-an-incident-story-648072d3cee6

O artigo eh sobre como retries podem sobrecarregar seu sistema e como lidar com isso.

Resumo do resumo:

Retries são perigosos, isso já sabemos. Mas como estratégias de retry impactam negativamente na sobrecarga do sistema eh onde fica interessante.

O artigo testa algumas estrategias de retry em alguns cenarios através de simulações. Mas como eh pra resumir o artigo que eh longo, vamos lá...

A estratégia de Retry+backoff+jitter funciona muito bem para sistemas que sejam considerados saudáveis (healthy), ou seja, que estão enfrentando uma sobrecarga temporaria, indisponibilidade parcial, mas principalmente curta, ou seja, que causa transient errors, mas ela não é de muita ajuda em sobrecargas longas (particionamento de rede, crash da aplicação ou alta taxa de erro), pois ela apenas posterga a sobrecarga da aplicação, aumentando o tempo de recovery da aplicação. De forma direta, podemos inferir que, se o tempo de sobrecarga for superior ao tempo que os clients (que fazem retry) estão dispostos a esperar, então os retries estão apenas piorando a situação!

Em contrapartida, Retry adaptativo (Retry Token Bucket) ou Retry Circuit-Breaker (o breaker é a nivel de retry, e não complemento a ele) funcionam para para sobrecargas ou indisponibilidades longas do sistema, e também para curtas - embora com menor taxa de sucesso para sobrecarga curta comparada ao backoff+jitter. Ambas as estratégias, em caso de sobrecarga longa, conseguem diminuir BASTANTE a carga da aplicação, para um percentual baixo da carga original, permitindo a aplicação se recupear mais rapido, que é justamente o que se quer em casos de indisponibilidade.

Outro ponto, é que Retry+backoff+jitter funciona muito bem para mitigação (diminuição ou eliminação) da sobrecarga do sistema em cenários mais estáveis (geralmente closed system), ou seja, cenários com long-lived clients ou número de clients limitados e/ou com execução serial/sequencial das requisições, como por exemplo, jobs em background fazendo polling no sistema ou numa fila. Enquanto as estratégias de Retry Token Bucket e Retry Circuit-Breaker, são ideais para cenários onde não há controle no número de clients (unbounded clients), por exemplo, bordas do sistema onde não se tem controle dos usuários ou dos clients - aqui, o importante é estar ciente que nesse tipo de cenário (geralmente open system), sempre haverá novos clients enviando novas requisições ("first try" - o primeiro request) independente se já existem outros usuários (ou threads) fazendo backoff nesse meio tempo.

O autor conseguiu combinar muito bem os vários artigos de resiliência do Marc Brooker e usar o simulador dele para validar as hipoteses! Ficou simplesmente ANIMAL!

(Eu acompanho o Marc, mas confesso que tive que reler os artigos do Marc para relembrar e conectar melhor os pontos - e gesuis, eh animal demais!)

@rponte
Copy link
Author

rponte commented Jan 8, 2025

  • ⭐️ Google SRE Book: Handling Overload
    • In a majority of cases (although certainly not in all), we've found that simply using CPU consumption as the signal for provisioning works well, for the following reasons:

      • In platforms with garbage collection, memory pressure naturally translates into increased CPU consumption.
      • In other platforms, it's possible to provision the remaining resources in such a way that they're very unlikely to run out before CPU runs out.
    • Our larger services tend to be deep stacks of systems, which may in turn have dependencies on each other. In this architecture, requests should only be retried at the layer immediately above the layer that is rejecting them. When we decide that a given request can't be served and shouldn't be retried, we use an "overloaded; don't retry" error and thus avoid a combinatorial retry explosion.

  • ⭐️ Google SRE Book: Addressing Cascading Failures
    • A cascading failure is a failure that grows over time as a result of positive feedback.

    • Limit retries per request. Don’t retry a given request indefinitely.

    • Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request. [...]

    • Think about the service holistically and decide if you really need to perform retries at a given level. In particular, avoid amplifying retries by issuing retries at multiple levels: [...]

    • Use clear response codes and consider how different failure modes should be handled. For example, separate retriable and nonretriable error conditions. Don’t retry permanent errors or malformed requests in a client, because neither will ever succeed. Return a specific status when overloaded so that clients and other layers back off and do not retry.

    • If handling a request is performed over multiple stages (e.g., there are a few callbacks and RPC calls), the server should check the deadline left at each stage before attempting to perform any more work on the request. For example, if a request is split into parsing, backend request, and processing stages, it may make sense to check that there is enough time left to handle the request before each stage.

@rponte
Copy link
Author

rponte commented Jan 8, 2025

@rponte
Copy link
Author

rponte commented Mar 21, 2025

@rponte
Copy link
Author

rponte commented Mar 21, 2025

Youtube | ScyllaDB: Resilient Design Using Queue Theory: This talk discusses backpressure, load shedding, and how to optimize latency and throughput.

@rponte
Copy link
Author

rponte commented Mar 21, 2025

@rafaelpontezup
Copy link

The #1 rule of scalable systems is to avoid congestion collapse - by @jamesacowling
https://x.com/jamesacowling/status/1934991944234770461

image

A good metaphor for congestion collapse is to imagine you're a barista at a coffee shop that just got popular. The cashier keeps taking orders and stacking them up higher and higher but you can't make coffees any faster. [...] - by @jamesacowling
https://x.com/jamesacowling/status/1935812480254787819

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment