-
-
Save stevej/3868225 to your computer and use it in GitHub Desktop.
I can make a case that queues only do what people want if you don’t consider failure cases. | |
Qs are empty (normal) or full (fail). Normally things are processed quickly. Failure case processing time is unbounded (or “too long”). | |
Solution is always “dump the Q”. Which means you do care about how long it takes to process items. So you want the queue to always be empty | |
Which means you only want a non-failing Q. | |
So why not admit it, use in-proc buffers, run enough servers to handle load? Reject work up front instead of dropping oldest items w/… | |
“flush the queues!” | |
it seems that draining the queue is the antipattern. in the case where it saved us today, we were able to fix a problem while the queue filled up. deploy code. watch the queue empty.
whereas it would have been. drop all requests on the floor.
queues were helpful today.
i think making the queues explicit lets you program in a failure policy too.
for example, if the queue exists on a server, instead of just implicitly as a process buffer, then you can tell the server "drop anything older than 15 minutes" or "warn us if things start backing up" (so you can add capacity).
normally all queues should always be empty, but if you have no fallback for what happens when they're not, then you're failure-recovery policy is "explode". better to have some conscious policy.
Nobody ever seems to have a conscious policy, though. And in many(?) cases if you're using external queues your level of writes is such that in a failure case you go from "OK" to "meh" to "WTF!?" incredibly quickly. Or you go back and forth from "OK" to "meh" all the time & boy-that-cried-wolf syndrome sets in.
Even an unconscious policy can lengthen the time you have to recover from catastrophe. Computers are really good at failing fast, and with zero buffering you have to fix your problem at computer time scales which you're not going to do. Even giving yourself 60 minutes worth of space to fix an issue brings the problem domain back to human scale so at least you have a fighting chance.
That said, managing backpressure in a distributed system is hard and the odds of getting it right without thinking about it are pretty poor.
Don't forget that durable queues can serve an important role as the commit logs in a distributed application architecture. If an action needs to persist some state in more than one data store, and the individual writes can be idempotent, then once the write hits the queue, you can return success to the client. Eventually job replay will recover from all transient errors.
With no durable queue, your application state is never coherent unless you can rest assured that 100% of your clients retry indefinitely in the face of errors.
This is pretty simplistic; it's silly to say outright that queues are bad (and just as silly to say that they are good).
Without even arguing about durability or failure, queues can serve many purposes.
For example, they form a natural load balancer: if you have a large number of low-velocity producers (which cannot use a least-loaded balancer since they don't have enough load), inverting the relationship between producer and consumer effectively gives you load balancing.
Queues allow you to write truly stateless consumers. This can be used to simplify architecture.
Queues can serve as buffers to smooth out spikes in load.
...
Anyway, my broad point is: it's really silly to say that it's outright good or bad. Queues have their use.
My broader point is: it's really silly to say that you can't come up with some rules of thumb for the use of any particular technique. Every technology has its use; it's just not useful for every case.
Rules of thumb are a poor substitute for understanding your system and the trade offs you are making. Rules of thumb are important, no doubt, when you are dealing with very complex systems that cannot be comprehended easily (your super scalar processor and its memory hierarchy!), but I don't think this holds for design questions involving distributed systems and queues.
A queue being full doesn't always mean the items have to be lost.
It depends on the lifetime requirements of the queue elements and how much time it takes at steady state to clear a queue. Also, when you have the ability to spin up new instances quickly (or repurpose old instances), the failure domain for queues shrinks. But if you item freshness is the top, or only, priority and you can't bring new queue workers to bear on the task, then I think you're right.