Load Testing RabbitMQ 2.8.

At Burt we're heavy into RabbitMQ, we use it as a series of tubes that transports almost all of the wonderful data around our platform. It's served us well, apart from a tiny, tiny detail: give it too much of the good stuff and it sort of gets drunk and won't have anymore. Legend has it that version 2.8 solves that problem, and as far as we can tell it's almost all true. Read on for pretty graphs.

[[MORE]]

Background

Our whole platform runs off of Amazon EC2, and for this test set we set up RabbitMQ clusters of 1, 2 and 3 c1.xlarge instances.

Each cluster gets 24 durable queues, we basically use RabbitMQ for transport, and to get the maximum throughput you want to have many queues, since each queue is essentially single threaded.

We loaded the queues by parsing about 2.3 gigs worth of messages from old production logs. The messages were combined and sent in batches of ten (another good technique for getting higher throughput), and routed explicitly to one of then 24 queues based on a property of the messages. The messages are marked as persistent. The queues were drained with a dummy application that connects to all the queues and reads data as fast as possible, prefetching 100 message batches and ack'ing each one individually.

The loader process is smart and connects to the MQ instance that hosts the queue, same for drainer, this avoids unnecessary network traffic between the nodes in the cluster (*).

All graphs were created by measuring the messages per second at 5 second intervals (where each message is a batch of ten actual messages).

Below are the tests we performed and their outcome.

Loader/Drainer max, single node, no clustering

Running 1 loader and 5 drainers we arrived at 25k fragments per second as the average drainer speed. Pretty consistently.

Running 4 loaders and 1 drainer we arrived at just above 30k fragments per second as the average loader speed. Spiky as hell though, but with consistent throughput. This kind of shenanigans used to kill RabbitMQ:s of < 2.8. Good JOB! Flow control FTW!

1 loader and 5 drainers

4 loaders and 1 drainer

Clustering, 1, 2 and 3 instances

We spread out the queues evenly across instances. That is to say the 3-instance cluster ran 8 queues per instance.

The result? No difference. No god damn difference what so ever. As far as speed is concerned, the single MQ-instance did juuust fine. And it didn't ripple as much.

1MQ, 6 loaders and 5 drainers

2MQ, 6 loaders and 5 drainers

3MQ, 6 loaders and 5 drainers

This doesn't feel like a very reasonable conclusion, and I do suppose that if we'd started more loaders/drainers we'd see the benefits of the cluster, but damn it I can't be starting and stopping servers all day long. With the awesome infrastructure tools by @mwq it literarily takes MINUTES to do, but we ain't got that kind of time. We're busy people. Got code to hack and coffee to drink. That Star Craft II ladder aint gonna climb itself!

HA

Ah, high availability. Each message on every queue is replicated to another instance. One instance goes down, you can reconnect to another instance and resume operation. Which is a seriously ballsy thing to do. We don't usually do that. We stop the system and restart the failing instance. Not that we wouldn't like to have magical failover awesomeness, but HA in RabbitMQ is expensive, like bricks of gold pressed latinum expensive. It hasn't proven a huge issue for us -- RabbitMQ has been rock solid since 2.6, so we keep our fingers crossed.

The performance is obviously impacted.

Conclusion

We're going with the cluster. Probably 2 nodes with durable queues.

Listen, it's not all about throughput. If the real-life drainers die we like to know that we can stockpile message for a couple of minutes until everything restarts. A single server only has so much memory and it'll get overloaded easier.

We won't be using much of the clustering capabilities though, as we create queues and connect to the instances individually. The only thing we use is the neat GUI so what we can track and monitor the individual instances.

If you like comparing stuff, have a look at the old load tests we performed. Lots of graphs here. The original test and the supplements as requested by the guys on the RabbitMQ mailing list.

Leftovers

Machines used were Amazon's c1.xlarge, 7Gig RAM, 8 cores.
The trailing of at the end of the graphs is usually one of the loaders finishing earlier than the rest.
1 loader 5 drainer graph's dips are probably our loader's GC:s, so don't read too much into it.

(*) If you're connected to machine A and the queue you're posting to is on machine B, RabbitMQ will forward your message for you. The problem is that this isn't free, because the message needs to go back out on the network and land on machine B, taking time and bandwidth on your network card. This used to be a big problem for us in 2.6. We solve the problem by running a routing aware wrapper over the driver, check it out: github.com/burtcorp/autobahn

iconara/post.md

Background

Loader/Drainer max, single node, no clustering

Clustering, 1, 2 and 3 instances

HA

Conclusion

Leftovers