Cacheops EVALSHA storms – causes and fixes

What cacheops does

cacheops watches Django make database queries, and caches the resultsets
When a record is created or updated, cacheops invalidates all the cached resultsets that might have included that record

How cacheops does that

cacheops operates on the Django ORM's QuerySet objects. When a QuerySet is executed, cacheops analyzes its where structure to build and cache a simplified list of the fields that were filtered on. Then after it caches the resultset itself, it adds the resultset's cache key to a list of queries that used those filter fields with the same values.

When a record is saved, cacheops looks through these lists to find the queries that it thinks the record might have matched, and deletes all the resultsets that have been cached for all of those queries. (There's an article by cacheops' author here that explains the concepts behind this process.)

Both these operations – updating the invalidation metadata, and looping over it to delete cached resultsets – are done by calling Lua scripts that Redis runs internally, like SQL stored procedures.

Why this is hurting us

Rover's flexible filtering lets clients create any query they can imagine. But cacheops' idea of what queries "might" match a changed record is pretty crude – it doesn't count any filters on joined tables, and any filters on TextFields, and any filters with an IN list longer than 8, and any case-insensitive filters, and others.

So when it goes looking for queries to invalidate, it finds lots that "match" the saved record. And those queries have lots of cached resultsets – thousands! – so the resulting Redis calls can take multiple seconds to enumerate and delete them all.

Which doesn't sound that bad – but Redis is single-threaded. While it's running the invalidation script, it can't handle any other requests. Which leads to the request-processing delays we're seeing.

What we can do now

Upsize our Redis cluster

Done, and it seems to be helping. But we'll catch up if we don't change how we use the cache.

Upgrade our Redis cluster to v3 and shard our cache

This would let us scale out easily ahead of growing traffic. cacheops invalidation doesn't currently work in a sharded cluster – see this GitHub issue that has been open for years – but maybe we can shard by table as Dennis has suggested.

Rebase our cacheops fork

It's a year old, and there may be relevant fixes in upstream. Ideally, we'll find that our customizations are all available there now and we can just use it instead.

Fix some cacheops bugs

There are some issues with cacheops' query analysis that affect us – for example, it skips TextFields on the assumption that they are large and shouldn't be compared, which isn't true on Postgres; and it skips case-insensitive matches which we (ugh) use by default. Fixing those would make the invalidation faster since it'd delete fewer cached resultsets, and it'd leave the valid resultsets in the cache, reducing cache misses.

Rewrite the Lua scripts

Dennis has found some optimizations in the Lua scripts.

What we could do in the near term

Move the Lua invalidation script to Python

The Lua script is taking too long to list and delete the invalid resultset caches – and while it's doing that Redis can't handle other requests. We could do that in Python, which would be even slower, but Redis would be able to handle other work in between commands.

Run the Python invalidation tasks on RQ workers

Once we're handling invalidation in Python, we could queue those tasks for asynchronous processing using RQ, which is already running to handle Bark tasks. I don't know if this would be fast enough to beat Edit UI save/load rendering though.

nicwolff/cacheops.md