Django's select_related
results in an SQL JOIN.
It's very useful when fetching a one-to-one relation:
class User(Model):
...
class UserProfile(Model):
user = OneToOneField(User, related_name="profile")
...
User.objects.select_related("profile").get(pk=1) # ✅DO!
It often results in massive data over-fetching when fetching a many-to-one relation:
class Album(Model):
...
class Track(Model):
album = Foreignkey(Album, related_name="tracks")
...
x = Track.objects.select_related("album").all() # ❌DON'T!
If an album has an average of 20 tracks, we're asking the database to fetch and serialize information on each album 20 times on average. With a few JOINs it's easy to end up with result tables that span gigabytes of data and once those no longer fit in memory, the result set starts to be assembled on disk. DevOps don't want you to force the database to write to disk, trust me on this one.
The correct course of action is to use prefetch_related
and let Django fetch the related records in a separate query:
x = Track.objects.prefetch_related("album").all() # ✅DO!
While intuitively it may seem slower to this using two queries rather that one, it's usually faster when the database is working under stress. The cost of sending an additional query packet to the DB server is negligible for RDS.
While it may seem convenient to do all optimizations in methods that filter data (often at model or manager level), it's much easier to figure out if all relations are correctly prefetched (and to confirm that none of the unwanted ones are present!) if the list of prefetches/joined relations lives close to the point where your application uses the data. The harder it is to tell what prefetches/joins you're using, the more prevalent are going to be the n+1 bugs where a single change in logic results in extra queries being executed for each row of the result set.
def active_products():
return Product.objects.filter(
active=True
).prefetch_related("category", "stock", "seller")
... # seven seas and seven mountains
def list_products_in_category(request, category_id):
products = active_products.filter(
category_id=category_id
)
# ❌DON'T!
def active_products():
return Product.objects.filter(
active=True
)
... # seven seas and seven mountains
def list_products_in_category(request, category_id):
products = active_products.filter(
category_id=category_id
).prefetch_related("seller") # we don't even need most of them
# ✅DO!
Filter queries are cheap unless INNER JOINs or DISTINCT results are involved. In those cases the database has to fetch a huge amount of joined rows (in the worst case the cartesian product of all joined tables), sort them and then deduplicate them before returning the results.
As mentioned before, this gets very slow if the results are forced to be flushed to disk because of their size (or because your database server is executing multiple queries in parallel—congratulations, your business is popular).
authors = Author.objects.filter(book__category__name__ilike="foo")
# ❌slow!
These queries are much faster if the database knows you're only interested in one table and use the others to narrow down the results. Using an EXISTS subquery for this purpose means the database can stop looking for matches as soon as it finds one.
categories = Category.objects.filter(
name__ilike="foo"
).values("id") # avoid overfetching in the subquery
books = Book.objects.annotate(
matching=Exists(categories.filter(category_id=OuterRef("id")))
).filter(
matching=True
).values("id") # avoid overfetching in the subquery
authors = Author.objects.annotate(
matching=Exists(books.filter(author_id=OuterRef("id")))
).filter(
matching=True
)
# ✅FAST!
# but also
# ❌quite ugly!
Unfortunately Django does not allow using Exist
expressions to filter directly. This means it's not easy to make the code both fast and readable, so use it with care or write helper queryset methods to take care of the boilerplate code.
Django 3.0 greatly simplifies section 3
Link to Docs