This project aims to boost performance, efficiency and reliability of Django’s database cache backend. The proposed improvements are:
- Using ORM instead of SQL queries: Simplifies the process.
- Implementation of zlib Compression: Optimizes storage efficiency by reducing the size of cached data.
- Introduction of Atomic Increment/Decrement Operations: Ensures data integrity during simultaneous updates.
- Adoption of Probabilistic Culling: Effectively manages cache size and the eviction of older data.
- Transition from Text to Blob for Value Storage: Enhances space utilization and performance by storing cache values as blobs instead of text.
- Make
set_many
Method Use Single Query: Reduces database load and improves speed by makingset_many
method execute single or fewer queries. - Utilization of Upsert Operations: Efficiently handles insert or update operations.
To use orm we can create model at runtime:
...
from django.apps import apps
class BaseDatabaseCache(BaseCache):
def __init__(self, table, params):
super().__init__(params)
self._table = table
self.cache_model_class = self._get_cache_entry_model(self._table)
def _get_cache_entry_model(self, table):
try:
model = apps.get_registered_model("django_cache", "cacheentry")
if table == model._meta.db_table:
return model
del apps.all_models["django_cache"]["cacheentry"]
except (LookupError, KeyError):
pass
class CacheEntry(models.Model):
cache_key = models.CharField(max_length=255, unique=True, primary_key=True)
value = models.TextField()
expires = models.DateTimeField(db_index=True)
class Meta:
app_label = "django_cache"
db_table = table
return CacheEntry
...
Now for example, for get_many
instead of executing sql we can:
def get_many(self, keys, version=None):
...
db = router.db_for_read(self.cache_model_class)
connection = connections[db]
rows = self.cache_model_class.objects.using(db).filter(cache_key__in=list(key_map)).values_list('cache_key', 'value', 'expires')
...
Another example, for has_key
method:
def has_key(self, key, version=None):
key = self.make_and_validate_key(key, version=version)
db = router.db_for_read(self.cache_model_class)
connection = connections[db]
now = tz_now().replace(microsecond=0)
return self.cache_model_class.objects.using(db).filter(Q(cache_key=key) & Q(expires__gt=now)).exists()
All tests pass for these two method changes.
Using zlib's compress/decompress
methods improves performances considerably, especially as data starts getting larger and it is also true to some extent for small data as well.
b64encoded = base64.b64encode(zlib.compress(pickled, level=6)).decode("latin1")
value = pickle.loads(zlib.decompress(base64.b64decode(value.encode())))
Benchmark:
Cache Alias | set benchmark | set_many benchmark | get benchmark | get_many benchmark | delete benchmark | incr benchmark |
---|---|---|---|---|---|---|
DatabaseCache | 3.520 | 507.169 | 0.214 | 4.701 | 0.687 | 5.362 |
w/ zlib | 2.442 | 363.081 | 0.138 | 2.092 | 0.311 | 2.283 |
Currently, a separate cull query is run on every set query changing this behaviour improves performance at the cost of culling expired values little later.
_cull_probability = 0.01
if random() <= _cull_probability:
self._cull(db, cursor, now, num)
Will use update_or_create
and/or bulk_create
with update_conflicts=True
Replace TextField with BinaryField and remove b64 encoding/decodings. (backwards compatibility needed)
Looks like set_many
needs to be overriden and adapted accordingly
from django.utils.timezone import timedelta
def set_many(self, data, timeout=DEFAULT_TIMEOUT, version=None, chunk_size=100):
with transaction.atomic():
objs = []
timeout = self.get_backend_timeout(timeout)
db = router.db_for_write(self.cache_model_class)
connection = connections[db]
for key, value in data.items():
key = self.make_and_validate_key(key, version=version)
pickled = pickle.dumps(value, self.pickle_protocol)
b64encoded = base64.b64encode(pickled).decode("latin1")
if timeout is None:
exp = datetime.max
if settings.USE_TZ:
exp = exp.replace(tzinfo=timezone.utc)
else:
tz = timezone.utc if settings.USE_TZ else None
exp = datetime.fromtimestamp(timeout, tz=tz)
exp = exp.replace(microsecond=0)
obj = self.cache_model_class(cache_key=key, value=b64encoded, expires=exp)
objs.append(obj)
if len(objs) == chunk_size:
self.cache_model_class.objects.bulk_create(objs, update_conflicts=True, update_fields=("value", "expires",), unique_fields=("cache_key",))
objs = []
if objs:
self.cache_model_class.objects.bulk_create(objs, update_conflicts=True, update_fields=("value", "expires",), unique_fields=("cache_key",))
return []
All tests pass, caveats:
bulk_create
'supdate_conflicts=True
not supported on oracle (see docs)- memory vs time trade off
Benchmark:
Cache Alias | set benchmark | set_many benchmark | get benchmark | get_many benchmark | delete benchmark | incr benchmark |
---|---|---|---|---|---|---|
DatabaseCache | 3.520 | 507.169 | 0.214 | 4.701 | 0.687 | 5.362 |
w/ set_many | 9.008 | 75.977 | 0.274 | 7.274 | 0.741 | 5.250 |
The idea is to override incr
, decr
methods from BaseCache
in DatabaseCache
and do it atomically
- Understand the codebase in more depth.
- Interact with the community and my mentors.
- Refine the project plan and timeline with the help of mentors.
- Task: Transition from SQL queries to ORM.
- Deliverable: Database operations in the cache backend use Django ORM.
- Task: Implement zlib compression.
- Deliverable: Cached data is compressed using zlib before being stored.
- Task: Implement atomic increment/decrement operations.
- Deliverable: Atomic increment/decrement operations are supported in the cache backend.
- Task: Implement probabilistic culling.
- Deliverable: Probabilistic culling is used to manage cache size but not on every set operation.
- Task: Transition from text to blob for value storage.
- Deliverable: Cache values are stored as blobs instead of text.
- Task: Make
set_many
method use single query. - Deliverable:
set_many
operation uses single or fewer query.
- Task: Implement upsert operations.
- Deliverable: Upsert operations are supported in the cache backend.
- Buffer time to finish any pending work, write tests, improve documentation, and clean up the code.
- Submit the code for final evaluation.
If time permits or after gsoc will look into hooking the database cache tables into migrations.
- https://adamj.eu/tech/2015/05/17/building-a-better-databasecache-for-django-on-mysql/
- https://django-mysql.readthedocs.io/en/latest/cache.html
- https://github.com/adamchainz/django-mysql/blob/80de65fa4eb1801f7ff721f40a419ca63d2f338d/src/django_mysql/cache.py
- https://github.com/adamchainz/django-mysql-benchmark-cache
- https://code.djangoproject.com/wiki/DynamicModels
My name is Hisham Mahmood and I am a first year computer science student at National University of Sciences and Technology, Islamabad. I started to use python in 2019 and since then I have been using django.
- #31405 : LoginRequiredAuthenticationMiddleware force all views to require authentication by default.
- #33517 : Extracting seconds also returns fractional seconds on PostgreSQL and Oracle.
- #34044 : Admin app search filter appears in nav sidebar but not on admin homepage.
- #34910 : Color contrast admin plus icon.
- #35099 : Combining QuerySets with "|" or "&" mutates right-hand side.
- #35173 : DisallowedModelAdminLookup raised when filtering with __isnull on a ForeignKey not listed in list_filters.
- PR-17844 : Fixed #35173 -- Reallowed filtering with __isnull on foreign keys not listed in list_filters.
- PR-17829 : Fixed #35099 -- Prevented mutating queryset when combining with & and | operators.
- PR-17792 : Refs #31405 -- Added LoginRequiredAuthenticationMiddleware force all views to require authentication by default.
- PR-17769 : Refs #33517 -- Prevented __second lookup from returning fractional seconds on Oracle.
- PR-17744 : Fixed #34910 -- Improved color contrast for add/change icons in admin.
- PR-17709 : #34044 -- Add the admin app filter on index pages in addition to the nav sidebar
- Primary Email: [email protected]
- Secondary Email: [email protected]
- LinkedIn: Hisham Mahmood