-
-
Save victorono/cd9d14b013487b8b22975512225c5b4c to your computer and use it in GitHub Desktop.
from django.db.models import Count, Max | |
unique_fields = ['field_1', 'field_2'] | |
duplicates = ( | |
MyModel.objects.values(*unique_fields) | |
.order_by() | |
.annotate(max_id=Max('id'), count_id=Count('id')) | |
.filter(count_id__gt=1) | |
) | |
for duplicate in duplicates: | |
( | |
MyModel.objects | |
.filter(**{x: duplicate[x] for x in unique_fields}) | |
.exclude(id=duplicate['max_id']) | |
.delete() | |
) |
Amazing script, thanks a lot!
https://docs.djangoproject.com/en/3.0/topics/db/aggregation/#order-of-annotate-and-values-clauses
@stasius12 this works because adding annotate
over values
does a group_by
, check the mentioned link!
👍
Super.thanks a lot!
Amazing!!
Thanks 👍
@victorono Neat. Got a challenge though. Perhaps an eye on this could help clarify?
>>> duplicate_books = Book.objects.values('title').annotate(title_count=Count('title')).filter(title_count__gt=1)
>>> for duplicate in duplicates:
Book.objects.filter(**{x: duplicate[x] for x in unique_fields}).exclude(id=duplicate['max_id']).delete()
However, I get:
raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused)
Helpful code snippet! 👍
When you have lots of duplicates this becomes very slow. Imagine a table where every row is duplicated, so that you have table_size/2
items in the duplicates
QuerySet after the first query and then need to do the delete for each of those one by one.
It gave a really good starting point though. This is what I ended up with, does it all in one query. Running time went from hours to minutes on a large table.
from django.db import connection
def remove_duplicates(model, unique_fields):
fields = ', '.join(f't."{f}"' for f in unique_fields)
sql = f"""
DELETE FROM {model._meta.db_table}
WHERE id IN (
SELECT
UNNEST(ARRAY_REMOVE(dupe_ids, max_id))
FROM (
SELECT
{fields},
MAX(t.id) AS max_id,
ARRAY_AGG(t.id) AS dupe_ids
FROM
{model._meta.db_table} t
GROUP BY
{fields}
HAVING
COUNT(t.id) > 1
) a
)
"""
with connection.cursor() as cursor:
cursor.execute(sql)
remove_duplicates(MyModel, ['field_1', 'field_2'])
Thanks a lot! Excellent dimensioning to improve execution performance
Great starting point. Tried to reduce the number of queries without using raw SQL.
def remove_duplicates_from_table(model, lookup_fields):
duplicates = (
model.objects.values(*lookup_fields)
.order_by()
.annotate(min_id=Min('id'), count_id=Count('id'))
.filter(count_id__gt=1)
)
fields_lookup = Q()
duplicate_fields_values = duplicates.values(*lookup_fields)
for val in duplicate_fields_values:
fields_lookup |= Q(**val)
min_ids_list = duplicates.values_list('min_id', flat=True)
if fields_lookup:
model.objects.filter(fields_lookup).exclude(id__in=min_ids_list).delete()
Ended up using Q object to avoid making select query in each iteration while looping over the duplicates
list.
Thank you so much! This is EXACTLY what I needed to fix an issue with one of my migrations taking forever.
@ahmed-zubair-1998 thanks
Thanks for posting this. It helped me clean up a nagging duplicate issue. I made one change for my use: switched to Min rather than max for the dupe to exclude from deletion. This was a better fit for me since I may edit some corresponding fields before a dupe is added to the database so I'd rather stick with the original. If anyone else is interested in this approach, swapping out Max for Min and max_id for min_id does the trick.