victorono/remove_duplicates.py

edkohler · 2019-09-04T23:54:32Z

Thanks for posting this. It helped me clean up a nagging duplicate issue. I made one change for my use: switched to Min rather than max for the dupe to exclude from deletion. This was a better fit for me since I may edit some corresponding fields before a dupe is added to the database so I'd rather stick with the original. If anyone else is interested in this approach, swapping out Max for Min and max_id for min_id does the trick.

patrik7 · 2020-04-23T08:19:56Z

Amazing script, thanks a lot!

KarthikNayak · 2020-05-04T07:16:25Z

https://docs.djangoproject.com/en/3.0/topics/db/aggregation/#order-of-annotate-and-values-clauses

@stasius12 this works because adding annotate over values does a group_by, check the mentioned link!

sandeep7410 · 2020-05-05T03:03:36Z

👍

cliftonavil · 2020-07-31T15:01:01Z

Super.thanks a lot!

Thutmose3 · 2021-04-02T22:28:08Z

Amazing!!

gokhanyildiz9535 · 2021-09-27T12:49:47Z

Thanks 👍

MarvinKweyu · 2021-11-30T11:11:55Z

@victorono Neat. Got a challenge though. Perhaps an eye on this could help clarify?

>>> duplicate_books = Book.objects.values('title').annotate(title_count=Count('title')).filter(title_count__gt=1)
>>> for duplicate in duplicates:
             Book.objects.filter(**{x: duplicate[x] for x in unique_fields}).exclude(id=duplicate['max_id']).delete()

However, I get:

    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused)

mallamsripooja · 2022-05-06T10:06:33Z

Helpful code snippet! 👍

vhtkrk · 2022-07-28T09:31:46Z

When you have lots of duplicates this becomes very slow. Imagine a table where every row is duplicated, so that you have table_size/2 items in the duplicates QuerySet after the first query and then need to do the delete for each of those one by one.

It gave a really good starting point though. This is what I ended up with, does it all in one query. Running time went from hours to minutes on a large table.

from django.db import connection

def remove_duplicates(model, unique_fields):
    fields  = ', '.join(f't."{f}"' for f in unique_fields)
    sql = f"""
    DELETE FROM {model._meta.db_table} 
    WHERE id IN (
        SELECT 
            UNNEST(ARRAY_REMOVE(dupe_ids, max_id))
        FROM (
            SELECT 
                {fields},
                MAX(t.id) AS max_id,
                ARRAY_AGG(t.id) AS dupe_ids
            FROM
                {model._meta.db_table} t
            GROUP BY
                {fields}
            HAVING
                COUNT(t.id) > 1
        ) a
    )
    """
    with connection.cursor() as cursor:
        cursor.execute(sql)


remove_duplicates(MyModel, ['field_1', 'field_2'])

victorono · 2022-08-03T23:01:00Z

Thanks a lot! Excellent dimensioning to improve execution performance

ahmed-zubair-1998 · 2022-08-16T14:56:28Z

Great starting point. Tried to reduce the number of queries without using raw SQL.

def remove_duplicates_from_table(model, lookup_fields):
    duplicates = (
        model.objects.values(*lookup_fields)
        .order_by()
        .annotate(min_id=Min('id'), count_id=Count('id'))
        .filter(count_id__gt=1)
    )

    fields_lookup = Q()
    duplicate_fields_values = duplicates.values(*lookup_fields)
    for val in duplicate_fields_values:
        fields_lookup |= Q(**val)
    min_ids_list = duplicates.values_list('min_id', flat=True)

    if fields_lookup:
        model.objects.filter(fields_lookup).exclude(id__in=min_ids_list).delete()

Ended up using Q object to avoid making select query in each iteration while looping over the duplicates list.

DArmstrong87 · 2023-11-30T23:27:00Z

Thank you so much! This is EXACTLY what I needed to fix an issue with one of my migrations taking forever.

victorono · 2024-04-14T03:02:08Z

@ahmed-zubair-1998 thanks

victorono/remove_duplicates.py

edkohler commented Sep 4, 2019

patrik7 commented Apr 23, 2020

KarthikNayak commented May 4, 2020

sandeep7410 commented May 5, 2020

cliftonavil commented Jul 31, 2020

Thutmose3 commented Apr 2, 2021

gokhanyildiz9535 commented Sep 27, 2021

MarvinKweyu commented Nov 30, 2021

mallamsripooja commented May 6, 2022

vhtkrk commented Jul 28, 2022

victorono commented Aug 3, 2022

ahmed-zubair-1998 commented Aug 16, 2022

DArmstrong87 commented Nov 30, 2023

victorono commented Apr 14, 2024

	from django.db.models import Count, Max

	unique_fields = ['field_1', 'field_2']

	duplicates = (
	MyModel.objects.values(*unique_fields)
	.order_by()
	.annotate(max_id=Max('id'), count_id=Count('id'))
	.filter(count_id__gt=1)
	)

	for duplicate in duplicates:
	(
	MyModel.objects
	.filter(**{x: duplicate[x] for x in unique_fields})
	.exclude(id=duplicate['max_id'])
	.delete()
	)