victorono/remove_duplicates.py

sandeep7410 · 2020-05-05T03:03:36Z

👍

cliftonavil · 2020-07-31T15:01:01Z

Super.thanks a lot!

Thutmose3 · 2021-04-02T22:28:08Z

Amazing!!

gokhanyildiz9535 · 2021-09-27T12:49:47Z

Thanks 👍

MarvinKweyu · 2021-11-30T11:11:55Z

@victorono Neat. Got a challenge though. Perhaps an eye on this could help clarify?

>>> duplicate_books = Book.objects.values('title').annotate(title_count=Count('title')).filter(title_count__gt=1)
>>> for duplicate in duplicates:
             Book.objects.filter(**{x: duplicate[x] for x in unique_fields}).exclude(id=duplicate['max_id']).delete()

However, I get:

    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused)

mallamsripooja · 2022-05-06T10:06:33Z

Helpful code snippet! 👍

vhtkrk · 2022-07-28T09:31:46Z

When you have lots of duplicates this becomes very slow. Imagine a table where every row is duplicated, so that you have table_size/2 items in the duplicates QuerySet after the first query and then need to do the delete for each of those one by one.

It gave a really good starting point though. This is what I ended up with, does it all in one query. Running time went from hours to minutes on a large table.

from django.db import connection

def remove_duplicates(model, unique_fields):
    fields  = ', '.join(f't."{f}"' for f in unique_fields)
    sql = f"""
    DELETE FROM {model._meta.db_table} 
    WHERE id IN (
        SELECT 
            UNNEST(ARRAY_REMOVE(dupe_ids, max_id))
        FROM (
            SELECT 
                {fields},
                MAX(t.id) AS max_id,
                ARRAY_AGG(t.id) AS dupe_ids
            FROM
                {model._meta.db_table} t
            GROUP BY
                {fields}
            HAVING
                COUNT(t.id) > 1
        ) a
    )
    """
    with connection.cursor() as cursor:
        cursor.execute(sql)


remove_duplicates(MyModel, ['field_1', 'field_2'])

victorono · 2022-08-03T23:01:00Z

Thanks a lot! Excellent dimensioning to improve execution performance

ahmed-zubair-1998 · 2022-08-16T14:56:28Z

Great starting point. Tried to reduce the number of queries without using raw SQL.

def remove_duplicates_from_table(model, lookup_fields):
    duplicates = (
        model.objects.values(*lookup_fields)
        .order_by()
        .annotate(min_id=Min('id'), count_id=Count('id'))
        .filter(count_id__gt=1)
    )

    fields_lookup = Q()
    duplicate_fields_values = duplicates.values(*lookup_fields)
    for val in duplicate_fields_values:
        fields_lookup |= Q(**val)
    min_ids_list = duplicates.values_list('min_id', flat=True)

    if fields_lookup:
        model.objects.filter(fields_lookup).exclude(id__in=min_ids_list).delete()

Ended up using Q object to avoid making select query in each iteration while looping over the duplicates list.

DArmstrong87 · 2023-11-30T23:27:00Z

Thank you so much! This is EXACTLY what I needed to fix an issue with one of my migrations taking forever.

victorono · 2024-04-14T03:02:08Z

@ahmed-zubair-1998 thanks

victorono/remove_duplicates.py

Select an option

No results found

Select an option

No results found

sandeep7410 commented May 5, 2020

Uh oh!

cliftonavil commented Jul 31, 2020

Uh oh!

Thutmose3 commented Apr 2, 2021

Uh oh!

gokhanyildiz9535 commented Sep 27, 2021

Uh oh!

MarvinKweyu commented Nov 30, 2021

Uh oh!

mallamsripooja commented May 6, 2022

Uh oh!

vhtkrk commented Jul 28, 2022

Uh oh!

victorono commented Aug 3, 2022

Uh oh!

ahmed-zubair-1998 commented Aug 16, 2022

Uh oh!

DArmstrong87 commented Nov 30, 2023

Uh oh!

victorono commented Apr 14, 2024

Uh oh!

	from django.db.models import Count, Max

	unique_fields = ['field_1', 'field_2']

	duplicates = (
	MyModel.objects.values(*unique_fields)
	.order_by()
	.annotate(max_id=Max('id'), count_id=Count('id'))
	.filter(count_id__gt=1)
	)

	for duplicate in duplicates:
	(
	MyModel.objects
	.filter(**{x: duplicate[x] for x in unique_fields})
	.exclude(id=duplicate['max_id'])
	.delete()
	)