Performance improvements for `django-import-export` bulk import

Performance improvements for bulk import

I implemented for my own use case - not fully tested - use at your own risk

This is how I improved performance of django-import-export when importing a large set of new rows.

Thinkpad T470 i5 processor (Ubuntu 18.04)
20,000 new rows to be inserted
total import duration 5.4 seconds

Main improvements

Use bulk_create()
Run with use_transactions=False
Override get_or_init_instance() so that instance lookups are prevented (not needed for new rows)
Remove all diffing code from import_row() if you don't need diffs (cuts ~30% total processing time)
Ensure that any FK field lookups don't make repeated db calls - use a cached resource
Postgres users have the option of bypassing the ORM and performing direct inserts.

class BulkSaveMixin:
    """
    Overridden to store instance so that it can be imported in bulk.
    https://github.com/django-import-export/django-import-export/issues/939#issuecomment-509435531
    """
    bulk_instances = []

    def save_instance(self, instance, using_transactions=True, dry_run=False):
        self.before_save_instance(instance, using_transactions, dry_run)
        if not using_transactions and dry_run:
            # we don't have transactions and we want to do a dry_run
            pass
        else:
            self.bulk_instances.append(instance)
        self.after_save_instance(instance, using_transactions, dry_run)

    def after_import(self, dataset, result, using_transactions, dry_run, **kwargs):
        if self.bulk_instances:
            try:
                self._meta.model.objects.bulk_create(self.bulk_instances)
            except Exception as e:
                # Be careful with this
                # Any exceptions caught here will be raised.
                # However, if the raise_errors flag is False, then the exception will be 
                # swallowed, and the row_results will look like the import was successful.
                # Setting raise_errors to True will mitigate this because the import process will
                # clearly fail.
                # To be completely correct, any errors here should update the result / row_results
                # accordingly.
                logger.error("caught exception during bulk_import: %s", str(e), exc_info=1)
                raise e
            finally:
                self.bulk_instances.clear()


class BookResource(BulkSaveMixin, resources.ModelResource):

    def get_or_init_instance(self, instance_loader, row):
        """
        Override to avoid repeated reads on the DB.
        :return: A newly instantiated model instance.
        ``True`` indicated this is a newly instantiated instance.
        """
        return self._meta.model(), True
    
    class Meta:
        model = Book

Large data sets

If using bulk_create() as described above, then all records have to be stored in memory before being written to the db, so this can cause memory issues for very large datasets. If you have this issue, then you will have to consider using batches.

Handling UPDATES / DELETES

If you need to handle SQL UPDATES and DELETES (not just INSERT operations), then this will need special attention to call bulk_update and delete respectively.

This will involve keeping track of rows which need to be inserted / updated / deleted, and calling the correct functions appropriately.

Notes on M2M Fields

If you need to save instances containing m2m fields in bulk, then this is a tricky issue. Again my views here are derived from my own reading (not from direct testing) so DYOR.

By default, bulk_create() does not support m2m fields. That means that there will need to be a separate DB call (wrapped in a transaction) in order to perform a bulk m2m operation.

This second call will only work if the instances created in bulk have their primary keys set (this is currently only supported by Postgres). If not using Postgres, then there will have to be a db call to retrieve newly created instances along with their PKs. This might be impossible if you have inserted objects which have identical field values, as it will not be possible to reliably to determine which object to set m2m relationships on.

You can then create m2m relations using a through model. I haven't tried this but it looks like it's going to work ok for creating new relations, but updates and deletes to relations will require more effort.

Note you would have to do this for each m2m field in the model, which adds to the complication.

matthewhegarty/django-import-export-bulk-import.md

Performance improvements for bulk import

Main improvements

Large data sets

Handling UPDATES / DELETES

Notes on M2M Fields

matthewhegarty commented Nov 18, 2024

Uh oh!