Skip to content

Instantly share code, notes, and snippets.

@davidread
Created February 16, 2019 11:18
Show Gist options
  • Save davidread/0cb2442dde65222f0dd41e9b5700035f to your computer and use it in GitHub Desktop.
Save davidread/0cb2442dde65222f0dd41e9b5700035f to your computer and use it in GitHub Desktop.
Migration performance tests

I ran the migrate_package_activity.py (ckan/ckan#3972) on a real existing CKAN database to get a sense of the time taken and the change in database size

Timing: Migration took 49s for 236 datasets. If you scale that linearly to 900k datasets (europeandataportal.eu) then you're talking 1 hour. There's likely to be more activities per dataset though, so it would be more like several hours. Seems reasonable to want to do the migration while the site is up.

Size: size of the activity table went up about 20%, because every revision we now store resources, extras and tags on each package. However the overall size came down when we deleted the content of the activity_detail table (which is no longer needed), because that had a quirk that when you add a tag, it stored the dictized package as well in that table - this seems like a mistake.

Test Details

Test database properties:

  • 4 years old (used by OpenGov developers)
  • 236 public datasets (and 1263 private ones, without activity)
  • 1499 activities - an average of 6 activities per dataset
  • 16MB SQL dump
  • Migrated from CKAN 2.7.3

Migration:

49s time taken 1499 activities, of which one dataset had 22 had missing revisions, from before it was deleted, (purged?) and recreated.

What was interesting was to compare the size of the data before and after migration.

Query: sudo -u postgres psql -c "SELECT pg_size_pretty( pg_total_relation_size('activity')) as activity, pg_size_pretty( pg_total_relation_size('activity_detail')) as activity_detail, pg_size_pretty( pg_database_size('ckan_default') ) as total;" ckan_default

Before:

 activity | activity_detail | total
----------+-----------------+--------
 15 MB    | 124 MB          | 186 MB

After (with a vacuum):

 activity | activity_detail | total
----------+-----------------+-------
 17 MB    | 12 MB           | 78 MB

activity_detail has 0 rows, but still occupies 12MB, even after the vacuum - a hangover, I guess.

This is what we got rid of from activity_detail:

select count(object_type), object_type  from activity_detail group by object_type;
 count | object_type
-------+--------------
 21000 | PackageExtra
 40302 | tag
 38632 | Resource
  5666 | Package

Commands used

paster db clean -c $CKAN_INI sudo -u postgres pg_restore --clean --if-exists -d ckan_default < /vagrant/src/ckan/ckansandbox.dump sudo -u postgres psql ckan_default -c "delete from activity_detail ad where ad.object_id not in (select id from package);" # DELETE 105152 sudo -u postgres psql ckan_default -c "delete from activity a where a.object_id not in (select id from package);" # DELETE 9527 sudo -u postgres psql ckan_default -c "VACUUM;" sudo -u postgres psql -c "SELECT pg_size_pretty( pg_total_relation_size('activity')) as activity, pg_size_pretty( pg_total_relation_size('activity_detail')) as activity_detail, pg_size_pretty( pg_database_size('ckan_default') ) as total;" ckan_default python ckan/migration/migrate_package_activity.py -c $CKAN_INI --delete yes > ckansandbox.migrate2 sudo -u postgres psql ckan_default -c "VACUUM;" sudo -u postgres psql -c "SELECT pg_size_pretty( pg_total_relation_size('activity')) as activity, pg_size_pretty( pg_total_relation_size('activity_detail')) as activity_detail, pg_size_pretty( pg_database_size('ckan_default') ) as total;" ckan_default paster search-index rebuild -r --config=/etc/ckan/default/development.ini paster --plugin=ckan user add admin password=password apikey=adminkey email=[email protected] paster --plugin=ckan sysadmin add admin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment