Skip to content

Instantly share code, notes, and snippets.

@sheac
Last active May 15, 2018 16:42
Show Gist options
  • Save sheac/a7a4d92ebe72e152005f5e8ececd8105 to your computer and use it in GitHub Desktop.
Save sheac/a7a4d92ebe72e152005f5e8ececd8105 to your computer and use it in GitHub Desktop.
Verify the safety and effectiveness of the script for backfilling document checksums

1. Purpose

This gist is to verify the safety and effectiveness of the script for backfilling document checksums. The script can be found on this mothership PR.

2. Script Requirements

This "script" is meant to fill in checksums for any documents table row that should have a checksum, but doesn't. Documents "should have" a checksum if pending == false. But not all do.

3. Script Algorithm

- get ids for all documents where pending==false && (checksum==null || checksum=='')
- for each id
  - fetch the checksum from S3
  - update the checksum for that row

4. Executing the Script

$ pwd
~/go/src/github.com/parsable/mothership/tools/backfill_doc_checksum

$ go run backfill_doc_checksum.go \
    -dbuser mothership \
    -dbpass REDACTED \
    -dbhost REDACTED \
    -dbname mothership \
    -awsid REDACTED \
    -awssecret REDACTED \
    -s3bucket wi-mothership-staging \
    -awsregion us-west-2 \
    [-limit 10] # for debugging (default is 100K)

5. Means of Verification

Note: I have not run the script in full on the staging environment. I'm waiting until it has been looked over by someone else before proceeding with that.

5.1 Effectiveness

On the staging environment, I've been running the script so it only grabs and modifies 10 of the roughly 15K "bad" document ids. The process is:

  1. count the number of "bad" documents
  2. run script that only modifies n documents, by setting the CLI flag -limit n
  3. re-count the number of "bad" documents, ensuring it is n less than before

Here is the output I've observed:

  1. Pre-count:
mothership=> select count(*) from documents where pending = false and (checksum is null or checksum = '');
 count
-------
 14651
(1 row)
  1. Test run with n=10:
STARTING ::
getting docs with missing checksums
found 10 docs
getting the s3 bucket
...
----------------------------------------------------------------------------
DONE
num documents updated 10
  1. Post-count:
mothership=> select count(*) from documents where pending = false and (checksum is null or checksum = '');
 count
-------
 14641
(1 row)

In the Safety section below, we'll also look at document rows individually to see that the checksum is updated as desired.

5.2 Safety

  1. We can check that no extra rows are updated

From the script output, I expect the following rows to have changed most recently:

  • 56cf5802-e226-40cc-a893-78b51c343d95
  • 62bfca12-b2fe-4a7f-bb06-354b15ffc303
  • 162a110f-777b-4613-8a1b-75458b3df020
  • 7699217b-1398-4062-b781-184c12a63676
  • 477d2633-9b4e-4cfe-bdcf-1aa7868ddffa
  • f5e81a59-3d17-4d45-8379-bd3e786831ff
  • af2e3a14-aa69-47af-a32b-5f68e749f502
  • 9ba7e6cf-8d64-42af-8175-c49a01fb14b6
  • 08af8e1b-c943-4f17-873f-1664cc647245
  • b0ba0449-8431-445d-8cda-2c8f23bbef5a

Luckily, the staging environment is being used so little at the moment that we can find that these are in fact the most recently updated documents. Moreover, timestamp inspection tells us that there's an approximately 50 second gap between the 3 rows affected by the previous update and the 10 rows affected by our update.

mothership=> select id, updated_at from documents order by updated_at desc limit 13;
                  id                  | updated_at
--------------------------------------+------------
 08af8e1b-c943-4f17-873f-1664cc647245 | 1526343712
 b0ba0449-8431-445d-8cda-2c8f23bbef5a | 1526343712
 af2e3a14-aa69-47af-a32b-5f68e749f502 | 1526343711
 f5e81a59-3d17-4d45-8379-bd3e786831ff | 1526343711
 9ba7e6cf-8d64-42af-8175-c49a01fb14b6 | 1526343711
 477d2633-9b4e-4cfe-bdcf-1aa7868ddffa | 1526343710
 56cf5802-e226-40cc-a893-78b51c343d95 | 1526343710
 62bfca12-b2fe-4a7f-bb06-354b15ffc303 | 1526343710
 162a110f-777b-4613-8a1b-75458b3df020 | 1526343710
 7699217b-1398-4062-b781-184c12a63676 | 1526343710
 50ec4fb4-3810-451e-8fb8-8c9a6d383557 | 1526343660
 3833f119-6248-477b-bed9-cb0903367876 | 1526343660
 7026775e-71b2-4b90-ba97-f501862cca1e | 1526343660
  1. We can check that for the updated rows, only the checksum field has changed

In order to peek at the documents before they're updated, we need to know which are going to change. So we'll change the query slightly to make it more predictable:

SELECT id FROM documents WHERE pending = FALSE AND (checksum is NULL OR checksum = '') ORDER BY UPDATED_AT LIMIT 2

Let's get a snapshot of what our data looks like "before":

mothership=> \x
Expanded display is on.
mothership=> select * from documents where pending = false and (checksum is null or checksum = '') order by updated_at limit 2;
-[ RECORD 1 ]--------+-------------------------------------
id                   | dd95fa4c-f2b1-42f1-bf7e-0690af6f9a5a
mime_type            | video/mp4
name                 | small.mp4
created_at           | 1438628737
updated_at           | 1438628737
team_id              |
size                 |
pending              | f
user_id              |
checksum             |
original_document_id |
caption              |
thumbnail_id         |
-[ RECORD 2 ]--------+-------------------------------------
id                   | 70af8dfe-b9b5-4517-b205-71d680707d94
mime_type            | application/pdf
name                 | SaaS Metrics.pdf
created_at           | 1438628790
updated_at           | 1438628790
team_id              |
size                 |
pending              | f
user_id              |
checksum             |
original_document_id |
caption              |
thumbnail_id         |

We run our script, and then check the values after:

mothership=> SELECT * FROM documents WHERE id IN ('dd95fa4c-f2b1-42f1-bf7e-0690af6f9a5a', '70af8dfe-b9b5-4517-b205-71d680707d94');
-[ RECORD 1 ]--------+-------------------------------------
id                   | dd95fa4c-f2b1-42f1-bf7e-0690af6f9a5a
mime_type            | video/mp4
name                 | small.mp4
created_at           | 1438628737
updated_at           | 1526345629
team_id              |
size                 |
pending              | f
user_id              |
checksum             | a3ac7ddabb263c2d00b73e8177d15c8d
original_document_id |
caption              |
thumbnail_id         |
-[ RECORD 2 ]--------+-------------------------------------
id                   | 70af8dfe-b9b5-4517-b205-71d680707d94
mime_type            | application/pdf
name                 | SaaS Metrics.pdf
created_at           | 1438628790
updated_at           | 1526345629
team_id              |
size                 |
pending              | f
user_id              |
checksum             | d775b64c0dae56bff9804e708983c622
original_document_id |
caption              |
thumbnail_id         |

As we can see, the only updated columns are:

  • updated_at
  • checksum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment