Skip to content

Instantly share code, notes, and snippets.

@larsyencken
Last active July 27, 2022 15:43
Show Gist options
  • Save larsyencken/7b127ad0d53b1d40cc393aca79ac05ae to your computer and use it in GitHub Desktop.
Save larsyencken/7b127ad0d53b1d40cc393aca79ac05ae to your computer and use it in GitHub Desktop.
OWID: data deployment via the ETL
sequenceDiagram
    participant cron
    participant etl_prod
    participant bulk_backport
    participant walden
    participant etl
    participant disk
    participant s3
    participant github
    participant mysql
    cron->>etl_prod: run (every 5 minutes)
    etl_prod->>disk: acquire lock
    etl_prod->>etl_prod: commit walden changes
    etl_prod->>github: pull and rebase walden
    etl_prod->>github: pull etl
    etl_prod->>bulk_backport: run full backport
    bulk_backport->>mysql: get active datasets
    mysql-->>bulk_backport: dataset ids
    bulk_backport->>mysql: get dataset X metadata
    bulk_backport->>bulk_backport: checksum metadata for X
    bulk_backport->>walden: compare against walden checksum for X
    walden-->>bulk_backport: checksum mismatch
    bulk_backport->>disk: write config file for X
    bulk_backport->>walden: add config for X to catalog
    walden->>s3: store config file for X in catalog
    walden->>disk: write index entry for X config
    bulk_backport->>mysql: fetch data values for X
    bulk_backport->>disk: write data file for X
    bulk_backport->>walden: add data for X to catalog
    walden->>s3: store data file for X in catalog
    walden->>disk: write index entry for X data
    bulk_backport->>bulk_backport: repeat for all other datasets
    etl_prod->>github: commit and push walden state
    etl_prod->>etl: run the full etl
    etl->>disk: update local catalog
    etl_prod->>etl: reindex, prune and replublish
    etl->>s3: update remote catalog
    etl_prod->>disk: release lock
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment