sequenceDiagram
participant cron
participant etl_prod
participant bulk_backport
participant walden
participant etl
participant disk
participant s3
participant github
participant mysql
cron->>etl_prod: run (every 5 minutes)
etl_prod->>disk: acquire lock
etl_prod->>etl_prod: commit walden changes
etl_prod->>github: pull and rebase walden
etl_prod->>github: pull etl
etl_prod->>bulk_backport: run full backport
bulk_backport->>mysql: get active datasets
mysql-->>bulk_backport: dataset ids
bulk_backport->>mysql: get dataset X metadata
bulk_backport->>bulk_backport: checksum metadata for X
bulk_backport->>walden: compare against walden checksum for X
walden-->>bulk_backport: checksum mismatch
bulk_backport->>disk: write config file for X
bulk_backport->>walden: add config for X to catalog
walden->>s3: store config file for X in catalog
walden->>disk: write index entry for X config
bulk_backport->>mysql: fetch data values for X
bulk_backport->>disk: write data file for X
bulk_backport->>walden: add data for X to catalog
walden->>s3: store data file for X in catalog
walden->>disk: write index entry for X data
bulk_backport->>bulk_backport: repeat for all other datasets
etl_prod->>github: commit and push walden state
etl_prod->>etl: run the full etl
etl->>disk: update local catalog
etl_prod->>etl: reindex, prune and replublish
etl->>s3: update remote catalog
etl_prod->>disk: release lock
Last active
July 27, 2022 15:43
-
-
Save larsyencken/7b127ad0d53b1d40cc393aca79ac05ae to your computer and use it in GitHub Desktop.
OWID: data deployment via the ETL
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment