Layer | Order | Description |
---|---|---|
raw |
Sequential | Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. csv , but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly! |
intermediate |
Sequential | This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted. |
primary |
Sequential |
# setup docker-compose | |
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose | |
sudo chmod +x /usr/local/bin/docker-compose | |
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose | |
# setup airflow 1.10.14 | |
git clone https://github.com/xnuinside/airflow_in_docker_compose | |
cd airflow_in_docker_compose | |
docker-compose -f docker-compose-with-celery-executor.yml up --build |
""" | |
Example of using sub-parser, sub-commands and sub-sub-commands :-) | |
""" | |
import argparse | |
def main(args): | |
""" | |
Just do something |
import pandas as pd | |
import numpy as np | |
def generate_random_dates(num_dates: int) -> np.array: | |
"""Generate a 1D array of `num_dates` random dates. | |
""" | |
start_date = "2020-01-01" | |
# Generate all days for 2020 | |
available_dates = [np.datetime64(start_date) + days for days in range(365)] | |
# Get `num_dates` random dates from 2020 |
# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/) | |
bucket = 'mybucket' | |
key = 'mycompressed_csv_files.tar.gz' | |
import s3fs | |
import tarfile | |
import io | |
import pandas as pd |
UPDATE: I have baked the ideas in this file inside a Python CLI tool called pyds-cli
. Please find it here: https://github.com/ericmjl/pyds-cli
Having done a number of data projects over the years, and having seen a number of them up on GitHub, I've come to see that there's a wide range in terms of how "readable" a project is. I'd like to share some practices that I have come to adopt in my projects, which I hope will bring some organization to your projects.
Disclaimer: I'm hoping nobody takes this to be "the definitive guide" to organizing a data project; rather, I hope you, the reader, find useful tips that you can adapt to your own projects.
Disclaimer 2: What I’m writing below is primarily geared towards Python language users. Some ideas may be transferable to other languages; others may not be so. Please feel free to remix whatever you see here!