qheuristics

Layer	Order	Description
`raw`	Sequential	Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. `csv`, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
`intermediate`	Sequential	This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
`primary`	Sequential

Explanation of the relationship between python logging root logger and other loggers

UPDATE: I have baked the ideas in this file inside a Python CLI tool called pyds-cli. Please find it here: https://github.com/ericmjl/pyds-cli

How to organize your Python data science project

Having done a number of data projects over the years, and having seen a number of them up on GitHub, I've come to see that there's a wide range in terms of how "readable" a project is. I'd like to share some practices that I have come to adopt in my projects, which I hope will bring some organization to your projects.

Disclaimer: I'm hoping nobody takes this to be "the definitive guide" to organizing a data project; rather, I hope you, the reader, find useful tips that you can adapt to your own projects.

Disclaimer 2: What I’m writing below is primarily geared towards Python language users. Some ideas may be transferable to other languages; others may not be so. Please feel free to remix whatever you see here!

	# setup docker-compose
	sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
	sudo chmod +x /usr/local/bin/docker-compose
	sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

	# setup airflow 1.10.14
	git clone https://github.com/xnuinside/airflow_in_docker_compose
	cd airflow_in_docker_compose
	docker-compose -f docker-compose-with-celery-executor.yml up --build

	"""
	Example of using sub-parser, sub-commands and sub-sub-commands :-)
	"""

	import argparse


	def main(args):
	"""
	Just do something

	import pandas as pd
	import numpy as np

	def generate_random_dates(num_dates: int) -> np.array:
	"""Generate a 1D array of `num_dates` random dates.
	"""
	start_date = "2020-01-01"
	# Generate all days for 2020
	available_dates = [np.datetime64(start_date) + days for days in range(365)]
	# Get `num_dates` random dates from 2020

	# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/)

	bucket = 'mybucket'
	key = 'mycompressed_csv_files.tar.gz'

	import s3fs
	import tarfile
	import io
	import pandas as pd