Run terraform fmt to format your code consistently
Use terraform validate to check for syntax or semantic issues before apply
Adopt tflint or similar linters to catch anti-patterns or unused code
Go to https://developer.apple.com/downloads/index.action and search for "Command line tools" and choose the one for your Mac OSX
Go to http://brew.sh/ and enter the one-liner into the Terminal, you now have brew installed (a better Mac ports)
Install transmission-daemon with
brew install transmission
Copy the startup config for launchctl with
ln -sfv /usr/local/opt/transmission/*.plist ~/Library/LaunchAgents
| version: '2' | |
| services: | |
| minio: | |
| restart: always | |
| image: docker.io/bitnami/minio:2021 | |
| ports: | |
| - '9000:9000' | |
| environment: | |
| - MINIO_ROOT_USER=miniokey |
| # suppose my data file name has the following format "datatfile_YYYY_MM_DD.csv"; this file arrives in S3 every day. | |
| file_suffix = "{{ execution_date.strftime('%Y-%m-%d') }}" | |
| bucket_key_template = 's3://[bucket_name]/datatfile_{}.csv'.format(file_suffix) | |
| file_sensor = S3KeySensor( | |
| task_id='s3_key_sensor_task', | |
| poke_interval=60 * 30, # (seconds); checking file every half an hour | |
| timeout=60 * 60 * 12, # timeout in 12 hours | |
| bucket_key=bucket_key_template, | |
| bucket_name=None, | |
| wildcard_match=False, |
| from airflow import DAG | |
| from airflow.operators.sensors import S3KeySensor | |
| from airflow.operators import BashOperator | |
| from datetime import datetime, timedelta | |
| yday = datetime.combine(datetime.today() - timedelta(1), | |
| datetime.min.time()) | |
| default_args = { | |
| 'owner': 'msumit', |
| with DAG(**dag_config) as dag: | |
| # Declare pipeline start and end task | |
| start_task = DummyOperator(task_id='pipeline_start') | |
| end_task = DummyOperator(task_id='pipeline_end') | |
| for account_details in pipeline_config['task_details']['accounts']: | |
| #Declare Account Start and End Task | |
| if account_details['runable']: | |
| acct_start_task = DummyOperator(task_id=account_details['account_id'] + '_start') | |
| acct_start_task.set_upstream(start_task) |
Setup Parquet-tools
brew install parquet-tools
Help parquet-tools -h
parquet-tools rowcount part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
parquet-tools head -n 1 part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
Context The Integration team has deployed a cron job to dump a CSV file containing all the new Shopify configurations daily at 2 AM UTC. The task will be to build a daily pipeline that will :
download the CSV file from https://alg-data-public.s3.amazonaws.com/[YYYY-MM-DD].csv, filter out each row with empty application_id, add a has_specific_prefix column set to true if the value of index_prefix differs from shopify_ else to false load the valid rows to a Postresql instance The pipeline should process files from 2019-04-01 to 2019-04-07.
| play.modules.enabled += "com.samklr.KamonModule" | |
| kamon { | |
| environment { | |
| service = "my-svc" | |
| } | |
| jaeger { |
A running example of the code from:
This gist creates a working example from blog post, and a alternate example using simple worker pool.
TLDR: if you want simple and controlled concurrency use a worker pool.