Sam Bessalah samklr

GCP deployment with Terraform

Basics

Run terraform fmt to format your code consistently

Use terraform validate to check for syntax or semantic issues before apply

Adopt tflint or similar linters to catch anti-patterns or unused code

Go to https://developer.apple.com/downloads/index.action and search for "Command line tools" and choose the one for your Mac OSX
Go to http://brew.sh/ and enter the one-liner into the Terminal, you now have brew installed (a better Mac ports)
Install transmission-daemon with
```
brew install transmission
```

Copy the startup config for launchctl with

ln -sfv /usr/local/opt/transmission/*.plist ~/Library/LaunchAgents

Setup Parquet-tools brew install parquet-tools

Help parquet-tools -h

parquet-tools rowcount part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet

parquet-tools head -n 1 part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet

parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet

Context The Integration team has deployed a cron job to dump a CSV file containing all the new Shopify configurations daily at 2 AM UTC. The task will be to build a daily pipeline that will :

download the CSV file from https://alg-data-public.s3.amazonaws.com/[YYYY-MM-DD].csv, filter out each row with empty application_id, add a has_specific_prefix column set to true if the value of index_prefix differs from shopify_ else to false load the valid rows to a Postresql instance The pipeline should process files from 2019-04-01 to 2019-04-07.

Golang Workers / Job Queue

A running example of the code from:

This gist creates a working example from blog post, and a alternate example using simple worker pool.

TLDR: if you want simple and controlled concurrency use a worker pool.

	version: '2'

	services:
	minio:
	restart: always
	image: docker.io/bitnami/minio:2021
	ports:
	- '9000:9000'
	environment:
	- MINIO_ROOT_USER=miniokey

	# suppose my data file name has the following format "datatfile_YYYY_MM_DD.csv"; this file arrives in S3 every day.
	file_suffix = "{{ execution_date.strftime('%Y-%m-%d') }}"
	bucket_key_template = 's3://[bucket_name]/datatfile_{}.csv'.format(file_suffix)
	file_sensor = S3KeySensor(
	task_id='s3_key_sensor_task',
	poke_interval=60 * 30, # (seconds); checking file every half an hour
	timeout=60 * 60 * 12, # timeout in 12 hours
	bucket_key=bucket_key_template,
	bucket_name=None,
	wildcard_match=False,

	from airflow import DAG
	from airflow.operators.sensors import S3KeySensor
	from airflow.operators import BashOperator
	from datetime import datetime, timedelta

	yday = datetime.combine(datetime.today() - timedelta(1),
	datetime.min.time())

	default_args = {
	'owner': 'msumit',

	with DAG(**dag_config) as dag:
	# Declare pipeline start and end task
	start_task = DummyOperator(task_id='pipeline_start')
	end_task = DummyOperator(task_id='pipeline_end')

	for account_details in pipeline_config['task_details']['accounts']:
	#Declare Account Start and End Task
	if account_details['runable']:
	acct_start_task = DummyOperator(task_id=account_details['account_id'] + '_start')
	acct_start_task.set_upstream(start_task)

	play.modules.enabled += "com.samklr.KamonModule"


	kamon {
	environment {
	service = "my-svc"
	}

	jaeger {