Alec Barrett-Wilsdon alecbw

Presto Specific

Don’t SELECT *, Specify explicit column names (columnar store)
Avoid large JOINs (filter each table first)
- In PRESTO tables are joined in the order they are listed!!
- Join small tables earlier in the plan and leave larger fact tables to the end
- Avoid cross joins or 1 to many joins as these can degrade performance
Order by and group by take time
- only use order by in subqueries if it is really necessary
When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.

Plots

Tasks

Preface

In general, AWS services can be accessed using

AWS web interface,
API libraries in a programming language, such as boto3 for Python 3,
AWS command-line interface, i.e. awscli.

I opted for the API library since it is

I had a really interesting journey today with a thorny little challenge I had while trying to delete all the files in a s3 bucket with tons of nested files. The bucket path (s3://buffer-data/emr/logs/) contained log files created by ElasticMapReduce jobs that ran every day over a couple of years (from early 2015 to early 2018).

Each EMR job would run hourly every day, firing up a cluster of machines and each machine would output it's logs. That resulted thousands of nested paths (one for each job) that contained thousands of other files. I estimated that the total number of nested files would be between 5-10 million.

I had to estimate this number by looking at samples counts of some of the nested directories, because getting the true count would mean having to recurse through the whole s3 tree which was just too slow. This is also exactly why it was challenging to delete all the files.

Deleting all the files in a s3 object like this is pretty challenging, since s3 doesn't really work like a true f

	How to search in all countries but the US (or any other for that matter)?

	Linkedin Country codes: https://developer.linkedin.com/docs/reference/country-codes#

	Linkedin faceted search url format: %5B"ca%3A0"%2C"au%3A0"%2C"es%3A0"%5D
	Decoded URL: ["ca:0","au:0","es:0"]

	=> Complete list for injection in url (remove the country you want to exclude):

	["ae:0","ar:0","at:0","au:0","be:0","br:0","ca:0","ch:0","cl:0","cn:0","co:0","cz:0","de:0","dk:0","es:0","fi:0","fr:0","fx:0","gb:0","gr:0","hk:0","hr:0","hu:0","id:0","ie:0","il:0","in:0","is:0","it:0","jp:0","lb:0","lu:0","lv:0","ma:0","mc:0","mx:0","my:0","nl:0","no:0","nz:0","oo:0","pe:0","ph:0","pk:0","pl:0","pr:0","pt:0","py:0","qa:0","ro:0","ru:0","sa:0","se:0","sg:0","sk:0","th:0","tr:0","tw:0","ua:0","us:0","uy:0","ve:0","vn:0","yu:0","za:0"]

	from selenium import webdriver
	from selenium.webdriver.common.proxy import Proxy
	from selenium.webdriver.common.keys import Keys
	from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
	from selenium.webdriver.chrome.options import Options
	import zipfile,os

	def proxy_chrome(PROXY_HOST,PROXY_PORT,PROXY_USER,PROXY_PASS):
	manifest_json = """
	{

	#!/usr/bin/python
	# -- coding: utf-8 --

	'''To use gzip file between python application and S3 directly for Python3.
	Python 2 version - https://gist.github.com/a-hisame/f90815f4fae695ad3f16cb48a81ec06e
	'''

	import io
	import gzip
	import json

	jmails.info
	sacustomerdelight.co.in
	extrobuzzapp.com
	ixigo.info
	offer4uhub.com
	netecart.com
	101coupon.in
	freedealcode.in
	bankmarket.in
	hotoffers.co.in


	service: service-name

	provider:
	name: aws
	runtime: nodejs6.10

	functions:
	myfunc:
	handler: handler.myfunc

	/*
	// AdWords Script: Put Data From AdWords Report In Google Sheets
	// --------------------------------------------------------------
	// Copyright 2017 Optmyzr Inc., All Rights Reserved
	//
	// This script takes a Google spreadsheet as input. Based on the column headers, data filters, and date range specified
	// on this sheet, it will generate different reports.
	//
	// The goal is to let users create custom automatic reports with AdWords data that they can then include in an automated reporting
	// tool like the one offered by Optmyzr.