vitalibertas’s gists

vitalibertas / scan_pdf.py

Created May 7, 2020 23:07

	# http://docs.wand-py.org/en/0.5.9/
	# http://www.imagemagick.org/script/formats.php
	# brew install freetype imagemagick
	# brew install PIL
	# brew install tesseract
	# pip3 install wand
	# pip3 install pyocr
	import pyocr.builders
	import requests
	from io import BytesIO

vitalibertas / getApiResults.py

Last active March 11, 2020 16:23

Python API Download Zipped JSON file, Unzip and Format for Redshift, Upload to S3 as GZip.

	gz_buffer = BytesIO()
	json_buffer = StringIO()
	download_url = "{0}{1}/file".format(request_url, file_id)
	request_download = requests.request("GET", download_url, headers=json_header, stream=True)
	with zipfile.ZipFile(BytesIO(request_download.content), mode='r') as z:
	unzip_file = StringIO(z.read(z.infolist()[0]).decode('utf-8'))

	json_responses = json.load(unzip_file)['responses']
	for response in json_responses:
	json_buffer.write(json.dumps(response))

vitalibertas / python_venv.md

Last active November 4, 2019 17:57

Python3 Virtualenv Setup

$ brew install python3

vitalibertas / gist:b16ed8a13d7d2d0516ef2d1b57b60402

Last active August 30, 2018 17:15

Readability: Python List Comprehension vs. Not

	# List Comprehension:
	process_dict = dict([(attributes.filename, attributes.st_size) for attributes in file_list if attributes.filename.startswith('solcon')])

	# Whitespace Generous:
	for attributes in file_list:
	if attributes.filename.startswith('solcon'):
	process_dict[attributes.filename] = attributes.st_size

vitalibertas / gist:4eff16e088aca0122d8c167c7977c4c4

Created September 28, 2017 19:27

Hive row_number() in place of aggregate to determine the maximum event time for each ID per day.

	SET hive.execution.engine = mr;
	SET hive.support.concurrency = false;
	SET hive.exec.parallel = true;
	SET hive.exec.dynamic.partition.mode=nonstrict;

	USE hosting_stats;

	WITH Rank AS (
	SELECT
	cid

vitalibertas / gist:3653fcf459647ca533dc81e8edf69dd5

Last active August 13, 2017 04:10

Hive query that uses arrays, aggregates, and windowing to determine customer onboarding category.

	WITH Landing AS (
	SELECT
	visit_id
	,COLLECT_SET(shopper_id) AS shopper_array
	,MIN(sequence) AS min_sequence
	FROM
	visits
	WHERE
	page_type = 'landing'
	GROUP BY

vitalibertas / gist:d17bba1219ed22e42f6b018608b85b96

Created August 13, 2017 03:51

Check HDFS for a specific file a certain amount of times before it errors out so you don't execute code that has a dependency.

	CHECK_HDFS="/some/path/to/file"

	function hdfsCheck {
	RETRY=0

	while [ $RETRY -lt 9 ];
	do
	COUNT=$(hdfs dfs -ls "${CHECK_HDFS}" \| wc -l) 2> stderr.txt

	if [ $COUNT -lt 1 ]; then

vitalibertas / gist:732c0b2a251f480c287ca6418ab1be65

Created August 13, 2017 03:45

Bash script being polite during work hours. Does hour math to sleep until 5:00 pm.

Cory Brickner vitalibertas