Rob Cowie robcowie

Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112

For older versions of Spark and ipython, please, see also previous version of text.

Install Java Development Kit

Hive partitioning scheme for dealing with late arriving data etc.

Over the last few years I've been quite involved with using hive for big data analysis.

I've read many web tutorials and blogs about using hadoop/hive/pig for data analysis but all them seem to be over simplified and targeted as a "my first hive query" kind of audience instead of showing how to structure hive tables and queries for real word use cases eg years of data, reoccurring batch jobs to build aggregate/reporting tables and having to deal with late arriving data etc.

Most of these tutorials look something like this

Twitter Data -> hdfs/external hive table external hive table -> hive query -> results.

	#!/bin/sh

	#############################################
	# Output file for HTML5 video #
	# Requirements: #
	# - handbrakecli #
	# #
	# usage: #
	# ./html5VideoHandBrakeFolder.sh folder #
	# #

	[
	{
	"keys": ["super+b"],
	"command": "build",
	"context": [
	{ "key": "selector", "operator": "equal", "operand": "source.c++" }
	],
	"args": {
	"build_system": "Packages/C++/C++.sublime-build",
	"variant": "Build"

	# Send
	openssl aes-256-cbc -salt -a -e -in /path/to/file \| nc -l 3333

	# Receive
	nc {ip} 3333 \| openssl aes-256-cbc -salt -a -d -out /path/to/file

	client
	dev tun
	remote example.com
	resolv-retry infinite
	nobind
	persist-key
	persist-tun
	ca [inline]
	cert [inline]
	key [inline]

	# -- coding: utf-8 --
	import scrapy
	from scrapy.http.request import Request
	from scrapy.selector import Selector

	import urllib2
	import re
	import PyV8
	import json

	<property>
	<name>hive.vectorized.groupby.flush.percent</name>
	<value>0.1</value>
	</property>
	<property>
	<name>hive.vectorized.groupby.maxentries</name>
	<value>10240</value>
	</property>
	<property>
	<name>tez.session.am.dag.submit.timeout.secs</name>

	--Hive expects that you want to calculate your percentiles by account_number and sales
	--This code will generate an error about a missing GROUP BY statement
	select
	account_number,
	sales,
	CASE WHEN sales > a.sales_90th_percentile from sales THEN 1 ELSE 0 END as top10pct_sales
	from sales
	cross join (select percentile_approx(sales, .9) as sales_90th_percentile from sales) a;

	The regex patterns in this gist are intended only to match web URLs -- http,
	https, and naked domains like "example.com". For a pattern that attempts to
	match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502


	# Single-line version:

	(?i)\b((?:https?:(?:/{1,3}\|[a-z0-9%])\|[a-z0-9.\-]+[.](?:com\|net\|org\|edu\|gov\|mil\|aero\|asia\|biz\|cat\|coop\|info\|int\|jobs\|mobi\|museum\|name\|post\|pro\|tel\|travel\|xxx\|ac\|ad\|ae\|af\|ag\|ai\|al\|am\|an\|ao\|aq\|ar\|as\|at\|au\|aw\|ax\|az\|ba\|bb\|bd\|be\|bf\|bg\|bh\|bi\|bj\|bm\|bn\|bo\|br\|bs\|bt\|bv\|bw\|by\|bz\|ca\|cc\|cd\|cf\|cg\|ch\|ci\|ck\|cl\|cm\|cn\|co\|cr\|cs\|cu\|cv\|cx\|cy\|cz\|dd\|de\|dj\|dk\|dm\|do\|dz\|ec\|ee\|eg\|eh\|er\|es\|et\|eu\|fi\|fj\|fk\|fm\|fo\|fr\|ga\|gb\|gd\|ge\|gf\|gg\|gh\|gi\|gl\|gm\|gn\|gp\|gq\|gr\|gs\|gt\|gu\|gw\|gy\|hk\|hm\|hn\|hr\|ht\|hu\|id\|ie\|il\|im\|in\|io\|iq\|ir\|is\|it\|je\|jm\|jo\|jp\|ke\|kg\|kh\|ki\|km\|kn\|kp\|kr\|kw\|ky\|kz\|la\|lb\|lc\|li\|lk\|lr\|ls\|lt\|lu\|lv\|ly\|ma\|mc\|md\|me\|mg\|mh\|mk\|ml\|mm\|mn\|mo\|mp\|mq\|mr\|ms\|mt\|mu\|mv\|mw\|mx\|my\|mz\|na\|nc\|ne\|nf\|ng\|ni\|nl\|no\|np\|nr\|nu\|nz\|om\|pa\|pe\|pf\|pg\|ph\|pk\|pl\|pm\|pn\|pr\|ps\|pt\|pw\|py\|qa\|re\|ro\|rs\|ru\|rw\|sa\|sb\|sc\|sd\|se\|sg\|sh\|si\|s