Jon Roosevelt RooseveltAdvisors

Github : Signing commits using GPG (Ubuntu/Mac) 🔐

Do you have an Github account ? If not create one.
Install required tools
Latest Git Client
gpg tools

# Ubuntu
sudo apt-get install gpa seahorse
# MacOS with https://brew.sh/

7. S3 backed notebooks for Zeppelin

SSH into your master node — https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
Create the zeppelin-site.xml file if it doesn’t already exist —

sudo cp /etc/zeppelin/conf/zeppelin-site.xml.template /etc/zeppelin/conf/zeppelin-site.xml

Installation

Assume you have anaconda installed on your computer

conda env remove -n rllab_test -y
cd ~/Downloads
git clone https://github.com/rll/rllab.git
cd rllab
conda env create -n rllab_test -f environment.yml

To connect a debugger to the driver

Append the following to your spark submit (or gatk-launch) options:

replace 5005 with a different available port if necessary

--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

This will suspend the driver until it gets a remote connection from intellij.

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
  bond-master bond0

auto eth1
iface eth1 inet manual

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

curl -s  -k https://USERNAME:[email protected]/1.0/user/repositories | python -c 'import sys, json, os; r = json.loads(sys.stdin.read()); [os.system("git clone %s" % d["resource_uri"].replace("/1.0/repositories","https://USERNAME:[email protected]")+".git") for d in r]'

Pyspark

spark-submit

spark-submit --master yarn --deploy-mode cluster --name pyspark_job --driver-memory 2G --driver-cores 2 --executor-memory 12G --executor-cores 5 --num-executors 10 --conf spark.yarn.executor.memoryOverhead=4096 --conf spark.task.maxFailures=36 --conf spark.driver.maxResultSize=0 --conf spark.network.timeout=800s --conf spark.scheduler.listenerbus.eventqueue.size=500000 --conf spark.speculation=true --py-files lib.zip,lib1.zip,lib2.zip spark_test.py

spark_test.py

import pyspark
import sys
from pyspark.sql import SQLContext

	# Install Anaconda
	wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
	bash Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p $HOME/anaconda
	export PATH="$HOME/anaconda/bin:$PATH"
	echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
	conda update -y -n base conda

	# Install Jupyter
	conda create -y -n jupyter python=3.5 jupyter nb_conda
	screen -dmS jupyter

	import datetime
	from jinja2 import Environment

	start = datetime.datetime.strptime("2017-02-01", "%Y-%m-%d")
	end = datetime.datetime.strptime("2017-07-24", "%Y-%m-%d")
	date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days+1)]

	template = """spark-submit --master yarn --deploy-mode cluster --class com.xyz.XXXAPP s3://com.xyz/aa-1.5.11-all.jar --input-request-events s3://com.xyz/data/event_{{date_str}}/* --input-geofence-events s3://com.xyz/data2/event_/{{date_str}}/* --output s3://com.xyz/output/{{date_str}}"""