- Do you have an Github account ? If not create one.
- Install required tools
- Latest Git Client
- gpg tools
# Ubuntu
sudo apt-get install gpa seahorse
# MacOS with https://brew.sh/
# Ubuntu
sudo apt-get install gpa seahorse
# MacOS with https://brew.sh/
sudo cp /etc/zeppelin/conf/zeppelin-site.xml.template /etc/zeppelin/conf/zeppelin-site.xml
# Install Anaconda | |
wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh | |
bash Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p $HOME/anaconda | |
export PATH="$HOME/anaconda/bin:$PATH" | |
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc | |
conda update -y -n base conda | |
# Install Jupyter | |
conda create -y -n jupyter python=3.5 jupyter nb_conda | |
screen -dmS jupyter |
Append the following to your spark submit (or gatk-launch) options:
replace 5005 with a different available port if necessary
--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
This will suspend the driver until it gets a remote connection from intellij.
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet manual
bond-master bond0
auto eth1
iface eth1 inet manual
import datetime | |
from jinja2 import Environment | |
start = datetime.datetime.strptime("2017-02-01", "%Y-%m-%d") | |
end = datetime.datetime.strptime("2017-07-24", "%Y-%m-%d") | |
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days+1)] | |
template = """spark-submit --master yarn --deploy-mode cluster --class com.xyz.XXXAPP s3://com.xyz/aa-1.5.11-all.jar --input-request-events s3://com.xyz/data/event_{{date_str}}/* --input-geofence-events s3://com.xyz/data2/event_/{{date_str}}/* --output s3://com.xyz/output/{{date_str}}""" | |
Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).
To store notebooks on S3, use:
--notebook-dir <s3://your-bucket/folder/>
To store notebooks in a directory different from the user’s home directory, use:
--notebook-dir <local directory>
curl -s -k https://USERNAME:[email protected]/1.0/user/repositories | python -c 'import sys, json, os; r = json.loads(sys.stdin.read()); [os.system("git clone %s" % d["resource_uri"].replace("/1.0/repositories","https://USERNAME:[email protected]")+".git") for d in r]'
spark-submit --master yarn --deploy-mode cluster --name pyspark_job --driver-memory 2G --driver-cores 2 --executor-memory 12G --executor-cores 5 --num-executors 10 --conf spark.yarn.executor.memoryOverhead=4096 --conf spark.task.maxFailures=36 --conf spark.driver.maxResultSize=0 --conf spark.network.timeout=800s --conf spark.scheduler.listenerbus.eventqueue.size=500000 --conf spark.speculation=true --py-files lib.zip,lib1.zip,lib2.zip spark_test.py
import pyspark
import sys
from pyspark.sql import SQLContext