Pierdomenico Fiadino pierdom

Running single-node Spark and Jupyer notebook with PySpark

Pre-requisites:

Java Development Kit (JDK): download at http://www.oracle.com/technetwork/java/javase/downloads/index.html
Spark pre-built for Hadoop 2.6 (in case not already installed): download at http://spark.apache.org/downloads.html
Jupyter notebook: sudo pip install jupyter / sudo pip3 istall jupyter

In this example, I have extracted the tar.gz file in /opt. Remember to use the correct path for $SPARK_HOME environment variable (see below), depending on how you installed it.

Change import matplotlib.pyplot as plt to:

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

To remove all '.DS_STore' files in an existent git repo:

find . -name .DS_Store -print0 | xargs -0 git rm --ignore-unmatch

To permanently ignore '.DS_STore' files in GIT repos:

echo .DS_Store >> ~/.gitignore_global
git config --global core.excludesfile ~/.gitignore_global

Jupyter + Pyspark how-to

Version 1.0 2016/11/14
Pierdomenico Fiadino | [email protected]

Synopysis

Install Jupyter Notebook in a dedicated Python virtualenv and integrate with Spark terminal pyspark on a cluster client (for this example, we will use the tourism-lab node).

This how-to assumes that we have SSH access to the machine and pyspark already installed and configure (try executing it and see if the variables sc, HiveContext and sqlContext are already installed).

My Capacity Scheduler configs

Field: yarn.resourcemanager.scheduler.class Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler`

The capacity scheduler is configured with 3 queues (default 75%, batch1 25%, batch2 0%) with a queue currently not used (batch2). Here the config:

yarn.scheduler.capacity.root.queues=batch1,batch2,default
yarn.scheduler.capacity.root.default.user-limit-factor=1

Five ways to tune Hive performance

1. Use Tez

set hive.execution.engine=tez;

	#!/usr/bin/env python3

	import matplotlib.pyplot as plt
	import seaborn as sns

	# dictionary with pre-counted bins
	test = {1:1,2:1,3:1,4:2,5:3,6:5,7:4,8:2,9:1,10:1}

	# with matplotlib
	plt.hist(list(test.keys()), weights=list(test.values()))

	from IPython.display import HTML

	HTML('''<script>
	code_show=true;
	function code_toggle() {
	if (code_show){
	$('div.input').hide();
	} else {
	$('div.input').show();
	}

	g = sns.boxplot(x="poi", y="visitor_cnt", hue="day_type", data=pois_averages, palette="Set1")
	for item in g.get_xticklabels():
	item.set_rotation(60)