Francisco Lopez fran0x

Cheat Sheet iTerm2

To install iTerm2 in OS X run brew install caskroom/cask/iterm2 (requires the almighty Homebrew installed first).

Action	Command
Vertical split	`Command + d`
Horizontal split	`Command + Shift + d`
Close the screen	`Command + w`
Move around screens	`Command + Alt + (up/down/left/right)`

I hereby claim:

I am flopezlasanta on github.
I am flopezlasanta (https://keybase.io/flopezlasanta) on keybase.
I have a public key whose fingerprint is 55A8 3CF8 344E 834A 3E00 ED65 3FD4 E16E 77EA DB72

To claim this, I am signing this object:

Install Python3, Scala and Apache Spark via Brew (http://brew.sh/)

brew update
brew install python3
brew install scala
brew install apache-spark

Set environment variables

	#!/bin/bash

	# Configuration
	#export DIGITALOCEAN_ACCESS_TOKEN= # Digital Ocean Token (mandatory to provide)
	export DIGITALOCEAN_SIZE=512mb # default
	export DIGITALOCEAN_REGION=nyc3 # default
	export DIGITALOCEAN_PRIVATE_NETWORKING=true # default=false
	#export DIGITALOCEAN_IMAGE="ubuntu-15-04-x64" # default
	# For other settings see defaults in https://docs.docker.com/machine/drivers/digital-ocean/

	// Measure.time is used to measure the time that takes to complete a block of code (in nanoseconds)
	// note: this version does not return the result of calling that function; a different version should be created for that
	object Measure {
	def time(block: => Unit)={
	val s = System.nanoTime
	block
	System.nanoTime - s
	}
	}

	// Control.using is used to automatically close any resource that has a close method
	// note: from the book "Beginning Scala" (by David Pollak)
	object Control {

	import scala.language.reflectiveCalls
	def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
	try {
	f(param)
	} finally {
	param.close()

	# load the "orders" table from Hive into a DataFrame
	orders_df=sqlCtx.sql("select * from orders")
	orders_df.printSchema()

	# 1) calculate number of orders in SUSPECTED_FRAUD status
	sqlCtx.select("select count(order_id) from orders where order_status='SUSPECTED_FRAUD'").show(5)

	# load the "order_items" table from Hive into a DataFrame
	order_items_df=sqlCtx.sql("select * from order_items")
	order_items_df.printSchema()

	# copy the Hive configuration file hive-site.xml to the spark configuration folder
	# sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

	# launch pyspark with the spark-csv package (note: version 1.2.0 has some issues thus better use 1.3.0)
	# PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.3.0

	# check dataframes are working
	sqlCtx.createDataFrame([("somekey", 1)])

	# load yelp dataset

	def simpleWordTokenizer(string):
	""" A simple (for-comprehension) implementation of input string tokenization
	Args:
	string (str): input string
	Returns:
	list: a list of tokens in lowercase and no empty strings
	"""
	return [x for x in re.split(split_regex, string.lower()) if x]

	starWarsDarkSide = 'Only at the end do you realize the power of the Dark Side.'

	import java.io.File
	import java.io.PrintWriter

	import scala.annotation.migration
	import scala.collection.immutable.ListMap
	import scala.collection.mutable.Map

	object JoinDuplicatedLines {
	def main(args: Array[String]) {
	val input = io.Source fromFile "input.csv"