Francisco Lopez fran0x

Install Python3, Scala and Apache Spark via Brew (http://brew.sh/)

brew update
brew install python3
brew install scala
brew install apache-spark

Set environment variables

Keybase proof

I hereby claim:

I am flopezlasanta on github.
I am flopezlasanta (https://keybase.io/flopezlasanta) on keybase.
I have a public key whose fingerprint is 55A8 3CF8 344E 834A 3E00 ED65 3FD4 E16E 77EA DB72

To claim this, I am signing this object:

Cheat Sheet iTerm2

To install iTerm2 in OS X run brew install caskroom/cask/iterm2 (requires the almighty Homebrew installed first).

Action	Command
Vertical split	`Command + d`
Horizontal split	`Command + Shift + d`
Close the screen	`Command + w`
Move around screens	`Command + Alt + (up/down/left/right)`

From https://github.com/spark-jobserver/spark-jobserver#getting-started-with-spark-job-server:

The easiest way to get started is to try the Docker container which prepackages a Spark distribution with the job server and lets you start and deploy it.

➜  spark-jobserver git:(master) docker-machine version
docker-machine version 0.7.0, build a650a40

// https://gist.github.com/radekg/ec5a1575c450a48e5cba

From http://stackoverflow.com/a/32393044/1305344:

object size extends App {
  (1 to 1000000).map(i => ("foo"+i, ()))
  val input = readLine("prompt> ")
}

Run it with sbt 'runMain size' and then use jps (to know the pids), jstat -gc pid (to query for gc) and jmap (similar to jstat) to analise resource allocation.

Introducting Apache Spark

What use cases are a good fit for Apache Spark? How to work with Spark?
- create RDDs, transform them, and execute actions to get result of a computation
- All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
  - the less disk operations, the faster (you do know it, don't you?)
- You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
- Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
- Data mining = analysis / insights / analytics
log mining

How much of machine learning is statistics and vice versa?

Learning using https://www.coursera.org/learn/machine-learning/home/welcome

machine learning = teaching a computer to learn concepts using data — without being explicitly programmed.
Supervised learning = "right answers" given
Regression problem
- continuous valued output
- deduce the function for a given data set and predict other values
"in regression problems, we are taking input variables and trying to map the output onto a continuous expected result function."

	// Control.using is used to automatically close any resource that has a close method
	// note: from the book "Beginning Scala" (by David Pollak)
	object Control {

	import scala.language.reflectiveCalls
	def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
	try {
	f(param)
	} finally {
	param.close()

	// Measure.time is used to measure the time that takes to complete a block of code (in nanoseconds)
	// note: this version does not return the result of calling that function; a different version should be created for that
	object Measure {
	def time(block: => Unit)={
	val s = System.nanoTime
	block
	System.nanoTime - s
	}
	}

	#!/bin/bash

	# Configuration
	#export DIGITALOCEAN_ACCESS_TOKEN= # Digital Ocean Token (mandatory to provide)
	export DIGITALOCEAN_SIZE=512mb # default
	export DIGITALOCEAN_REGION=nyc3 # default
	export DIGITALOCEAN_PRIVATE_NETWORKING=true # default=false
	#export DIGITALOCEAN_IMAGE="ubuntu-15-04-x64" # default
	# For other settings see defaults in https://docs.docker.com/machine/drivers/digital-ocean/