davidsth · May 2, 2018 12:52
diff --git a/swe642 -spark b/swe642 -spark
 # Apache Spark
 - 100x faster than Hadoop MapReduce because it's in memory.
 - spark replaces hadoop mapreduce not the hadoop file system (HDFS)

 - Benchmark https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
 - 

 - cluster management with YARN

 ## Spark API

 - sparkSQL: distributed query engine
 - Spark streaming: processing of live streams of data


 ## Spark Streaming/Structured Streaming

 - processing of live streams of data


 ## Spark ML

 ## Resilient Distributed Dataset (RDD)

 Immutable, partitioned collection of elements that can be operated in parallel

 DataFrames:
 - distributed collection of data organzied in named columns
 - equivalent to relational database
 -
 - automatically optimized

 Mesos: cluster manager that runs hadoop mapreduce

 # Spark Runtime


 - master "driver" , slave "executor"

 # sparkcontext
 - main entrypoint of spark applicaton
 - part of driver program
 - task is smallest unit of work

 context
 driver
 job
 stage
 task
 executor

 ---

 ## Spark Streaming/Structured Streaming

 - Discretized streams (dstreams)
 - continuous stream of data
 - continuous series of RDDs
 - different API than RDD. Data didn't have event time in dstream so ordering was not possible if data arrived late.
 - no build in end to end guarantees: process exactly once is complex

 - Structured Streaming introduced


 ## Structured Streaming

 - scalable, fault tolerant stream processing engine build on spark SQL engine
 - sues DataFrame API to express streaming aggregation. event time, windows,
 - optimized for spark SQL
 - ensures end-to-end exactly once fault-tolerance through checkpointing and writeahead

 Key Characteristics:
 - Data modes
 - fault tolerance
 - structured streaming: window group aggregation

 - data stream as unbounded table: table get bigger as you add more record

 ### Input data
 where can structured streaming read data from?
 - file sources, kafka source, socket source

 kafka: event driven messaging broker

 ### Output Sinks

 - File sink
 - console sink (debug)
 - memory sink (debug)

 Output mode:

 Complete Mode: Entire result table written to external storage
 Append Mode: New roles appended to result table


 ## Fault-Tolerance

 -

 - Window aggregation: collect data every x interval

 - late data handling: orders data that comes in late in its own date range bucket
 - Can have issue of crowding. "Watermarking" or threshold of how late data is expected to come in
 to drop the old state. "maybe data is of no value if its too late"


 ---

 # Machine Learning

 typical data analysis process
 - data collecting, cleaning, training models, model tuning and evaludation

 MLib:

 classification: binary classification
 regression: historical data
 clustering:
 collaborative filtering: "recommendation"
 dimensionality reduction:

 featurization -> training -> model evaluation


 ### Types

 #### Supervised Learning

 - Use pre-defined set of "training examples" to facilitate accurate conclusion when new data is given
 - classification divides items into categories. Most common: binary classification.
 - Multiclass classification plausible

 #### Spark ML Pipeline:

 ##### Transformer:

 - algo that can transform data from one DataFrame to another DataFrame.
 - Implements method called transform()

 ##### Estimator:
 - Algo that can be fit on DataFrame to produce a Transformer
 - Uses method called fit() that accepts DataFrame and produces Model

 ##### Pipeline:
 - Chains multiple Transformers and Estimators together to specify ML workflow
 - Pipeline is an **Estimator**


 ##### Parameter:
 - Common API shared by Transformer and Estimators

 #### Model Selection

 - Using data to find best model or parameters for given task
 - Cross validation (Logistic regression threshold)

 Receiver Operating Characteristic (ROC) Curve: True positive if area under curve is closer to 1. 0.5 or y=x graph isn't useful



 #### Unsupervised Learning: Learning algorithm should find patters on it's own.

 - Deep Learning: Reinforcement algo


 --- 

 # SWE642 JPA notes

 Java objects Serializable interface does not have any methods to be implemented. Marshalling and unmarshalling is automatic
	# Apache Spark
	- 100x faster than Hadoop MapReduce because it's in memory.
	- spark replaces hadoop mapreduce not the hadoop file system (HDFS)

	- Benchmark https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
	-

	- cluster management with YARN

	## Spark API

	- sparkSQL: distributed query engine
	- Spark streaming: processing of live streams of data


	## Spark Streaming/Structured Streaming

	- processing of live streams of data


	## Spark ML

	## Resilient Distributed Dataset (RDD)

	Immutable, partitioned collection of elements that can be operated in parallel

	DataFrames:
	- distributed collection of data organzied in named columns
	- equivalent to relational database
	-
	- automatically optimized

	Mesos: cluster manager that runs hadoop mapreduce

	# Spark Runtime


	- master "driver" , slave "executor"

	# sparkcontext
	- main entrypoint of spark applicaton
	- part of driver program
	- task is smallest unit of work

	context
	driver
	job
	stage
	task
	executor

	---

	## Spark Streaming/Structured Streaming

	- Discretized streams (dstreams)
	- continuous stream of data
	- continuous series of RDDs
	- different API than RDD. Data didn't have event time in dstream so ordering was not possible if data arrived late.
	- no build in end to end guarantees: process exactly once is complex

	- Structured Streaming introduced


	## Structured Streaming

	- scalable, fault tolerant stream processing engine build on spark SQL engine
	- sues DataFrame API to express streaming aggregation. event time, windows,
	- optimized for spark SQL
	- ensures end-to-end exactly once fault-tolerance through checkpointing and writeahead

	Key Characteristics:
	- Data modes
	- fault tolerance
	- structured streaming: window group aggregation

	- data stream as unbounded table: table get bigger as you add more record

	### Input data
	where can structured streaming read data from?
	- file sources, kafka source, socket source

	kafka: event driven messaging broker

	### Output Sinks

	- File sink
	- console sink (debug)
	- memory sink (debug)

	Output mode:

	Complete Mode: Entire result table written to external storage
	Append Mode: New roles appended to result table


	## Fault-Tolerance

	-

	- Window aggregation: collect data every x interval

	- late data handling: orders data that comes in late in its own date range bucket
	- Can have issue of crowding. "Watermarking" or threshold of how late data is expected to come in
	to drop the old state. "maybe data is of no value if its too late"


	---

	# Machine Learning

	typical data analysis process
	- data collecting, cleaning, training models, model tuning and evaludation

	MLib:

	classification: binary classification
	regression: historical data
	clustering:
	collaborative filtering: "recommendation"
	dimensionality reduction:

	featurization -> training -> model evaluation


	### Types

	#### Supervised Learning

	- Use pre-defined set of "training examples" to facilitate accurate conclusion when new data is given
	- classification divides items into categories. Most common: binary classification.
	- Multiclass classification plausible

	#### Spark ML Pipeline:

	##### Transformer:

	- algo that can transform data from one DataFrame to another DataFrame.
	- Implements method called transform()

	##### Estimator:
	- Algo that can be fit on DataFrame to produce a Transformer
	- Uses method called fit() that accepts DataFrame and produces Model

	##### Pipeline:
	- Chains multiple Transformers and Estimators together to specify ML workflow
	- Pipeline is an Estimator


	##### Parameter:
	- Common API shared by Transformer and Estimators

	#### Model Selection

	- Using data to find best model or parameters for given task
	- Cross validation (Logistic regression threshold)

	Receiver Operating Characteristic (ROC) Curve: True positive if area under curve is closer to 1. 0.5 or y=x graph isn't useful



	#### Unsupervised Learning: Learning algorithm should find patters on it's own.

	- Deep Learning: Reinforcement algo


	---

	# SWE642 JPA notes

	Java objects Serializable interface does not have any methods to be implemented. Marshalling and unmarshalling is automatic