Skip to content

Instantly share code, notes, and snippets.

@davidsth
Last active May 2, 2018 12:52
Show Gist options
  • Save davidsth/93b835f5090216011bc0c0046e413758 to your computer and use it in GitHub Desktop.
Save davidsth/93b835f5090216011bc0c0046e413758 to your computer and use it in GitHub Desktop.
# Apache Spark
- 100x faster than Hadoop MapReduce because it's in memory.
- spark replaces hadoop mapreduce not the hadoop file system (HDFS)
- Benchmark https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
-
- cluster management with YARN
## Spark API
- sparkSQL: distributed query engine
- Spark streaming: processing of live streams of data
## Spark Streaming/Structured Streaming
- processing of live streams of data
## Spark ML
## Resilient Distributed Dataset (RDD)
Immutable, partitioned collection of elements that can be operated in parallel
DataFrames:
- distributed collection of data organzied in named columns
- equivalent to relational database
-
- automatically optimized
Mesos: cluster manager that runs hadoop mapreduce
# Spark Runtime
- master "driver" , slave "executor"
# sparkcontext
- main entrypoint of spark applicaton
- part of driver program
- task is smallest unit of work
context
driver
job
stage
task
executor
---
## Spark Streaming/Structured Streaming
- Discretized streams (dstreams)
- continuous stream of data
- continuous series of RDDs
- different API than RDD. Data didn't have event time in dstream so ordering was not possible if data arrived late.
- no build in end to end guarantees: process exactly once is complex
- Structured Streaming introduced
## Structured Streaming
- scalable, fault tolerant stream processing engine build on spark SQL engine
- sues DataFrame API to express streaming aggregation. event time, windows,
- optimized for spark SQL
- ensures end-to-end exactly once fault-tolerance through checkpointing and writeahead
Key Characteristics:
- Data modes
- fault tolerance
- structured streaming: window group aggregation
- data stream as unbounded table: table get bigger as you add more record
### Input data
where can structured streaming read data from?
- file sources, kafka source, socket source
kafka: event driven messaging broker
### Output Sinks
- File sink
- console sink (debug)
- memory sink (debug)
Output mode:
Complete Mode: Entire result table written to external storage
Append Mode: New roles appended to result table
## Fault-Tolerance
-
- Window aggregation: collect data every x interval
- late data handling: orders data that comes in late in its own date range bucket
- Can have issue of crowding. "Watermarking" or threshold of how late data is expected to come in
to drop the old state. "maybe data is of no value if its too late"
---
# Machine Learning
typical data analysis process
- data collecting, cleaning, training models, model tuning and evaludation
MLib:
classification: binary classification
regression: historical data
clustering:
collaborative filtering: "recommendation"
dimensionality reduction:
featurization -> training -> model evaluation
### Types
#### Supervised Learning
- Use pre-defined set of "training examples" to facilitate accurate conclusion when new data is given
- classification divides items into categories. Most common: binary classification.
- Multiclass classification plausible
#### Spark ML Pipeline:
##### Transformer:
- algo that can transform data from one DataFrame to another DataFrame.
- Implements method called transform()
##### Estimator:
- Algo that can be fit on DataFrame to produce a Transformer
- Uses method called fit() that accepts DataFrame and produces Model
##### Pipeline:
- Chains multiple Transformers and Estimators together to specify ML workflow
- Pipeline is an **Estimator**
##### Parameter:
- Common API shared by Transformer and Estimators
#### Model Selection
- Using data to find best model or parameters for given task
- Cross validation (Logistic regression threshold)
Receiver Operating Characteristic (ROC) Curve: True positive if area under curve is closer to 1. 0.5 or y=x graph isn't useful
#### Unsupervised Learning: Learning algorithm should find patters on it's own.
- Deep Learning: Reinforcement algo
---
# SWE642 JPA notes
Java objects Serializable interface does not have any methods to be implemented. Marshalling and unmarshalling is automatic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment