Last active
May 2, 2018 12:52
-
-
Save davidsth/93b835f5090216011bc0c0046e413758 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Apache Spark | |
- 100x faster than Hadoop MapReduce because it's in memory. | |
- spark replaces hadoop mapreduce not the hadoop file system (HDFS) | |
- Benchmark https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html | |
- | |
- cluster management with YARN | |
## Spark API | |
- sparkSQL: distributed query engine | |
- Spark streaming: processing of live streams of data | |
## Spark Streaming/Structured Streaming | |
- processing of live streams of data | |
## Spark ML | |
## Resilient Distributed Dataset (RDD) | |
Immutable, partitioned collection of elements that can be operated in parallel | |
DataFrames: | |
- distributed collection of data organzied in named columns | |
- equivalent to relational database | |
- | |
- automatically optimized | |
Mesos: cluster manager that runs hadoop mapreduce | |
# Spark Runtime | |
- master "driver" , slave "executor" | |
# sparkcontext | |
- main entrypoint of spark applicaton | |
- part of driver program | |
- task is smallest unit of work | |
context | |
driver | |
job | |
stage | |
task | |
executor | |
--- | |
## Spark Streaming/Structured Streaming | |
- Discretized streams (dstreams) | |
- continuous stream of data | |
- continuous series of RDDs | |
- different API than RDD. Data didn't have event time in dstream so ordering was not possible if data arrived late. | |
- no build in end to end guarantees: process exactly once is complex | |
- Structured Streaming introduced | |
## Structured Streaming | |
- scalable, fault tolerant stream processing engine build on spark SQL engine | |
- sues DataFrame API to express streaming aggregation. event time, windows, | |
- optimized for spark SQL | |
- ensures end-to-end exactly once fault-tolerance through checkpointing and writeahead | |
Key Characteristics: | |
- Data modes | |
- fault tolerance | |
- structured streaming: window group aggregation | |
- data stream as unbounded table: table get bigger as you add more record | |
### Input data | |
where can structured streaming read data from? | |
- file sources, kafka source, socket source | |
kafka: event driven messaging broker | |
### Output Sinks | |
- File sink | |
- console sink (debug) | |
- memory sink (debug) | |
Output mode: | |
Complete Mode: Entire result table written to external storage | |
Append Mode: New roles appended to result table | |
## Fault-Tolerance | |
- | |
- Window aggregation: collect data every x interval | |
- late data handling: orders data that comes in late in its own date range bucket | |
- Can have issue of crowding. "Watermarking" or threshold of how late data is expected to come in | |
to drop the old state. "maybe data is of no value if its too late" | |
--- | |
# Machine Learning | |
typical data analysis process | |
- data collecting, cleaning, training models, model tuning and evaludation | |
MLib: | |
classification: binary classification | |
regression: historical data | |
clustering: | |
collaborative filtering: "recommendation" | |
dimensionality reduction: | |
featurization -> training -> model evaluation | |
### Types | |
#### Supervised Learning | |
- Use pre-defined set of "training examples" to facilitate accurate conclusion when new data is given | |
- classification divides items into categories. Most common: binary classification. | |
- Multiclass classification plausible | |
#### Spark ML Pipeline: | |
##### Transformer: | |
- algo that can transform data from one DataFrame to another DataFrame. | |
- Implements method called transform() | |
##### Estimator: | |
- Algo that can be fit on DataFrame to produce a Transformer | |
- Uses method called fit() that accepts DataFrame and produces Model | |
##### Pipeline: | |
- Chains multiple Transformers and Estimators together to specify ML workflow | |
- Pipeline is an **Estimator** | |
##### Parameter: | |
- Common API shared by Transformer and Estimators | |
#### Model Selection | |
- Using data to find best model or parameters for given task | |
- Cross validation (Logistic regression threshold) | |
Receiver Operating Characteristic (ROC) Curve: True positive if area under curve is closer to 1. 0.5 or y=x graph isn't useful | |
#### Unsupervised Learning: Learning algorithm should find patters on it's own. | |
- Deep Learning: Reinforcement algo | |
--- | |
# SWE642 JPA notes | |
Java objects Serializable interface does not have any methods to be implemented. Marshalling and unmarshalling is automatic |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment