main
function executing various parallel operations on a cluster.
Main abstraction: Resilient Distributed Dataset (Fault tolerant abstraction for In-Memory Cluster computing)
Motivated by two types of applications that traditional Map Reduce weren't handling efficiently:
- iterative alogrithms: used in iterative ML algorithms like Page-Rank, K-means clustering, logic regression
- interactive Data Mining tools
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
The challenge is having fault tolerance done efficiently. Current distributed shared memory, distributed key-value store, databases and Piccolo offers fine grained update. You manipulate cells. RDD offers coarse-grained updates: map, filter and join on the whole dataset. RDD log the transformation instead of the data. When a partition fails, it has enough information to re-compute it. RDD also allow checkpoints to store intermediate results.
Formally, an RDD is a read-only, partitioned collection of records
RDD let you control 2 aspects:
- persistence: keep in-memory or store it on a storage
- partitionning: partitionning strategy accross cluster nodes. Useful for joining 2 datasets.
Lazy transformation. Only compute the first time you actually ask for results.
Lazy operations: map, filter, reduceByKey, reduceByKey ...
Greedy operations: count, reduce, collect
Operations that force Spark to move data around in order to process:
- repartion operations: repartition and coalesce
- ByKey: except for counting, for example reduceByKey, groupByKey
- Join: join and cogroup
It has performance impact, generate a lot of files, I/O and network. It can be configured.
Persist intermediate results to fast next operations. Useful when an intermediate result will be used for several other operations.
Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon
CPU vs Memory tradeoff:
If you have enough memory: MEMORY
If not, use: MEMORY_ONLY_SER chosing an efficient serializer is crucial. Data will be in memory serialized, so there is a cost in CPU Use replication for faster recovery.
Only spill of disk if the cost of generating this RDD was very high.
If env with multiple application or a lot of memory, use off-heap as it has several advantages:
- It allows multiple executors to share the same pool of memory in Tachyon.
- It significantly reduces garbage collection costs.
- Cached data is not lost if individual executors crash.
Read only data to be distributed to nodes in an efficient way. A use case is the sharing of a large input dataset.
Monoid structure that will be managed by Spark (initial value and associative function)