Skip to content

Instantly share code, notes, and snippets.

@toff63
Created October 11, 2015 21:13
Show Gist options
  • Save toff63/2e2701c6982a26367587 to your computer and use it in GitHub Desktop.
Save toff63/2e2701c6982a26367587 to your computer and use it in GitHub Desktop.

Spark

main function executing various parallel operations on a cluster. Main abstraction: Resilient Distributed Dataset (Fault tolerant abstraction for In-Memory Cluster computing)

RDD

Motivated by two types of applications that traditional Map Reduce weren't handling efficiently:

  • iterative alogrithms: used in iterative ML algorithms like Page-Rank, K-means clustering, logic regression
  • interactive Data Mining tools

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

The challenge is having fault tolerance done efficiently. Current distributed shared memory, distributed key-value store, databases and Piccolo offers fine grained update. You manipulate cells. RDD offers coarse-grained updates: map, filter and join on the whole dataset. RDD log the transformation instead of the data. When a partition fails, it has enough information to re-compute it. RDD also allow checkpoints to store intermediate results.

Formally, an RDD is a read-only, partitioned collection of records

RDD let you control 2 aspects:

  • persistence: keep in-memory or store it on a storage
  • partitionning: partitionning strategy accross cluster nodes. Useful for joining 2 datasets.

Lazy transformation. Only compute the first time you actually ask for results.

Transformation

Lazy operations: map, filter, reduceByKey, reduceByKey ...

Action

Greedy operations: count, reduce, collect

Shuffle operation

Operations that force Spark to move data around in order to process:

  • repartion operations: repartition and coalesce
  • ByKey: except for counting, for example reduceByKey, groupByKey
  • Join: join and cogroup

It has performance impact, generate a lot of files, I/O and network. It can be configured.

Persistence

Persist intermediate results to fast next operations. Useful when an intermediate result will be used for several other operations.

Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon

Which persistence to use?

CPU vs Memory tradeoff:

If you have enough memory: MEMORY

If not, use: MEMORY_ONLY_SER chosing an efficient serializer is crucial. Data will be in memory serialized, so there is a cost in CPU Use replication for faster recovery.

Only spill of disk if the cost of generating this RDD was very high.

If env with multiple application or a lot of memory, use off-heap as it has several advantages:

  • It allows multiple executors to share the same pool of memory in Tachyon.
  • It significantly reduces garbage collection costs.
  • Cached data is not lost if individual executors crash.

Shared Variable

Broadcast variable

Read only data to be distributed to nodes in an efficient way. A use case is the sharing of a large input dataset.

Accumulator

Monoid structure that will be managed by Spark (initial value and associative function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment