Last active
January 13, 2016 16:59
-
-
Save kovasb/9b625b8e82b0aa08c1c9 to your computer and use it in GitHub Desktop.
Assessing Spark & Flink from Clojure POV
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
**Concerns | |
- Interactivity | |
-- Incremental extension or modification of running system | |
- Modularity | |
-- Pluggable serialization, storage, conveyance, scheduling, lifecycle hooks etc etc | |
**Spark Summary | |
- RDDs: represent a lazily-computed distributed dataset | |
-- each dataset is broken up into 'partitions' that individually sit at different machines | |
-- just a DAG with edges annotated to describe relationship between partitions of parent and partitions of child | |
-- DAG is extended via 'transformations' that close over a supplied function | |
-- eventually an 'action' materializes the root, causing depedencies to evaluate | |
--- caching & durability policies at nodes; automatic recomputation on worker failure | |
-- suitable for interactive computation | |
- RDD Representation | |
-- Logical representation: | |
--- list of partitions | |
--- function for computing each partition | |
--- list of 'dependencies' (DAG children + edge annotation) | |
--- optional partitioner & optional 'preferred locations' | |
-- Programmatic representation | |
--- Concrete classes inheriting from base class | |
--- No programmatic formal distinction between transformations and actions | |
--- rddtypeA.transformationX(fn) -> new rddtypeB(fn, rddtypeA) | |
---- DAG treated as implementation detail, not directly exposed or manipulated | |
---- Computation context ("Task Context") available to RDD but not directly to fn | |
- RDD Evaluation | |
-- RDD value is a logical list of partitions [p1 p2 pN] | |
-- RDD 'transformations' defined in terms of (transform-fn input-iterator) -> output-iterator | |
--- (map | |
(fn [input-partition-iterator] output-partition-iterator) | |
[p1-iterator p2-iterator pN-iterator] | |
--- specialized transforms like elemetwise map, filter, mapcat etc built on top | |
--- plays well with 'sequence' transduction context (would prefer eduction if available?) | |
- Coordination | |
-- Coordination of job execution happens at driver | |
--- In in same process, interleaved, as job description | |
-- User supplies driver program to a driver (typically via shell command) | |
-- Driver program builds up RDDs via transformations and performs actions | |
--- Behind the scenes, driver process coordinates with workers ('executors') and cluster manager | |
--- API entry point is a "Spark Context"; leaf RDDs created via Spark Context. | |
--- RDDs created in driver are serialized and shipped to executors | |
---- RDD methods are invoked on BOTH driver and executors | |
-- 'Actions' invoked at driver invoke executor computation & IO side-effects | |
- Inputs & Outputs | |
-- Data inputs & outputs via side effects; not formalized | |
--- example input: leaf RDD contains urls; RDD xform reads data from urls and outputs via iterator | |
-- Jar + main class is supplied to driver | |
--- driver supplies resources to executors; fairly inscrutable process | |
-- Driver can launch REPL | |
--- REPL-created classes supplied to executors via URL classloader | |
- Modularity | |
-- plug in functions for consuming/creating iterators (great) | |
-- Downhill from there | |
-- Fundamental construct (DAG) is hidden behind impl details | |
--- Cannot be directly fabricated or manipulated | |
-- Concrete types, not interfaces | |
--- Worse, scala types (cumbersome to inhabit from Clojure). Java API an afterthought | |
-- RDD types form a complex web of relationships without formal semantics | |
-- Conveyance, coordination, initialization mechanisms inscrutable, adhoc | |
- Interactivity | |
-- Core design allows extending the computation DAG | |
--- extension happens within driver process, which has many opaque entanglements | |
-- Scala shell allows new definitions visible at executors | |
--- Hacked into system, not clear how to extend to Clojure | |
**Flink Summary | |
- Job Graph | |
-- Represents dataflow | |
--- Edges of system are sources and sinks | |
--- Various 'operators' in between | |
-- High level API --> DAG --> Optimizer/Compiler --> 'Job Graph' | |
--- 'Job Graph' distributed execution managed by JobManager | |
- Job Graph Representation | |
-- Set of concrete classes that are just AST-like descriptors | |
--- Computation nodes, input&output nodes, edges of various kinds | |
--- Nodes contain user-supplied functions/components | |
- Job Graph Evaluation | |
-- Evaluation executes the dataflow | |
-- Job Graph turns into 'Execution Graph' at Job Manager | |
--- Nodes tell worker components what to do and when to do it, track high-level state | |
- Coordination | |
-- Client process submits Job Graph + resources to JobManager | |
-- Client process not involved with coordination | |
- Inputs & Outputs | |
-- Well-defined interfaces for input and output data | |
-- Well-defined job submission boundary | |
-- No built-in facilities for REPL input | |
- Modularity | |
-- Fundamental construct (Job Graph) easy to fabricate | |
-- Interfaces everywhere | |
--- Most concrete classes are in Java | |
-- Well-defined components & interactions between them | |
- Interactivity | |
-- No affordances specifically for interactivity | |
-- Provided implementations assume static Job Graph | |
-- However, it is plausible to modify behavior of Job Manager and of worker nodes |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment