kovasb · January 13, 2016 16:59
diff --git a/gistfile1.txt b/gistfile1.txt
 **Concerns
 - Interactivity
 -- Incremental extension or modification of running system
 - Modularity
 -- Pluggable serialization, storage, conveyance, scheduling, lifecycle hooks etc etc

 **Spark Summary
 - RDDs: represent a lazily-computed distributed dataset
 -- each dataset is broken up into 'partitions' that individually sit at different machines
 -- just a DAG with edges annotated to describe relationship between partitions of parent and partitions of child
 -- DAG is extended via 'transformations' that close over a supplied function
 -- eventually an 'action' materializes the root, causing depedencies to evaluate
 --- caching & durability policies at nodes; automatic recomputation on worker failure
 -- suitable for interactive computation
 - RDD Representation
 -- Logical representation:
 --- list of partitions
 --- function for computing each partition
 --- list of 'dependencies' (DAG children + edge annotation)
 --- optional partitioner & optional 'preferred locations'
 -- Programmatic representation
 --- Concrete classes inheriting from base class
 --- No programmatic formal distinction between transformations and actions
 --- rddtypeA.transformationX(fn) -> new rddtypeB(fn, rddtypeA)
 ---- DAG treated as implementation detail, not directly exposed or manipulated
 ---- Computation context ("Task Context") available to RDD but not directly to fn
 - RDD Evaluation
 -- RDD value is a logical list of partitions [p1 p2 pN]
 -- RDD 'transformations' defined in terms of (transform-fn input-iterator) -> output-iterator
 --- (map
      (fn [input-partition-iterator] output-partition-iterator)
       [p1-iterator p2-iterator pN-iterator]
 --- specialized transforms like elemetwise map, filter, mapcat etc built on top
 --- plays well with 'sequence' transduction context (would prefer eduction if available?)
 - Coordination
 -- Coordination of job execution happens at driver
 --- In in same process, interleaved, as job description
 -- User supplies driver program to a driver (typically via shell command)
 -- Driver program builds up RDDs via transformations and performs actions
 --- Behind the scenes, driver process coordinates with workers ('executors') and cluster manager
 --- API entry point is a "Spark Context"; leaf RDDs created via Spark Context.
 --- RDDs created in driver are serialized and shipped to executors
 ---- RDD methods are invoked on BOTH driver and executors
 -- 'Actions' invoked at driver invoke executor computation & IO side-effects
 - Inputs & Outputs
 -- Data inputs & outputs via side effects; not formalized
 --- example input: leaf RDD contains urls; RDD xform reads data from urls and outputs via iterator
 -- Jar + main class is supplied to driver
 --- driver supplies resources to executors; fairly inscrutable process
 -- Driver can launch REPL
 --- REPL-created classes supplied to executors via URL classloader

 - Modularity
 -- plug in functions for consuming/creating iterators (great)
 -- Downhill from there
 -- Fundamental construct (DAG) is hidden behind impl details
 --- Cannot be directly fabricated or manipulated
 -- Concrete types, not interfaces
 --- Worse, scala types (cumbersome to inhabit from Clojure). Java API an afterthought
 -- RDD types form a complex web of relationships without formal semantics
 -- Conveyance, coordination, initialization mechanisms inscrutable, adhoc

 - Interactivity
 -- Core design allows extending the computation DAG
 --- extension happens within driver process, which has many opaque entanglements
 -- Scala shell allows new definitions visible at executors
 --- Hacked into system, not clear how to extend to Clojure

 **Flink Summary
 - Job Graph
 -- Represents dataflow
 --- Edges of system are sources and sinks
 --- Various 'operators' in between
 -- High level API --> DAG --> Optimizer/Compiler --> 'Job Graph'
 --- 'Job Graph' distributed execution managed by JobManager
 - Job Graph Representation
 -- Set of concrete classes that are just AST-like descriptors
 --- Computation nodes, input&output nodes, edges of various kinds
 --- Nodes contain user-supplied functions/components
 - Job Graph Evaluation
 -- Evaluation executes the dataflow
 -- Job Graph turns into 'Execution Graph' at Job Manager
 --- Nodes tell worker components what to do and when to do it, track high-level state
 - Coordination
 -- Client process submits Job Graph + resources to JobManager
 -- Client process not involved with coordination
 - Inputs & Outputs
 -- Well-defined interfaces for input and output data
 -- Well-defined job submission boundary
 -- No built-in facilities for REPL input

 - Modularity
 -- Fundamental construct (Job Graph) easy to fabricate
 -- Interfaces everywhere
 --- Most concrete classes are in Java
 -- Well-defined components & interactions between them

 - Interactivity
 -- No affordances specifically for interactivity
 -- Provided implementations assume static Job Graph
 -- However, it is plausible to modify behavior of Job Manager and of worker nodes
	**Concerns
	- Interactivity
	-- Incremental extension or modification of running system
	- Modularity
	-- Pluggable serialization, storage, conveyance, scheduling, lifecycle hooks etc etc

	**Spark Summary
	- RDDs: represent a lazily-computed distributed dataset
	-- each dataset is broken up into 'partitions' that individually sit at different machines
	-- just a DAG with edges annotated to describe relationship between partitions of parent and partitions of child
	-- DAG is extended via 'transformations' that close over a supplied function
	-- eventually an 'action' materializes the root, causing depedencies to evaluate
	--- caching & durability policies at nodes; automatic recomputation on worker failure
	-- suitable for interactive computation
	- RDD Representation
	-- Logical representation:
	--- list of partitions
	--- function for computing each partition
	--- list of 'dependencies' (DAG children + edge annotation)
	--- optional partitioner & optional 'preferred locations'
	-- Programmatic representation
	--- Concrete classes inheriting from base class
	--- No programmatic formal distinction between transformations and actions
	--- rddtypeA.transformationX(fn) -> new rddtypeB(fn, rddtypeA)
	---- DAG treated as implementation detail, not directly exposed or manipulated
	---- Computation context ("Task Context") available to RDD but not directly to fn
	- RDD Evaluation
	-- RDD value is a logical list of partitions [p1 p2 pN]
	-- RDD 'transformations' defined in terms of (transform-fn input-iterator) -> output-iterator
	--- (map
	(fn [input-partition-iterator] output-partition-iterator)
	[p1-iterator p2-iterator pN-iterator]
	--- specialized transforms like elemetwise map, filter, mapcat etc built on top
	--- plays well with 'sequence' transduction context (would prefer eduction if available?)
	- Coordination
	-- Coordination of job execution happens at driver
	--- In in same process, interleaved, as job description
	-- User supplies driver program to a driver (typically via shell command)
	-- Driver program builds up RDDs via transformations and performs actions
	--- Behind the scenes, driver process coordinates with workers ('executors') and cluster manager
	--- API entry point is a "Spark Context"; leaf RDDs created via Spark Context.
	--- RDDs created in driver are serialized and shipped to executors
	---- RDD methods are invoked on BOTH driver and executors
	-- 'Actions' invoked at driver invoke executor computation & IO side-effects
	- Inputs & Outputs
	-- Data inputs & outputs via side effects; not formalized
	--- example input: leaf RDD contains urls; RDD xform reads data from urls and outputs via iterator
	-- Jar + main class is supplied to driver
	--- driver supplies resources to executors; fairly inscrutable process
	-- Driver can launch REPL
	--- REPL-created classes supplied to executors via URL classloader

	- Modularity
	-- plug in functions for consuming/creating iterators (great)
	-- Downhill from there
	-- Fundamental construct (DAG) is hidden behind impl details
	--- Cannot be directly fabricated or manipulated
	-- Concrete types, not interfaces
	--- Worse, scala types (cumbersome to inhabit from Clojure). Java API an afterthought
	-- RDD types form a complex web of relationships without formal semantics
	-- Conveyance, coordination, initialization mechanisms inscrutable, adhoc

	- Interactivity
	-- Core design allows extending the computation DAG
	--- extension happens within driver process, which has many opaque entanglements
	-- Scala shell allows new definitions visible at executors
	--- Hacked into system, not clear how to extend to Clojure

	**Flink Summary
	- Job Graph
	-- Represents dataflow
	--- Edges of system are sources and sinks
	--- Various 'operators' in between
	-- High level API --> DAG --> Optimizer/Compiler --> 'Job Graph'
	--- 'Job Graph' distributed execution managed by JobManager
	- Job Graph Representation
	-- Set of concrete classes that are just AST-like descriptors
	--- Computation nodes, input&output nodes, edges of various kinds
	--- Nodes contain user-supplied functions/components
	- Job Graph Evaluation
	-- Evaluation executes the dataflow
	-- Job Graph turns into 'Execution Graph' at Job Manager
	--- Nodes tell worker components what to do and when to do it, track high-level state
	- Coordination
	-- Client process submits Job Graph + resources to JobManager
	-- Client process not involved with coordination
	- Inputs & Outputs
	-- Well-defined interfaces for input and output data
	-- Well-defined job submission boundary
	-- No built-in facilities for REPL input

	- Modularity
	-- Fundamental construct (Job Graph) easy to fabricate
	-- Interfaces everywhere
	--- Most concrete classes are in Java
	-- Well-defined components & interactions between them

	- Interactivity
	-- No affordances specifically for interactivity
	-- Provided implementations assume static Job Graph
	-- However, it is plausible to modify behavior of Job Manager and of worker nodes