sgoyal1012/Best Links.md

Last active September 7, 2020 16:02

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6.js"></script>
Save sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6 to your computer and use it in GitHub Desktop.

Download ZIP

Scala Course

Raw

Best Links.md

Raw

Course 1. Functional Programming Principles in Scala.md

Best links

https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6#file-best-links-md

Course 1. Functional Programming Principles in Scala

Week 1 -- Getting Started

Scala, SBT setup
Create scala worksheet .sc - useful feature
Practice assignment
- Using === instead of == in tests.
- Recursion!
SBT Tutorial https://www.coursera.org/learn/progfun1/supplement/uV974/sbt-tutorial-and-submission-of-assignments-please-read
Can run in two ways, either via main or extends App

1.1 Programming Paradigms

Functional programming is a paradigm
Imperative programming
- Java!
- Von Neumann Computer
- One needs to conceptualize data structures word by word
- Mathematical theories vs programming
- Le Voila! Functional programming
Functional programming can be seen as a calculator -- use the REPL
- Restricted (no mutable variables, eg. Pure LISP) vs wider (focus on functions, eg. Scala)
- Functions are FIRST CLASS CITIZENS
Why functional programming?

Good for exploiting ||ism

1.2 Elements of Programming

See Functional programming as a calculator, REPL
Goes over the evaluation of an arithmetic expression and a parameterized function - The substitution model
Side Effects - eg. C++
- https://softwareengineering.stackexchange.com/questions/40297/what-is-a-side-effect
Call by value and call by name - great examples!

1.3 Evaluation Strategies

CBV and CBN evaluation/termination etc
Scala always CBV, as it is better performant, plays nicer with imperative programming

1.4 Scala `if-else`

If-else is an expression, predicates
Short circuit evaluation, sometimes do not need to see the right hand side to evaluate the expression
def vs val (by name and value)
Scala make a param call by name :=> - https://alvinalexander.com/source-code/scala/simple-scala-call-name-example

By-name parameters are only evaluated when used. They are in contrast to by-value parameters. To make a parameter called by-name, simply prepend => to its type.

1.5 Square root using Newton's method

Recursive functions NEED to have a return type (because it needs to know the return type), otherwise it is optional (Duh)

1.6 Blocks and Lexical Scope

Nested functions -- put auxillary functions inside the function sqrt
Blocks

1.7 Tail Recursion

Function rewriting rule, GCD and factorial
Tail recursion https://stackoverflow.com/questions/33923/what-is-tail-recursion
- Can execute in constant stack space, as it calls itself as the last action.
- Tail calls -- calling other function.
Can force a function to be tail recursive by annotating with @tailrec

Mostly solved by storing an accumulator or intermediate result.

Assignment

Stable identifiers!!

Week 2 - Higher Order Functions

2.1 Higher Order Functions

Functions are First Class Values
Higher order function operates on other functions!
- A function becomes a parameter of another function
- Anonymous Functions -- passing functions with name can get tedious; to avoid A FUNCTION HAS NO NAME.
```
(x:Int) => x*x*x
```
- It is Syntactic Sugar, looks neater, but does add logical value.

2.2 Currying

A function returns another function!
Eg.sum(cube)(1,10) applies sum to cube; and then this applied function is applied to (1,10) Function application associates to the left.
Just combine the two functions and write them one after the other!
Instead of passing parameter lists, we pass anonymous functions one after the other -- CURRYING
Goes over an example of curring using Product function ***Good example -- https://docs.scala-lang.org/tour/currying.html ***

2.3 Fixed points of functions

f(x) = x -- is the fixed point of a function
SQRT is a special case of fixed point.
Average Damping
Abstractions can be combined into higher order functions to create new abstractions.

2.4 Scala Syntax Summary

EBNF notations (https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)

2.5 Functions and data

Define classes
Scala keeps types and values in different namespaces
toString needs an override method!

2.6 More fun with rationals

Our previous functions do not simplify the rationals to their simples form.
Calculate the gcd in the constructor itself, and use to reduce the rational. -- Data Abstraction
Usage of this in the max/less functions.
Scala require!
- https://stackoverflow.com/questions/26140757/what-to-choose-between-require-and-assert-in-scala
- Can also use assert
- require is to enforce a precondition on the caller, assert on code's behavior.
Constructors -- the primary constructor

2.7 Evaluation and operators

The new Class(x1, x2) is already a value.
Substituion of this for the class value (self-reference)
Any method with a param can be used like an infix operator r.add(s) <--> r add s
Relaxed identifiers
Operator overloading
- Unary operators, need unary_ prefix : https://stackoverflow.com/questions/16644988/why-is-the-unary-prefix-needed-in-scala
- Even overloaded identifiers have precedence!!

Assignment - Functional sets

Represent sets as functions.

  /**
   * Returns the set of the one given element.
   */
  def singletonSet(elem: Int): Set = {
    x: Int => x.equals(elem)
  }

Week 3 - Data and Abstraction

3.1 -- Class hierarchies

Dynamic binding/Late binding -- https://stackoverflow.com/questions/19017258/static-vs-dynamic-binding-in-java
Abstract Class
- Permissible to have empty methods if the class is abstract; otherwise shows errors Method XXX is not defined HDP!
- Gives an example of implementing set as Binary Trees. (Persistent data structures, old data structure does not go away)
- Subclasses and superclasses -- without no super class, Object is the root of all classes
- Subclass can also override the non abstract methods.
Objects -- A Singleton for a class!!
Introduces main class!
Dynamic method dispatch --> Code invoked by a method call depends on the runtime type of the object.

3.2 -- How classes are organized

Organized in packages. -- package clause at top of the file.
- Fully qualified name
- Some entities (like Int etc are automatically imported)
Traits

Single inheritance language
Essentially like an abstract class.
A class can inherit from MANY traits.
- Can contains fields and concrete methods.
Scala class hierarchy -- at the VERY TOP is Scala.Any
- Numeric types are not strict, but can convert into each other.
- Any, AnyRef, Anyval
- Type Nothing at the bottom
  - Signals abnormal termination
  - Empty collections
- null --> Null
  - Int CANNOT be null! (null is only for AnyVal)

3.3 -- Polymorphism

Type parameterization
- Classes and methods can have type.
- Immutable lists -- cons lists
- Definition generalized using a type parameter [T]
- ACHTUNG -- Defining members as val defines it as an accessible member.
Functions can also have type parameters.
- Scala can figure out the type parameter based on the parameters passed!
Type parameters do not affect evaluation! -- Type Erasure
- Only important for the compiler, types erased at runtime.
Polymorphism

Subtyping
Generics

Assignments!

Very nice assignment on TweetSets

Scala has to conditional operator~ https://alvinalexander.com/scala/scala-ternary-operator-syntax
Very good assignment!

Week 4: Types and Pattern Matching

4.1: Objects everywhere

Pure object oriented lanuage --> Everything is an object!. Is scala one of these?
Pure Booleans
- true and false becomes Constant objects!
- Even Int can become a class!
Exercsie on Natural numbers. -- Very good!!!
- Peano Nunbers! -- https://wiki.haskell.org/Peano_numbers

4.2: Functions as objects?

Function types are are.. IN FACT objects in scala type A => B is just an abbreviation for class trait Function1[A, B]
Anon functions lead to AnonFun
Calling a function is f.apply() -- Functions are objects!

Eg.

class AddOne extends Function1[Int, Int] {
     |   def apply(m: Int): Int = m + 1
     | }

def is NOT an object!, When for example def f() is used as f -- converts it into function value -- ETA expansion.

Useful -- https://twitter.github.io/scala_school/basics2.html#apply

4.3: Subtyping and generics

Base Type
- assertAllPos example
- Conforming to the bound --> S <: T --> S is a subtype of T --> S > :T --> S is a supertype of T
- Can also mix lower bound and upper bound
- Given S <: T, given covariance for some types, therefore List[S] <: List[T], becuase LIsts are covariant. Covariance -- https://docs.scala-lang.org/tour/variances.html
In Java, arrays are covariant, but they do type checking for every array to prevent issues.
- Why was it done? Because they wanted a sort to work for everything.
Barbara Liskov Substition Principle

If A<:B, then everything you can do with B, you should be able to do with A.

In scala, Arrays are NOT Covariant (for why, see more below)

4.4: Variance

Lists are immutable, arrays are mutable. Immutable types can be covariant, if some other additional conditions are also met.
Types can be covariant C[+A], contravariant C[-A], or nonvariant C[A]
Typing rules for functions --

For a function, if A2 <: A1 and B1 <: B2, then A1 => B1 <: A2 => B2.

Functions must be contravariant in their argument types and covariant in their result types
Example -->
trait Function1[-T, +U] {
  def apply(x: T): U
} // Variance check is OK because T is contravariant and U is covariant

class Array[+T] {
  def update(x: T)
} // variance checks fails

MIND BLOWN! --> Function trait (checked in Scala code) becomes

trait Function1[-T, +U] {
 def apply(x: T) : U
}
i.e. return type is covariant, argument type is contravariant.

Can you just put + and - whenever the hell you want?! NOOO -- otherwise we could make array covariant. Scala compiler checks that are no problematic rules of variance.

. Roughly, what it will do is, it will let covariant type parameters only appear in method results. Contravariant type parameters can only appear in method parameters, and nonvariant, or invariant, that they are used as aliases, type parameters can actually appear anywhere.

Example from Piyush 007 -- https://gitlab.criteois.com/p.narang/scala-class/blob/master/src/main/scala/com/criteo/scalaclass/generics/Generics.scala

Takes example of Nil from the Cons Lists example.Nothing is a subtype of everything.
Making a class covariant can be a lot of work! -- The prepend example

Assume the similar Cat, Dog, Animal inheritance tree used earlier, plus the following:

abstract class SmallAnimal extends Animal
case class Mouse(name: String) extends SmallAnimal
Suppose we’re working with functions that accept types of animals, and return the types of food they eat. If we would like a Cat => SmallAnimal (because cats eat small animals), but are given a Animal => Mouse instead, our program will still work. Intuitively an Animal => Mouse will still accept a Cat as an argument, because a Cat is an Animal, and it returns a Mouse, which is also a SmallAnimal. Since we can safely and invisibly substitute the former by the latter, we can say Animal => Mouse is a subtype of Cat => SmallAnimal.

4.5 Decomposition

Expr, Number and Sum example; trait expr
Too tedious!! too many methods for any functionality. Quadratic number of methods with functionality.
Maybe use Type Casts? Nooo that sucks! Scala does not recommend it.
Define eval in every class -- but that way need to touch all classes if you add a method to the base trait.

4.6 Pattern Matching

All previous methods have shortcomings.
Case Classes -- Automatically adds companion objects. -- A generalization of the switch statement. -- Can match on multiple expressions -- Matches in the order they are written (Can match constructors too Sum(x1, x2) -- You can have methods in the class/object itself this.match for example.
Is not always good! Analyze the use cases and decide.. The Expression Problem

4.7 Lists

Lists are immutable, and recursive
- Similar to the Cons lists implemented before.
- All lists are constructed from empty lists, using ::, called Cons eg. fruit = "apples" :: "(banana") List(x) == x :: Nil
- Cons associated to the righ -- A::B::C is interpreted as A::(B::C)
Insertion Sort --good example!

Homework -- VERY GOOD HUFFMAN ASSIGNMENT

Tuple unpacking, AB DOBARA MAT BHOOLNA!

def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = {
      freqs.sortBy(_._2).map {
        case (c: Char, count: Int) => Leaf(c, count)
      }
    }

Write tests this way!!

trait TestTrees {
		val t1 = Fork(Leaf('a',2), Leaf('b',3), List('a','b'), 5)
		val t2 = Fork(Fork(Leaf('a',2), Leaf('b',3), List('a','b'), 5), Leaf('d',4), List('a','b','d'), 9)
	}

Week 5-- More Lists

5.1 More Functions on Lists

Goes over the functions on lists like drop, tail.. xs(n)
Creating new lists --> xs ++ ys. xs updated (n,x)
Finding elements --> xs indexOf x, xs contains x
Implementation of last, init, reverse and concat in a recursive manner.
RemoveAt in terms of take and drop

5.2 Pairs and tuples

Merge sort example.
- Merge function (Compare head elements with each other)
- splitAt function returns a pair, eg. ("answer", 42) is a pair.
Tuple
- _1 and _2 can be used to access the fields
- Pattern match in a symmetics way using tuples

case (Nil, ys) => ys
case (xs, Nil) => xs

Run case matching on tuples!

5.3 Implicit parameters

Can make merge sort with [T]..but then not every type defines a comparator <, so compiler cannot be sure.
Can parametrize with a function -- Use a function lt -- put function value last in your parameters..gives compiler an idea of the types of parameters.
trait Ordering[T] as a parameter -- leave out the ord in the call after you mark as implicit..the compiler figures out the correct ordering
Compiler determines implicits according to the type of arguments, and whether they are in scope at that point. Failing rules, it is an ERROR.

5.4 Higher order functions on lists

The map method to transform every element of a list.
The filter method to filter out elements based on a condition.
Other functions.. such as filterNot, partition, takeWhile, dropWhile, span -- Nice functions!

5.5 Reduction on lists

reduceLeft for sum and product.
foldLeft takes in an accumulator
Implementations of both foldLeft and reduceLeft
foldRight and reduceRight --> parenthesis go to the right.
The operators are equivalent, if the operators are associative and commutative -- Good example on when concat cannot be used!

5.6 - Reasoning About Concat

What does it mean for a functional program to be correct? Eg. of concat, it is associative etc.
Structural Induction -- eg. of Natural Induction -- class 8 Maths! (Factorial example)
Structural Induction
- To prove a property P(xs) for all lists xs, show base case, and then prove for x::xs
- Shows the associative property for concatenation.
- Induction hypotheses

5.7 - A Larger Equational Proof on Lists

Larger/More difficult example
- reverse -- proving by induction does not work
- Replace xs.reverse by ys; and prove that.

Week 6: Collections

6.1 - Other Collections

Lists' access depends on the index (element at index 0 is faster to get than in the middle)
Vector -- Upto 32*32 is an array, after 32 .. every element becomes a pointer to an array of 32 elements.
- Number of accesses is the depth - log32(n)-- much faster!!
- https://www.scala-lang.org/api/2.12.4/scala/collection/immutable/Vector.html
Depends on access patterns, for head and list, Vector is mucho mejor~
- Does not have ::, but has +: and :+
- Vectors are also immutable! Goes over insertion of an element. -- create many objects!
Seq is the base class --> the common base class for List and Vector
- At the apex is Iterable
Goes over Array and String
Range function --> 6 to 1 by -2 -- >Lower Bound to Upper Bound by Type
Goes over sequence operations
- exists and forall
- zip and unzip
- flatmap
- sum, max etc for numeric lists
Goes over examples

6.2 - Combinatorial Search and For-Expressions

Handling nested sequences
- Eg. of generating pairs of integers < 7 where sum is prime.
- IndexedSequence ---> is implemented using Vector.
- To flatten, can use foldRight or flatten.
  - xs flatmap f = (xs map f).flatten
Program can be difficult to understand! due to abstraction.
Introduces the for expression. --> for (s) yield e; s contains a generator and a filter.

Expresses the isPrime problem in nested for.

6.3 - Combinatorial Search Example

Sets
- Sets are unordered
- Do not have duplicate elements.
- contains method.
N Queens problem -- VERY NICE PROBLEM!
- Recursive solution --> more than one solution!
- zip is very useful!!!
- The diagonal check!
Maps
- NoSuchElementException -- the get method returns None if key not found.
- Introduces the Option type. --> Some and None --> pattern matching
Sorted and GroupBy
- sorted is natural ordering, sortWith is according to a rule.
The polynomial example
- Concat, ++ superimposes instead of adding.
- Defines a function by himself, adjust!
- withDefaultValue!! -- simplifies it.
- Repeated parameters --> bindings: (Int, Double)*
- Uses foldLeft -- is more efficient

6.5 - Putting the Pieces Together

Mnemonics for a phone number

Comparing the lines of code for scripting vs general languages.
Inverts the map using for
Map can also be used as a function!
Writes the encode function.

6.6 - Conclusion

Assignments -- Anagrams

The identity function -- JHAKAAS!
s flatmap wordOccurences

Great assignment !! -- TOO DIFFICULT!

Raw

Course 2. Functional Program Design in Scala.md

Course 2. Functional Program Design in Scala

Week 1: For Expressions and Monads

Recap

Recap on case classes
- JSON example, tries to represent it using abstract classes and case classes; recursive examples for toString method.
- By itself, an expression is not typable -- JBinding, String example. Goes over trait Function1
- We can subclass the function type, for eg. Seq[Elem] extends (Int => Elem)
- scala MatchError
Partial Function -- mind blown!
- Define a function as Partial Function; isDefinedAt to check
- ACHTUNG!! -- isDefinedAt only applies to the outerblock of matching, does not guarantee in Nested.
Recap on collections
- All have map, flatMap, filter, foldLeft, foldRight
- idealized implementation of map, flatMap and filter on Lists
- In practice are different to apply to arbit collections, and make them tail-recursive
- For expressions --> simplify combinations of core methods map, flatMap and filter -- the Prime example is much simpler!
How for translates to map, flatMap
- First flatMap --> for recursively
- Generators can also have Patterns!! JSON example -- translates to withFilter on the pattern.

Lecture 1.1 - Queries with For

for can be considered as querying a database.
The book database example.
- Nice example of yielding results to find authors who have writen at least two books.
- Can use distinct
map, flatmap and filter can all be expressed in for.

Lecture 1.2 - Translation of For

Goes over translation of for in terms of map
- withFilter
- Two generators go to flatMap
- Every translation steps has one less generator
- Goes over the isPrime example.
As long as you have map, flatMap and filter, can use any datatype for for

Lecture 1.3 - Functional Random Generators

for expressions are NOT TIED to collections, all you need is map, flatMap and filter.
Generator example.
- Can use for for it, but need a map and flatmap definition.
- self example (because there are two generate fns)
- List generators; trees example.
Random generators can be used for unit testing.
- ScalaCheck -- https://alvinalexander.com/scala/fp-book/scalacheck-1-introduction

Lecture 1.4 - Monads

Has two functions flatMap and unit
- List is a monad
- andThen for function composition
Monad Laws
- Associativity , Left unit and Right unit
- Proves them on Option class -- flatMap proof is a bit involved.
Significance of laws
- Can inline the two generators due to significance.
The Try type
- Success and Failure
- Composing with for
- Bulletproof example - an expression composed of Try, map and flatMap will never throw a non fatal exception.

Week 2: Lazy Evaluation

Lecture 2.1 - Structural Induction on Trees

Structural induction on Trees
- IntSet --> contains and incl
- Three laws for intset, for eg. --> (s incl x) contains y = s contains y
- Three laws completely define IntSet
- Base case is Empty, induction step is NonEmpty
- Five cases for the third law

Lecture 2.2 - Streams

Prime number example -- calculate every prime number and just take the top two, using the for code.
- Compute the tail only IF needed.
- Avoid computing the tail until on demand --> Stream
- Range code is similar but behavior is completely different -- creates only on demand!
- :: produces a list; #:: produces a stream (identical to Stream.cons)
Implementations are defined in the Stream object
The difference is that the tail params is call by name (remember the :=> to make a var call by name!
StreamRange example question

Lecture 2.3 - Lazy evaluation

Do things as late as possible, never do them twice
Instead of computing the tail everytime, can store the result and re-use it --> lazy evaluation vs call by name evaluation
- Haskell is purely lazy
- By default scala does strict evaluation, but allows it.
lazy val vs def --> is the same the first time..but after the first time, lazy val is not recomputed --> very good example
Stream's tail becomes a lazy val
- Gives examlpe on the isPrime example

Lecture 2.4 - Computing with Infinite Sequences

Since calculation is done ONLY when needed, good to define infinite sequences.
- Define and then take upto the where you need them.
Calculation of prime numbers --> Sieve of erasthosenes
- Excellent implementation!!
- Streams to converge approximation of sqrt

Lecture 2.5 - Case Study: the Water Pouring Problem

Explains the problem, and the methods/classes to represent.
- Try all paths until you have the destination.. i.e. the glass with the correct amount
- case class Capacity, trait Move with moves Empty, Fill and Pour
- def change in the trait to be defined by every case class above.
- Path history, foldRight
- Calculate ALL the paths and then filter by the solution (i.e. paths with end state as the glass with a volume)
NOT VERY EFFICIENT
- Do not revisit the states already explored!
TIP -- Name everything you can!, do not put all operators in one line and keep scope for future changes!
https://alvinalexander.com/scala/how-to-control-visibility-constructor-fields-scala-val-var-private

Assignment: Bloxorz

Nice ASSIGNMENT! CHUTIYA MISTAKE!!!

Week 3 -- Functions and State

3.1 Functions and state

Goes over function rewriting.
Rewriting always leads to the same solution.
- Any order of rewrite leads to the same solution!
But the world is a state of objects --> an object has state if it is behavior changes with time (for eg. a bank account)
var is a mutable definition (can be reassigned)
Gives the BankAccount example and definition.
The Cons Stream's example -- assuming operations on Stream are purely functional, unless you have side effects.
Does not have to be a variable change.. FOCUS ON THE BEHAVIOR! (if behavior changes => stateful)

Lecture 3.2 - Identity and Change

Referential Transparency
Operational Equivalent -- two definitions x and y are operationally equivalent if any tests cannot be distringuished.
- Goes over proving the above in the BankAccount example.
The substitution model does not HOLD!!

Lecture 3.3 - Loops

Loops are NOT essential for imperative prgramming.
Defines while as a function WHILE --> condition must be passed by name (so it is reevaluated every time)
Also defines REPEAT
for loops cannot be represented by just higher order functions.
- Like for expressions.. but uses foreach

Lecture 3.4 - Extended Example: Discrete Event Simulation

Digital Circuits (ECE aa gayi!!...)
Base components -- AND, OR and NOT (with a delay)
Creates an half adder using just these basic gates.
Defines the classes/functions Wires and gates
Uses HalfAdder to define a FullAdder

Lecture 3.5 - Discrete Event Simulation: API and Usage

Action --> a function that has side effect (return type Unit)

time is simulated, not actual time.
Wire class -> getSignal, addSignal, actions to be done when signal changes.
Goes over the implementation of all the gates.

Lecture 3.5 - Discrete Event Simulation: Implementation and Test

Agenda is a list of Events to be performed at a time; they perform actions
Action goes into the Agenda.
Probe is also defined to get the values of the wire during the simulation
Runs a simulation (that has delays of time as well)
Goes over OrGateAlt -- an implementation in terms of AND and NOT
Losing referential transparency, more confusing

Week 4 -- Timely Effects

Lecture 4.1 - Imperative Event Handling: The Observer Pattern

Observer Pattern -- Remember MVC from Objective C days!! (Publisher-Subscriber)
- Goes over the Publisher trait
- Shows the example of BankAccount in this pattern.
- A subscriber Consolidator that subscribes to all the bank accounts. _ Nice example!!
Goes over the good and the bad (Bad is return type is Unit (forces imperative style) -- CONCURRENCY!!
We will study how to go from Imperative Functional Programming to Functional Reactive Programming

Lecture 4.2 - Functional Reactive Programming (FRP)

Signals that changed with time.
- Mouse moved example -- Signal[Position]
frp.Signal <--> Scala.React
- Get the value of the signal at the current time.
- Get value of a signal in terms of other signals.
- Introduces Var ---> for signals that can be updated/changed (The Update method -- which is also defined for Arrays.
Important -- Defining a signal in terms of other Signal A = f(Signal B) for example -- the relation is maintained going forward. (forever in the future!)
Goes over the BankAccount example.
- Balance becomes a Signal[Int]
- A signal cannot be a function of itself!! balance() = balance() * amount IS WRONGG!!
  - Instead get the value of balance in a val and then do it
IMPORTANT!

Diff between b = b + 1 and b() = b() + 1 -- the latter DONT MAKE SENSE!, as the first one updated value, and the second defines a relation in eternity.

Lecture 4.3 - A simple FRP implementation

Every signal has (1) the current value (2) current expression defining it (3) all of its observers that need to re- evaluated; if signal value changes.
How to know on whose behalf a signal is updated.
Need to call in a stack like fasion -- defines a class Stackable variables
Type Signal[_] it means can take signals of all types.
- NoSignal -- a sentinel signal.
Adds the current caller to the signal
- A signal cannot observe itself (duh) -- Catches the cyclic dependency
- computeValue() does the propagation
- NoSignal's computeValue does nothing.
Problems with parallelism!!

Going to thread local -- Dynamic Variable

Pass current value of signal as a implicit parameter

Lecture 4.4 - Latency as an Effect 1

Introduces the monad Future (Computations can take time!!!)
- Reading from memory takes time, and sending a packet to Europe takes time!
- Converting into a human time format, that makes sense! --> sending a packet to Europe in this scale takes 5 years!
- Do not have to wait for a computation, Callback -- mind blown!!*

Lecture 4.5 - Latency as an Effect 2

A Monad -- Future
- Handles both the idea that computations can fail, and can also take time.
- Function onComplete --> think of it as an envelope given to the calculator and asking him to find the result, fill the envelope and return it back to us.
- Ignoring implicit execution context for now.
- Function can match to Success or Failure
  - Can have a onComplete that takes two callbacks (one for success/failure)
- readFromMemory returns Future[Array[Byte]]
- Future's apply returns a future that computes asynchronously (Future just expects a Body)
Lecture 4.6 - Combinators on Futures 1
- Functions on future.. all the usual flatMap etc, along with recoverWith
- flatMap cleans up the code and makes it nicer -- Very nice examples!
  - zips two futures (Does not work as zip fails when one of the lists being zipped fails)
- recover and recoverWith
Lecture 4.7 - Combinators on Futures 2
- Writes a better method fallBackTo --> expressed in terms of recoverWith
- Introduces blocking, the Awaitable trait, but IS NOT RECOMMENDED!!
  - You should NEVER EVER BLOCK in production!!
Lecture 4.8 - Composing Futures 1
- for comprehensions (remember brightroll?!!)
- Defines retry
  - In terms of fallBackTo recursion
Lecture 4.9 - Implementation of flatMap on Future
- Implementation of flatMap in terms of CallBack
Lecture 4.10 - Composing Futures 2
- Goes over reminding us of foldLeft and foldRight
- Implements retry in terms of foldLeft and foldRight
- Sometimes straight recursion is much simpler to understand!
Conclusion, FIN.

Assignment
- Uses Signal's -- FRP

Raw

Course 3: Parallel Programming.md

Course 3. Parallel Programming

Week 1: Introduction to Parallel Programming

1.1 What is parallel computing?

Many calculations (can be done in parallel), same time
Amdahl!!

After a while, hit a power wall

Just increasing CPU frequency does not help, so added multi-cores!

Parallel programming is HARDER than sequential

But we need to do it for SPEEDUP (No other reason)

Parallel vs Concurrency

Parallel -- Uses Parallel hardware for more efficiency (eg. matrix multiplication, computer graphics) - speedup
Concurrent -- Improves modularity (eg. async apps such as web servers, databases) -- convenience
Both of these can intersect each other

Different granularities of Parallelism
- Bit-level parallelism and Instruction-level parallelism and Task-level parallelism
- Different types of hardware -- Multi core processors, Symmetric multiprocessors, GPU, FPGA, Computer Clusters

1.2 Parallelism on the JVM

What is an operating system? -- Software than manages hardware and software resources, and schedules program execution
Process - An instance of a program executing in the OS (a process has a PID) -- the most coarse grained unit in || programming
Multitasking -- two processes are isolated and owns a memory area
Each process can have multiple threads (Threads share memory space) -- can be started within the same program, and can exchange information between one another
- A thread has a program counter and a program stack
- JVM process starts with main thread -- in a sequential program, only the main thread is used.
- Goes over the Thread subclass
- Goes over an example where the outputs can be different (threads are executed in parallel!)
- To ensure, a sequence of statements is in executed in order -- Atomicity -- use the synchronized construct! -- at most one thread can hold a monitor

1.3 Parallelism on the JVM II

Synchronizations can be nested! -- Gives the example of a bank transfer -- locks on both source and transfer account
- Excellent example when a thread transfers from a1 to a2, and another from a2 to a1 --> DEADLOCK!
Resolving deadlocks

Acquire locks in the same order!

Memory model

How threads interact when accessing shared memory
Two threads writing to different memory locations --> no need to synchronize!
The thread join --> seeing write Rule.

1.4 Running Computations in Parallel

pNorm example

Uses recursion!!
MIND BLOWN! --> Need to use call by name! (need to pass unevaluated computations)
Memory contention!! -- Speedup also take into account not only cores, but also memory
Also you are only as fast as your fastest thread..

1.5 Estimate `pi` -- monte carlo

Estimate using randomly sampling points to fall within a circle/square

1.6 First class tasks

Using the construct task instead of parallel.
- Defines Task interface that takes in a computation as an argument
Breaks the p-Norm implementation into tasks
Defines parallel construct in terms of tasks

1.7 How Fast are Parallel Programs?

Empirical measurement vs asymptotic analysis (worst case run time)
Asymptotic analysis of sequential running time and parallel running time on the sum-segment example
Parallel program time depends on the number of resources --> introduces Work (all sequential) and Depth (Infinite parallel)
- Can never do better than Depth (as that is the case when parallelism is infinite)
Amdahl's law returns!! -- the law of parallelism

1.8 Benchmarking Parallel Programs

Benchmarking vs Testing
- Testing is binary, benchmarking gives a continuous value
Just timing is not enough -- make sure to take care of stuff such as warm-up, eliminating outliers, prevent such as GC, JIT compilation, aggresive optimizations
Enter ScalaMeter

Be wary of GC! -- runtimes can vary across runs

JVM may apply further optimizations -- Scalameter has withWarmer -- awesome!! -- Nice intro to ScalaMeter; can also measure Memory footprint`

Assignment -- Nice assignment on blurring in photoshop

Week 2: Basic Task Parallel Algorithms

2.1 Parallel Sorting

Uses an intermediate array -- breaks down sorting into smaller array sizes

merge and the copy functions

parMergeSort vs quickSort -- Uses ScalaMeter

2.2 Data Operations and Parallel Mapping

Gives examples of operations on collections (fold, scan, map etc.)
List is NOT good (as cannot find the middle easily) --> use Array and Tree
The naive parallel implementation first element cons rest of list is still n!!
So we look at Array
- Breaks into parallel implementation -- (1) Writes need to be disjoint (2) Threshold for sequential needs to be large enough to have efficiency!
Goes over implementation of Tree
- leaf stores the num of elements, and nodes contain the elements
Compare arrays vs Trees --> in trees less worry about disjointness

2.3 Parallel Fold (Reduce) Operation

Goes over differences between foldLeft, foldRight, reduceLeft, reduceRight
Parallelism can be done ONLY on associative operations --- as we want to split into parts as we wish
- Each associative operations can be shown as a tree --> leaves are the values and nodes are the operations.
- For ordering for a Tree, can use list
- Tree rotation preserves the result.
For arrays reduction, can convert into tree
Gives reduce and reduceSeg function. (reduce is done in a sequential fashion)

2.4 Associativity 1

Compares with commutative
Gives examples of operations that are both associative and commutative
Examples where it is associative but NOT commutative
- String concat
- Matrix multiplication
Example of where it is the opposite case
Should NOT LOSE associativity
Floating point operations are not necessarily associative! (Great examples!!) -- due to precision considerations
Trick to make functions commutative! --> get start and just always pass the lesser argument first (by making a less checker)

2.5 Associtavity II

Gives an example of associativity for tuples
Example of average using two reduces --> one for sum and other for length
- Gives the case when commutativity also implies associativity (given the rotation condition)
- Gives two examples of applying this theorem
Floating Numbers -- CUIDATE!!
Sets example (Closure operation -- A Closure B)

2.6 Scan

scan is like a combo of map and fold
HOW to PARALLELIZE?!!
- this value Depends on the previous value!
- Do more work (store the intermediate results) --> parallelize, totally worth it.
- Have trees that store intermediate values
- Think of summing up trees from leaf to the head, storing results in an intermediate fashion as you go up.
- upsweep and downsweep functions to the rescue
- Application to arrays --> store the Segment of array in the non-leaf nodes

Assignment -- Parallel programming

Count change problem -- how to divide stuff if we do not know the workload?!!

Counting change is a canonical example of a task-parallel problem in which the partitioning the workload across processors is solution-driven -- to know how to optimally partition the work, we would first need to solve the problem itself.

the synchronization costs of doing this are way too high. Instead, we need to agglomerate parts of the computation. We do this by calling the sequential countChange

Speed up is measured using ScalaMeter -- letting it warmup and GC first

Nice line of sight problem

Week 3: Data-parallelism

3.1 Data Parallel Programming

Task parallelization --> a form of parallelization that distributes execution across computing nodes
Data parallelization --> distributes data across nodes
Parallel for loop
- Introduces the parallel for loop --> MUST BE CAREFUL of adding .par to the loop
- Mandelbrot set --> Keeps computing complex numbers till reaching infitiny if adbsolute value
  - Goes over the image array in parallel.
  - Compares data parallelism for this problem, vs task parallelism --> Why?!! -- compares workload for different parts of the data
    - Task of balancing workload is managed by the scheduler -- we do not care about equally dividing! it is done for us..

3.2 Data Parallel Operations

Some methods like filter can be done in par.
But foldLeft CANNOT be done --> great lego example!! -- foldLeft depends on the previous part --> so MUST be sequential
But fold can be done--> it is like a reduction tree

3.3 Data Parallel Operations II

Uses fold to calculate max using Int.MinValue as the neutral element
Rock, paper and scissors example!
- Play is NOT associative --> it is commutative
For fold to work --> the two following conditions must hold to make the operator f a MONOID
- f(a, f(b,c)) == f(f(a,b),c) -- ASSOCIATIVE
- f(z,a) == f(a,z) == a -- Like an identity on the neutral element z
- Commutativity does not matter
For fold -- the type of the neutral element must be the same !!!
In comes aggregate --> a combination of fold and foldLeft

z   A   z   A   z   A   z   A
 \ /     \ /seqop\ /     \ /    
  B       B       B       B
    \   /  combop   \   /
      B _           _ B
         \ combop  /
              B

https://stackoverflow.com/questions/6928374/example-of-the-scala-aggregate-function

3.4 Scala || operations

Scala collections hierarchy
- Traversable[T] -- Must implement foreach
- Iterable[T] -- Must have iterator
- Seq[T] -- Orderd ...
Parallel collection hierarchy

ParIterable[T] etc

GenSeq, GenIterable etc -- can or cannot be run in parallel
Compares toPar on List vs Vector -- List is way slower
All collections have a parallel counterpart (mutable Par, immutable par etc)
Example of calling a function with arguments GenSet that accepts both sequential and parallel implementations.
- BUT parallel case is NOT deterministic
- NEVER modify the same memory locations (like a mutable set) without proper synchronization -- instead of a mutable set uses the ConcurrentSkipListSet
- Lessons learnt!
  - Never write to a collection that is concurrently traveresed.
  - Never read from a collection that is concurrent modified.
  - TrieMap is an exception -- has a snapshot() method.
    - https://stackoverflow.com/questions/29499381/what-is-a-triemap-and-what-is-its-advantages-disadvantages-compared-to-a-hashmap

3.5 Splitters and Combiners

Iterator - must always call hasNext
- Implements foldLeft on iterator
Splitter extends Iterator
- Every parallel collection has a splitter implementation
- Efficient method split --> O(n)
- Implements fold on a splitter --> similar to reduction trees --> parallely keep splitting and then sequential if below a threshold
Builder
- newBuilder method
Combiner -- is a parallel version of the Builder -->
- Every parallel collection must implement newCombiner -->
- Combines two combiners
- Efficient method -- O(logn)
Good notes!
- https://stackoverflow.com/questions/40098794/what-are-builder-combiner-and-splitter-in-scala
- https://docs.scala-lang.org/overviews/parallel-collections/architecture.html

Assignment -- KMeans

K-means is an example of a bulk synchronous parallel algorithm (BSP). BSP algorithms are composed 
from a sequence of supersteps, each of which contains:

parallel computation, in which processes independently perform 
local computations and produce some values;

communication, in which processes exchange data;

barrier synchronisation, during which processes wait until every process finishes

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Remember Tail Recursion! -- https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6#17-tail-recursion

The recursive call should be the last thing that the function does. (no operation after!) Assignment tries different choices of initial means selection and compares them

Week 4: Data Structures for parallel computing

4.1 - Implementing Combiners

Transformer methods --> like filter, map etc (create new collection)
Scala Builder for sequential transformer operations
For Parallel, we use Combiner (a Builder with a combine) (combine this with that)
- Combine is different for different collections (for Set and Map it is a union)
- Combine MUST be efficient!! Must execute in O(log n + log m) time
- To gain any speedup, the operation must be logarithmic!!
Combiner
- Gives a bad implementation of Combiner --> O(n)
- Arrays CANNOT be combined efficiently --> as they are contiguous
- Compares runtime for various Data Structures.
- Pues.. implementation of a combiner is NOT trivial!

4.2 Parallel Two-phase Construction

Construct data structures in parallel using the two-phase construction (Use an intermediate data structure)
- The intermediate data structure must have an efficient combine method, an efficient += method and can convert into the resulting data structure in O(n/P) time
- Many intermediate data structures are created ,and then combined, and then parallel-y it is converted to the initial daa structure
- Uses a nested array buffer -- and then just copies the reference!
- The result method to copy back into the array in a parallel fashion
- Can extend to other data structures, need to choose an efficient intermediate data structure
  - Intermediate data strucures MUST be non-overlapping

4.3 Conc-tree Data Structure

Conc-tree a parallel counterpart of Cons List
- Lists are built for sequential computations, trees can be done in parallel.
- Trees are ONLY good for parallelism if they are balanced
  - Enter Conc trees! Contains level and size info at each node
  - An invariant -- the level cannot be greater than one
  - <> to Concat trees -- Concat making sure that the tree balance structure is maintained
  - Goes over cases of concating trees in various unbalanced situations.
- https://en.wikipedia.org/wiki/Conc-tree_list

4.4 Amortized, Constant-time Append Operation

+= method should be O(1)
- Adds the Append node type
  - Relaxes the balanced constraint
  - Make sure that there are NO more than log N Append nodes
  - Makes a parallel relation between counting numbers in binary rep and append nodes --

4.5 Conc Tree combiners

Conc buffers -- introduces the Chunk node and the expand method

Raw

Course 4: Big Data Analysis With Scala and Spark.md

Course 4. Parallel Programming

Week 1: From Parallel to distributed

1.1 Logistics

Goes over logistics
Why scala? Why spark?
- As the data science doe in MATLAB is small (cant fit on memory)
- Bcoz its the shizz yo
- Works a lot like Scala
- Modeled like Scala collections (map, flatMap) -- hadoop a lot more boilerplate Databricks community edition!

1.2: Data-Parallel to Distributed Data-Parallel

In previous node, we did data-parallel (jellybeans in a jar!)
- Different processors operate on different chunks of data, but same machine
- Data is SPLIT into different nodes, on which nodes operate in parallel.
  - Now MUST worry about network latency!
- A lot can be shared between share memory and distributed memory --> operations should be associative!
- Again, reminding you about LATENCY!!
My dear RDD

1.3 Latency

Issues while going from shared to distributed
- Partial Failure - a subset might fail!!
- Latency -- Network communication causes latency -- LATENCY WILL NEVER EVER GO AWAY
  - Important latency numbers -- https://gist.github.com/hellerbarde/2843375
  - 1MB from memory vs 1 memory from disk - 10 TIMES HIGHER!!
  - Memory < Disk < Network (Latency order)
  - Multiplies by billion to get a human sense
Goes over hadoop and mapreduce --> Mapreduce/Hadoop was great because of fault tolerance
- Failure probability is high (for a node to fail..)
Why spark?
- Because of fault tolerance, hadoop/mapreduce shuffles its data and writes to disk (EXPENSIVE!)
- Spark strategy for handling latency-- all data is immutable and everything is in memory!
- Just replay the transormation chain!
- 100 times faster -- does stuff in memory
- Minimizes network traffic
Hadoop vs spark on benchmark -- logistic regression 100s per 1s!

1.4 Basics of Spark's RDDs

RDDs seem like a immutable collection, have higher order functions like map, filter etc
- Heavy use of higher order functions
- Similar structure -- like MAGIC!
How to create a RDD?
- Transform a RDD
- or create one.. using SparkSession, SparkContext

1.5 RDDs: Transformation and Actions

Transformers create a new collections (not single values!) -- Accessors return single values as results
Spark defines Transformations and Actions
- Transformations return RDD
- Actions return NOT a RDD.. write to HDFS
- MUY IMPORTANTE -- Transformations are lazy and Actions are eager
  - Laziness reduces network communication by being a lazy fuck!
  - NOTHING HAPPENS when calling a map -- add an action to get a result!!
- Eager actions
  - collect, count, take, reduce etc. etc.
  - LOOK AT THE RETURN TYPE to know if its a action vs transformation!
- Stage
  - Something that we will do in the future sometime
- Spark can optimize stuff...
  - eg. x.filter.take(10) can stop when you have 10! -- MIND BLOWN
- Two RDD transformations (union, cartesian etc)

1.6 Evaluation in Spark: Unlike Scala Collections!

Why is spark good?
- Lazy and in memory!
- Most data science stuff requires iterations (map + reduce)
  - In hadoop, there is a read and write for intermediate data. (so a lot of time is spent!)
  - Spark loads everything in memory (saves so much time!)
  - Logistic regression is a hello world of the iterative data algorithm world
Gives an example of logistic regression
- RDDs are recomputed on each action -- unneccesary to do it everytime
- persist -- after an RDD is evaluated, can reuse -- MUY IMPORTANTE
- Suppose you do two operations --- can just persist -- such as take followed by count
Persistence comes in various flavors
- Can keep both on memory and disk (as regular/serialized java objects)
- cache vs persist -- cache is memory only and persist can take multiple arguments (eg. MEMORY_ONLY, MEMORY_ONLY_SER)
Different than collections!
- Dont wait on the rsult (dont be EAGER)
- Dont do same computations (use caching hijo de puta!)
Laziness is a good thing!!
- Spark can analyze and optimize (if you just need 10, it will optimize to only get 10) -- as soon as I get 10, just quit hdp!
- Spark can fuse computations and do less work
  - For eg. map and filter are two passes, but it just does it once!! (MIND BLOWN!)

1.6 Cluster Topology Matters!

There is a master-worker topology
- master is the driver program (The commander, we talk to this)
- executors are workers (they do the work)
- Cluster manager is YARN or MESOS
Driver is the brain
- Contains the main, transformations etc
Steps in a spark program
- Spark SENDS the program to executors (think about it hard hijo de puta)
- Driver says to executor -- take this function and data, do your shit and return the result
- foreach is an action -- so nothing happens on the driver! (ONLY Transformations happen on it)
- take gives it back to the driver program (sent back to the driver)
Summary

Think where your code is executed
- Transformations are stored up on the driver.. and then on an action executor runs them.

Week 2: Reduction Operations & Distributed Key-value pairs

2.1 Reduction Operations

How are actions`reduction operations` distributed?
- Methods like fold, aggregate etc walk through a collection and combine to get a result
- Remember that foldLeft is NOT parallelizable
  - Because of TYPES!!
- fold returns the same type
- aggregate combines the two
on RDDs -- foldLeft and foldRight does NOT even exits -- as it is not parallelizable! -- Thy must use aggregate!!

Why not have a serial implementation? Does not make sense synchronize wise..
aggregate is desirable as you go from a higher data structure (Case class) to a simpler one (like Int)

2.2 Pair RDDs

Key-value pairs are preferred in the map-reduce world!
- Project down onto key-value pairs
- PairRDDs have special operations -- special methods -- groupByKey, reduceByKey, join etc.

2.3 Transformations and Actions on Pair RDDs

Actions and Transformations on PairRDD
- groupBy example
- Spark groupByKey groups the functions according to the keys they already have (no function required)
- reduceByKey reduces the values for the same keys.
  - Note that reduceByKey is a transformation, unlike the action reduce in scala collections
- mapValues only applies functions to the values
- countByKey just counts the number of elements by key
- Transformation keysis NOT an action

2.4 Joins

Inner Join vs Outer Join
- DIfference happens when both set of keys do not exist in both
- Gives example of train and location data
- Inner join only takes the keys that intersect
  - It is commutative
- Joins ARE transformations~~
- Outer join --> takes the union
  - Returns the Option type (can be None or Some)
  - Choice of left vs right depends on whose keys are more important to us

Week 3: Reduction Operations & Distributed Key-value pairs

3.1 Shuffling: What it is and why it's important

groupBy and grooupByKey --> shuffle data as data MUST move around to combine data for the same key

groupBy returns a ShuffledRDD
Need to move around the values so that key has its values together
SHUFFLE is expensive!! -- Remember the humanized latency

reduceByKey is much more efficient! USE THIS!
- reduces on the mappers first
- Reduces the amount of data for network serialization
reduceByKey gives almost 3x speed!

3.2 Partitioning

How does spark know which key to put where?
What is a partition?
- Data within a RDD is split into partitions.
- Data in same partition is ALWAYS on the same machine.
- Can configure the number of partitions
- Generally number of partitions is equal to number of cores on all executor nodes
Can customize a partition ONLY on a pair RDD
Hash partition calculates p using the hashvalue -- tries to spread around based on the value
- Hash partitioning can cause unevenly skewed if we are unlucky
Range partioning -- partitioned according to ordering
- Can distribute keys unevenly based on ranges
How to customize partioning?
- MUY IMPORTANTE -- partitionBy
- Using transformations that create partitions
- MUST persist()!! -- otherwise it is repatition everytime
- Otherwise partition is applied everytime causing shuffle network operations
map does not make sense AS the KEYS MAY CHANGE!!!
- Use mapValues WHEREVR possible!

3.3 Optimizing with partitioners

Why partioning?

Becuase avoid network shuffles!!
Range partitioners can further optimizer reduceByKey, 9x speedup over grouoByKey!
Gives a great example of join always shuffling data
At the satrt of the program, partition the big dataset
We want to shuffle the small dataset, not the bigger one
Shuffle occurs when resulting RDD depends on elements from other or the same RDD - Achtung on operations that can cause a shuffle!

3.4 Wide vs Narrow Dependencies

Not all transformations are created equal! (eg. shuffle operations take MUCH longer)
Computations are represented as a lineage graph (DAG)
- RDDs are depicted as partitions (atomic pieces of dataset), and the dependencies
Transformations cause shuffles
- Narrow dependency -- Each partition of parent is used by at most one partition
- Wide dependencies -- Eacg partition of parent may be used by more partitions in child RDD
- groupByKey has wide dependencies
- NICE EXAMPLE on a graph
- operations like join can cause shuffle (depending on whether you were smart to partition or not)
Spark dependencies method!! and toDebug String
Lineage graphs are the key to fault tolerance!
- You can go up the graph to recompute a failed partition!
- For deep dependencies.. need to recompute all partitions above! -- MIND BLOWN!

Week 4: SQL, Dataframe and Datasets

4.1 Structured vs Unstructured data

Goes over an example of joining two RDDs
- Compares three different approaches -- cartesian
- filter before join is much faster than join then filter.
- Gets the same result! But some methods are waster
- Can Spark we smarter to do that.. i.e. optimize the code itsekf?
Different types of data
- Unstructured -- log files, images etc.
- Semi strucutred -- JSON, XML etc.
- Structured -- Database tables
- RDD has no idea about the structure of the data, other than it knows what type of object it is, so cant optimize
- Given a structure, spark can optimize!
- Databases are more organized and optimized
- Spark SQL will optimize!

4.2 Spark SQL

SQL is the lingua franca
Spark SQl adds relational processing
- A spark module, a library on top of spark
- Catalyst a query optimizer
- Spark SQL is top of spark, operates on RDDs under the hood.
Goes over SQL lingo
Dataframes REQUIRE some schema! -- Dataframe are untyped~~ (thats why you have them!)
- untyped transformations
Can create dataframe from an RDD by either inferring schema, or by specifying

With case classes, column names are inferred!
Goes over creating DF from RDD[Row]

On datafarame, can do createOrReplaceTemplate -- then just SQL away!!, gives the people example

4.3 Dataframe (1)

Just running over RDDs
- Applies relational optimizations from the DB community
- Are untyped --chingado Rows! Scala compiler does not type check
- Goes over the primitive types and the complex types available (Map, Struct etc)
  - Case classes map to a Struct
- Dataframe API looks similar to a SQL api
  - groupBy, select etc
- Working with columns has three ways -->
  - Using $ notation
    - $"age">10
  - Refer the dataframe
  - SQL Queries inside strings
- GroupBy returns the RelationalGroupedDataset datatype

4.4 Dataframe (2)

Handling missing data (Nan and null) (drop, fill, replace)
In join MUST specify the columns to join on

Goes over a join example

The Query optimizer takes care of the optimization for you!
- 4x faster!!
- Spark optimizes for you! Gives you optimized RDDs
- The catalyst can reorder and know which data is not used, so less data around the network!
  - Catalyst reorder partitions, reduces amount of data to shuffle around, removes unwanted partitions
  - Due to being restriced to SQL data type, Tungsten optimizes how to pack data into memory
    - Stores data in a columnar format
Limitations
- Errors happen at runtime!!
- Limited to same data types

4.5 Datasets!

Row is not convenient, need to cast them back using asInstanceOf
- but it is a headache, can use the printSchema to get the schema
BUT we want both type safety and optimizations!
- Enter Datasets!
- Dataframe = Dataset[Row] --> Dataframe is a type of a Dataset
- A combo of RDD and Dataframe
  - Mix and match RDD and Dataframes.
  - REQUIRE structured/semi structure data (must have some sort of schema)
- Now we can mix and match APIs!
- Can convert RDD and DFF to Dataset
Datasets require TypedColumn and not just Column (as datasets are typed!)
- Has both typed and untyped transformtions
Can call map on Dataframe -- gives a dataset; but need to typecast manually!
Operations on RDD are not all available on Datasets (and also different sometimes!)
- groupByKey returns the maldito KeyValuedGroupedDataset
  - Two set process to get back a dataset
Pay attention to the TypedColumn!
KeyValuedGroupedDataset --> mapGroups and flatMapGroups
- Pero donde chingado esta reduceByKey! (Datasets do NOT have it!)
- mapGroups requires a full shuffle!! (Do NOT use'em!)
- Docs suggest an Aggregator.
  - has three types [-IN, BUF, OUT]
  - Need to define the functions zero, reduce, merge and finish
  - Also need encoders!! --> Encoders.STRING
Compares Dataframes vs RDDs vs Datasets

RDD is the core abstraction ! Others are built on it
Dataframe is the best optimization! (but untyped!)

Not everything on a dataset can be optimized, for eg. functional lambda fiter cannot be optimzied

Can optimize relational queries better, as it knows the columns required