(a moderately, well, shameless rework of Holden Karau's "Scala - Crash Course")
Scala is a multi-paradigm high-level language for the JVM.
It offers the ability to use both Object-oriented & Functional approaches.
Scala is statically typed. Type inference eliminates the need for more explicit type declarations. It is intended to work with and along side of Java code. It has the ability to use any Java class (and inherit from it, etc). It can be called from Java classes.
To follow examples in Spark, it helps if you have an understanding of the following concepts:
- variables
- functions
- closures
- Scala Collections API
- tuples & case classes
We can explore Scala interactively through sbt console
or dse spark
.
We'll referred to this as the REPL (which stands for Read-Evaluate-Print-Loop).
In Java, defining a primitive int
and immutable, read-only reference to a String
looks like:
int x = 7;
final String y = "hello"
In Scala, we'd do the following:
scala> var x: Int = 7
x: Int = 7
scala> val y: String = "hello"
y: String = hello
As we mentioned, we don't need to always declare the types - they can be inferred:
scala> var x = 7
x: Int = 7
scala> val y = "hello"
y: String = hello
In Java, we might define a static function that squares a primitive integer like so:
...
public static int square(int x) {
return x*x;
}
...
In Scala, named one line functions can be simply defined as:
scala> def square(x: Int): Int = x*x
square: (x: Int)Int
scala> square(3)
res0: Int = 9
And we can use a code block to define the body of the function too:
scala> def square(x: Int): Int = {
| x*x
| }
square: (x: Int)Int
scala> square(3)
res1: Int = 9
The pipes, |
, are added by the REPL to use that definition is continuing on the next line.
We might find it handy to print out variables to text, and maybe even define a function to help.
In Java, we'd do:
...
void announce(String text) {
System.out.println(text);
}
...
In Scala, we'd do:
def announce(text: String) = {
println(text)
}
With Scala, we can define closures (or, we might also want to call them anonymous or lambda functions).
We can do so with varying levels of ceremony, depending on type inference to help reduce declarations...
(x: Int) => x + 2 // full version, indicating the argument type
x => x + 2 // droping the type, it can determined via type inference
_ + 2 // use 'placeholder' syntax, we can remove declaration 'x'
x => { // just like functions, the body can be a block of code
val numberToAdd = 2
x + numberToAdd
}
Note: Similar to some scripting languages, the evaluation of last line of a block of code will be returned.
These closures are not far from regular functions. They're just missing the defined names:
x => { // just like functions, the body can be a block of code
val numberToAdd = 2
x + numberToAdd
}
def addTwo(x: Int): Int = {
val numberToAdd = 2
x + numberToAdd
}
The Scala Collections API was partly the inspiration for the API used by Spark. It helps to be familiar with the operations you'll find in Scala Collections (like: foreach
, map
, filter
, reduce
, etc). For a more thorough listing of functions, see Seq
scala> val lst = List(1, 2, 3)
lst: List[Int] = List(1, 2, 3)
scala> lst.foreach(x => println(x))
1
2
3
scala> lst.foreach(println)
1
2
3
scala> lst.map(x => x + 2)
res5: List[Int] = List(3, 4, 5)
scala> lst.map(_ + 2)
res6: List[Int] = List(3, 4, 5)
scala> lst.filter(x => x % 2 == 1)
res7: List[Int] = List(1, 3)
scala> lst.filter(_ % 2 == 1)
res8: List[Int] = List(1, 3)
scala> lst.reduce((x, y) => x + y)
res9: Int = 6
scala> lst.reduce(_ + _)
res10: Int = 6
scala> lst
res11: List[Int] = List(1, 2, 3)
Note: a new result was shown with each operation, but lst
was not modified. All of these operations leave the list unchanged.
Scala provides a simple tuple definition: (val1, val2)
A case class
in Scala as similar to a namedtuple
in Python or a Java Bean or Plain Old Java Object (POJO). It offers a simple definition and gives an equality method and a formatted toString
.
scala> val points = List((0.1,1.3),(1.1,2.2),(0.1,1.2))
points: List[(Double, Double)] = List((0.1,1.3), (1.1,2.2), (0.1,1.2))
scala> case class Point(x: Double, y: Double)
defined class Point
scala> val pt1 = Point(points(0)._1, points(0)._2)
pt1: Point = Point(0.1,1.3)
This shows that you can refer to fields in a tuple with ordinal numberings (x._1
or x._2
).
If we have a z-coordinator tuple, referring to the third value would be:
scala> val t = (0.1,1.3,4.2)
t: (Double, Double, Double) = (0.1,1.3,4.2)
scala> t._3
res2: Double = 4.2
It would be nice to have an example of
.map(<blah>).flatten
and.flatMap(<blah>)
.Also might be nice to take a string, split it, and
reduceByKey
for a count and lo-fi word count.