Spark

Spark manipulates input datas as RDD, which basically are distributed datasets.
RDD transformations (map) are lazy. It's like a roadmap of transformations to operate over dataset. But lazy, still.
RDD actions evaluates transformations and reduces in order to generate and return the result.
RDD transformations are re-evaluated on each actions by default unless you cache them

tips

Working around with RDD objects, you might meet such weird errors : "value reduceByKey is not a member of spark.RDD[MyEncapsulatedType]". It is raised whenever Spark can't operate a conversion of a RDD into a format compatible with the action you're requesting. Worry not, and import spark.SparkContext._ which will automatically handle it for you

Best way I found to use spark as a framework is to link it and use it as the doc says to. And to compile, package, run it through the spark sbt util
SparkContext takes a Seq([jars]) last arguments, which is basically your application jars. So don't forget to package it before running it.
Typical way to run an App wich embed's spark is to : sbt compile:package::run it
Except for certain specific operations, RDD transformations and actions always return a RDD or subclass of RDD
Spark can read a resource from local fs, HDFS, S3, just use the correct pattern when instanciating the SparkContext : sn3:// or hdfs:// and so on.
Spark supports working over compressed files (.gz format tested so far)

Spark-ec2 is awesome, it lets you create a fully configured spark/mesos/hadoop/hdfs cluster in just about one or two commands
You can at any time start/stop destroy the cluster. To do it, install spark locally, and use the spark-ec2 script like so: ./spark-ec2 -k NAMEOFKEY -i /path/to/identityfile start|stop|destroy NAMEOFTHECLUSTER

Ensure MESOS_NATIVE_LIBRARY is exported on the machine which will run you spark application
Ensure that when you set up a SparkContext to use mesos HOST is replaced with mesos master in : http://HOST:5050
To update spark config, update /root/spark/conf/spark-env.sh
To spread your spark config through nodes, use `/root/mesos-ec2/copy-dir /root/spark/
Generally a TASK_STAGING means a task who has failed
Generally if your task history keeps stock on logging to STDERR you've probably not supplied the correct ip for your mesos master to SparkContext

Scala

catching is useful to use the Option pattern on a call. Anyway it will return an Option object, Some if there was no exception, None if there was. Example : catching(classOf[NumberFormatException]).opt(Resource(line.split("\t")))
You can pattern match on types, and so on Some/None like :
```
.filter {
    resource =>
      resource match {
        case Some(r)    => true
        case None       => false
      }
}
```
- Existential types are quite usefull when it comes to use a generic type in a prototype, without explictly binding it. It shall be used via the underscore symbol. For example existential type is used in the following to tell the compiler that T type should lower binded to RDD[nomatterwhattype]:
```
class View[T <: RDD[_]](input: T)
```
tuples are 1-indexed, I repeat tuples index starts at 1, not 0 !!