- Spark manipulates input datas as RDD, which basically are distributed datasets.
- RDD transformations (map) are lazy. It's like a roadmap of transformations to operate over dataset. But lazy, still.
- RDD actions evaluates transformations and reduces in order to generate and return the result.
- RDD transformations are re-evaluated on each actions by default unless you cache them
tips
- Working around with RDD objects, you might meet such weird errors : "value reduceByKey is not a member of spark.RDD[MyEncapsulatedType]". It is raised whenever Spark can't operate a conversion of a RDD into a format compatible with the action you're requesting. Worry not, and
import spark.SparkContext._
which will automatically handle it for you
- Best way I found to use spark as a framework is to link it and use it as the doc says to. And to compile, package, run it through the spark sbt util
- SparkContext takes a Seq([jars]) last arguments, which is basically your application jars. So don't forget to package it before running it.
- Typical way to run an App wich embed's spark is to :
sbt compile:package::run
it - Except for certain specific operations, RDD transformations and actions always return a RDD or subclass of RDD
- Spark can read a resource from local fs, HDFS, S3, just use the correct pattern when instanciating the SparkContext :
sn3://
orhdfs://
and so on. - Spark supports working over compressed files (
.gz
format tested so far)
- Spark-ec2 is awesome, it lets you create a fully configured spark/mesos/hadoop/hdfs cluster in just about one or two commands
- You can at any time start/stop destroy the cluster. To do it, install spark locally, and use the spark-ec2 script like so:
./spark-ec2 -k NAMEOFKEY -i /path/to/identityfile start|stop|destroy NAMEOFTHECLUSTER
-
Ensure
MESOS_NATIVE_LIBRAR
Y is exported on the machine which will run you spark application -
Ensure that when you set up a
SparkContext
to use mesosHOST
is replaced with mesos master in :http://HOST:5050
-
To update spark config, update
/root/spark/conf/spark-env.sh
-
To spread your spark config through nodes, use `/root/mesos-ec2/copy-dir /root/spark/
-
Generally a TASK_STAGING means a task who has failed
-
Generally if your task history keeps stock on
logging to STDERR
you've probably not supplied the correct ip for your mesos master toSparkContext
-
catching
is useful to use the Option pattern on a call. Anyway it will return an Option object, Some if there was no exception, None if there was. Example :catching(classOf[NumberFormatException]).opt(Resource(line.split("\t")))
-
You can pattern match on types, and so on Some/None like :
.filter { resource => resource match { case Some(r) => true case None => false } }
-
Existential types are quite usefull when it comes to use a generic type in a prototype, without explictly binding it. It shall be used via the underscore symbol. For example existential type is used in the following to tell the compiler that T type should lower binded to RDD[nomatterwhattype]:
class View[T <: RDD[_]](input: T)
-
-
tuples are 1-indexed, I repeat tuples index starts at 1, not 0 !!