Skip to content

Instantly share code, notes, and snippets.

@oleiade
Created December 7, 2012 11:32
Show Gist options
  • Save oleiade/4232716 to your computer and use it in GitHub Desktop.
Save oleiade/4232716 to your computer and use it in GitHub Desktop.
Spark thoughts

Spark

Rdd

  • Spark manipulates input datas as RDD, which basically are distributed datasets.
  • RDD transformations (map) are lazy. It's like a roadmap of transformations to operate over dataset. But lazy, still.
  • RDD actions evaluates transformations and reduces in order to generate and return the result.
  • RDD transformations are re-evaluated on each actions by default unless you cache them

tips

  • Working around with RDD objects, you might meet such weird errors : "value reduceByKey is not a member of spark.RDD[MyEncapsulatedType]". It is raised whenever Spark can't operate a conversion of a RDD into a format compatible with the action you're requesting. Worry not, and import spark.SparkContext._ which will automatically handle it for you

Usage

  • Best way I found to use spark as a framework is to link it and use it as the doc says to. And to compile, package, run it through the spark sbt util
  • SparkContext takes a Seq([jars]) last arguments, which is basically your application jars. So don't forget to package it before running it.
  • Typical way to run an App wich embed's spark is to : sbt compile:package::run it
  • Except for certain specific operations, RDD transformations and actions always return a RDD or subclass of RDD
  • Spark can read a resource from local fs, HDFS, S3, just use the correct pattern when instanciating the SparkContext : sn3:// or hdfs:// and so on.
  • Spark supports working over compressed files (.gz format tested so far)

Tools

  • Spark-ec2 is awesome, it lets you create a fully configured spark/mesos/hadoop/hdfs cluster in just about one or two commands
  • You can at any time start/stop destroy the cluster. To do it, install spark locally, and use the spark-ec2 script like so: ./spark-ec2 -k NAMEOFKEY -i /path/to/identityfile start|stop|destroy NAMEOFTHECLUSTER

Through Mesos

  • Ensure MESOS_NATIVE_LIBRARY is exported on the machine which will run you spark application

  • Ensure that when you set up a SparkContext to use mesos HOST is replaced with mesos master in : http://HOST:5050

  • To update spark config, update /root/spark/conf/spark-env.sh

  • To spread your spark config through nodes, use `/root/mesos-ec2/copy-dir /root/spark/

  • Generally a TASK_STAGING means a task who has failed

  • Generally if your task history keeps stock on logging to STDERR you've probably not supplied the correct ip for your mesos master to SparkContext


Scala

  • catching is useful to use the Option pattern on a call. Anyway it will return an Option object, Some if there was no exception, None if there was. Example : catching(classOf[NumberFormatException]).opt(Resource(line.split("\t")))

  • You can pattern match on types, and so on Some/None like :

    .filter {
        resource =>
          resource match {
            case Some(r)    => true
            case None       => false
          }
    }
    • Existential types are quite usefull when it comes to use a generic type in a prototype, without explictly binding it. It shall be used via the underscore symbol. For example existential type is used in the following to tell the compiler that T type should lower binded to RDD[nomatterwhattype]:

      class View[T <: RDD[_]](input: T)
  • tuples are 1-indexed, I repeat tuples index starts at 1, not 0 !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment