Skip to content

Instantly share code, notes, and snippets.

@jamesrajendran
Last active June 20, 2017 09:13
Show Gist options
  • Save jamesrajendran/2cb8e5ea0e858bfe33a3422d1f417695 to your computer and use it in GitHub Desktop.
Save jamesrajendran/2cb8e5ea0e858bfe33a3422d1f417695 to your computer and use it in GitHub Desktop.
Advantages of Dataset/Dataframe over RDDs from spark2.0
--------------------------------------------------------
1.Dataset's Strong Typing: compile check
Strong typing is virtual - at the 'table' level same datatype
Makes it portable across languages like sql/python/scala
eg:val ds = sqlContext.range(3)
ds.as[String]
2.Existing Business logic can be reused:
case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){
val latChicago = 41.87
val lonChicago = -87
def nearChicago = {
math.sqrt(math.pow(loc(8)-lonChicago, 2) + math.pow(loc(1)-latChicago,2)) <1
}
}
sqlContext.read.json("fileName").withColumnRenamed("_id","zip").as[Zip].filter(_.nearChicago).show
ds.groupBy('state).count.orderBy('state).show
3. Plan Optimization
Parsed plan
Logical plan
Optimized plan
Physical plan
4.Wholestage codegen
Code will be rearranged and executed in groups/stages
For example above business logic function will be part of the code generated by spark vs running it on top of the underlying code.
use debugCodegen function to see this
import org.apache.spark.sql.execution.debug._
5.Hive Integration is for Free
display(spark.catalog.listDatabases)
display(dbutils.fs.ls("hdfs hive table folder"))
6.Structured streaming is just an extension of existing APIs
7.Dataset vs DataFrame
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
import sqlContext.implicits._
val personDF = sqlContext.createDataFrame(personRDD)
prdd = personDF.rdd -- returns RDD[Row]
val x = personDF.filter("age >10")
val x = personDF.filter(p => p.age >10) --will fail
val personDS = personDF.as[Person]
val perRDD = personDS.rdd -- returns RDD[person]
val y = personDS.filter(p => p.age >10 ) -- will not fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment