Last active
June 20, 2017 09:13
-
-
Save jamesrajendran/2cb8e5ea0e858bfe33a3422d1f417695 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Advantages of Dataset/Dataframe over RDDs from spark2.0 | |
-------------------------------------------------------- | |
1.Dataset's Strong Typing: compile check | |
Strong typing is virtual - at the 'table' level same datatype | |
Makes it portable across languages like sql/python/scala | |
eg:val ds = sqlContext.range(3) | |
ds.as[String] | |
2.Existing Business logic can be reused: | |
case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){ | |
val latChicago = 41.87 | |
val lonChicago = -87 | |
def nearChicago = { | |
math.sqrt(math.pow(loc(8)-lonChicago, 2) + math.pow(loc(1)-latChicago,2)) <1 | |
} | |
} | |
sqlContext.read.json("fileName").withColumnRenamed("_id","zip").as[Zip].filter(_.nearChicago).show | |
ds.groupBy('state).count.orderBy('state).show | |
3. Plan Optimization | |
Parsed plan | |
Logical plan | |
Optimized plan | |
Physical plan | |
4.Wholestage codegen | |
Code will be rearranged and executed in groups/stages | |
For example above business logic function will be part of the code generated by spark vs running it on top of the underlying code. | |
use debugCodegen function to see this | |
import org.apache.spark.sql.execution.debug._ | |
5.Hive Integration is for Free | |
display(spark.catalog.listDatabases) | |
display(dbutils.fs.ls("hdfs hive table folder")) | |
6.Structured streaming is just an extension of existing APIs | |
7.Dataset vs DataFrame | |
case class Person(name : String , age : Int) | |
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20))) | |
import sqlContext.implicits._ | |
val personDF = sqlContext.createDataFrame(personRDD) | |
prdd = personDF.rdd -- returns RDD[Row] | |
val x = personDF.filter("age >10") | |
val x = personDF.filter(p => p.age >10) --will fail | |
val personDS = personDF.as[Person] | |
val perRDD = personDS.rdd -- returns RDD[person] | |
val y = personDS.filter(p => p.age >10 ) -- will not fail |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment