jamesrajendran · June 20, 2017 09:13
diff --git a/SparkDatasetVsRDD b/SparkDatasetVsRDD
 Advantages of Dataset/Dataframe over RDDs from spark2.0
 --------------------------------------------------------
 1.Dataset's Strong Typing: compile check
 						 Strong typing is virtual - at the 'table' level same datatype 
 						 Makes it portable across languages like sql/python/scala
 						 eg:val ds = sqlContext.range(3)
 							ds.as[String]
 2.Existing Business logic can be reused:
 	case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){
 		val latChicago = 41.87
 		val lonChicago = -87
 		
 		def nearChicago = {
 		 math.sqrt(math.pow(loc(8)-lonChicago, 2) + math.pow(loc(1)-latChicago,2)) <1
 		}
 	
 	}
 	
 	sqlContext.read.json("fileName").withColumnRenamed("_id","zip").as[Zip].filter(_.nearChicago).show
 	ds.groupBy('state).count.orderBy('state).show
 	
 3. Plan Optimization
 		Parsed plan
 		Logical plan
 		Optimized plan
 		Physical plan
 		
 4.Wholestage codegen
 			Code will be rearranged and executed in groups/stages
 			For example above business logic function will be part of the code generated by spark vs running it on top of the underlying code.
 			
 			use debugCodegen function to see this
 			import org.apache.spark.sql.execution.debug._
 5.Hive Integration is for Free
  display(spark.catalog.listDatabases)		
  display(dbutils.fs.ls("hdfs hive table folder"))  

 6.Structured streaming is just an extension of existing APIs

 7.Dataset vs DataFrame
 case class Person(name : String , age : Int)
 val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
 import sqlContext.implicits._

 val personDF = sqlContext.createDataFrame(personRDD)
 prdd = personDF.rdd  -- returns RDD[Row]
 val x = personDF.filter("age >10")
 val x = personDF.filter(p => p.age >10) --will fail

 val personDS = personDF.as[Person]
 val perRDD = personDS.rdd	-- returns RDD[person]
 val y = personDS.filter(p => p.age >10 ) -- will not fail
	Advantages of Dataset/Dataframe over RDDs from spark2.0
	--------------------------------------------------------
	1.Dataset's Strong Typing: compile check
	Strong typing is virtual - at the 'table' level same datatype
	Makes it portable across languages like sql/python/scala
	eg:val ds = sqlContext.range(3)
	ds.as[String]
	2.Existing Business logic can be reused:
	case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){
	val latChicago = 41.87
	val lonChicago = -87

	def nearChicago = {
	math.sqrt(math.pow(loc(8)-lonChicago, 2) + math.pow(loc(1)-latChicago,2)) <1
	}

	}

	sqlContext.read.json("fileName").withColumnRenamed("_id","zip").as[Zip].filter(_.nearChicago).show
	ds.groupBy('state).count.orderBy('state).show

	3. Plan Optimization
	Parsed plan
	Logical plan
	Optimized plan
	Physical plan

	4.Wholestage codegen
	Code will be rearranged and executed in groups/stages
	For example above business logic function will be part of the code generated by spark vs running it on top of the underlying code.

	use debugCodegen function to see this
	import org.apache.spark.sql.execution.debug._
	5.Hive Integration is for Free
	display(spark.catalog.listDatabases)
	display(dbutils.fs.ls("hdfs hive table folder"))

	6.Structured streaming is just an extension of existing APIs

	7.Dataset vs DataFrame
	case class Person(name : String , age : Int)
	val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
	import sqlContext.implicits._

	val personDF = sqlContext.createDataFrame(personRDD)
	prdd = personDF.rdd -- returns RDD[Row]
	val x = personDF.filter("age >10")
	val x = personDF.filter(p => p.age >10) --will fail

	val personDS = personDF.as[Person]
	val perRDD = personDS.rdd -- returns RDD[person]
	val y = personDS.filter(p => p.age >10 ) -- will not fail