Last active
December 20, 2022 00:50
-
-
Save tecmaverick/c1ee383beb60af90d7d1882649cbb566 to your computer and use it in GitHub Desktop.
Spark RDD ScratchPad
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// ============================================================ | |
// Generate a test KeyValue Pair | |
spark.conf.set("spark.sql.shuffle.partitions",2) | |
val num = Seq((2000,10),(2001,20),(2000,20),(2002,30),(2003,30),(2004,50),(2004,100),(2004,250),(2005,250),(2005,25), | |
(2006,150),(2006,225),(2007,250),(2007,125),(2008,250),(2009,25),(2010,250),(2010,125)) | |
val rdd = sc.parallelize(num) | |
val prdd = rdd.reduceByKey(_ + _).repartition(2) | |
val srdd = rdd.sortByKey().repartition(2) | |
// HashPartitioner is used | |
prdd.partitioner | |
// RangePartitioner is used | |
srdd.partitioner | |
// Save hash partitioned data to file | |
prdd.saveAsTextFile("file:///Users/abe/Personal/Apache Spark/data/hashp") | |
// Save range partitioned data to file | |
srdd.saveAsTextFile("file:///Users/abe/Personal/Apache Spark/data/rangep") | |
// ============================================================ | |
// Generate a sequrnce of numbers | |
val rdd = sc.parallelize(Seq.range(0,100)) | |
rdd.foreach(println) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment