Skip to content

Instantly share code, notes, and snippets.

@tecmaverick
Created December 19, 2022 04:36
Show Gist options
  • Save tecmaverick/ab11b16acaad1c88ba469e5534809e2f to your computer and use it in GitHub Desktop.
Save tecmaverick/ab11b16acaad1c88ba469e5534809e2f to your computer and use it in GitHub Desktop.
Get the number of students per course
// Input
// studentid,coursename
// 01,CHEM|PHY|MATH
// 02,CHEM|PHY
// 03,MATH|COMP|CHEM
// 04,MATH|PHY
// 05,HIST|MARKT|BIOL
// 06,BIOL|PHY
// 07,BOTONY|ZOOL
// 08,BOTONY|COMP
// 09,HIST|COMP
// 010,MATH|COMP|STAT
// 011,COMP|STAT
//Output - Students per course
// (COMP,5)
// (MATH,4)
// (PHY,4)
// (CHEM,3)
// (STAT,2)
// (BIOL,2)
// (HIST,2)
// (BOTONY,2)
// (ZOOL,1)
// (MARKT,1)
val filePath = "file:///Users/abe/Personal/Apache Spark/Scripts/AjpRDDReverseMapping/data/data.csv"
val spark = SparkSession.builder()
.master("local[3]")
.appName("Sensor Processor")
.config("spark.sql.shuffle.partitions", 2)
.getOrCreate()
val csv_rdd = spark.sparkContext.textFile(filePath)
val header = csv_rdd.first()
val csv_filtered_rdd = csv_rdd
.filter(x=> x != header)
.map { row =>
val fields = row.split(",")
val student_id = fields(0)
val courses = fields(1).split("\\|")
(student_id, courses)
}
.flatMapValues(courses => courses)
.groupBy(row => row._2)
.map(row=>(row._1, row._2.size))
.coalesce(1)
.sortBy(row=>row._2,false)
.foreach(println(_))
spark.stop()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment