Created
February 22, 2017 02:53
-
-
Save pavlov99/28fe7241224e59977e47cfc58b424a0e to your computer and use it in GitHub Desktop.
Disjoint (additional) objects count with Apache Spark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// This method uses Window function to eliminate double counting of objects, which belong to multiple groups. | |
// `groups` is a DataSet with two columns: id and group. The first column identifies the object, the second is a group name. | |
// One of the usage of this method is customer segmentation. | |
val disjointGroups = groups | |
.withColumn("_rank", dense_rank().over(org.apache.spark.sql.expressions.Window.partitionBy("id").orderBy("group"))) | |
.filter($"_rank" === 1).drop("_rank") | |
// Show disjoint groups with additional count. | |
disjointGroups | |
.groupBy("group") | |
.agg(countDistinct("id") as "number_records") | |
.orderBy("group") | |
.show() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment