Skip to content

Instantly share code, notes, and snippets.

@saswata-dutta
Created July 23, 2020 11:43
Show Gist options
  • Save saswata-dutta/28a598a54ea68a9c70048843d685a3ad to your computer and use it in GitHub Desktop.
Save saswata-dutta/28a598a54ea68a9c70048843d685a3ad to your computer and use it in GitHub Desktop.
scala> val values = Seq((1, "a",1), (1, "b",2), (2, "c", 2), (3, "d",1), (3, "e", 1), (3, "f",0))
values: Seq[(Int, String, Int)] = List((1,a,1), (1,b,2), (2,c,2), (3,d,1), (3,e,1), (3,f,0))
scala> val df = values.toDF
df: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> val max_df = df.groupBy("_1").agg(max("_3").alias("_3"))
max_df: org.apache.spark.sql.DataFrame = [_1: int, _3: int]
scala> df.join(max_df, Seq("_1", "_3"), "leftsemi").dropDuplicates("_1", "_3").show
+---+---+---+
| _1| _3| _2|
+---+---+---+
| 3| 1| d|
| 2| 2| c|
| 1| 2| b|
+---+---+---+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment