Skip to content

Instantly share code, notes, and snippets.

@saswata-dutta
Last active October 4, 2020 16:59
Show Gist options
  • Save saswata-dutta/97ad39b7f92d0f423cf2c817229fb025 to your computer and use it in GitHub Desktop.
Save saswata-dutta/97ad39b7f92d0f423cf2c817229fb025 to your computer and use it in GitHub Desktop.
drop duplicate rows by id, keeping one with latest timestamp
// https://www.datasciencemadesimple.com/distinct-value-of-dataframe-in-pyspark-drop-duplicates/
// https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
// to deal with ties within window partitions, a tiebreaker column is added
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val byId = Window.partitionBy("id").orderBy(col("last_updated").desc, col("tiebreak"))
val deduped = df.
withColumn("tiebreak", monotonically_increasing_id()).
withColumn("rank", rank().over(byId)).
filter(col("rank") === 1).
drop("rank","tiebreak")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment