Skip to content

Instantly share code, notes, and snippets.

@doppiomacchiatto
Created August 19, 2019 16:47
Show Gist options
  • Select an option

  • Save doppiomacchiatto/ca4bb88053b8c37509092e8593a20aaa to your computer and use it in GitHub Desktop.

Select an option

Save doppiomacchiatto/ca4bb88053b8c37509092e8593a20aaa to your computer and use it in GitHub Desktop.
Find duplicates in a Spark DataFrame
val transactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json("s3n://bucket-name/transaction.json")
transactions.groupBy("id", "organization").count.sort($"count".desc).show
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment