Created
February 14, 2019 07:28
-
-
Save jomoespe/cd0dcfcb7b910e1a5df779e0e7687181 to your computer and use it in GitHub Desktop.
Snippet of Spark job to merge parquet files, also removing duplicates
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val partitions = 5; // this value depends on data and volumes. Will be different in every case. | |
val df = sqlContext.read.json(“URI://path/to/parquet/files/") | |
df.createOrReplaceTempView("df") | |
val df_output = spark | |
.sql("SELECT DISTINCT * FROM df") // this removes duplicates. If it's not needed, simply remove this line | |
.coalesce(partitions) | |
df_output.write.parquet("URI://path/to/destination") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment