Skip to content

Instantly share code, notes, and snippets.

@oluies
Created March 12, 2016 07:27
Show Gist options
  • Save oluies/d028cc89305fd7b1b852 to your computer and use it in GitHub Desktop.
Save oluies/d028cc89305fd7b1b852 to your computer and use it in GitHub Desktop.
//read each of your input directory as a dataframe and union them and repartition it to the # of files you want and dump it back
val dfSeq = MutableList[DataFrame]()
sourceDirsToConsolidate.map(dir => {
val df = sqlContext.parquetFile(dir)
dfSeq += df
})
val masterDf = dfSeq.reduce((df1, df2) => df1.unionAll(df2))
masterDf.coalesce(numOutputFiles).write.mode(saveMode).parquet(destDir)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment