Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save thangarajan8/5bdbd49288204d8051ab778aefc0f0aa to your computer and use it in GitHub Desktop.
Save thangarajan8/5bdbd49288204d8051ab778aefc0f0aa to your computer and use it in GitHub Desktop.
Apache Spark Repartition vs coalesce
Repatition
1. create even number of records in resultant partitions so the resources are consumed equally
2. Go for full shuffle so it will cost effective
3. used to increase or decerase number of partitions
Coalesce:
1. Create un-even number of records in resultant partitions due to this load will be un-balanced
2. won't go for full shuffle so it will be fast
3. used to decrease number of partitions
in RDD creation we can specify the number of partition we want. But in dataframe we cannot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment