Malthe Borch malthe

105 followers · 6 following

(Independent)
Copenhagen
https://maltheborch.com

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

1 file
1 fork
0 comments
4 stars

chhantyal / spark_rdd_to_pandas_distributed.py

Last active April 27, 2023 23:53

Convert Spark RDD to Pandas DataFrame inside Spark executors and make Spark DataFrame from resulting RDD. This is distributed i.e. no need for collecting RDD to driver.

	"""
	Spark DataFrame is distributed but it lacks many features compared to Pandas.
	If you want to use Pandas, you can't just convert Spark DF to Pandas because that means collecting it to driver.
	It can be slow & not work at all when data size is big.

	So only way to use Pandas is to create mini dataframes inside executors.
	This gist shows how to create DataFrame from RDD inside Spark executors & build Spark DataFrame from final output.
	"""

	# Convert function to use in mapPartitions