Skip to content

Instantly share code, notes, and snippets.

@joshlk
Last active September 19, 2025 16:11
Show Gist options
  • Save joshlk/871d58e01417478176e7 to your computer and use it in GitHub Desktop.
Save joshlk/871d58e01417478176e7 to your computer and use it in GitHub Desktop.
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
@amarvin
Copy link

amarvin commented Sep 19, 2025

Doesn't work for Spark Connect, which doesn't allow the use of the KDD API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment