Skip to content

Instantly share code, notes, and snippets.

@nguyenhieuec
Forked from joshlk/faster_toPandas.py
Created November 13, 2020 05:37
Show Gist options
  • Select an option

  • Save nguyenhieuec/8d0c04019d4008dd29c5d0595d50afb5 to your computer and use it in GitHub Desktop.

Select an option

Save nguyenhieuec/8d0c04019d4008dd29c5d0595d50afb5 to your computer and use it in GitHub Desktop.
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment