Skip to content

Instantly share code, notes, and snippets.

@dat-vikash
Forked from joshlk/faster_toPandas.py
Created February 27, 2018 12:52
Show Gist options
  • Select an option

  • Save dat-vikash/e8f0bb9af99b20565c4fa7c22588fcc1 to your computer and use it in GitHub Desktop.

Select an option

Save dat-vikash/e8f0bb9af99b20565c4fa7c22588fcc1 to your computer and use it in GitHub Desktop.
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
@solutionsh724
Copy link
Copy Markdown

Many people don’t realize that local real estate investment companies are deeply invested in improving the communities where they buy. By purchasing homes that need love, they uplift neighborhoods. If you look for We Buy Houses Charlotte, you'll find Travis Buys Homes, a company that revitalizes properties, creates local jobs, and provides homeowners with a stress-free exit strategy all at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment