TariqAHassan/PandasConcatWorkaround.adoc

Last active November 6, 2023 22:57

Star (15) You must be signed in to star a gist
Fork (3) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/TariqAHassan/fc77c00efef4897241f49e61ddbede9e.js"></script>
Save TariqAHassan/fc77c00efef4897241f49e61ddbede9e to your computer and use it in GitHub Desktop.

Fast Alternative to pd.concat() for row-wise concatenation

Raw

Pandas DataFrames are fantastic. However, concatenating them using standard approaches, such as pandas.concat(), can be very slow with large dataframes. This is a work around for that problem.

Note: this approach assumes that: (a) the goal is a row-wise concatenation (i.e., axis=0) and (b) all dataframes share the same column names.

If these assumptions are not met, this approach could still work…but it will likely need to be modified.

Start by importing os, Pandas and chain from itertools.

import os
import pandas as pd
from itertools import chain

Set the path to the data files:

PATH_TO_FILES = '/your/path/here/'

Read in the Data as Pandas DataFrames (csv files, in this example):

frames = list()
for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]:
    frames.append(pd.read_csv(csv))

Define a function to flatten large 2D lists quickly:

def fast_flatten(input_list):
    return list(chain.from_iterable(input_list))

Next, construct a dictionary using the column names from one of the dataframes (located at index 0):

COLUMN_NAMES = frames[0].columns

Now, construct a dictionary from the column names:

df_dict = dict.fromkeys(COLUMN_NAMES, [])

Iterate though the columns:

for col in COLUMN_NAMES:
    extracted = (frame[col] for frame in frames)

    # Flatten and save to df_dict
    df_dict[col] = fast_flatten(extracted)

Lastly use the from_dict method to produce the combined DataFrame:

df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES]

While this method is not very pretty, it typically is much faster than pd.concat() and yields the exact same result.

kaijfox commented Jul 30, 2018 •

edited

Loading

Any idea why this provides a speedup? pd.concat is faster for me, and I'd like to figure out what aspect of my test case is slowing this method down.

HuyTu7 commented Jun 5, 2019

I have the same problem, I tried to compare this with pd.concat dask dataframe dd.concat and both were faster?

efendi54 commented Aug 1, 2019

Same here: pd.concat seems still to be faster

Author

TariqAHassan commented Aug 1, 2019 •

edited

Loading

Hey all. This guide was posted several years ago now and may no longer be true.

tspeterkim commented Nov 8, 2019

This worked really well for me! Thanks for sharing

swalkoAI commented Dec 3, 2019

@TariqAHassan Thanks for this. This is many times faster than pandas.concat() and I work with lists of 100,000s of dataframes. Excellent. Thank you for sharing.

saphir746 commented Nov 11, 2020 •

edited

Loading

An extension for Dfs with different columns:
(where frames is a list of dataframes to concat together)

# N = total # of rows to collate

def fast_flatten(input_list):
    a=list(chain.from_iterable(input_list))
    a += [False] * (N - len(a)) # collating logical arrays - missing values are replaced with False
    return list(a)

def combine_lists(frames):
    COLUMN_NAMES = [frames[i].columns for i in range(len(frames))]
    COL_NAMES=list(set(list(itertools.chain(*COLUMN_NAMES))))
    df_dict = dict.fromkeys(COL_NAMES, [])
    for col in COL_NAMES:
        extracted = (frame[col] for frame in frames10 if col in frame.columns.tolist())
        df_dict[col] = fast_flatten(extracted)
    Df_new = pd.DataFrame.from_dict(df_dict)[COL_NAMES]
    return Df_new

DaltonWorsnup commented Apr 13, 2023 •

edited

Loading

This is still WAY faster than pd.concat in April 2023. For reference, I am concatenating ~7000 dataframes with an average of 5 columns and 1337 rows per dataframe. In fact, on my machine, pd.concat runs out of memory and crashes after several minutes - the approach in this gist takes ~10 seconds and does not crash. (Python 3.9 - Pandas 2.0.0)

akamil-etsy commented Jul 15, 2023 •

edited

Loading

yes, but simpler, this is faster than using concat on a list of dataframes (pd.concat(dfs)) , just load each df into dictionary and combine dictionaries than convert to df

in this case i'm reading hundreds of 4MB parquet files from a directory, where each parquet file contains 'id' and 'i' columns, and return a combined dataframe

def read_and_sort_parquets(directory):
    data_dict = {}

    for filename in os.listdir(directory):
        if filename.endswith('.parquet'):
            print(f"Processing: {filename}")
            file_df = pd.read_parquet(os.path.join(directory, filename))
            file_dict = file_df.set_index('id').to_dict()['i']
            data_dict.update(file_dict)

    # Sort the dictionary by keys (i.e., by 'id')
    data_dict = dict(sorted(data_dict.items()))  
    
    # Convert the sorted dictionary to a DataFrame
    df = pd.DataFrame(list(data_dict.items()), columns=['id', 'i'])

    return df

ids_df = read_and_sort_parquets("parquet_dir")
print(f"size: {len(ids_df)}")
print(ids_df.head())

sammlapp commented Nov 6, 2023 •

edited

Loading

Nice! In my case, I noticed that this code was >100x faster than pd.concat when my individual dataframes had lots of content in the index (actually MultiIndex). This code removes the index and is fast. When I .reset_index() before pd.concat(), pd.concat() is very fast and I can restore the original indices rather than losing them.

TariqAHassan/PandasConcatWorkaround.adoc

kaijfox commented Jul 30, 2018 •

edited

Loading

Uh oh!

HuyTu7 commented Jun 5, 2019

Uh oh!

efendi54 commented Aug 1, 2019

Uh oh!

TariqAHassan commented Aug 1, 2019 •

edited

Loading

Uh oh!

tspeterkim commented Nov 8, 2019

Uh oh!

swalkoAI commented Dec 3, 2019

Uh oh!

saphir746 commented Nov 11, 2020 •

edited

Loading

Uh oh!

DaltonWorsnup commented Apr 13, 2023 •

edited

Loading

Uh oh!

akamil-etsy commented Jul 15, 2023 •

edited

Loading

Uh oh!

sammlapp commented Nov 6, 2023 •

edited

Loading

Uh oh!

TariqAHassan/PandasConcatWorkaround.adoc

kaijfox commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuyTu7 commented Jun 5, 2019

Uh oh!

efendi54 commented Aug 1, 2019

Uh oh!

TariqAHassan commented Aug 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tspeterkim commented Nov 8, 2019

Uh oh!

swalkoAI commented Dec 3, 2019

Uh oh!

saphir746 commented Nov 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaltonWorsnup commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akamil-etsy commented Jul 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sammlapp commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaijfox commented Jul 30, 2018 •

edited

Loading

TariqAHassan commented Aug 1, 2019 •

edited

Loading

saphir746 commented Nov 11, 2020 •

edited

Loading

DaltonWorsnup commented Apr 13, 2023 •

edited

Loading

akamil-etsy commented Jul 15, 2023 •

edited

Loading

sammlapp commented Nov 6, 2023 •

edited

Loading