Skip to content

Instantly share code, notes, and snippets.

@TariqAHassan
Last active November 6, 2023 22:57
Show Gist options
  • Save TariqAHassan/fc77c00efef4897241f49e61ddbede9e to your computer and use it in GitHub Desktop.
Save TariqAHassan/fc77c00efef4897241f49e61ddbede9e to your computer and use it in GitHub Desktop.
Fast Alternative to pd.concat() for row-wise concatenation

Pandas DataFrames are fantastic. However, concatenating them using standard approaches, such as pandas.concat(), can be very slow with large dataframes. This is a work around for that problem.

Note: this approach assumes that: (a) the goal is a row-wise concatenation (i.e., axis=0) and (b) all dataframes share the same column names.

If these assumptions are not met, this approach could still work…​but it will likely need to be modified.

Start by importing os, Pandas and chain from itertools.

import os
import pandas as pd
from itertools import chain

Set the path to the data files:

PATH_TO_FILES = '/your/path/here/'

Read in the Data as Pandas DataFrames (csv files, in this example):

frames = list()
for csv in [os.path.join(PATH_TO_FILES, f) for f in os.listdir(PATH_TO_FILES) if f.endswith('.csv')]:
    frames.append(pd.read_csv(csv))

Define a function to flatten large 2D lists quickly:

def fast_flatten(input_list):
    return list(chain.from_iterable(input_list))

Next, construct a dictionary using the column names from one of the dataframes (located at index 0):

COLUMN_NAMES = frames[0].columns

Now, construct a dictionary from the column names:

df_dict = dict.fromkeys(COLUMN_NAMES, [])

Iterate though the columns:

for col in COLUMN_NAMES:
    extracted = (frame[col] for frame in frames)

    # Flatten and save to df_dict
    df_dict[col] = fast_flatten(extracted)

Lastly use the from_dict method to produce the combined DataFrame:

df = pd.DataFrame.from_dict(df_dict)[COLUMN_NAMES]

While this method is not very pretty, it typically is much faster than pd.concat() and yields the exact same result.

@akamil-etsy
Copy link

akamil-etsy commented Jul 15, 2023

yes, but simpler, this is faster than using concat on a list of dataframes (pd.concat(dfs)) , just load each df into dictionary and combine dictionaries than convert to df

in this case i'm reading hundreds of 4MB parquet files from a directory, where each parquet file contains 'id' and 'i' columns, and return a combined dataframe

def read_and_sort_parquets(directory):
    data_dict = {}

    for filename in os.listdir(directory):
        if filename.endswith('.parquet'):
            print(f"Processing: {filename}")
            file_df = pd.read_parquet(os.path.join(directory, filename))
            file_dict = file_df.set_index('id').to_dict()['i']
            data_dict.update(file_dict)

    # Sort the dictionary by keys (i.e., by 'id')
    data_dict = dict(sorted(data_dict.items()))  
    
    # Convert the sorted dictionary to a DataFrame
    df = pd.DataFrame(list(data_dict.items()), columns=['id', 'i'])

    return df

ids_df = read_and_sort_parquets("parquet_dir")
print(f"size: {len(ids_df)}")
print(ids_df.head())

@sammlapp
Copy link

sammlapp commented Nov 6, 2023

Nice! In my case, I noticed that this code was >100x faster than pd.concat when my individual dataframes had lots of content in the index (actually MultiIndex). This code removes the index and is fast. When I .reset_index() before pd.concat(), pd.concat() is very fast and I can restore the original indices rather than losing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment