Skip to content

Instantly share code, notes, and snippets.

@goddoe
Created January 31, 2019 02:29
Show Gist options
  • Save goddoe/63f041dd981b1cbbfed5a4196e160339 to your computer and use it in GitHub Desktop.
Save goddoe/63f041dd981b1cbbfed5a4196e160339 to your computer and use it in GitHub Desktop.
pandas apply multiprocessing

(ref: https://stackoverflow.com/questions/26784164/pandas-multiprocessing-apply) I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).

EDIT: Here's the solution I finally found:

import multiprocessing as mp
import pandas.util.testing as pdt

def process_apply(x):
    # do some stuff to data here

def process(df):
    res = df.apply(process_apply, axis=1)
    return res

if __name__ == '__main__':
    p = mp.Pool(processes=8)
    split_dfs = np.array_split(big_df,8)
    pool_results = p.map(aoi_proc, split_dfs)
    p.close()
    p.join()

    # merging parts processed by different processes
    parts = pd.concat(pool_results, axis=0)

    # merging newly calculated parts to big_df
    big_df = pd.concat([big_df, parts], axis=1)

    # checking if the dfs were merged correctly
    pdt.assert_series_equal(parts['id'], big_df['id'])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment