Last active
November 9, 2023 19:59
-
-
Save jlln/338b4b0b55bd6984f883 to your computer and use it in GitHub Desktop.
Efficiently split Pandas Dataframe cells containing lists into multiple rows, duplicating the other column's values.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def splitDataFrameList(df,target_column,separator): | |
''' df = dataframe to split, | |
target_column = the column containing the values to split | |
separator = the symbol used to perform the split | |
returns: a dataframe with each entry for the target column separated, with each element moved into a new row. | |
The values in the other columns are duplicated across the newly divided rows. | |
''' | |
def splitListToRows(row,row_accumulator,target_column,separator): | |
split_row = row[target_column].split(separator) | |
for s in split_row: | |
new_row = row.to_dict() | |
new_row[target_column] = s | |
row_accumulator.append(new_row) | |
new_rows = [] | |
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator)) | |
new_df = pandas.DataFrame(new_rows) | |
return new_df |
Super excited, this is my first time commenting with something I made, be gentle. I needed something similar when a number and a string were in one column separated by a comma (blanks as well). I modified my code a bit to hopefully work a little more universally.
The code checks how many times a delimiter is used in each column row, then repeats that line for each one.
import pandas as pd
from itertools import chain
import numpy as np
def chainer(df,col,sep,dtype):
df = df.astype({col:dtype})
lens = df[col].str.split(sep).map(len)
dicts = {}
for cols in df.columns:
if cols == col:
dicts[cols] = list(chain.from_iterable(df[cols].str.split(sep)))
else:
dicts[cols] = np.repeat(df[cols],lens)
return pd.DataFrame.from_dict(dicts)
df = chainer(df,'Combined Column',',','str')
Added the astype because my column wouldn't convert a float or NaN, after using str worked like a champ.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Need some help. I am unable to run script provided by @namanjh and others above. I am using the same input data file separated by commas. Script errors due to new_df not being defined. Please help. Thanks.
Input data:
Script:
#import numpy as np
import pandas as pd
from IPython.display import display
df = pd.DataFrame(pd.read_excel("file_path.xlsx"))
column_selectors = list(df)
row_delimiters = ','
#new_df = []
display(df)
display(column_selectors)
def split_dataframe_rows(df, column_selectors, row_delimiters):
def _split_list_to_rows(row, row_accumulator, column_selector, row_delimiter):
split_rows = {}
max_split = 0
for column_selector in column_selectors:
split_row = row[column_selector].split(row_delimiter)
split_rows[column_selector] = split_row
if len(split_row) > max_split:
max_split = len(split_row)
df2 = new_df
display(df2)