jlln/separator.py

ghost · 2018-08-12T11:13:04Z

This variation might be a bit faster.

def split_data_frame_list(df, target_column):
    """
    Splits a column with lists into rows
    
    Keyword arguments:
        df -- dataframe
        target_column -- name of column that contains lists        
    """
    
    # create a new dataframe with each item in a seperate column, dropping rows with missing values
    col_df = pd.DataFrame(df[target_column].dropna().tolist(),index=df[target_column].dropna().index)

    # create a series with columns stacked as rows         
    stacked = col_df.stack()

    # rename last column to 'idx'
    index = stacked.index.rename(names="idx", level=-1)
    new_df = pd.DataFrame(stacked, index=index, columns=[target_column])
    return new_df

Example:

>>> df = pd.DataFrame({'name':['a','b','c'], "items":[['a1','a2','a3'],['b1','b2','b3'],['c1','c2','c3']]})`
>>> df
	name	items
0	a	[a1, a2, a3]
1	b	[b1, b2, b3]
2	c	[c1, c2, c3]

>>> split_data_frame_list(df, target_column="items")

		items
	idx	
0	0	a1
	1	a2
	2	a3
1	0	b1
	1	b2
	2	b3
2	0	c1
	1	c2
	2	c3

zouweilin · 2018-09-05T17:13:00Z

Hey I made it so it can accept multiple columns and try to split on all of them at the same time

def split_dataframe_rows(df,column_selectors, row_delimiter):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector,row_delimiter):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector].split(row_delimiter)
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors,row_delimiter))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

into

kleinias · 2018-09-11T11:40:33Z

And here is some variation of @JoaoCarabetta's split function, that leaves additional columns as they are (no drop of columns) and sets list-columns with empty lists with None, while copying the other rows as they were.

def split_data_frame_list(df, 
                       target_column,
                      output_type=float):
    ''' 
    Accepts a column with multiple types and splits list variables to several rows.

    df: dataframe to split
    target_column: the column containing the values to split
    output_type: type of all outputs
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    row_accumulator = []
    def split_list_to_rows(row):
        split_row = row[target_column]
        if isinstance(split_row, list):
          for s in split_row:
              new_row = row.to_dict()
              new_row[target_column] = s
              row_accumulator.append(new_row)
          if split_row == []:
              new_row = row.to_dict()
              new_row[target_column] = None
              row_accumulator.append(new_row)
        else:
          new_row = row.to_dict()
          new_row[target_column] = split_row
          row_accumulator.append(new_row)
    df.apply(split_list_to_rows, axis=1)
    new_df = pd.DataFrame(row_accumulator)
    return new_df

>>> df = pd.DataFrame({'name':['a','b','c','d'], "items":[['a1','a2','a3'],['b1','b2','b3'],['c1','c2','c3'],[]],'leave me':range(4)})
>>> df
          items  leave me name
0  [a1, a2, a3]         0    a
1  [b1, b2, b3]         1    b
2  [c1, c2, c3]         2    c
3            []         3    d
>>> split_data_frame_list(df, target_column='items')
  items  leave me name
0    a1         0    a
1    a2         0    a
2    a3         0    a
3    b1         1    b
4    b2         1    b
5    b3         1    b
6    c1         2    c
7    c2         2    c
8    c3         2    c
9  None         3    d

FernandoFavoretti · 2018-09-19T12:06:38Z

helps a lot, thank you =D

gnespatel1618 · 2018-09-20T17:13:12Z

@zouweilin 's extended version for lists

def split_dataframe_rows(df,column_selectors):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector]
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

Try this too:

def flatten_data(json = None):
    df = pd.DataFrame(json)
    list_cols = [col for col in df.columns if type(df.loc[0, col]) == list]
    for i in range(len(list_cols)):
        col = list_cols[i]
        meta_cols = [col for col in df.columns if type(df.loc[0, col]) != list] + list_cols[i+1:]
        json_data = df.to_dict('records')
        df = json_normalize(data=json_data, record_path=col, meta=meta_cols, record_prefix=col+str('_'), sep='_')
    return json_normalize(df.to_dict('records'))

ENJOY..!!!

kanvesh · 2019-02-07T21:46:53Z

Was helpful. Thanks.

harllos · 2019-02-13T15:06:49Z

Works just fine! Thank you.

nitishkmr1989 · 2019-04-22T18:56:26Z

Can someone help me with the code for the below problem:

I have multiple columns with more than 1 value separated by delimiter. I need to create separate rows for those columns such that each value in the column will become a new row keeping the other values same.

I have attached the input and expected output in the excel sheet.

akshay681 · 2019-05-02T21:07:45Z

Literally just made a github account right now so I could say thank you.

Thank you!

EXACTLY what I was looking for and worked like a charm.

QtRoS · 2019-06-04T08:50:08Z

Big thanks, worked really well, much faster and cleaner than 1-st link solutions from Google.

sfiso001 · 2019-06-10T15:26:23Z

So I'm trying to use code by when I call the function and insert my delimeter into the function (semicolon) I get an invalid syntax error. I might be doing something wrong. Sorry I'm new to programming.

splitDataFrameList(df, alert_rule, ;)

barryhillier · 2019-07-05T17:03:03Z

Thanks so much, using this saved me a ton of time!

kaishwary · 2019-07-17T12:06:50Z

Thank you , I modified it little bit to accomodate multiple delimiters by generating a regex pattern.

Like -
delimiters = ",","|"

Import re module for this
import re

def splitDataFrameList(df,target_column,delimiters):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    regexPattern = "|".join(map(re.escape,delimiters))
    def splitListToRows(row,row_accumulator,target_column,regexPattern):
        split_row = re.split(regexPattern,row[target_column])
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,regexPattern))
    new_df = pd.DataFrame(new_rows)
    return new_df

namanjh · 2019-09-11T09:39:24Z

Hey I made it so it can accept multiple columns and try to split on all of them at the same time

def split_dataframe_rows(df,column_selectors, row_delimiter):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector,row_delimiter):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector].split(row_delimiter)
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors,row_delimiter))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

into

Thanks bro .. worked like a charm..

Prateek180909 · 2020-01-20T10:00:55Z

Worked wonders, Thank you so much :D

pjdw · 2020-03-26T14:25:09Z

Thank you , I modified it little bit to accomodate multiple delimiters by generating a regex pattern.

Like -
delimiters = ",","|"

Import re module for this
import re

def splitDataFrameList(df,target_column,delimiters):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    regexPattern = "|".join(map(re.escape,delimiters))
    def splitListToRows(row,row_accumulator,target_column,regexPattern):
        split_row = re.split(regexPattern,row[target_column])
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,regexPattern))
    new_df = pd.DataFrame(new_rows)
    return new_df

I have a dataframe that contains fieldnames and field content. There is no specific delimiter, but the fieldnames are limitative.
Have a somewhat clumsy way
https://stackoverflow.com/questions/60773067/split-string-no-delimiter-with-limitative-field-names-and-content
But this is far from optimal. Any advice?

kpunyakoti · 2020-04-17T09:28:14Z

What a help! very handy, saved time and worked like a magic!! Thank you!

Exactly what I was looking for.

sirishaditya · 2020-04-28T23:51:58Z

Hey I made it so it can accept multiple columns and try to split on all of them at the same time

def split_dataframe_rows(df,column_selectors, row_delimiter):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector,row_delimiter):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector].split(row_delimiter)
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors,row_delimiter))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

into

Works great, exactly what I was looking for. Thanks!

tvk66866 · 2020-05-04T21:32:05Z

Very useful. Thanks

ShashwatShah24 · 2020-06-11T17:31:44Z

Thanks A lot For this code

DanielDaCosta · 2021-01-06T22:26:40Z

Thanks! Very useful!

You could use 'pd.' instead of 'pandas.' :)

pedrovgp · 2021-10-19T17:54:59Z

From pandas 0.25 on, one can use explode

aliheadou · 2022-03-12T17:18:10Z

Thank you !

tomasmetroy · 2022-09-06T22:43:25Z

very helpful!

Tusmijm · 2022-10-02T18:19:26Z

Need some help. I am unable to run script provided by @namanjh and others above. I am using the same input data file separated by commas. Script errors due to new_df not being defined. Please help. Thanks.

Input data:

Contact	Email	email2	phone	notes
adam,bob	adam.con, bob.com, john.com	adam.com2, bob.com2, john.com2	adamphone, bobphone, johnphone	should be same for everyone
other contact	don’t touch this	asdf	asdf	don’t touch this
rachael, simone, snake	rachael.com, simone.com	rachael.com2, simone.com2, snake.com2	rachaelphone, simonephone	should be same for everyone
other contact	don’t touch this	asdf	asdf	don’t touch this

Script:
#import numpy as np
import pandas as pd
from IPython.display import display

df = pd.DataFrame(pd.read_excel("file_path.xlsx"))
column_selectors = list(df)
row_delimiters = ','
#new_df = []
display(df)
display(column_selectors)

def split_dataframe_rows(df, column_selectors, row_delimiters):
def _split_list_to_rows(row, row_accumulator, column_selector, row_delimiter):
split_rows = {}
max_split = 0
for column_selector in column_selectors:
split_row = row[column_selector].split(row_delimiter)
split_rows[column_selector] = split_row
if len(split_row) > max_split:
max_split = len(split_row)

    for i in range(max_split):
        new_row = row.to_dict()
        for column_selector in column_selectors:
            try:
                new_row[column_selector] = split_rows[column_selector].pop(0)
            except IndexError:
                new_row[column_selector] = ''
        row_accumulator.append(new_row)
    
new_rows = []
df.apply(_split_list_to_rows, axis=1, args = (new_rows, column_selectors, row_delimiter))
new_df = pd.DataFrame(new_rows, column=df.columns)
return new_df

df2 = new_df
display(df2)

john0305 · 2022-10-19T17:55:18Z

Super excited, this is my first time commenting with something I made, be gentle. I needed something similar when a number and a string were in one column separated by a comma (blanks as well). I modified my code a bit to hopefully work a little more universally.

The code checks how many times a delimiter is used in each column row, then repeats that line for each one.

import pandas as pd
from itertools import chain
import numpy as np

def chainer(df,col,sep,dtype):
  df = df.astype({col:dtype})
  lens = df[col].str.split(sep).map(len)
  dicts = {}
  for cols in df.columns:
    if cols == col:
      dicts[cols] = list(chain.from_iterable(df[cols].str.split(sep)))
    else:
      dicts[cols] = np.repeat(df[cols],lens)
  return pd.DataFrame.from_dict(dicts)

df = chainer(df,'Combined Column',',','str')

Added the astype because my column wouldn't convert a float or NaN, after using str worked like a champ.

	def splitDataFrameList(df,target_column,separator):
	''' df = dataframe to split,
	target_column = the column containing the values to split
	separator = the symbol used to perform the split

	returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
	The values in the other columns are duplicated across the newly divided rows.
	'''
	def splitListToRows(row,row_accumulator,target_column,separator):
	split_row = row[target_column].split(separator)
	for s in split_row:
	new_row = row.to_dict()
	new_row[target_column] = s
	row_accumulator.append(new_row)
	new_rows = []
	df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
	new_df = pandas.DataFrame(new_rows)
	return new_df

jlln/separator.py

ghost commented Aug 12, 2018

Uh oh!

zouweilin commented Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kleinias commented Sep 11, 2018

Uh oh!

FernandoFavoretti commented Sep 19, 2018

Uh oh!

gnespatel1618 commented Sep 20, 2018

Uh oh!

kanvesh commented Feb 7, 2019

Uh oh!

harllos commented Feb 13, 2019

Uh oh!

nitishkmr1989 commented Apr 22, 2019

Uh oh!

akshay681 commented May 2, 2019

Uh oh!

QtRoS commented Jun 4, 2019

Uh oh!

sfiso001 commented Jun 10, 2019

Uh oh!

barryhillier commented Jul 5, 2019

Uh oh!

kaishwary commented Jul 17, 2019

Uh oh!

namanjh commented Sep 11, 2019

Uh oh!

Prateek180909 commented Jan 20, 2020

Uh oh!

pjdw commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpunyakoti commented Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sirishaditya commented Apr 28, 2020

Uh oh!

tvk66866 commented May 4, 2020

Uh oh!

ShashwatShah24 commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DanielDaCosta commented Jan 6, 2021

Uh oh!

pedrovgp commented Oct 19, 2021

Uh oh!

aliheadou commented Mar 12, 2022

Uh oh!

tomasmetroy commented Sep 6, 2022

Uh oh!

Tusmijm commented Oct 2, 2022

Uh oh!

john0305 commented Oct 19, 2022

Uh oh!

zouweilin commented Sep 5, 2018 •

edited

Loading

pjdw commented Mar 26, 2020 •

edited

Loading

kpunyakoti commented Apr 17, 2020 •

edited

Loading

ShashwatShah24 commented Jun 11, 2020 •

edited

Loading