bsweger/useful_pandas_snippets.md

Last active May 28, 2025 23:07

Star (1,445) You must be signed in to star a gist
Fork (636) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/bsweger/e5817488d161f37dcbd2.js"></script>
Save bsweger/e5817488d161f37dcbd2 to your computer and use it in GitHub Desktop.

Download ZIP

Useful Pandas Snippets

Raw

useful_pandas_snippets.md

Useful Pandas Snippets

A personal diary of DataFrame munging over the years.

Data Types and Conversion

Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)

pd.to_numeric(df['Column Name'])

Convert Series datatype to numeric, changing non-numeric values to NaN
(h/t @makmanalp for the updated syntax!)

pd.to_numeric(df['Column Name'], errors='coerce')

Change data type of DataFrame column

df.column_name = df.column_name.astype(np.int64)

Exploring and Finding Data

Get a report of all duplicate records in a DataFrame, based on specific columns

dupes = df[df.duplicated(
    ['col1', 'col2', 'col3'], keep=False)]

List unique values in a DataFrame column
(h/t @makmanalp for the updated syntax!)

df['Column Name'].unique()

For each unique value in a DataFrame column, get a frequency count

df['Column Name'].value_counts()

Grab DataFrame rows where column = a specific value

df = df.loc[df.column == 'somevalue']

Grab DataFrame rows where column value is present in a list

test_data = {'hi': 'yo', 'bye': 'later'}
df = pd.DataFrame(list(d.items()), columns=['col1', 'col2'])
valuelist = ['yo', 'heya']
df[df.col2.isin(valuelist)]

Grab DataFrame rows where column value is not present in a list

test_data = {'hi': 'yo', 'bye': 'later'}
df = pd.DataFrame(list(d.items()), columns=['col1', 'col2'])
valuelist = ['yo', 'heya']
df[~df.col2.isin(valuelist)]

Select from DataFrame using criteria from multiple columns
(use | instead of & to do an OR)

newdf = df[(df['column_one']>2004) & (df['column_two']==9)]

Loop through rows in a DataFrame
(if you must)

for index, row in df.iterrows():
    print (index, row['some column'])

Much faster way to loop through DataFrame rows if you can work with tuples
(h/t hughamacmullaniv)

for row in df.itertuples():
    print(row)

Get top n for each group of columns in a sorted DataFrame
(make sure DataFrame is sorted first)

top5 = df.groupby(
    ['groupingcol1',
    'groupingcol2']).head(5)

Grab DataFrame rows where specific column is null/notnull

newdf = df[df['column'].isnull()]

Select from DataFrame using multiple keys of a hierarchical index

df.xs(
    ('index level 1 value','index level 2 value'),
    level=('level 1','level 2'))

Slice values in a DataFrame column (aka Series)

df.column.str[0:2]

Get quick count of rows in a DataFrame

len(df.index)

Get length of data in a DataFrame column

df.column_name.str.len()

Updating and Cleaning Data

Delete column from DataFrame

del df['column']

Rename several DataFrame columns

df = df.rename(columns = {
    'col1 old name':'col1 new name',
    'col2 old name':'col2 new name',
    'col3 old name':'col3 new name',
})

Lower-case all DataFrame column names

df.columns = map(str.lower, df.columns)

Even more fancy DataFrame column re-naming
lower-case all DataFrame column names (for example)

df.rename(columns=lambda x: x.split('.')[-1], inplace=True)

Lower-case everything in a DataFrame column

df.column_name = df.column_name.str.lower()

Sort DataFrame by multiple columns

df = df.sort_values(
    ['col1','col2','col3'],ascending=[1,1,0])

Change all NaNs to None (useful before loading to a db)

df = df.where((pd.notnull(df)), None)

More pre-db insert cleanup...make a pass through the DataFrame, stripping whitespace from strings and changing any empty values to None
(not especially recommended but including here b/c I had to do this in real life once)

df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None)

Get rid of non-numeric values throughout a DataFrame:

for col in refunds.columns.values:
  refunds[col] = refunds[col].replace(
      '[^0-9]+.-', '', regex=True)

Set DataFrame column values based on other column values
(h/t: @mlevkov)

df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value

Clean up missing values in multiple DataFrame columns

df = df.fillna({
    'col1': 'missing',
    'col2': '99.999',
    'col3': '999',
    'col4': 'missing',
    'col5': 'missing',
    'col6': '99'
})

Doing calculations with DataFrame columns that have missing values. In example below, swap in 0 for df['col1'] cells that contain null.

df['new_col'] = np.where(
    pd.isnull(df['col1']), 0, df['col1']) + df['col2']

Split delimited values in a DataFrame column into two new columns

df['new_col1'], df['new_col2'] = zip(
    *df['original_col'].apply(
        lambda x: x.split(': ', 1)))

Collapse hierarchical column indexes

df.columns = df.columns.get_level_values(0)

Reshaping, Concatenating, and Merging Data

Pivot data (with flexibility about what what becomes a column and what stays a row).

pd.pivot_table(
  df,values='cell_value',
  index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values
  columns=['col4']) #data values in this column become their own column

Concatenate two DataFrame columns into a new, single column
(useful when dealing with composite keys, for example)
(h/t @makmanalp for improving this one!)

df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str)

Display and formatting

Set up formatting so larger numbers aren't displayed in scientific notation
(h/t @thecapacity)

pd.set_option('display.float_format', lambda x: '%.3f' % x)

To display with commas and no decimals

pd.options.display.float_format = '{:,.0f}'.format

Creating DataFrames

Create a DataFrame from a Python dictionary

df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2'])

Convert Django queryset to DataFrame

qs = DjangoModelName.objects.all()
q = qs.values()
df = pd.DataFrame.from_records(q)

Creating Series

Create a new series using an index from an existing series

# original_series = the original series
# simplisitically generate a list of values to use in the new series
new_series_values = list(range(1, len(original_series) + 1))
new_series = pd.Series(new_series_values, index=original_series.index)

pashute commented Nov 6, 2017

Could you add how to calculate column value from rows in other dataframes with multiple index
# df1 data:
# cateory date data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-06 01:01:01', 21, 22
# dog, '2017-11-04 01:01:01', 11, 12
# dog, '2017-11-07 01:01:01', 21, 31

# df2 data:
# cateory     date               data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-07 11:11:11', 21, 22
# dog, '2017-11-06 01:01:01', 11, 12
# dog, '2017-11-07 10:01:01', 21, 31

dfresult = pd.dataframe(columns: ['category', 'date', 'indication1', 'indication2', 'calculated'], index=('category','date')
dfresult.append(df1['category','date'])
# dfresult data:
# cateory date indication1 indication2 calculated
# cat, '2017-11-05 01:01:01', na, na, na
# dog, '2017-11-07 10:01:01', na, na, na

# now set dfreslut['data1'] to be function: 
        - if both df1['data1'] and df2['data1'] of that index are na:  "x"
        - if only df1[data1] of that index is na:  "x1"
        - if only df2[data1] of that index is na:  "x2"   
         - if df2 is zero then: "z" 
         - otherwise set to: df1['data1']/df2['data1'] - 1
   # and set dfresult['calculated'] to calculated function taking from current row. i.e. = dfresult['data1'] < tolerance

oserttas-math commented Nov 27, 2017

Hi All,

Thank you Ms. Sweger for sharing this valuable asset. I do have a quick question on replacing columns in a dataframe with columns from another dataframe. What I found out is to assign new column to older one but this throws an error.
Basically what I did is
df_1['oldColumn'] = df_2['newColumn']
this does the work but throws this error here "SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame."

I was wondering what your solution would be for this one Ms. Sweger...

Thanks again!

krauseling commented Dec 2, 2017

Lovely collection! Thanks much!

blaggacao commented Dec 17, 2017 •

edited

Loading

Typo: Should be df.columns (df.columnS) in https://gist.github.com/bsweger/e5817488d161f37dcbd2#file-useful_pandas_snippets-py-L13-L19

Actually only this worked in my case (maybe because of index/column ambiguity?):

# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]

Reproduce with:

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))

sidharthbolar commented Dec 23, 2017

thanks!

aic5 commented Jan 14, 2018

That is great! Thanks for sharing.

kmr0877 commented Jan 26, 2018

really cool stuff .Appreciate the efforts 👍

salman6049 commented Feb 1, 2018

Thanks for sharing. I am interested in knowing how to grab columns values in a DataFrame through a loop.

vsaraogi commented Feb 11, 2018

you saved my day, thanks

jdfthetech commented Feb 14, 2018

excellent work

jerisalan commented Feb 19, 2018

Immensely helpful. Kudos for preparing this 👍

brianz commented Mar 8, 2018

refunds[col].replace('[^0-9]+.-', '', regex=True) was a great find for me...thank you!

GitPatrickHarris commented Apr 10, 2018

This is huge!

manmartgarc commented Apr 28, 2018

Fantastic!

cnmetzi commented Jun 1, 2018

Thanks for sharing! Very helpful!

musalys commented Jul 11, 2018

Thanks for this!! really helpful and useful!

ineedme commented Oct 24, 2018 •

edited

Loading

# Validate datetime
day,month,year = inputDate.split('/')
isValidDate = True
try :
    datetime.datetime(int(year),int(month),int(day))
except ValueError :
    isValidDate = False
if(isValidDate) :
    print ("Input date is valid ..")
else :
    print ("Input date is not valid..")

nonagonyang commented Nov 18, 2018

This is extremly useful. Thank you

Author

bsweger commented Dec 6, 2018

Typo: Should be df.columns (df.columnS) in https://gist.github.com/bsweger/e5817488d161f37dcbd2#file-useful_pandas_snippets-py-L13-L19

Actually only this worked in my case (maybe because of index/column ambiguity?):
# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]
Reproduce with:
df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))

Suuuper belated response, but I think what happened here is that my wording in the example was super confusing. By column, I meant "the name of the column you're searching" but that wasn't at all clear. Updated those snippets, so hopefully it won't trip up anyone else. Thanks for the feedback!

PrabhaDias commented Feb 28, 2019

Thank you so much for doing this - it's really, really helpful! You've saved me a couple of hours of head-scratching while wading through various stackoverflow answers.

pedrolugo2 commented Aug 12, 2020 •

edited

Loading

Incredible work! Thank you for sharing!
I made a VSCode snippet file out of it, if anyone is interested!

andreeagheorghe commented Oct 4, 2020

Nice list!

I created a way to quickly search and locate pandas snippets. It saves me a ton of time.

https://AllTheSnippets.com/search/

adsd19 commented Feb 14, 2022

Best.

bsweger/useful_pandas_snippets.md

Useful Pandas Snippets

Data Types and Conversion

Exploring and Finding Data

Updating and Cleaning Data

Reshaping, Concatenating, and Merging Data

Display and formatting

Creating DataFrames

Creating Series

pashute commented Nov 6, 2017

Uh oh!

oserttas-math commented Nov 27, 2017

Uh oh!

krauseling commented Dec 2, 2017

Uh oh!

blaggacao commented Dec 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidharthbolar commented Dec 23, 2017

Uh oh!

aic5 commented Jan 14, 2018

Uh oh!

kmr0877 commented Jan 26, 2018

Uh oh!

salman6049 commented Feb 1, 2018

Uh oh!

vsaraogi commented Feb 11, 2018

Uh oh!

jdfthetech commented Feb 14, 2018

Uh oh!

jerisalan commented Feb 19, 2018

Uh oh!

brianz commented Mar 8, 2018

Uh oh!

GitPatrickHarris commented Apr 10, 2018

Uh oh!

manmartgarc commented Apr 28, 2018

Uh oh!

cnmetzi commented Jun 1, 2018

Uh oh!

musalys commented Jul 11, 2018

Uh oh!

ineedme commented Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nonagonyang commented Nov 18, 2018

Uh oh!

bsweger commented Dec 6, 2018

Uh oh!

PrabhaDias commented Feb 28, 2019

Uh oh!

pedrolugo2 commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreeagheorghe commented Oct 4, 2020

Uh oh!

adsd19 commented Feb 14, 2022

Uh oh!

blaggacao commented Dec 17, 2017 •

edited

Loading

ineedme commented Oct 24, 2018 •

edited

Loading

pedrolugo2 commented Aug 12, 2020 •

edited

Loading