Skip to content

Instantly share code, notes, and snippets.

@bsweger
Last active November 13, 2024 19:55
Show Gist options
  • Save bsweger/e5817488d161f37dcbd2 to your computer and use it in GitHub Desktop.
Save bsweger/e5817488d161f37dcbd2 to your computer and use it in GitHub Desktop.
Useful Pandas Snippets

Useful Pandas Snippets

A personal diary of DataFrame munging over the years.

Data Types and Conversion

Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)

pd.to_numeric(df['Column Name'])

Convert Series datatype to numeric, changing non-numeric values to NaN
(h/t @makmanalp for the updated syntax!)

pd.to_numeric(df['Column Name'], errors='coerce')

Change data type of DataFrame column

df.column_name = df.column_name.astype(np.int64)

Exploring and Finding Data

Get a report of all duplicate records in a DataFrame, based on specific columns

dupes = df[df.duplicated(
    ['col1', 'col2', 'col3'], keep=False)]

List unique values in a DataFrame column
(h/t @makmanalp for the updated syntax!)

df['Column Name'].unique()

For each unique value in a DataFrame column, get a frequency count

df['Column Name'].value_counts()

Grab DataFrame rows where column = a specific value

df = df.loc[df.column == 'somevalue']

Grab DataFrame rows where column value is present in a list

test_data = {'hi': 'yo', 'bye': 'later'}
df = pd.DataFrame(list(d.items()), columns=['col1', 'col2'])
valuelist = ['yo', 'heya']
df[df.col2.isin(valuelist)]

Grab DataFrame rows where column value is not present in a list

test_data = {'hi': 'yo', 'bye': 'later'}
df = pd.DataFrame(list(d.items()), columns=['col1', 'col2'])
valuelist = ['yo', 'heya']
df[~df.col2.isin(valuelist)]

Select from DataFrame using criteria from multiple columns
(use | instead of & to do an OR)

newdf = df[(df['column_one']>2004) & (df['column_two']==9)]

Loop through rows in a DataFrame
(if you must)

for index, row in df.iterrows():
    print (index, row['some column'])

Much faster way to loop through DataFrame rows if you can work with tuples
(h/t hughamacmullaniv)

for row in df.itertuples():
    print(row)

Get top n for each group of columns in a sorted DataFrame
(make sure DataFrame is sorted first)

top5 = df.groupby(
    ['groupingcol1',
    'groupingcol2']).head(5)

Grab DataFrame rows where specific column is null/notnull

newdf = df[df['column'].isnull()]

Select from DataFrame using multiple keys of a hierarchical index

df.xs(
    ('index level 1 value','index level 2 value'),
    level=('level 1','level 2'))

Slice values in a DataFrame column (aka Series)

df.column.str[0:2]

Get quick count of rows in a DataFrame

len(df.index)

Get length of data in a DataFrame column

df.column_name.str.len()

Updating and Cleaning Data

Delete column from DataFrame

del df['column']

Rename several DataFrame columns

df = df.rename(columns = {
    'col1 old name':'col1 new name',
    'col2 old name':'col2 new name',
    'col3 old name':'col3 new name',
})

Lower-case all DataFrame column names

df.columns = map(str.lower, df.columns)

Even more fancy DataFrame column re-naming
lower-case all DataFrame column names (for example)

df.rename(columns=lambda x: x.split('.')[-1], inplace=True)

Lower-case everything in a DataFrame column

df.column_name = df.column_name.str.lower()

Sort DataFrame by multiple columns

df = df.sort_values(
    ['col1','col2','col3'],ascending=[1,1,0])

Change all NaNs to None (useful before loading to a db)

df = df.where((pd.notnull(df)), None)

More pre-db insert cleanup...make a pass through the DataFrame, stripping whitespace from strings and changing any empty values to None
(not especially recommended but including here b/c I had to do this in real life once)

df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None)

Get rid of non-numeric values throughout a DataFrame:

for col in refunds.columns.values:
  refunds[col] = refunds[col].replace(
      '[^0-9]+.-', '', regex=True)

Set DataFrame column values based on other column values
(h/t: @mlevkov)

df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value

Clean up missing values in multiple DataFrame columns

df = df.fillna({
    'col1': 'missing',
    'col2': '99.999',
    'col3': '999',
    'col4': 'missing',
    'col5': 'missing',
    'col6': '99'
})

Doing calculations with DataFrame columns that have missing values. In example below, swap in 0 for df['col1'] cells that contain null.

df['new_col'] = np.where(
    pd.isnull(df['col1']), 0, df['col1']) + df['col2']

Split delimited values in a DataFrame column into two new columns

df['new_col1'], df['new_col2'] = zip(
    *df['original_col'].apply(
        lambda x: x.split(': ', 1)))

Collapse hierarchical column indexes

df.columns = df.columns.get_level_values(0)

Reshaping, Concatenating, and Merging Data

Pivot data (with flexibility about what what becomes a column and what stays a row).

pd.pivot_table(
  df,values='cell_value',
  index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values
  columns=['col4']) #data values in this column become their own column

Concatenate two DataFrame columns into a new, single column
(useful when dealing with composite keys, for example)
(h/t @makmanalp for improving this one!)

df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str)

Display and formatting

Set up formatting so larger numbers aren't displayed in scientific notation
(h/t @thecapacity)

pd.set_option('display.float_format', lambda x: '%.3f' % x)

To display with commas and no decimals

pd.options.display.float_format = '{:,.0f}'.format

Creating DataFrames

Create a DataFrame from a Python dictionary

df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2'])

Convert Django queryset to DataFrame

qs = DjangoModelName.objects.all()
q = qs.values()
df = pd.DataFrame.from_records(q)

Creating Series

Create a new series using an index from an existing series

# original_series = the original series
# simplisitically generate a list of values to use in the new series
new_series_values = list(range(1, len(original_series) + 1))
new_series = pd.Series(new_series_values, index=original_series.index)

@pashute
Copy link

pashute commented Nov 6, 2017

Could you add how to calculate column value from rows in other dataframes with multiple index
# df1 data:
# cateory date data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-06 01:01:01', 21, 22
# dog, '2017-11-04 01:01:01', 11, 12
# dog, '2017-11-07 01:01:01', 21, 31

# df2 data:
# cateory     date               data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-07 11:11:11', 21, 22
# dog, '2017-11-06 01:01:01', 11, 12
# dog, '2017-11-07 10:01:01', 21, 31

dfresult = pd.dataframe(columns: ['category', 'date', 'indication1', 'indication2', 'calculated'], index=('category','date')
dfresult.append(df1['category','date'])
# dfresult data:
# cateory date indication1 indication2 calculated
# cat, '2017-11-05 01:01:01', na, na, na
# dog, '2017-11-07 10:01:01', na, na, na

# now set dfreslut['data1'] to be function: 
        - if both df1['data1'] and df2['data1'] of that index are na:  "x"
        - if only df1[data1] of that index is na:  "x1"
        - if only df2[data1] of that index is na:  "x2"   
         - if df2 is zero then: "z" 
         - otherwise set to: df1['data1']/df2['data1'] - 1
   # and set dfresult['calculated'] to calculated function taking from current row. i.e. = dfresult['data1'] < tolerance

@oserttas-math
Copy link

Hi All,

Thank you Ms. Sweger for sharing this valuable asset. I do have a quick question on replacing columns in a dataframe with columns from another dataframe. What I found out is to assign new column to older one but this throws an error.
Basically what I did is
df_1['oldColumn'] = df_2['newColumn']
this does the work but throws this error here "SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame."

I was wondering what your solution would be for this one Ms. Sweger...

Thanks again!

@krauseling
Copy link

Lovely collection! Thanks much!

@blaggacao
Copy link

blaggacao commented Dec 17, 2017

Typo: Should be df.columns (df.columnS) in https://gist.github.com/bsweger/e5817488d161f37dcbd2#file-useful_pandas_snippets-py-L13-L19

Actually only this worked in my case (maybe because of index/column ambiguity?):

# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]

Reproduce with:

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))

@sidharthbolar
Copy link

thanks!

@aic5
Copy link

aic5 commented Jan 14, 2018

That is great! Thanks for sharing.

@kmr0877
Copy link

kmr0877 commented Jan 26, 2018

really cool stuff .Appreciate the efforts 👍

@salman6049
Copy link

Thanks for sharing. I am interested in knowing how to grab columns values in a DataFrame through a loop.

@vsaraogi
Copy link

you saved my day, thanks

@jdfthetech
Copy link

excellent work

@jerisalan
Copy link

Immensely helpful. Kudos for preparing this 👍

@brianz
Copy link

brianz commented Mar 8, 2018

refunds[col].replace('[^0-9]+.-', '', regex=True) was a great find for me...thank you!

@GitPatrickHarris
Copy link

This is huge!

@manmartgarc
Copy link

Fantastic!

@cnmetzi
Copy link

cnmetzi commented Jun 1, 2018

Thanks for sharing! Very helpful!

@musalys
Copy link

musalys commented Jul 11, 2018

Thanks for this!! really helpful and useful!

@ineedme
Copy link

ineedme commented Oct 24, 2018

# Validate datetime
day,month,year = inputDate.split('/')
isValidDate = True
try :
    datetime.datetime(int(year),int(month),int(day))
except ValueError :
    isValidDate = False
if(isValidDate) :
    print ("Input date is valid ..")
else :
    print ("Input date is not valid..")

@nonagonyang
Copy link

This is extremly useful. Thank you

@bsweger
Copy link
Author

bsweger commented Dec 6, 2018

Typo: Should be df.columns (df.columnS) in https://gist.github.com/bsweger/e5817488d161f37dcbd2#file-useful_pandas_snippets-py-L13-L19

Actually only this worked in my case (maybe because of index/column ambiguity?):

# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]

Reproduce with:

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))

Suuuper belated response, but I think what happened here is that my wording in the example was super confusing. By column, I meant "the name of the column you're searching" but that wasn't at all clear. Updated those snippets, so hopefully it won't trip up anyone else. Thanks for the feedback!

@PrabhaDias
Copy link

Thank you so much for doing this - it's really, really helpful! You've saved me a couple of hours of head-scratching while wading through various stackoverflow answers.

@pedrolugo2
Copy link

pedrolugo2 commented Aug 12, 2020

Incredible work! Thank you for sharing!
I made a VSCode snippet file out of it, if anyone is interested!

@andreeagheorghe
Copy link

Nice list!

I created a way to quickly search and locate pandas snippets. It saves me a ton of time.

https://AllTheSnippets.com/search/

@adsd19
Copy link

adsd19 commented Feb 14, 2022

Best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment