the tl;dr of https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428
select a column of data, use brackets df['column_name']
select rows of data, use .loc
or datetime index df['2019-01-01':'2019-02-28' ]
- if performance is primary concern, using numpy array instead of Pandas
use read_csv
and it's many arguments for reading files
use .isna
method to filter NaN rows
use ~
to negate
prefer math operators (+ - * / ** // %
) instead of math methods (lt gt eq ne
)
use pandas math aggregation methods instead of built in math functions
df['column_name'].sum()
instead ofsum(df['column_name'])
df['column_name'].max()
instead ofmax(df['column_name'])
prefer df.groupby(...).agg(...)
for doing group by aggregation
- Good:
df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})
- e.g.
df.groupby('fruit').agg({'tastiness': 'mean'})
- e.g.
df.groupby('fruit').agg({'tastiness': 'mean', 'weight': ['mean', 'median']})
- e.g.
- OK:
df.groupby('grouping column')['aggregating column'].agg('aggregating function')
- e.g.
df.groupby('fruit')['tastiness'].mean()
- e.g.
for going from wide to long format, prefer melt
over stack
for going from long to wide format, prefer pivot_table
over unstack
or pivot