Skip to content

Instantly share code, notes, and snippets.

@DGrady
Last active August 18, 2018 04:16
Show Gist options
  • Save DGrady/20878e0f6fe7b5ee44c065bc52594e3e to your computer and use it in GitHub Desktop.
Save DGrady/20878e0f6fe7b5ee44c065bc52594e3e to your computer and use it in GitHub Desktop.
Analyze data frames that contain mainly categorical (string) data
import pandas as pd
def describe_population(df: pd.DataFrame) -> pd.DataFrame:
"""
Report the populated and uniqueness counts for each column of the input.
The ratio columns are given as percents.
"""
N = len(df)
dtypes = df.dtypes
distincts = df.nunique()
nas = df.isnull().sum()
pop = N - nas
out = pd.DataFrame()
out['dtype'] = dtypes
out['na'] = nas
out['populated'] = pop
out['distinct'] = distincts
out['pop/N'] = 100 * pop / N
out['dist/pop'] = 100 * distincts / pop
out.columns.name = "N = {:,}".format(N)
return out
@Syrus
Copy link

Syrus commented Aug 18, 2018

Thank you! I will modify it to reveal data values since I don't have to worry about revealing secrets in my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment