Last active
August 18, 2018 04:16
-
-
Save DGrady/20878e0f6fe7b5ee44c065bc52594e3e to your computer and use it in GitHub Desktop.
Analyze data frames that contain mainly categorical (string) data
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
def describe_population(df: pd.DataFrame) -> pd.DataFrame: | |
""" | |
Report the populated and uniqueness counts for each column of the input. | |
The ratio columns are given as percents. | |
""" | |
N = len(df) | |
dtypes = df.dtypes | |
distincts = df.nunique() | |
nas = df.isnull().sum() | |
pop = N - nas | |
out = pd.DataFrame() | |
out['dtype'] = dtypes | |
out['na'] = nas | |
out['populated'] = pop | |
out['distinct'] = distincts | |
out['pop/N'] = 100 * pop / N | |
out['dist/pop'] = 100 * distincts / pop | |
out.columns.name = "N = {:,}".format(N) | |
return out |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you! I will modify it to reveal data values since I don't have to worry about revealing secrets in my use case.