Skip to content

Instantly share code, notes, and snippets.

@decorouz
Last active August 5, 2024 13:46
Show Gist options
  • Save decorouz/a17da385110593c5ff43915bce3c9e77 to your computer and use it in GitHub Desktop.
Save decorouz/a17da385110593c5ff43915bce3c9e77 to your computer and use it in GitHub Desktop.
# When analyzing survey data, it is common for missing values to be represented by various placeholders
# such as "DK", "NO RESPONSE", "Missing/DK", and "REFUSED". During exploratory data analysis (EDA),
# it is standard practice to examine instances of missing values using the command df.isna().sum().
# However, this command does not account for the aforementioned placeholders.
# Consequently, it is beneficial to employ a utility function that computes the count of these specific
# representations of missing values.
def count_special_values(df: pd.DataFrame)->pd.DataFrame:
"""
Counts the occurrences of specific special values (DK, REFUSED, NO RESPONSE, Missing/DK)
and missing values (NaN) in each column.
Parameters:
df (pandas DataFrame): The input DataFrame
Returns:
pandas DataFrame: A DataFrame that shows the counts of special values and missing values for each column
"""
# List of special values to look for
special_values = ["DK", "NO RESPONSE", "Missing/DK"]
# Create a dictionary to hold the counts
counts = {val: {} for val in special_values}
counts['MISSING'] = {}
for column in df.columns:
for val in special_values:
counts[val][column] = (df[column] == val).sum()
# Count missing values (NaN) in the current column
counts['MISSING'][column] = df[column].isna().sum()
result_df = pd.DataFrame(counts)
# Filter out columns where the count of all special values and missing values is 0
result_df = result_df.loc[(result_df > 0).any(axis=1)]
return result_df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment