Skip to content

Instantly share code, notes, and snippets.

@phucnh
Last active September 10, 2019 05:45
Show Gist options
  • Save phucnh/769edf35ff3d882b3af64a8ee5ff297d to your computer and use it in GitHub Desktop.
Save phucnh/769edf35ff3d882b3af64a8ee5ff297d to your computer and use it in GitHub Desktop.
Find rate of empty value grouped by column in csv (or tsv) file

Count empty values in csv file, grouped by column with pandas

Reference: Data Cleaning with Python and Pandas: Detecting Missing Values

Prerequisites

  • Python 3.6
  • pip

Running

Preparation

  1. Place csv file into your local computer

  2. Ensure that csv file contains header row

  3. Install requirements by:

$ pip3 install -r requirements.txt

Executing

$ cat data/sample.csv
PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
100001000,104,PUTNAM,Y,3,1,1000
100002000,197,LEXINGTON,N,3,1.5,
100003000,,LEXINGTON,N,,1,850
100004000,,BERKELEY,12,1,,700
100005000,203,BERKELEY,Y,3,2,1600
100006000,207,BERKELEY,Y,,1,800
100007000,100,WASHINGTON,,2,HURLEY,950
100008000,213,TREMONT,Y,1,1,
100009000,215,TREMONT,Y,,2,1800

$ python3 app.py --path data/sample.csv --type=csv
File type is [csv]
Sample data:
         PID  ST_NUM    ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH   SQ_FT
0  100001000   104.0     PUTNAM            Y           3.0        1  1000.0
1  100002000   197.0  LEXINGTON            N           3.0      1.5     NaN
2  100003000     NaN  LEXINGTON            N           NaN        1   850.0
3  100004000     NaN   BERKELEY           12           1.0      NaN   700.0
4  100005000   203.0   BERKELEY            Y           3.0        2  1600.0


Rate of empty values, grouped by column.
Left is column name, right is rate of empty values (range 0.0 ~ 1.0):
ST_NUM          0.222222
OWN_OCCUPIED    0.111111
NUM_BEDROOMS    0.333333
NUM_BATH        0.111111
SQ_FT           0.222222
dtype: float64

Packaging

$ pip3 install pyinstaller
$ pip3 install -r requirements.txt
$ pyinstaller --onefile \
  --hiddenimport "fsspec.implementations" \
  --hiddenimport "fsspec.implementations.local" \
  app.py
...

# Execute program
$ ./dist/app --path <your-csv-file> --type <csv-or-tsv>
# -*- coding: utf-8 -*-
import argparse
import dask.dataframe as ddf
import os
import traceback
def detect_null_columns(file_path, file_type):
"""Detect columns that contains empty value(s).
Find rate of empty values of each column."""
# Check file path
if not (os.path.exists(file_path) and os.path.isfile(file_path)):
raise Exception(f'[{file_path}] file is not found or is not file')
# Read csv file to DataFrame, also trim whitespaces
delimiter = ''
if file_type == 'tsv':
print('File type is [tsv]')
delimiter = '\t'
elif file_type == 'csv':
print('File type is [csv]')
delimiter = ','
else:
raise Exception(f'Unknown file type [{file_type}]')
read_df = ddf.read_csv(file_path, sep=delimiter,
engine='python', encoding='utf-8')
print('Sample data:')
print(read_df.head())
null_columns = read_df.columns[read_df.isnull().any()]
null_columns_summary = (read_df[null_columns].isnull().mean()).compute(scheduler='processes')
print('\n')
print("""Rate of empty values, grouped by column.
Left is column name, right is rate of empty values (range 0.0 ~ 1.0):""")
print(null_columns_summary)
def run(args):
"""Execute program"""
try:
file_path = args.path
file_type = args.type
detect_null_columns(file_path, file_type) # type: ignore
except Exception as e:
print('Got unexpected exception:', e)
print(traceback.format_exc())
# print(e)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='CLI argument parser')
parser.add_argument('--path', required=True,
type=str, help='Path to csv file')
parser.add_argument('--type', required=True,
type=str, help='Type of file (e.g. csv, tsv)')
args = parser.parse_args()
run(args)
argparse==1.4.0
# After https://github.com/cloudpipe/cloudpickle/pull/299 is released,
# MUST use cloudpickle 1.1.1 because >=1.2.0 is work only with python 3.7.
# bump cloudpickle version.
cloudpickle==1.1.1
dask==2.3.0
dask[dataframe]
fsspec==0.4.4
numpy==1.17.2
pandas==0.25.1
python-dateutil==2.8.0
pytz==2019.2
six==1.12.0
toolz==0.10.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment