glebmikha’s gists

Calculate the percentage of missing values in each column and sort them in descending order.
1. Missing values and outliers are not problems to be fixed! They are facts.
2. During EDA you must not “fix” them because you have to deal with your data and problem as it is.
3. If you see missing values, just report them.
Identify and understand your target variable.
1. Understand the type of the target variable: binary, categorical, or numeric.
2. Examine the distribution of the target variable.
  1. For a binary variable (which needs to be converted into 0s and 1s if it is in string format), the mean (a proportion of 1s) is simply used.
  2. For a categorical variable, value counts are used.
For a numeric variable, a histogram or a pandas' describe table is used.

Gleb Mikhaylov glebmikha