- Calculate the percentage of missing values in each column and sort them in descending order.
- Missing values and outliers are not problems to be fixed! They are facts.
- During EDA you must not “fix” them because you have to deal with your data and problem as it is.
- If you see missing values, just report them.
- Identify and understand your target variable.
- Understand the type of the target variable: binary, categorical, or numeric.
- Examine the distribution of the target variable.
- For a binary variable (which needs to be converted into 0s and 1s if it is in string format), the mean (a proportion of 1s) is simply used.
- For a categorical variable, value counts are used.
- For a numeric variable, a histogram or a pandas' describe table is used.