Step 1
Data loading# data science checklist
Step 1
Analyse the project's goals/objectives
- check if the goal is to predict, classify or cluster data
Data loading
-
check data sources:
- if file:
- check for file extension
- check file size/dimension (if fits in memory)
- use appropriate method for reading/loading data from the file
- if URL:
- check URL healthcheck
- download url to disk/memory
- follow previous step is file
- if streaming:
- store batch of data
- if file:
-
check for available metadata (sometimes there's information about the data)
Exploratory data analysis
- visualize a data sample for the train/test set
- check the data types
- determine the independent and dependent variables on the dataset
Data preprocessing/cleaning/wrangling
- check if data is clean
- check for missing values
- if it has missing values:
- % missing values < 10%
- delete rows
- % missing values > 10%
- delete feature
- % missing values < 10%
- if it has missing values:
- check for missing values
Data modeling
- Model selection:
- select metrics for evaluation
- select random forest algorithm to optimize
- define model's hyperparameters (small set of parameters)
- train/optimize models using cross validation
- select best model
- evaluate model for under- or overfitting effects
Solution analysis
- check the model accuracy and set it as the minimal baseline for the dataset
- check which features contributed most to the model
Step 2
Data preprocessing/cleaning/wrangling
-
check the data format
- check if data has a valid range of values (e.g., ages above 100 are usually unlikely)
- check for data type (numerical or categorical)
- if categorical:
- check the % of unique values vs total values
- if low, feature is categorical and does not need further processing
- if high, feature may need further analysis and processing
- if numerical:
- check if is continuous or categorical (% unique values vs total values)
- continuous:
- check for Homoscedasticity
- check for normality (skewed/kurtosis)
- check for linearity
- check if is continuous or categorical (% unique values vs total values)
- categorical:
- convert to categorical
- if categorical:
-
preprocess data
- if numerical, scale data using mean subtraction and std division
- if categorical, convert to one-hot encoding (with n-1 categories)
-
check if data is clean
- check for missing values
- if it has missing values:
- try multiple imputation methods (by training multiple random forrest models and evaluate their accuracy)
- delete rows or features
- consider imputation and remove/impute values
- if it has missing values:
- check for outliers
- univariate
- bivariate
- multivariate
- check for missing values
-
feature engineering
- check for correlations between features
- evaluate factor analysis
- evaluate dimensionality reduction
Data modeling
- Model selection:
- select metrics for evaluation
- select algorithms to optimize
- define model's hyperparameters
- train/optimize models using cross validation
- evaluate trained models for under- or overfitting effects
- select best model(s)
- Model train
- train/optimize best model(s) using the full training data
- (optional) train an ensemble of models
Step 3
Inference
- Apply the trained model on new data
Overall process
- data load
- import data
- data preprocess
- clean
- transform
- normalize
- data modeling
- model selection
- hyperparameter selection
- model training
- select best model
- model ensemble (optional)
- inference
-
evaluate model on unseen data
-
check data sources:
- if file:
- check for file extension
- check file size/dimension (if fits in memory)
- use appropriate method for reading/loading data from the file
- if URL:
- check URL healthcheck
- download url to disk/memory
- follow previous step is file
- if streaming:
- store batch of data
- if file:
-
check for available metadata (sometimes there's information about the data)
Data preprocessing/cleaning/wrangling
- check if data is clean
- check for missing values
- if it has missing values:
- % missing values < 10%
- delete rows
- % missing values > 10%
- delete feature
- % missing values < 10%
- if it has missing values:
- check for missing values
Data modeling
- Model selection:
- select metrics for evaluation
- select random forest algorithm to optimize
- define model's hyperparameters (small set of parameters)
- train/optimize models using cross validation
- select best model
- evaluate model for under- or overfitting effects
First analysis
- check model accuracy and set as the minimal baseline
- check which features contributed most to the model
Step 2
Data preprocessing/cleaning/wrangling
-
check the data format
- check if data has a valid range of values (e.g., ages above 100 are usually unlikely)
- check for data type (numerical or categorical)
- if categorical:
- check the % of unique values vs total values
- if low, feature is categorical and does not need further processing
- if high, feature may need further analysis and processing
- if numerical:
- check if is continuous or categorical (% unique values vs total values)
- continuous:
- check for Homoscedasticity
- check for normality (skewed/kurtosis)
- check for linearity
- check if is continuous or categorical (% unique values vs total values)
- categorical:
- convert to categorical
- if categorical:
-
preprocess data
- if numerical, scale data using mean subtraction and std division
- if categorical, convert to one-hot encoding (with n-1 categories)
-
check if data is clean
- check for missing values
- if it has missing values:
- try multiple imputation methods (by training multiple random forrest models and evaluate their accuracy)
- delete rows or features
- consider imputation and remove/impute values
- if it has missing values:
- check for outliers
- univariate
- bivariate
- multivariate
- check for missing values
-
feature engineering
- check for correlations between features
- evaluate factor analysis
- evaluate dimensionality reduction
Data modeling
- Model selection:
- select metrics for evaluation
- select algorithms to optimize
- define model's hyperparameters
- train/optimize models using cross validation
- evaluate trained models for under- or overfitting effects
- select best model(s)
- Model train
- train/optimize best model(s) using the full training data
- (optional) train an ensemble of models
Step 3
Inference
- Evaluate trained model on new data
Overall process
- data load
- import data
- data preprocess
- clean
- transform
- normalize
- data modeling
- model selection
- hyperparameter selection
- model training
- select best model
- model ensemble (optional)
- inference
- evaluate model on unseen data