-
-
Save kiwidamien/bcbe8e527a5f0cc9f28c4fe692f70cbc to your computer and use it in GitHub Desktop.
| { | |
| "cells": [ | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "from imblearn.pipeline import make_pipeline\n", | |
| "from imblearn import datasets\n", | |
| "from imblearn.over_sampling import SMOTE\n", | |
| "\n", | |
| "from sklearn.ensemble import RandomForestClassifier\n", | |
| "from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split, KFold\n", | |
| "from sklearn.metrics import recall_score, roc_auc_score\n", | |
| "\n", | |
| "import pandas as pd\n", | |
| "import numpy as np" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Load some data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We are going to load the `thyroid` dataset from imbalanced-learn's collection. This is a classification task where we are trying to determine whether someone has a bad thyroid or not.\n", | |
| "\n", | |
| "Imbalanced-learn's datasets are typically labelled `-1` (the majority class) and `1` (the rare class we are trying to find). Let's change that to `0` and `1`, which we are more used to for binary classification" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "thyroid_collection = datasets.fetch_datasets()['thyroid_sick']\n", | |
| "X = pd.DataFrame(thyroid_collection['data'])\n", | |
| "y = thyroid_collection['target']\n", | |
| "y[y==-1] = 0" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "What is the imbalance? We can ask what fraction of the `y` values are `1` (which is just the mean of values in `y`)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.0612407211028632" | |
| ] | |
| }, | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# What is the imbalance?\n", | |
| "y.mean()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Okay, about 6% of our patients have thyroid issues. Another way of thinking about this is that we have approximately 15 people with healthy thyroids for every person with a bad thyroid.\n", | |
| "\n", | |
| "Our goal will be to find a good recall (i.e. we want our classifier to find as many positive cases as it can). We have to be aware there is a danger in using this metric, as simply predicting _everyone_ has a bad thyroid will make the recall 100%. \n", | |
| "\n", | |
| "In practice, you might want to optimize some other metric like AUC and then choose a threshold to optimize recall once you have selected a model. Because AUC can be confusing, I am using the slightly more dangerous recall as my metric in this problem." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Standardizing our splits" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's make sure that our results are consistent as we try different methods. It is a little simpler to have `cv=5` in all of our grid searches and cross-validations, but we will get different splits each time.\n", | |
| "\n", | |
| "If we use `cv=kf`, where `kf` is a `KFold` object we can ensure that we get the same splits each time." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "kf = KFold(n_splits=5, random_state=42, shuffle=False)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Standard approach, no oversampling" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's get a baseline result by picking a random forest. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array([0.81081081, 0.73684211, 0.875 , 0.7037037 , 0.7804878 ])" | |
| ] | |
| }, | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45)\n", | |
| "\n", | |
| "rf = RandomForestClassifier(n_estimators=100, random_state=13)\n", | |
| "\n", | |
| "cross_val_score(rf, X_train, y_train, cv=kf, scoring='recall')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "These are pretty decent result, and we haven't even optimized the model. \n", | |
| "\n", | |
| "Let's try to do some hyperparameter tuning:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "params = {\n", | |
| " 'n_estimators': [50, 100, 200],\n", | |
| " 'max_depth': [4, 6, 10, 12],\n", | |
| " 'random_state': [13]\n", | |
| "}\n", | |
| "\n", | |
| "grid_no_up = GridSearchCV(rf, param_grid=params, cv=kf, \n", | |
| " scoring='recall').fit(X_train, y_train)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.7803820054409211" | |
| ] | |
| }, | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_no_up.best_score_" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array([0.37592633, 0.16071796, 0.08487883, 0.64779239, 0.65553493,\n", | |
| " 0.62892524, 0.77319976, 0.76886478, 0.78038201, 0.77010354,\n", | |
| " 0.77497469, 0.77784313])" | |
| ] | |
| }, | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_no_up.cv_results_['mean_test_score']" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{'max_depth': 10, 'n_estimators': 200, 'random_state': 13}" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_no_up.best_params_" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Ok, we see that we have about 78% recall on one of our models before we have tried oversampling. This is the number to beat.\n", | |
| "\n", | |
| "Normally we would wait until we had finished our modeling to look at the test set, but an important part of this is to see how oversampling, done incorrectly, can make us too confident in our ability to generalize based off cross-validation. We haven't oversampled yet, so let's just check that the test scores are in line with what we expect from the CV scores about (i.e. about 78%)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.8035714285714286" | |
| ] | |
| }, | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "recall_score(y_test, grid_no_up.predict(X_test))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Ok, this looks like it is (roughly) consistent with the CV results." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Oversampling (the wrong way)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's just oversample the training data (we are smart enough not to oversample the test data)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "X_train_upsample, y_train_upsample = SMOTE(random_state=42).fit_sample(X_train, y_train)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's check that upsampling gave us an even split of the two classes:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.5" | |
| ] | |
| }, | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "y_train_upsample.mean()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Ok, let's cross-validate using grid search. Remember the training set has been upsampled, that is not being done as part of the GridSearch" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.9843160927198451" | |
| ] | |
| }, | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "params = {\n", | |
| " 'n_estimators': [50, 100, 200],\n", | |
| " 'max_depth': [4, 6, 10, 12],\n", | |
| " 'random_state': [13]\n", | |
| "}\n", | |
| "\n", | |
| "grid_naive_up = GridSearchCV(rf, param_grid=params, cv=kf, \n", | |
| " scoring='recall').fit(X_train_upsample, \n", | |
| " y_train_upsample)\n", | |
| "grid_naive_up.best_score_" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "This is an amazing recall! If we look at the validation scores, they _all_ look pretty good:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 14, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array([0.93360792, 0.9345499 , 0.93337591, 0.94714925, 0.94736138,\n", | |
| " 0.94273667, 0.97585677, 0.98218414, 0.97864618, 0.98237253,\n", | |
| " 0.98187974, 0.98431609])" | |
| ] | |
| }, | |
| "execution_count": 14, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_naive_up.cv_results_['mean_test_score']" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Here is the model that made these results:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 15, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{'max_depth': 12, 'n_estimators': 200, 'random_state': 13}" | |
| ] | |
| }, | |
| "execution_count": 15, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_naive_up.best_params_" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Ok, let's look at how it does on the training set as a whole (once we eliminate the upsampling)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "1.0" | |
| ] | |
| }, | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "recall_score(y_train, grid_naive_up.predict(X_train))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Ok, what about the test set?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.9107142857142857" | |
| ] | |
| }, | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# But wait ... uh-oh, spagetti-os!\n", | |
| "recall_score(y_test, grid_naive_up.predict(X_test))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Ok, time for some good news/bad news:\n", | |
| "- good: the recall on the test set is 91%, better than the 80% we got without upsampling\n", | |
| "- bad: our confidence in the cross-valdation results went down. With no upsampling, the validation recall was 78%, which was a good estimate of the test validation of 80%. With upsampling, the validation recall was 100% which isn't a good measure of the test recall (91%)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Let's make SMOTE-ing part of our cross validation!" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The issue is that we \n", | |
| "- oversample\n", | |
| "- the split into cross-validation folds\n", | |
| "\n", | |
| "You can imagine the simplest method of over-sampling (namely, copying the data point). Let's say every data point is copied 6 times. It isn't hard to imagine that in 3-fold validation, a typical folding has (on average) 2 copies of each point in each fold. If the classifier memories the test set, the validation set will get a perfect score because the validation set has no new data points!\n", | |
| "\n", | |
| "Instead, we should split into training and validation folds. Then, on each fold, we should\n", | |
| "1. Oversample the minority class\n", | |
| "2. Train the classifier on the training folds\n", | |
| "3. Validate the classifier on the remaining fold\n", | |
| "\n", | |
| "Let's see this in detail" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 18, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "example_params = {\n", | |
| " 'n_estimators': 100,\n", | |
| " 'max_depth': 5,\n", | |
| " 'random_state': 13\n", | |
| " }\n", | |
| "\n", | |
| "def score_model(model, params, cv=None):\n", | |
| " if cv is None:\n", | |
| " cv = KFold(n_splits=5, random_state=42)\n", | |
| "\n", | |
| " smoter = SMOTE(random_state=42)\n", | |
| " \n", | |
| " scores = []\n", | |
| "\n", | |
| " for train_fold_index, val_fold_index in cv.split(X_train, y_train):\n", | |
| " X_train_fold, y_train_fold = X_train.iloc[train_fold_index], y_train[train_fold_index]\n", | |
| " X_val_fold, y_val_fold = X_train.iloc[val_fold_index], y_train[val_fold_index]\n", | |
| "\n", | |
| " X_train_fold_upsample, y_train_fold_upsample = smoter.fit_resample(X_train_fold,\n", | |
| " y_train_fold)\n", | |
| " model_obj = model(**params).fit(X_train_fold_upsample, y_train_fold_upsample)\n", | |
| " score = recall_score(y_val_fold, model_obj.predict(X_val_fold))\n", | |
| " scores.append(score)\n", | |
| " return np.array(scores)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 19, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array([0.78378378, 0.76315789, 0.96875 , 0.81481481, 0.90243902])" | |
| ] | |
| }, | |
| "execution_count": 19, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "score_model(RandomForestClassifier, example_params, cv=kf)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Janky grid search:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 20, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{'n_estimators': [50, 100, 200],\n", | |
| " 'max_depth': [4, 6, 10, 12],\n", | |
| " 'random_state': [13]}" | |
| ] | |
| }, | |
| "execution_count": 20, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "params" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 21, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "score_tracker = []\n", | |
| "for n_estimators in params['n_estimators']:\n", | |
| " for max_depth in params['max_depth']:\n", | |
| " example_params = {\n", | |
| " 'n_estimators': n_estimators,\n", | |
| " 'max_depth': max_depth,\n", | |
| " 'random_state': 13\n", | |
| " }\n", | |
| " example_params['recall'] = score_model(RandomForestClassifier, \n", | |
| " example_params, cv=kf).mean()\n", | |
| " score_tracker.append(example_params)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 22, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "[{'n_estimators': 50,\n", | |
| " 'max_depth': 4,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8486884268736002},\n", | |
| " {'n_estimators': 100,\n", | |
| " 'max_depth': 6,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8291411035554168},\n", | |
| " {'n_estimators': 50,\n", | |
| " 'max_depth': 12,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8283643801053943},\n", | |
| " {'n_estimators': 200,\n", | |
| " 'max_depth': 12,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8254883333269085},\n", | |
| " {'n_estimators': 50,\n", | |
| " 'max_depth': 10,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.825245471723328},\n", | |
| " {'n_estimators': 200,\n", | |
| " 'max_depth': 6,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.82387794566068},\n", | |
| " {'n_estimators': 100,\n", | |
| " 'max_depth': 12,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8224316180750713},\n", | |
| " {'n_estimators': 50,\n", | |
| " 'max_depth': 6,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8196299476626819},\n", | |
| " {'n_estimators': 100,\n", | |
| " 'max_depth': 4,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8172428365464309},\n", | |
| " {'n_estimators': 200,\n", | |
| " 'max_depth': 4,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.810992836546431},\n", | |
| " {'n_estimators': 100,\n", | |
| " 'max_depth': 10,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8096988596426978},\n", | |
| " {'n_estimators': 200,\n", | |
| " 'max_depth': 10,\n", | |
| " 'random_state': 13,\n", | |
| " 'recall': 0.8095566121320295}]" | |
| ] | |
| }, | |
| "execution_count": 22, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "sorted(score_tracker, key=lambda x: x['recall'], reverse=True)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The best estimator appears to have the parameters\n", | |
| "```\n", | |
| "{'n_estimators': 50,\n", | |
| " 'max_depth': 4,\n", | |
| " 'random_state': 13,\n", | |
| " }\n", | |
| "```\n", | |
| "and a recall score of 85% for the validation score. Let's see how this compares with the test score:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 23, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.8392857142857143" | |
| ] | |
| }, | |
| "execution_count": 23, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "rf = RandomForestClassifier(n_estimators=50, max_depth=4, random_state=13)\n", | |
| "rf.fit(X_train_upsample, y_train_upsample)\n", | |
| "recall_score(y_test, rf.predict(X_test))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Note that is is roughly consistent (84% vs 85%)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Let's use the imbalanced class pipeline" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The imbalanced-learn dataset extends the sklearn's built-in pipeline methods. Specifically, you can import \n", | |
| "```python\n", | |
| "from sklearn.pipeline import Pipeline, make_pipeline\n", | |
| "```\n", | |
| "which will allow you to do multiple steps at once. It is also nice that if you _fit_ the model, all the steps (such as scaling, and the model) are fit at once. If you _predict_ with the model, scaling steps are only _trensformed_, so you can pass multiple steps into a pipeline. \n", | |
| "\n", | |
| "There is a restriction. The restriction comes partially from the naming of functions (e.g. `transform` vs `resample`) but one way of thing of it is that sklearn's pipeline only allows for one row in to be transformed to another row (perhaps with different or added features). To upsample, we need to _increase_ the number of rows. Imbalanced-learn generalizes the pipeline, but tries to keep the syntax and function names the same:\n", | |
| "```python\n", | |
| "from imblearn.pipeline import Pipeline, make_pipeline\n", | |
| "```\n", | |
| "\n", | |
| "Let's see it in action:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 24, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "Pipeline(memory=None,\n", | |
| " steps=[('smote', SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,\n", | |
| " out_step='deprecated', random_state=42, ratio=None,\n", | |
| " sampling_strategy='auto', svm_estimator='deprecated')), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterio...ators=100, n_jobs=None,\n", | |
| " oob_score=False, random_state=13, verbose=0, warm_start=False))])" | |
| ] | |
| }, | |
| "execution_count": 24, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "imba_pipeline = make_pipeline(SMOTE(random_state=42), \n", | |
| " RandomForestClassifier(n_estimators=100, random_state=13))\n", | |
| "\n", | |
| "imba_pipeline" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 25, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array([0.75675676, 0.78947368, 0.90625 , 0.77777778, 0.7804878 ])" | |
| ] | |
| }, | |
| "execution_count": 25, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 26, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "new_params = {'randomforestclassifier__' + key: params[key] for key in params}\n", | |
| "grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',\n", | |
| " return_train_score=True)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 27, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=False),\n", | |
| " error_score='raise-deprecating',\n", | |
| " estimator=Pipeline(memory=None,\n", | |
| " steps=[('smote', SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,\n", | |
| " out_step='deprecated', random_state=42, ratio=None,\n", | |
| " sampling_strategy='auto', svm_estimator='deprecated')), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterio...ators=100, n_jobs=None,\n", | |
| " oob_score=False, random_state=13, verbose=0, warm_start=False))]),\n", | |
| " fit_params=None, iid='warn', n_jobs=None,\n", | |
| " param_grid={'randomforestclassifier__n_estimators': [50, 100, 200], 'randomforestclassifier__max_depth': [4, 6, 10, 12], 'randomforestclassifier__random_state': [13]},\n", | |
| " pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n", | |
| " scoring='recall', verbose=0)" | |
| ] | |
| }, | |
| "execution_count": 27, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_imba.fit(X_train, y_train)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 28, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "(array([0.84867805, 0.81722134, 0.81096913, 0.81961792, 0.82913244,\n", | |
| " 0.82386742, 0.82525267, 0.80970919, 0.80956689, 0.82837268,\n", | |
| " 0.8224292 , 0.82550424]),\n", | |
| " array([0.90125477, 0.8855561 , 0.87690479, 0.93433132, 0.92447245,\n", | |
| " 0.91290858, 0.99573952, 0.99584802, 0.99584802, 0.99855072,\n", | |
| " 0.99855072, 0.99855072]))" | |
| ] | |
| }, | |
| "execution_count": 28, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_imba.cv_results_['mean_test_score'], grid_imba.cv_results_['mean_train_score']" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 29, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.8486780485230826" | |
| ] | |
| }, | |
| "execution_count": 29, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_imba.best_score_" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 30, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{'randomforestclassifier__max_depth': 4,\n", | |
| " 'randomforestclassifier__n_estimators': 50,\n", | |
| " 'randomforestclassifier__random_state': 13}" | |
| ] | |
| }, | |
| "execution_count": 30, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "grid_imba.best_params_" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 31, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "y_test_predict = grid_imba.best_estimator_.predict(X_test)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 32, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.8392857142857143" | |
| ] | |
| }, | |
| "execution_count": 32, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "recall_score(y_test, y_test_predict)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Note that we are getting the same result we would get if we bypassed the SMOTE step altogether. When making a prediction, the SMOTE step does nothing:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 33, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "y_test_predict = grid_imba.best_estimator_.named_steps['randomforestclassifier'].predict(X_test)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 34, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.8392857142857143" | |
| ] | |
| }, | |
| "execution_count": 34, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "recall_score(y_test, y_test_predict)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## What if I just trained a model with the \"best\" params?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 35, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{'max_depth': 4, 'n_estimators': 50, 'random_state': 13}" | |
| ] | |
| }, | |
| "execution_count": 35, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "snip_len = len('randomforestclassifier__')\n", | |
| "best_params = {key[snip_len:] : grid_imba.best_params_[key] for key in grid_imba.best_params_}\n", | |
| "best_params" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 36, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "0.8392857142857143" | |
| ] | |
| }, | |
| "execution_count": 36, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "rf = RandomForestClassifier(**best_params).fit(X_train_upsample, y_train_upsample)\n", | |
| "recall_score(y_test, rf.predict(X_test))" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "metis", | |
| "language": "python", | |
| "name": "metis" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.8" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 2 | |
| } |
Thanks for a great guide.
Quick question, in:
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
Is it correct we can use X, y instead of the X_train and y_train, since imba_pipeline will take care of not doing the over sampling on the test data?
Thanks for the great guide!
Is there a reason for not using StratifiedKFold?
@Vini-002 StratifiedKFold doesn’t solve the imbalanced problem. What it does do is make sure that each of the folds are equally unbalanced. If the thyroid cases are only 10% of the population, you will have 10% of each fold have thyroid cases. This will still lead to some ML classifiers just stating the majority case.
there is no reason you could not use stratified sampling in addition to one of the techniques used here.
I understand that it doesn't solve the imbalance, but I thought it could be bad not to use StratifiedKFold as you could end up with some fold with no samples of the minority class.
That is absolutely fair. I probably wouldn’t add it into the main article (I think it is useful to focus on one problem) but can definitely see a section at the bottom for “going further” or “future improvements”.
I’ll take the advice onboard and add a section — thank you!!
Thanks for the best unbalanded jubiter notbook that I saw so far.
My question is in the last cell[36] you got 0.8392857142857143. Is the
same exact value from cell[34] -- >recall_score(y_test, y_test_predict), this was a coincidence or not, because the model used in cell [34] was trained in the CV folds.
Thanks,
Paulo Praça