Created
January 23, 2018 10:26
-
-
Save tabrez/8eb238722cc2940f2e0434381c9345b9 to your computer and use it in GitHub Desktop.
MNIST.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import numpy as np\nimport pandas as pd\nfrom sklearn.datasets import fetch_mldata\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport matplotlib.style as style\nstyle.use('bmh')\n%matplotlib inline\n\npd.options.display.max_rows = 14\n\nfrom IPython.core.interactiveshell import InteractiveShell\nInteractiveShell.ast_node_interactivity = \"all\"", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# Load and explore the dataset" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "mnist = fetch_mldata('MNIST original')\n# What's the structure of the object returned by sklearn?\nmnist", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* DESCR should contain more information about the dataset but unfortunately it most often doesn't\n* You should find out more information about the dataset on your own\n* features are available in `mnist.data`, labels are available in `mnist.target`" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "mnist.DESCR", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "X = mnist['data']\ny = mnist['target']\n\nX.shape, y.shape\nnp.sqrt(784)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* There are 70,000 observations. 784 features/columns. Each observation has the image data in the form of 28x28 pixels per image. \n* To train, use 784 features as it is. \n* To print in image form, convert each observation(e.g. X[432] or X[766]) to 28x28 form using function `reshape`(e.g. X[22].reshape(28, 28))" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "d = X[34911]\nl = y[34911]\n\na = plt.imshow(d.reshape(28, 28), cmap = matplotlib.cm.binary)\na = plt.axis('off')\n\n# what's the label for the above observation?\n# l", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# Split the set: first 60k observations into training set, remaining 10k observations into test set\nX_train, y_train = X[:60000], y[:60000]\nX_test, y_test = X[60000:], y[60000:]\n\n# Shuffle the training set. \n# First compute indices in random order so it can be used on both X_train and y_train; may not be needed if both are in the same dataset\nshuffle_index = np.random.permutation(60000)\nX_train, y_train = X_train[shuffle_index], y_train[shuffle_index]", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# Binary Classifier: Is the digit 5 or not?" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# Let us create a simple binary classifier that can classify digits 0-9 as two categories: `5` or `not 5`(hence the name)\nfrom sklearn.linear_model import SGDClassifier\n\n# whereever label is 5, set that label to `True` instead\n# wherever the label is 0,1,2,3,4,6,7,8,9, set that label to `False` instead\ny_train_5 = (y_train == 5) \ny_test_5 = (y_test == 5) ", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "sgd_clf = SGDClassifier(random_state=42, max_iter=5)\nx = sgd_clf.fit(X_train, y_train_5) # train!", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# predict on a sample observation and check with its label\nsgd_clf.predict([d])\nl\n# Exercise: try out predict() on a few other training observations", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# Performance Measures" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "### Calculate accuracy of classification using k-fold cross-validation\nfrom sklearn.model_selection import cross_val_score\ncross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* Is ~94-96% accuracy good? Let us see the accuracy of one of the worst classifiers possible.\n* What's the accuracy of the classifier that classifies every number as not being 5\n\n* The following classifier does no training at all. Whenever predict is called, it just returns all 0s in the shape of X.\n* Do you understand how reshaping is being done here and why?\n* If you forgot how required shape can be passed to `np.zeros`: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.zeros.html" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.base import BaseEstimator\nclass Never5Classifier(BaseEstimator):\n def fit(self, X, y=None):\n pass\n def predict(self, X):\n return np.zeros((len(X), 1), dtype=bool)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "### Calculate accuracy for this dumb classifier\ncross_val_score(Never5Classifier(), X_train, y_train_5, cv=3, scoring='accuracy')", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* That's ~90-91% accuracy for this dataset when you always predict `not 5` as the result where the learning part is entirely skipped! \n* Our previous classifier was only ~4% more accurate than one of the worst possible classifiers. \n* Before we think of better classifier models, let us first learn better ways to measure performance \n\n### A. Confusion Matrix\n\n* Confusion matrix tells us the number of times category A was classified incorrectly as category B, C, D, etc. and similarly for B, C, D, etc.\n* E.g. 3rd row and 4th column in the confusion matrix tells us how many times the classifier incorrectyl classified the images of 3 with images of 5.\n* To create a confusion matrix we need actual labels and predicted labels. use `cross_val_predict` function in this case to get predictions instead of the scores.\n\n### cross_val_predict vs. cross_val_score & predict\n* Exercise: Repeat the following with Never5Classifier if you want to even though you know already what that confusion matrix would look like\n* `predict` function gives a prediction after the corresponding label has been seen and trained on already\n* Predictions obtained via `cross_val_predict` are generated using cross validation technique\n* This means that the predictions were generated without looking at the training labels\n* How is this possible?" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.model_selection import cross_val_predict\nfrom sklearn.metrics import confusion_matrix\n\ny_ps = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) \nconfusion_matrix(y_train_5, y_ps)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "| | not 5 | 5 |\n|:-:|:-:|:-:|\n| not 5 | | |\n | 5 | | |\n\n* How to interpret the above result?\n * Rows = actual class & columns = predicted class\n* This means:\n * ~50k of the digits that are not digit 5 were correctly classified as `not 5`. These are called as _true negatives_.\n * ~1k of the digits that are not digit 5 were incorrectly classified as `5`. These are called _false positives_. \n * ~1.5k of the digits that are digit 5 were incorrectly classified as `not 5`. These are called _false negatives_.\n * ~4k of the digits that are digit 5 were correctly classified as `5`. These are called _true positives_." | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "confusion_matrix(y_train_5, y_train_5)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### B. Precision & Recall\n\n$$ Precision = \\frac{TP}{TP + FP} $$\n\n$ $\n\n$$ Recall = \\frac{TP}{TP + FN} $$\n\n$ $\nWhere: \n\n$ $\nTP = True positives, FP = False positives, FN = False negatives" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.metrics import precision_score, recall_score\nprecision_score(y_train_5, y_ps)\nrecall_score(y_train_5, y_ps)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### What do these values mean?\n\n* When our SGDClassifier based model predicts a `5`, the accuracy is ~80%.\n* It detects ~73% of 5s.\n* Precision and recall measures can be combined into one measure called an F1 score:\n\n### C. F1 Score\n$ $\n$$ F_1 = \\frac{TP}{TP + \\frac{FN+FP}{2}} $$\n\n$ $\n* F1 prefers classifiers that have similar Precision and Recall scores. Sometimes you want classifiers that have high precision or high recall;\n In those cases you still use precision and recall scores as performance measures. Give examples." | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.metrics import f1_score\nf1_score(y_train_5, y_ps)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "How to get high precision as well as high recall? That's not possible because as one increases the other one decreases and vice-versa.\n### Precision/Recall trade-off\n\nHow does SGDClassifier work?\n* For each observation, it computes a score based on a _decision function_. score > threshold ? class A : class B\n* sklearn has no function that let's us control/modify the threshold value used in the decision function." | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "y_scores = sgd_clf.decision_function([d])\nthreshold = 0\ny_scores > threshold", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "threshold = 200000\ny_scores > threshold", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Raising the threshold decreases recall. \n\n8 <===> 7 <===> 3 <===> 9 [1] <===> 5 <===> 2 <===> 5 [2] <====> 5 <===> 6 <====> 5 [3] <===> 5 <===> 5 \n\n[1] (75%, 100%) \n[2] (80%, 67%) \n[3] (100%, 50%)\n\n### PR Curve and Threshold values\nHow to pick the threshold that satisfies our precision/recall requirements? Look at the Precision/Recall curve and Precision vs. Recall curve." | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.metrics import precision_recall_curve\n\ndef plot_pr_v_t(precisions, recalls, thresholds):\n plt.figure(figsize=(14,8))\n plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')\n plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')\n plt.xlabel('Threshold')\n plt.legend(loc='center left')\n plt.ylim(0, 1)\n \ny_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')\nprecisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)\n#precisions.shape\n#recalls.shape\n#thresholds.shape\nplot_pr_v_t(precisions, recalls, thresholds)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# Precision vs Recall Curve or PR Curve\ndef plot_p_v_r(precisions, recalls):\n plt.figure(figsize=(14,8))\n plt.plot(recalls, precisions, 'b--')\n plt.xlabel('Recall')\n plt.ylabel('Precision')\n plt.xlim(0, 1)\n plt.ylim(0, 1)\n \nplot_p_v_r(precisions, recalls)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# how to use custom threshold value for predictions? \ny_pred_with_custom_threshold = y_scores > 130000\nprecision_score(y_train_5, y_pred_with_custom_threshold)\nrecall_score(y_train_5, y_pred_with_custom_threshold)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### ROC AUC\n* Another way to tune the performance of the classifier: ROC AUC\n* ROC = Receiver operating characteristic and AUC = Area under curve\n* ROC plots _true positive rate_ (i.e. Recall) vs _false positive rate_. \n" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.metrics import roc_curve\n\ndef plot_roc_curve(fpr, tpr, label=None):\n plt.figure(figsize=(14,8))\n plt.plot(fpr, tpr, linewidth=2, label=label)\n plt.axis([-0.01, 1, -0.01, 1])\n plt.xlabel('False positive rate(FPR)')\n plt.ylabel('True Positive Rate(TPR)')\n\nfpr, tpr, thresholds = roc_curve(y_train_5, y_scores)\nplot_roc_curve(fpr, tpr)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* As Recall/TPR increases, false positives(FPR) also increase.\n* A good classifier stays close to top-left corner(i.e. minimise the area in that region).\n* 1 - AUC gives us that area in the top-left corner that needs to be minimised." | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.metrics import roc_auc_score\n1 - roc_auc_score(y_train_5, y_scores)", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### Should we use ROC curve or PR curve/Threshold to fine-tune the classifier?\n* Prefer PR Curve whenever the positve alss is rare/you need to minimise the false positives more than the false negatives\n* Use ROC otherwise\n\nNext Up: Train another model(e.g. RandomForestClassifier and check its ROC AUC score).\n\n## RandomForestClassifier" | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from sklearn.ensemble import RandomForestClassifier\n\nforest_clf = RandomForestClassifier(random_state=42)\n# RFC's `predict_proba` returns probabilities that the given observation belongs to a given class\ny_probas = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method='predict_proba') \n# convert probabilities to scores\ny_scores_fr = y_probas[:, 1] # probability of positive class\nfpr_fr, tpr_fr, thr_fr = roc_curve(y_train_5, y_scores_fr)\nplot_roc_curve(fpr, tpr, label='SGDClassifier')\nx = plt.plot(fpr_fr, tpr_fr, 'b:', label='RandomForestClassifier')\nx = plt.legend(loc='lower right')", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "roc_auc_score(y_train_5, y_scores_fr)\ny_fr_ps = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)\nprecision_score(y_train_5, y_fr_ps)\nrecall_score(y_train_5, y_fr_ps)\n\n# Exercise: Summarise the steps to train a binary classifier", | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"name": "python", | |
"version": "3.6.3", | |
"mimetype": "text/x-python", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"pygments_lexer": "ipython3", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
}, | |
"gist": { | |
"id": "", | |
"data": { | |
"description": "MNIST.ipynb", | |
"public": true | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment