Skip to content

Instantly share code, notes, and snippets.

@tabrez
Created January 23, 2018 10:26
Show Gist options
  • Save tabrez/8eb238722cc2940f2e0434381c9345b9 to your computer and use it in GitHub Desktop.
Save tabrez/8eb238722cc2940f2e0434381c9345b9 to your computer and use it in GitHub Desktop.
MNIST.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import numpy as np\nimport pandas as pd\nfrom sklearn.datasets import fetch_mldata\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport matplotlib.style as style\nstyle.use('bmh')\n%matplotlib inline\n\npd.options.display.max_rows = 14\n\nfrom IPython.core.interactiveshell import InteractiveShell\nInteractiveShell.ast_node_interactivity = \"all\"",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Load and explore the dataset"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mnist = fetch_mldata('MNIST original')\n# What's the structure of the object returned by sklearn?\nmnist",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "* DESCR should contain more information about the dataset but unfortunately it most often doesn't\n* You should find out more information about the dataset on your own\n* features are available in `mnist.data`, labels are available in `mnist.target`"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mnist.DESCR",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X = mnist['data']\ny = mnist['target']\n\nX.shape, y.shape\nnp.sqrt(784)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "* There are 70,000 observations. 784 features/columns. Each observation has the image data in the form of 28x28 pixels per image. \n* To train, use 784 features as it is. \n* To print in image form, convert each observation(e.g. X[432] or X[766]) to 28x28 form using function `reshape`(e.g. X[22].reshape(28, 28))"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "d = X[34911]\nl = y[34911]\n\na = plt.imshow(d.reshape(28, 28), cmap = matplotlib.cm.binary)\na = plt.axis('off')\n\n# what's the label for the above observation?\n# l",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "# Split the set: first 60k observations into training set, remaining 10k observations into test set\nX_train, y_train = X[:60000], y[:60000]\nX_test, y_test = X[60000:], y[60000:]\n\n# Shuffle the training set. \n# First compute indices in random order so it can be used on both X_train and y_train; may not be needed if both are in the same dataset\nshuffle_index = np.random.permutation(60000)\nX_train, y_train = X_train[shuffle_index], y_train[shuffle_index]",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Binary Classifier: Is the digit 5 or not?"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "# Let us create a simple binary classifier that can classify digits 0-9 as two categories: `5` or `not 5`(hence the name)\nfrom sklearn.linear_model import SGDClassifier\n\n# whereever label is 5, set that label to `True` instead\n# wherever the label is 0,1,2,3,4,6,7,8,9, set that label to `False` instead\ny_train_5 = (y_train == 5) \ny_test_5 = (y_test == 5) ",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "sgd_clf = SGDClassifier(random_state=42, max_iter=5)\nx = sgd_clf.fit(X_train, y_train_5) # train!",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "# predict on a sample observation and check with its label\nsgd_clf.predict([d])\nl\n# Exercise: try out predict() on a few other training observations",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Performance Measures"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "### Calculate accuracy of classification using k-fold cross-validation\nfrom sklearn.model_selection import cross_val_score\ncross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "* Is ~94-96% accuracy good? Let us see the accuracy of one of the worst classifiers possible.\n* What's the accuracy of the classifier that classifies every number as not being 5\n\n* The following classifier does no training at all. Whenever predict is called, it just returns all 0s in the shape of X.\n* Do you understand how reshaping is being done here and why?\n* If you forgot how required shape can be passed to `np.zeros`: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.zeros.html"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.base import BaseEstimator\nclass Never5Classifier(BaseEstimator):\n def fit(self, X, y=None):\n pass\n def predict(self, X):\n return np.zeros((len(X), 1), dtype=bool)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "### Calculate accuracy for this dumb classifier\ncross_val_score(Never5Classifier(), X_train, y_train_5, cv=3, scoring='accuracy')",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "* That's ~90-91% accuracy for this dataset when you always predict `not 5` as the result where the learning part is entirely skipped! \n* Our previous classifier was only ~4% more accurate than one of the worst possible classifiers. \n* Before we think of better classifier models, let us first learn better ways to measure performance \n\n### A. Confusion Matrix\n\n* Confusion matrix tells us the number of times category A was classified incorrectly as category B, C, D, etc. and similarly for B, C, D, etc.\n* E.g. 3rd row and 4th column in the confusion matrix tells us how many times the classifier incorrectyl classified the images of 3 with images of 5.\n* To create a confusion matrix we need actual labels and predicted labels. use `cross_val_predict` function in this case to get predictions instead of the scores.\n\n### cross_val_predict vs. cross_val_score & predict\n* Exercise: Repeat the following with Never5Classifier if you want to even though you know already what that confusion matrix would look like\n* `predict` function gives a prediction after the corresponding label has been seen and trained on already\n* Predictions obtained via `cross_val_predict` are generated using cross validation technique\n* This means that the predictions were generated without looking at the training labels\n* How is this possible?"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.model_selection import cross_val_predict\nfrom sklearn.metrics import confusion_matrix\n\ny_ps = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) \nconfusion_matrix(y_train_5, y_ps)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "| | not 5 | 5 |\n|:-:|:-:|:-:|\n| not 5 | | |\n | 5 | |       |\n\n* How to interpret the above result?\n * Rows = actual class & columns = predicted class\n* This means:\n * ~50k of the digits that are not digit 5 were correctly classified as `not 5`. These are called as _true negatives_.\n * ~1k of the digits that are not digit 5 were incorrectly classified as `5`. These are called _false positives_. \n * ~1.5k of the digits that are digit 5 were incorrectly classified as `not 5`. These are called _false negatives_.\n * ~4k of the digits that are digit 5 were correctly classified as `5`. These are called _true positives_."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "confusion_matrix(y_train_5, y_train_5)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### B. Precision & Recall\n\n$$ Precision = \\frac{TP}{TP + FP} $$\n\n$ $\n\n$$ Recall = \\frac{TP}{TP + FN} $$\n\n$ $\nWhere: \n\n$ $\nTP = True positives, FP = False positives, FN = False negatives"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.metrics import precision_score, recall_score\nprecision_score(y_train_5, y_ps)\nrecall_score(y_train_5, y_ps)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### What do these values mean?\n\n* When our SGDClassifier based model predicts a `5`, the accuracy is ~80%.\n* It detects ~73% of 5s.\n* Precision and recall measures can be combined into one measure called an F1 score:\n\n### C. F1 Score\n$ $\n$$ F_1 = \\frac{TP}{TP + \\frac{FN+FP}{2}} $$\n\n$ $\n* F1 prefers classifiers that have similar Precision and Recall scores. Sometimes you want classifiers that have high precision or high recall;\n In those cases you still use precision and recall scores as performance measures. Give examples."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.metrics import f1_score\nf1_score(y_train_5, y_ps)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "How to get high precision as well as high recall? That's not possible because as one increases the other one decreases and vice-versa.\n### Precision/Recall trade-off\n\nHow does SGDClassifier work?\n* For each observation, it computes a score based on a _decision function_. score > threshold ? class A : class B\n* sklearn has no function that let's us control/modify the threshold value used in the decision function."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "y_scores = sgd_clf.decision_function([d])\nthreshold = 0\ny_scores > threshold",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "threshold = 200000\ny_scores > threshold",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Raising the threshold decreases recall. \n\n8 <===> 7 <===> 3 <===> 9 [1] <===> 5 <===> 2 <===> 5 [2] <====> 5 <===> 6 <====> 5 [3] <===> 5 <===> 5 \n\n[1] (75%, 100%) \n[2] (80%, 67%) \n[3] (100%, 50%)\n\n### PR Curve and Threshold values\nHow to pick the threshold that satisfies our precision/recall requirements? Look at the Precision/Recall curve and Precision vs. Recall curve."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.metrics import precision_recall_curve\n\ndef plot_pr_v_t(precisions, recalls, thresholds):\n plt.figure(figsize=(14,8))\n plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')\n plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')\n plt.xlabel('Threshold')\n plt.legend(loc='center left')\n plt.ylim(0, 1)\n \ny_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')\nprecisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)\n#precisions.shape\n#recalls.shape\n#thresholds.shape\nplot_pr_v_t(precisions, recalls, thresholds)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "# Precision vs Recall Curve or PR Curve\ndef plot_p_v_r(precisions, recalls):\n plt.figure(figsize=(14,8))\n plt.plot(recalls, precisions, 'b--')\n plt.xlabel('Recall')\n plt.ylabel('Precision')\n plt.xlim(0, 1)\n plt.ylim(0, 1)\n \nplot_p_v_r(precisions, recalls)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "# how to use custom threshold value for predictions? \ny_pred_with_custom_threshold = y_scores > 130000\nprecision_score(y_train_5, y_pred_with_custom_threshold)\nrecall_score(y_train_5, y_pred_with_custom_threshold)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### ROC AUC\n* Another way to tune the performance of the classifier: ROC AUC\n* ROC = Receiver operating characteristic and AUC = Area under curve\n* ROC plots _true positive rate_ (i.e. Recall) vs _false positive rate_. \n"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.metrics import roc_curve\n\ndef plot_roc_curve(fpr, tpr, label=None):\n plt.figure(figsize=(14,8))\n plt.plot(fpr, tpr, linewidth=2, label=label)\n plt.axis([-0.01, 1, -0.01, 1])\n plt.xlabel('False positive rate(FPR)')\n plt.ylabel('True Positive Rate(TPR)')\n\nfpr, tpr, thresholds = roc_curve(y_train_5, y_scores)\nplot_roc_curve(fpr, tpr)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "* As Recall/TPR increases, false positives(FPR) also increase.\n* A good classifier stays close to top-left corner(i.e. minimise the area in that region).\n* 1 - AUC gives us that area in the top-left corner that needs to be minimised."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.metrics import roc_auc_score\n1 - roc_auc_score(y_train_5, y_scores)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Should we use ROC curve or PR curve/Threshold to fine-tune the classifier?\n* Prefer PR Curve whenever the positve alss is rare/you need to minimise the false positives more than the false negatives\n* Use ROC otherwise\n\nNext Up: Train another model(e.g. RandomForestClassifier and check its ROC AUC score).\n\n## RandomForestClassifier"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.ensemble import RandomForestClassifier\n\nforest_clf = RandomForestClassifier(random_state=42)\n# RFC's `predict_proba` returns probabilities that the given observation belongs to a given class\ny_probas = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method='predict_proba') \n# convert probabilities to scores\ny_scores_fr = y_probas[:, 1] # probability of positive class\nfpr_fr, tpr_fr, thr_fr = roc_curve(y_train_5, y_scores_fr)\nplot_roc_curve(fpr, tpr, label='SGDClassifier')\nx = plt.plot(fpr_fr, tpr_fr, 'b:', label='RandomForestClassifier')\nx = plt.legend(loc='lower right')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "roc_auc_score(y_train_5, y_scores_fr)\ny_fr_ps = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)\nprecision_score(y_train_5, y_fr_ps)\nrecall_score(y_train_5, y_fr_ps)\n\n# Exercise: Summarise the steps to train a binary classifier",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.6.3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "MNIST.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment