Last active
March 7, 2017 03:20
-
-
Save yosemitebandit/52422e48f9ab9083243c to your computer and use it in GitHub Desktop.
assignment two from the Udacity Deep Learning course
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "5hIbr52I7Z7U" | |
}, | |
"source": [ | |
"Deep Learning\n", | |
"=============\n", | |
"\n", | |
"Assignment 1\n", | |
"------------\n", | |
"\n", | |
"The objective of this assignment is to learn about simple data curation practices, and familiarize you with some of the data we'll be reusing later.\n", | |
"\n", | |
"This notebook uses the [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) dataset to be used with python experiments. This dataset is designed to look like the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset, while looking a little more like real data: it's a harder task, and the data is a lot less 'clean' than MNIST." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
}, | |
"colab_type": "code", | |
"collapsed": true, | |
"id": "apJbCsBHl-2A" | |
}, | |
"outputs": [], | |
"source": [ | |
"# These are all the modules we'll be using later. Make sure you can import them\n", | |
"# before proceeding further.\n", | |
"from __future__ import print_function\n", | |
"import matplotlib.pyplot as plt\n", | |
"import matplotlib.cm as cm\n", | |
"%matplotlib inline\n", | |
"import numpy as np\n", | |
"import os\n", | |
"import sys\n", | |
"import tarfile\n", | |
"from IPython.display import display, Image\n", | |
"from scipy import ndimage\n", | |
"from sklearn.linear_model import LogisticRegression\n", | |
"from six.moves.urllib.request import urlretrieve\n", | |
"from six.moves import cPickle as pickle" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "jNWGtZaXn-5j" | |
}, | |
"source": [ | |
"First, we'll download the dataset to our local machine. The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to 'A' through 'J' (10 classes). The training set has about 500k and the testset 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"output_extras": [ | |
{ | |
"item_id": 1 | |
} | |
] | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"executionInfo": { | |
"elapsed": 186058, | |
"status": "ok", | |
"timestamp": 1444485672507, | |
"user": { | |
"color": "#1FA15D", | |
"displayName": "Vincent Vanhoucke", | |
"isAnonymous": false, | |
"isMe": true, | |
"permissionId": "05076109866853157986", | |
"photoUrl": "//lh6.googleusercontent.com/-cCJa7dTDcgQ/AAAAAAAAAAI/AAAAAAAACgw/r2EZ_8oYer4/s50-c-k-no/photo.jpg", | |
"sessionId": "2a0a5e044bb03b66", | |
"userId": "102167687554210253930" | |
}, | |
"user_tz": 420 | |
}, | |
"id": "EYRJ4ICW6-da", | |
"outputId": "0d0f85df-155f-4a89-8e7e-ee32df36ec8d" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Found and verified notMNIST_large.tar.gz\n", | |
"Found and verified notMNIST_small.tar.gz\n" | |
] | |
} | |
], | |
"source": [ | |
"url = 'http://yaroslavvb.com/upload/notMNIST/'\n", | |
"\n", | |
"def maybe_download(filename, expected_bytes, force=False):\n", | |
" \"\"\"Download a file if not present, and make sure it's the right size.\"\"\"\n", | |
" if force or not os.path.exists(filename):\n", | |
" filename, _ = urlretrieve(url + filename, filename)\n", | |
" statinfo = os.stat(filename)\n", | |
" if statinfo.st_size == expected_bytes:\n", | |
" print('Found and verified', filename)\n", | |
" else:\n", | |
" raise Exception(\n", | |
" 'Failed to verify' + filename + '. Can you get to it with a browser?')\n", | |
" return filename\n", | |
"\n", | |
"train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)\n", | |
"test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "cC3p0oEyF8QT" | |
}, | |
"source": [ | |
"Extract the dataset from the compressed .tar.gz file.\n", | |
"This should give you a set of directories, labelled A through J." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"output_extras": [ | |
{ | |
"item_id": 1 | |
} | |
] | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"executionInfo": { | |
"elapsed": 186055, | |
"status": "ok", | |
"timestamp": 1444485672525, | |
"user": { | |
"color": "#1FA15D", | |
"displayName": "Vincent Vanhoucke", | |
"isAnonymous": false, | |
"isMe": true, | |
"permissionId": "05076109866853157986", | |
"photoUrl": "//lh6.googleusercontent.com/-cCJa7dTDcgQ/AAAAAAAAAAI/AAAAAAAACgw/r2EZ_8oYer4/s50-c-k-no/photo.jpg", | |
"sessionId": "2a0a5e044bb03b66", | |
"userId": "102167687554210253930" | |
}, | |
"user_tz": 420 | |
}, | |
"id": "H8CBE-WZ8nmj", | |
"outputId": "ef6c790c-2513-4b09-962e-27c79390c762" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"notMNIST_large already present - Skipping extraction of notMNIST_large.tar.gz.\n", | |
"['notMNIST_large/A', 'notMNIST_large/B', 'notMNIST_large/C', 'notMNIST_large/D', 'notMNIST_large/E', 'notMNIST_large/F', 'notMNIST_large/G', 'notMNIST_large/H', 'notMNIST_large/I', 'notMNIST_large/J']\n", | |
"notMNIST_small already present - Skipping extraction of notMNIST_small.tar.gz.\n", | |
"['notMNIST_small/A', 'notMNIST_small/B', 'notMNIST_small/C', 'notMNIST_small/D', 'notMNIST_small/E', 'notMNIST_small/F', 'notMNIST_small/G', 'notMNIST_small/H', 'notMNIST_small/I', 'notMNIST_small/J']\n" | |
] | |
} | |
], | |
"source": [ | |
"num_classes = 10\n", | |
"np.random.seed(133)\n", | |
"\n", | |
"def maybe_extract(filename, force=False):\n", | |
" root = os.path.splitext(os.path.splitext(filename)[0])[0] # remove .tar.gz\n", | |
" if os.path.isdir(root) and not force:\n", | |
" # You may override by setting force=True.\n", | |
" print('%s already present - Skipping extraction of %s.' % (root, filename))\n", | |
" else:\n", | |
" print('Extracting data for %s. This may take a while. Please wait.' % root)\n", | |
" tar = tarfile.open(filename)\n", | |
" sys.stdout.flush()\n", | |
" tar.extractall()\n", | |
" tar.close()\n", | |
" data_folders = [\n", | |
" os.path.join(root, d) for d in sorted(os.listdir(root))\n", | |
" if os.path.isdir(os.path.join(root, d))]\n", | |
" if len(data_folders) != num_classes:\n", | |
" raise Exception(\n", | |
" 'Expected %d folders, one per class. Found %d instead.' % (\n", | |
" num_classes, len(data_folders)))\n", | |
" print(data_folders)\n", | |
" return data_folders\n", | |
" \n", | |
"train_folders = maybe_extract(train_filename)\n", | |
"test_folders = maybe_extract(test_filename)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "4riXK3IoHgx6" | |
}, | |
"source": [ | |
"---\n", | |
"Problem 1\n", | |
"---------\n", | |
"\n", | |
"Let's take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font. Display a sample of the images that we just downloaded. Hint: you can use the package IPython.display.\n", | |
"\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAABxUlEQVR4nG2SvWuUQRCHn53dl1xC\nNFGJ2FhoTAohCtoZOEGwCGihoiJa2ASsgvkD0llZxUasrMTggUjAIoWgiFZqJUIQwUYDmlSRcOo7\nHxY58Q5vqmEfnmXmt8uj0OitrbVXi82MAEvxU3371FVV69ojwl42ydAKjfBu09XUI+aRNH1Qjl/f\nEQnLz+6MGgMTFydcMPLlFpA5peFRx10qSAw+CYuwWB0ml6rwPDTquMdgKaXi0I/wCI8LYlrDJgE4\nqqq1fHqPgzMl/FeJrU7XBzo7t8k76VFERKqYnELQ9HaldMHAnaQsDDpW2jfrbjiyZ0zJB+ZmXKJ8\nvfZaAArLUXdHFLG2uA+h51oHcB9Yvf14Mxelr9n+9uLGUI9p+emt3ZaGx2eajcbYydmrq/mf2cmW\nxKWtqOv4fqwnBCHnnKVqLZCKjt0f6YaOmZkZrY0cRY/O9YsvffmMkznfB5L4DSQm+71KDOwiAb9K\nTknLfgQQSgCpbh4OweUNIMyGR9cqRz6GRWhcKdPjQ6fPkiAxcWbUqPaemGmEuFUPlmiFhvX+alc1\ni+UhxFDrTBVmZqpGFlmfP9dORSj575idpr3xYeXheo74A33eLRoVEtD0AAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAABhElEQVR4nGWSv24TQRCHv5ndU5Bo\nMBDFOiTENSAaKh4AiY6SAqgo0iJ6xDvwFLwDSHRBNCgCxB+LCiFBgm0lMrEjk+C73aG4W9t3TDXS\nt7+Z+c2siDOL1HG2t1UURX7z1baaAd4q8Jub/cvFlXyr3wPKLEMrAN9/nBf9Cz3XiC0SslMUAH/u\nqQkQDRARcRgnzUv99UIW0Uydc04FQPiT4GwXURFWIcwT5APrBEyYYw38vu9ji66V5dtn2pByqczm\ngyZdh0nJx5i1abUsG3n/oyNdHKeyUQd7HahZSszxqdlWM2w4f3/pzvHg1KKtItjxLXyydemgBW1h\nz7OkVd61oZV2O3USnlk7gr3eWK7rTgdatHtNXeHi4j/41aeBN3YsdMVP8HXfvzvd9UYe5pUqoLzp\nQq2uP8LVXq4edsxYsP0bKGCM36abmoVQhRCszLdrqNNdyhCigYhz3jnn/Mnda74+zYBMBYhHk6PJ\n4Wg8Gk2ms9++nu3LyzM/D4bD4Xg6n60+zT+uVRLLi2wr8AAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAACB0lEQVR4nF2STUhUYRSGn/Ode02N\nHBsqk6BCN6E2RUVWEkEt2gWSiwhtES1c1kaEhggxMAsR+rFSWggFbSTMRVAtTIjAFlIEMmKkFUYb\nCdSxmbn3tLjTpPOtDjznfd9zOJ8AuJDquk3ZX7OL5ggp37F9d23tt5dvBRDbn2x8/3o+HZ9O5ahP\npOfTS+WN/RU9gOPIUuY00dv3qDsRVVctCzjG7BK+ej6VAx9bQJwTn8EgBUAqqFHnoHnhTRwVwKP+\nt10AHCOWcCWU3rU7oIB4bP1k3QCOxMoIVE9YEnWIKhyey11BAYSmn70Nn60P3wFQ+zAz1kBU44gN\np22+ygOtqGsdnBxuAgUBEPVfnFq6lQszbkPF6vLU5CIiYX45ZdCsF8pisTIBoEQosOMWLFRqFOJx\ncej5bZdPxDFq1hUNh1KfMzv0bxxh54r92Suaj58w6813gkerWaoEAUQZsOBLrODq0WX2DgcoJC1j\nZwpCPG4GNiPO94SNnUHWnvxnKC22Gl4GONvXvxIuVElhEcSVjpvZ6PV7449rPpidWyMEYfOD74H9\nuH+UHrNn6xgIxPfsAk6GNh13a0xVQaIrVs3Y8sEiYRSsnoyZta1j3vm2MgDx6TbrwFsreWrpAzjU\n0W7WUeRp2WPRB7hhs83FeUN2TUBPvJrr3FLMRNvbvk6tbotPjKTRYD38C/nmun2Ap6MaAAAAAElF\nTkSuQmCC\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAB1klEQVR4nG2STWuTQRSFn3tnkmhS\ni7XFYovE+rEQiquCKynoVoQuXLpzoz9C99K1G/+AVOgvcOEHokiq0IWipVBjMFpMa5v6keSde128\niYbSszs898wMcw8cLL3x4AyR46fbf7pZylKWSQwhhBBLlXj1Dq/WoyzNdyyZuWUuqqIqGmOgp+eI\nC/OUDjjWJJxCr2313N3dWPvh5gOpUoXR8UXvuSffOHzTs3/Qzesjcc8+A5i+/P0eIfnguTJ5NqrP\n5G5XMoTw/9bi+eg+iQDgHVN72hQA8YvVdCk64/lkCYHa5UHw+hIa4VDujhCUnVhMAJpqX+8vRigA\nCEcJYFnWT27MtojQy90IAWYXWq7lSmXiyUorWIRunhyljE0v94P3VopdorCX2zLHcDcAUlghQYRW\nDjtMgCgIrtkqDips4gBtxnFEBExWP4iBQgPA2WQaYavdBd296wpEqKPgNKkSbi8XNSI/myQgGmum\nLtDgJNsPt/vfbgDq1BuA8KVwgjfbQVVVc4a6/nqHAd9mxniOmJmZ9fMob3GgMyf+jME6+yrogmdu\n/vGTN8dE9pfpQuqX58V+wtSjdXd3T8nar68MVQHglg9pDh1CkalkBSBZ9/vO45rYEPwLYK/ugXEC\nV6oAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAB3ElEQVR4nG2QPUjVYRjFf8/7vte4\nqYsNFWWgePuUsIhsCKSkLCKqJSwC9zALijaXSscWCdqyFpegpQglIqqxICjSJIcwUiojSOxe/x+n\n4X5ane15zvNxzjHKMJfKtR/uyjWvpvBt4ePb8TdW5hxpQ9+p+qnJL/mm3Uc3yWQPypzHXfz0qCcD\nYGQvF5KCRksnPR1Tc4cw8z6E4D1nlepMmevXeCPel75Y4L4WmgHM/IjGMF8RR6BXE0XOjeppHa7K\n4TmioaLOEX3dQs0eOFpvbAMC56VLBP4DR8dPTTZgf/UtGBgT0tWVR6s4KS1tXqGmBo+ll/8cLRna\n1Z2617ik0qjMKQ7HM8t172um49rNTmC+WtcNNC0uGZBkC7fDDhzLFf3Krt3Y2SID2Z0CS4rUW+vE\nMah8Xu/acESKNVhDhgzr81KynwDfFWl85WYuTvQCD24ao2t7Ug0hcMzHPAPgpuJY9yr2LNDyWbEO\n4oG9kZRoABe8cz441r3SsmbXFEMbU6RUQ9liZX0/lEZ6WAq7dU5RGmv6Wk/71u7hmYVbvcO/dKEo\n0bNvXmlSkKRUc1caCQxpT0m/Y8Pd30qVavHJ6QzmV3FuNmCAgZPaDuTqFz88n5H5RCG+vvOES4E/\nRjTHFXKnzAQAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAABsElEQVR4nG2SMWtUQRSFv5k3LxhB\nQRbEYl0ski5FIBYSxEYSEAstgiBIQIT8GmshkMYiYKFsI2pAFALpwjZCEEQFFSNqYkTWXd+bucfi\nbXbfRk8zXD7O4R7uwFCexa292N/7+PbD14Ne6j9u1tn53zINFbU2ghmr+mOSmZlJMu2HESWA4V01\nKIWiHnvhVz026a4bUaepa43jy6fkQO7ng81HtVQcZGwqSirVJnO+BuVC4DCqS676QigKDWMSdec/\nCoDzzoR38T/QkwASztlR6KWL12cb3Z32hrkjVs/0M0mS6elp8mGVdXJgZlexjCmVhTZqPdfJ8ROr\nZ8osZN6HkBavpPHUlfmYD5oZU4x180vVKAP4MQ7DXDU6Z8p3X5BNDkCJ8McAjHsdn326/ZlzZ/GA\n4wAIX5pyiN7cZdfZnyhuNFIGOHYQtJUkU/cSHsdSMbh3bxoPN5Ukmbprt64uP0zVLyh0v1rliQpJ\nVhmqp9T7ZnXY1juVSbIYY4wmKZX6Pl8ZPa2XMovJTDJL0UzbM4d1PdnKa9X06k4+YA48NrmwMNs6\ncZLetzed51t9T3XYv7yWIh47bySkAAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAACEUlEQVR4nFXSTUiUURTG8f+991Vz\n1FBRklIbUcZ2MQXixkXRxyJr0cJNQavcBBERtIloE9W2vaswA/uEVgUhSC1MaLKPQcI+SCVpIiYn\nZpyZ9z4t3nlhOsv743Duee41ADivRDo12B0Wvs1nK9af77heNSIydlx7H0qStPVqgr0ljeFiO7kq\nL+VX1/9KXtPjJR2roeOiQl+5dyLZ2p6aeFSUNspKYiObVFnZUQyAIf1GVfl+LGBJF0K97MQ5a4x1\njq6svI86Dc/lfw4RUKtGTimsIQdV0TkaYsPQs6FqX4RT0odt1JVhSdV+LNi2IzBTCv7Dz7hhDNhU\nNyygGIwxlnfQgjEm2NdEYSVGgyDk7khujlAm6IRiLkYFHcaU8tmjQFNzPkhAWKr1qXXmUGjzmU9f\nmwd7e3fd4Za03hCl4xhTVXUVAEFzBQDP2+nR0Ki7o4rVWviCq1J+d7RxfKnkR/nK6cDAGXkdiB/P\nWMBxSXqKwdrXRZGOZoI8YCjAGgHeLi9ZxgnrEhJDRAc2vI8fG5GrxzS04AH6/ng9xpnYAlKb0hOi\n8VMq6QpBYI0x1gUwqy3NgQHLwKrKutEQf5OdD1UNlWnEAI7RnLaUOTvcHiT6D9/OSZlFfdkep7Zn\nUV4qbnxfz0te2eSCNntq61kSl3/IS5K8Nm+2M6+Vtnh3a+mafJD9Xf61/OzCAI79s8ex8A9BYgpb\njKp24gAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAl0lEQVR4nO1SMQ7CMAw8J1bZeUFZ\neQ5v4Rm8heewEgl1Zadycgypi5BIQZ3xYjt3smP7kFj40QpTwIItgkoUqaFUT9aUkdohyBt94oDo\nZL+pzHg/nExhejxvc2U99OL1LYEAka438zcvGi34F6N31eKtMEUFOf8yyh/8Bq5fvL6O3UMAQb+b\nj42hLc1Bx7bARpW2wGT9nE+rA2qle2HnpwAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAABbElEQVR4nHWTMUucQRCGn5ndu0/k\nUCFRJI1goaCmFAJR0qRUIZVNwCJgKwhKUogWIkhARFIEoo2VbQIpEn+BYKlgsD6wMGk06t3n7lh4\nyvm5vuU+vDM7LzNyJPbsuQkAmFTPs9YuCCrEATTTJavbrXJ7p+0vXn88NIsW+sGz2AQnECCbvbIY\n+pVGxYYEL+pqa5OXZihFmcXgs+8zmoIA1zX/7bcPKaiMDUQ2YtIpMj8a2ftfSUCJ1nqG+7dz4ROQ\nyssaEqYTZdXHqXINUCk6jTpvVlDg8YdU++Z+VhrJPHQ6Pi8MZtTLdzM1K7A8Mb51/1jseVqt7u5/\nabBiz5Jkpa/bBIDrIjQLUVfxQHkmEUK0P8ctKL3rifjMyd82lJGQcGK2eSDBfXApGNmilL9/FVMQ\nnObdy09sgmje8aPH9BEUdc7yoV/DQfFYYY46dE99ag8OvLrYvJxZR8/g27FOogPkwTnASd7ZcnsN\nwA2lTIwg6p4fUAAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAABcUlEQVR4nHXRMUhVcRTH8e85//uv\n9xREGgrRofBBIAaZSAgNjjo2K4K5uurS3uQSEQiNNbi0NeRgk7n5GiIwipYaEh1E6ZX33v85DbeW\n3r8znOXD4XB+R96NX7QHAX72BGIr/vrU3d5xNYCxzvCGV1755lDn+o2bU8tdN3szSQFAwQOvvPSH\nREAIL7zy3hIBVBUBQFBVlSKtHRZV6/lKCqgZDoBjZmZ1PN0kOFv3kir/Vs2rE5V06UnL+tH1qIuF\n+vYK/YjyHlBWYwadL4ByZzaD8B0gyUIWz5vTprJYIiB0smjN5NX/IwxkscmTOosBwDnPouLgfM3i\n5WbyYxavNIv3sjgKuJ7tZFCYwDFef+5HqWUC8WCPMy8Tbs0Qarb2Qz8WLEqq4tt1rAjhTxBKMEDK\nmTX3uH+/p6YpkUh/e7J67uWAh2fzx2oUG9fO7hJVmI+DJq2h6Ukp9x7tqhjI05FU/hDwdttBtfft\nw8FhCubAbxkykmaN5JxPAAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<IPython.core.display.Image object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"for i in range(10):\n", | |
" random_dir = np.random.choice(test_folders)\n", | |
" filenames = [f for f in os.listdir(random_dir) if os.path.isfile(os.path.join(random_dir, f))]\n", | |
" random_filename = np.random.choice(filenames)\n", | |
" display(Image(filename=os.path.join(random_dir, random_filename)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "PBdkjESPK8tw" | |
}, | |
"source": [ | |
"Now let's load the data in a more manageable format. Since, depending on your computer setup you might not be able to fit it all in memory, we'll load each class into a separate dataset, store them on disk and curate them independently. Later we'll merge them into a single dataset of manageable size.\n", | |
"\n", | |
"We'll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road. \n", | |
"\n", | |
"A few images might not be readable, we'll just skip them." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"output_extras": [ | |
{ | |
"item_id": 30 | |
} | |
] | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"executionInfo": { | |
"elapsed": 399874, | |
"status": "ok", | |
"timestamp": 1444485886378, | |
"user": { | |
"color": "#1FA15D", | |
"displayName": "Vincent Vanhoucke", | |
"isAnonymous": false, | |
"isMe": true, | |
"permissionId": "05076109866853157986", | |
"photoUrl": "//lh6.googleusercontent.com/-cCJa7dTDcgQ/AAAAAAAAAAI/AAAAAAAACgw/r2EZ_8oYer4/s50-c-k-no/photo.jpg", | |
"sessionId": "2a0a5e044bb03b66", | |
"userId": "102167687554210253930" | |
}, | |
"user_tz": 420 | |
}, | |
"id": "h7q0XhG3MJdf", | |
"outputId": "92c391bb-86ff-431d-9ada-315568a19e59" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"notMNIST_large/A.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/B.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/C.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/D.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/E.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/F.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/G.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/H.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/I.pickle already present - Skipping pickling.\n", | |
"notMNIST_large/J.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/A.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/B.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/C.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/D.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/E.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/F.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/G.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/H.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/I.pickle already present - Skipping pickling.\n", | |
"notMNIST_small/J.pickle already present - Skipping pickling.\n" | |
] | |
} | |
], | |
"source": [ | |
"image_size = 28 # Pixel width and height.\n", | |
"pixel_depth = 255.0 # Number of levels per pixel.\n", | |
"\n", | |
"def load_letter(folder, min_num_images):\n", | |
" \"\"\"Load the data for a single letter label.\"\"\"\n", | |
" image_files = os.listdir(folder)\n", | |
" dataset = np.ndarray(shape=(len(image_files), image_size, image_size),\n", | |
" dtype=np.float32)\n", | |
" image_index = 0\n", | |
" print(folder)\n", | |
" for image in os.listdir(folder):\n", | |
" image_file = os.path.join(folder, image)\n", | |
" try:\n", | |
" image_data = (ndimage.imread(image_file).astype(float) - \n", | |
" pixel_depth / 2) / pixel_depth\n", | |
" if image_data.shape != (image_size, image_size):\n", | |
" raise Exception('Unexpected image shape: %s' % str(image_data.shape))\n", | |
" dataset[image_index, :, :] = image_data\n", | |
" image_index += 1\n", | |
" except IOError as e:\n", | |
" print('Could not read:', image_file, ':', e, '- it\\'s ok, skipping.')\n", | |
" \n", | |
" num_images = image_index\n", | |
" dataset = dataset[0:num_images, :, :]\n", | |
" if num_images < min_num_images:\n", | |
" raise Exception('Many fewer images than expected: %d < %d' %\n", | |
" (num_images, min_num_images))\n", | |
" \n", | |
" print('Full dataset tensor:', dataset.shape)\n", | |
" print('Mean:', np.mean(dataset))\n", | |
" print('Standard deviation:', np.std(dataset))\n", | |
" return dataset\n", | |
" \n", | |
"def maybe_pickle(data_folders, min_num_images_per_class, force=False):\n", | |
" dataset_names = []\n", | |
" for folder in data_folders:\n", | |
" set_filename = folder + '.pickle'\n", | |
" dataset_names.append(set_filename)\n", | |
" if os.path.exists(set_filename) and not force:\n", | |
" # You may override by setting force=True.\n", | |
" print('%s already present - Skipping pickling.' % set_filename)\n", | |
" else:\n", | |
" print('Pickling %s.' % set_filename)\n", | |
" dataset = load_letter(folder, min_num_images_per_class)\n", | |
" try:\n", | |
" with open(set_filename, 'wb') as f:\n", | |
" pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)\n", | |
" except Exception as e:\n", | |
" print('Unable to save data to', set_filename, ':', e)\n", | |
" \n", | |
" return dataset_names\n", | |
"\n", | |
"train_datasets = maybe_pickle(train_folders, 45000)\n", | |
"test_datasets = maybe_pickle(test_folders, 1800)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "vUdbskYE2d87" | |
}, | |
"source": [ | |
"---\n", | |
"Problem 2\n", | |
"---------\n", | |
"\n", | |
"Let's verify that the data still looks good. Displaying a sample of the labels and images from the ndarray. Hint: you can use matplotlib.pyplot.\n", | |
"\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"J.pickle\n", | |
"(52911, 28, 28)\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAGEAAAD/CAYAAADhariQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztfWtsXMmV3lf9frGfZPNNihQ94tiShuMZSTNjw5Fkb2Ib\ntjdIjMXu2nHixIsAziYBEiD2LmB4kOSHnR8GdpPdH+tsFt4ggzgxYK8deJzxzGKsiT0vaTSShnqS\nIkU22exmv99PVn40T6n6st8vtsT+gAab3ffWvV3frTpVp06dj3HOMcDhQnXYNzDAgIS+wICEPsCA\nhD7AgIQ+wICEPkBbJDDGPs0Yu80Yu8sY+0anbuqogbU6T2CMqQDcBfBJANsA3gXwu5zz2527vaOB\ndlrCWQD3OOcPOOd5AP8TwG935raOFtohYRLApvS/Z/+zAZqEptsXYIwN/CL74JyzSp+3Q8IWgBnp\n/6n9z3oKxhhkuzY3N4e5uTnMz8/j+PHjmJycxN7enjjmJz/5Cb74xS9Cr9fDYDBgZmYGMzMzsFgs\nUKvVYKy8nl588UW8+OKLFa/NOUc6ncaDBw+wvr6ORCKBl156CZ///Odx79493Lp1C+FwGOFwGB98\n8EHV39AOCe8CWGCMzQLwAvhdAL/XRnktQTmwiMVi8Hg8SKfT8Pl8sFqt4Jxjb28PALCysoJXXnkF\nQ0NDsNlsyOfzcDqdMBqNYIxBrVY3de1MJoP19XW8/fbbCAQCWF1dxcsvvwyNRgOTyYRIJIJMJlOz\nnJZJ4JwXGWN/COAVlGzLX3LOb7VaXqeQyWQQiUSQz+cRj8dhMBjAOQfnHIwxhEIh3L17F3q9Hkaj\nERaLBW63G4wx2O12mM1mABDHV8Pe3h6KxSJSqRQ2Nzdx7do1hMNhBINBrKysCDKj0Sii0WjNe27L\nJnDOfwHgRDtldBqZTEYQoFaroVarRWthjKFYLOLevXviqR8ZGcHMzAzMZjP0er0gASgRcf78+YrX\n4ZyjUCggmUzC4/Hg+vXriMViyGQySCQSgsBCoYB8Pl/znrtumHuNYrGIYrFY8TuVSgXGGDKZjOie\nwuEwYrEYUqmUqCy5i5NJkFtHoVBAOp1GPB5HOBzG7u4uksnkgfMbwWNHAoCybkT5XqUqH5VzzlEs\nFsuMdyUov6NWEI1Gy0hljIlr0jn1SHksSQBQsT+XK0geVdHntWwAgVpDLpdDPB5HJBJBOp0uK6vS\n8bWIaIsExtg6gCiAPQB5zvnZdsrrFKr9YHpa5WP0ej2GhoZgNpuh1WobvkY2m0UoFILP50MikSgr\nm8qnVz202xL2AJznnIfbLKfjqEcEPaFEgsViaYsE5fUaJQBonwSGR8gdLncZarUaWq0WJpMJQ0ND\nMJlM0Gg04vtqFUjdTS6XQyQSQTAYRDKZbLj/r4R2K5AD+CVj7F3G2B+0WVZPQC1Aq9XCYDDAZDLB\nYrHAaDQKEhpBLpdDNBpFMBgsswmtoN2W8DHOuZcxNoISGbc45/+vzTK7AvnpVqlUMBgMsFqtsFgs\nMJvNMBgMZbNlpYGVRzzkrgiHw/D7/aI7koewzZDSVkvgnHv3/+4C+DFK7u2+g3L0o1arYTab4XK5\nYLVaYTQaodVqDwxfK4FmyslkEru7u9jZ2SmzCc3YAkLLJDDGTIwxy/57M4C/C6C6l6qPoNFoYLFY\nMDw8DKvVCoPBAI1G0zAJhUIBqVQKoVAIu7u7bZPQTnc0CuDH+65qDYD/wTl/pY3yegaNRgO73Y6J\niQnYbDZhC6qN9WXQTDmVSiGVSiGdTiOfzx+OTeCcrwFYavnKhwitVgubzXaAhEZQKBSQyWSQSqWQ\nTCaRSqXaIgB4hIaX7UK2CVqtFg6HA1NTU7Db7TVJkLsXzjlSqRSCwaBwUXcilvfIkCBDq9XC6XRi\nenoaDoej4Uka5xzJZBJ+vx+hUAjZbBbAQ59UI26PSjgyJMjDU61WC6vVitHRUQwNDTW1kJNMJhEI\nBBAOhwUJABr2PVVCXRIYY3/JGPMxxq5LnzkYY68wxu4wxv4vY8zW0tV7CJkEnU4Hk8kEm80Go9Eo\nSKhXidQd7e7uIhwO110xaxSNtIS/AvD3FJ99E8CrnPMTAP4WwB915G66BHkipdFoxOINDU/loWkt\nIqg72t3dPdAdtYO6JOzPgJUOut8G8IP99z8A8PfbuosuQq4gvV4Pq9UKh8NRNkmrVYnyDLhYLCKR\nSIiWQCQ04zGthFaHqG7OuW//BnYYY+4Wy+kqlJWr0+lgs9ngdDqF006lUjU0SaNggUoktItOLer0\nbWyR/CQbDAY4HA64XC6YzeamDHKxWEQul0MikRBDVLklyH+bRaujIx9jbBQAGGNjAPwtltNTGAwG\nOJ1ODA8Pw2QyiRFNva6EFvVpNS0YDCIajSKXy5Ud0yoaJYHtvwg/BfBP9t//YwB/0/IddBHK7sho\nNGJ4eBhutxsWi6WuESbs7e0hl8shmUwikUggHo8jnU6jUCh05D4bGaK+BOA3AJ5gjG0wxr4K4DsA\nfosxdgelqOzvdORuugyz2YyxsTFMTEzAYrE0fB4t6ofDYaRSKRSLRTHaandkBDRgEzjnv1/lq0+1\nffUuQxnAZTKZMDo6ivHxcQwNDVXtQuQ+nmKVkskkQqGQIKGTeGyjLQhyf6/T6UT4o16vb7iMXC6H\ncDiM7e1tRCIR0Q3JBPfCJjySUBpcnU4Hq9UKq9XaFAn5fB7hcBhbW1uIRqMoFAploTPtolW3xbcZ\nYx7G2Hv7r0+3fScdhDLgS6fTwWw2i6gKk8lU02mnrOBCoYB4PC6WMjvdHbXqtgCA73HOP7r/+kVH\n76oDkCvSZDLB6XTCbrdjaGiooUV9mUgigSIr5Gi9nriyq7gtgPIha99CpVKVkWAymaDT6Q4s6lfr\nVjjnyOfziMVi8Pv9iMfjoiV0Ki9IOzbhDxlj7zPG/ms/eVGVFapSqWA2mzEyMgK73S4cdvVCE4GD\n68nb29vCJhDa8RmJe2zxvD8HMM85XwKwA+B7bd1FF0BEyCTQqEj2FdWqQCIhnU4LEmKxmCChEwQA\nLZLAOd/lD6/+fQBn2r6TLkGlUsFms2FychLDw8NNjYrk9eROLepXvMcGjytzW+z7iwj/AH0c6qJW\nqwUJLpcLBoOh4XOLxSIymQySySTS6TSy2WzHXBUy6k7W9t0W5wG4GGMbAL4N4AJjbAmlgOB1AP+8\n43fWIahUKgwNDWFsbAwOhwM6na7m8fIsm7Zeye4KAGWOv06gVbfFX3Xk6l2AsmLIJlCgV635gfLc\ndDqN3d1d+P1+pFIpAJ0nAHiMZ8wUAaHVamE2m8VCTjPh7+l0GoFAAIFAQGyFksvvFB4rEuSnVK1W\nw2QywWq1ik0g8sioEbcDjYoo8lq+Tr1zm8FjQ4KyQtRqNYxGY1nktV6vb2o1LZPJVCSh02jEdzTF\nGPtbxtgyY+wGY+xf7X/ed2EvMhEU6jgyMgKr1Xpglqw8Xv6f1pMp2i4QCJSR0EmXBdBYSygA+Dec\n848AeB7Av2CMLaLPw150Oh3sdjvcbncZCfWMKmOsLPw9EAggGAwKw0zoqWHmnO9wzt/ff58AcAul\nPBZ9Hfai1+vhdDoxPj4Oq9UKjUbT8O7Mvb095PN5sZomB3p1apYsoymbwBg7hlIk9lsARuWwFwB9\nFfZiMBgwPDwsgn4rhbVUIoUcdrQzPxKJIBqNloW3dJqIhlfW9jeE/AjAv+acJ9jBFDp9Ffai1+sx\nPDyMiYkJWK3WhmKLgFIF06J+PB5HLBY7MDztNBq6M8aYBiUC/jvnnCIr+jrsRafTweFwYGxsrGrQ\nb6XdNcViEdFoFB6PB8Fg8EDQbzfQaHf03wDc5Jz/ifRZX4W9KEMRySYQCY22hGKxiEgkgq2tLQSD\nQRFbpNz31kk04jv6GIAvAbjBGLuKUrfzxwC+C+B/Mcb+KYAHAH6n43fXJDjn0Ov10Ov1cDgcsNls\nGBoagl6vb7jyCoUCQqEQ1tbW4Pf7OxZ5XQuN+I5+DaDaDKelsBdlhbRj5JRbVo1GI+x2O1wuF2w2\nm0iXIM9ya12vUCggGAxidXUVOzs7/UFCv0MZdmI2mzE6OirmB9VagUwExRZRniSfz4f19XXs7u4+\nviTUmyw1chwdq2wJVqsV09PTmJycLLMFlcqS3da5XA6xWAw+nw9bW1t48OABAoEAcrncgY3kncah\nkFCtO1JWaiMuY2U3Y7fby0ioVobS9ZDJZBAIBLC5uQmPx4Otra2yDF7dqHxCI4Z5CsBfo7RveQ/A\nX3DO/zNj7NsA/gAPh6Z/XC30hSpXpVJBo9FAq9WKtGg0OaJlQ+XTVs+g7u3tCZe1Wq3G2NgYTpw4\ngfn5edhsD91Zyj3K9DefzyOXy2FzcxNXrlzBlStXsLKyUjdlWifRSEsg39H7+xO2K4yxX+5/9z3O\neUOL/CqVCmq1WiT1oLgfcpQBJaMoB9sC1cNRlGN8jUYDg8GAsbExLC4uYn5+viyfnXyeXGY+nxfJ\nBN966y386le/QigUqpiKrVtoZHS0g1JEBfZnyrfwMBNwQ+O+ubk5YSxNJpMIviI/TTqdRjqdFmHn\n0WgUoVAIoVDoQH4ieg9AtCqyA1NTUzh58iTGxsZgsVjKArzkbokqnip/c3MTV69exe3bt7Gzs4Nc\nLteTyic0ZRMk39HbAD6OUuzRPwJwGcC/5ZxXzD15+vRpPPXUU1haWioLviISstksstksdnd3sb29\njbW1NSwvLyMejyOfz4vWQaCnnwiYmZnBuXPncO7cOSwsLGB4eFgMS5XdEFAywoFAAF6vF5cvX8bl\ny5exsrKCra0tpNPpA1m8uo12fEd/DuDfc845Y+w/ohR79M8qnbu9vQ2tVotUKoWLFy/ihRdegEql\nOmAT/H4/Njc3MTIyAsZKWRvJd1MsFqHRaMpeNB+Yn5/HmTNn8Pzzz4uVNLVaLVzS+XxehK/Qahll\n9X333Xfx9ttvw+fzIZfLdSWaoh4aIqGS74iX0usQvg/gZ9XOf+6553D+/HlcuHABRqNRbNaTn2iV\nSgWn0ykysFitVszOzmJjYwMbGxvI5XKwWCwYGhqC0+mEw+EQL7fbjZmZGRFNQQTQFqdQKCSiqmn0\ns729ja2tLRHuns/ne94CCI22hAO+I8bY2L69AOrEHnk8norGVTlqMhgMcLlcWFlZwYULF/DUU0/h\n+vXruHbtGlKplNjqRPmtnU4nbDYb3n77bZw8ebKsbEoKm0ql4PP5sLGxgRs3buDVV18VeUwDgYC4\nj8NEO76j32809ujKlSsYHR3F5uYm3G43XC6X2Es8NDQEnU4nnmC1Wo1Lly7h7NmzcDqdeOKJJ2A2\nm0VLsFgssNlssNlsUKvVSKfTePnllzE/Py+SxdI6AMUM7e7uCntz7949qNVqZDKZQ698Qju+o4bD\n4X0+H1555RVcvnwZ8/PzWFhYwLFjxzA1NYWJiQmYzWZYLBbo9XoRkmIwGGAwGGA2mzE7OysiKCiY\nlzGGbDYrUt5sbGyIeFGv14utrS1sbW3B5/MhmUwimUwim80iHo+LJUzl7LyXIyIZPZkxZzIZ+Hw+\nxONx6PV62Gw22O122O12ZDIZ6HQ6kaWX+mVqFRQ7Kg9PC4WC6O8pLXMqlRIpk/1+P7xer+j7M5kM\nstnsgUlgv7SEljV1Gr7AQMRCgFcRseg6CQPUx2MT/PUoY0BCH2BAQh+g6ySwJlQJGWPrjLFrjLGr\njLF3KnzfVBayKsdX3P7bbLhnheP/Za3ya0L24Xf6hRLJKwBmAWgBvA9gscbx9wE4anz/cZQciNel\nz74L4N/tv/8GgO/UOf7bKLnmlWWPAVjaf28BcAfAYrXyaxxfsfxar263hGZVCWtmoedNZiGrcjxd\nR1l2U+GeVY5vysVP6DYJzaoStpKFviwLGRoLx6y5/bfZcE+Fi79u+Ur0m2H+GOf8owA+i1L098db\nKKPexKfm9l+ly75CebzO8U1vL+62FHBTqoS8tSz0TYVj8hrbfyu57GuVX83FX638amgna7wKwH9B\nKe/FRwD8HivtW5AhVAkZYzqUVAl/WqW8RrPQN5uFrJntv82Ge1Z08dcovzLaGPk8B+Bl6f9vAvhG\nheM+jdLI4R6Ab9Yobw6l0dNVADcqHQvgJZS0n7MANgB8FYADwKv713gFgL3O8X8N4Pr+tX6CUp8P\nAB8DUJTu4b39e3dWKr/G8RXLr1mXbZDwD1EKf6H/vwzgTyscxwev0qtaXT5yYZAajQaTk5OYnJzE\nwsICTpw4gXfffRdf/vKXRZpNisLQ6XQiHomWPP/sz/4MX/va14Q7nMJs5P+B0sNpMpnwwx/+EF//\n+tfFOgYFJWSzWeRyOWSzWREtks1m8dOf/hQXLlwQ3wUCAezu7uLHP/5x9d/URn0cihSwRqPB+Pg4\nTp8+jWeeeQZnzpxBPB7HZz7zmYobA5UReiMjI3jqqafK1hbovdR6xTlvvPEGTp8+XfVYzrlYBykU\nClhfX8dXvvIVXLp0CW+88YbIHFnzN7VRHz2VAqZ1aIvFgpmZGSwtLeHEiRMYGRkRSlGNhCxSOc1e\ntxJkQoBSIBstWn32s5/Fpz71KTx48AAbGxv4xS+qL0Q+MlLAKpVK5LqenZ3F0tISZmZmYLPZcOHC\nhQMEKP8CpSf7E5/4hFjabGRljQRQaxFL5ahUKly4cAF6vR6cc+h0OrhcrrphNI+kFLDcBRSLRXzi\nE58A5+XpN+XALxnVpH3l8+TrnD9/viYB8nmMMVy8eFF8JsfI1sIjY5gpUi8Wi2FtbQ2XL19GKpXC\n8ePHRTBYJXn3aqE2zaDVtWjOucgaVguPFAmUq3p9fV1k+DUajTAajdDpdGJDiNIgK/+X/yrRyFNP\nx1GLpL9KFAoFRCIR7OzsVCjhIR4ZEgjFYhF+vx83b95EKpWCx+MRaZcpfwXFMdH+NQqfMRgM4n86\nrhHbIHd/BIpxogBmkoCnGFvaC3337l3cvXu3ZvltLfSzBqSAuxFtQeN/yvZrt9tFaKTZbBY5UIeG\nhkQyWgoYs9lsIujMYrE0pJ/A+cMwG2oB9IR7vV4RWslYSdOTIs0TiQRWV1exsrKCtbU18G5EWzDG\n7gN4hteQAu5myItWq4Ver4fRaBQVbjAYYDQaYTKZBCEUXEatxe1240Mf+hCOHz8uWoa8rUrZMpLJ\nJHZ2drCzsyO01UiBNhAIwOfzYWdnB4yV5GLIfqXT6bLzqpHwyEoBUxQd5aXLZrOIRqNixkyzZvpL\n7zUaDWZnZ/HJT34STqdTZAOjAGXgIBGJRAK3bt3Ce++9B6/XK/Q2s9msyJGXSCQAQJQjB6XJOguV\n0C4JtAhTRMmP9P02y2sKcuh7pV2Wlfp6zjn8fj/m5ubw9NNPi25N/p7Oo/LD4TDu3buHN998U2wq\nyWQyovWQ26NVPLJSwJW60WojHzK+soGlBCI2m61sN5AMesp9Ph82Nzdx//59oSBCoyIqV2lXmgmz\nb3eyJhZhGGO0CNNzPWblJE1JAhlfIoC6MepSlLmv6fxsNisM8ObmJtbW1spGSbLKoPKa8sbIenb3\nsZACln9oAy74MlEK2gVUqesig0yZgWnblrzJhQisd81aeKykgCv5iyp9TyRQFng5Fb9MRjKZhNfr\nFSQoCWimomvhSEgBKyuMNHLkliD37wSS9EokEl3d19xv0RZdQ6XuqFpLaNVX1CoeObdFK1B2FyTf\nSCTIw0uZABqiFgqFshFUJ7ogGUdGhVYGiVL4fL6yISeBVsni8XjZ5IxQz/Y0iyOhQkugSisUCoKE\nWCxWNu4HHraAeDwu/ENEwqG0BP6Iq9AS5EorFotIp9OIRCJIJpNi7xsdw/a39VICK4fD0VTK/2bx\nWKvQVgPNmCm5Ce3kp6gMmoTZbDbMzMzA5/OJPc/dQKdGR13zlHYD5O+XhSlkNwRlpLFarZiamsL0\n9DRsNlvXRk5HSoW2Gqr18bJWp+xtpXM6hcdahbZdyCTYbDaxjt1ptCrn8h0A/5v1UTrOZkDJr8xm\ns1gI0mg0ZWsKwMO8eJFIBKlUqmy+UC2aoxU81iq01aBWq2GxWMSijsFgKEvbSSAvKmnq5PP5Mrd4\np3Bk3BZyBZO2wsTEhBDAq5TWXylsJNuOThroI0GCMuxFq9XC6XRienoaTqcTOp2ubG2AkMlkEA6H\nEQqFylbu6oXNNIsjQYISFNM6MjIixI6UkXdAOQmHLefScSlg5ZPZa5ArOxgMIpFIiFAWAi3UU/Sc\nsiX0i+8IaFEKuBUCOkGY0m1BXlRaK5D7epmEYDBY1hL6yXcENLlXVyqv4z+ikWvKf0lp3OPxIBAI\niLVm8h9RBEcikRA6a0pNnU7iUKSAmyWCjm2nRcjXyufzIiOk3+9HNBpFOp0WizvFYlHEvVLqtm4K\nG7XqwGs4HacSFCMqx4ZWg3J7EmXwahfU1WQyGezu7sLn88HtdkOtVkOr1YqFHDneVGkzOomWSOBN\npONUwmQyYXZ2FhMTE5iamsL4+HjF48iRRiGGHo8Hm5ub8Hq9rdxyRXDOEQqFsLKyItL8Dw0NiS6J\nwluUc4PDIuHAXmDeYDpOJdxuN+bm5nDq1CksLS3hwx/+MJV5wJ+vVqtx+/Zt3Lp1C9euXUMikWib\nBGVFhkIhrK6uwuVyCT8RLeoQEXJX2A1b1hMpYIvFgsnJSUxNTeHEiRNYXFzE3NyckN/av47yumCM\nYXZ2Fnq9HiaTCXq9HhaLRQTZ1ovxbASUIVK51txL9EQK2Gw248SJEzh37hwWFxfxxBNPYHR0tGyP\nQDVQJnin0ykisK9evSpEh5rtIpTXIgOcTCa7ojreCHoSbTE6OoqFhQUsLS2JfKgUhCuHHVaqUBKx\nLhQKiEajSCaT2Nzc7JhLmTFWtpp2GOgJCdPT0zh27BgWFhbEEw00Z+B0Oh2cTqdYYCESZBIbgTLm\nVKfTwWw2C02HwyCiJySMjY1hcnIS09PTB4SGlPOFSpVATyttBqlURqvQ6XSwWCxlwhq9Rk+kgGUF\n2EqQ0+qn02lkMhmx+E6jk0wmA7/fj/X1dYTDYTGxatelodVqy4Q1DgM9kQLW6XRi5UrZ78uL7kRA\nNpsVmge0kkWpntfX1xEKhcpCF9sBdUeHSUKrci6UG+7v7B/2AwCvo0TMAVQSpAYebotdWVnB3bt3\nkUwmhVCRPLPW6/UIBAJYXl7G7du3sbu727HhpFarLbMJh4FW5VwO5IarFXtUSZadYvszmQyWl5fx\n8ssvCzUPAKJiaMMfjYq2trYQCoXKsrG02h2RYSaboNVqK+5z6DZ6IgX861//Gpubm7h06RIuXrxY\nltpgb28PoVAI9+/fx9ramnAtEwm085I8n6SNLHdH7Rrmvu+OgNq54TjnvnqxR1/4whdw5swZnD17\nVqhKAQ9zP1itVrjdbkQikbL8QWSwaZcM2QvlIkyjqGTElUNU5bG9GLK2LOeCh7FH30Wd2CNSlJJB\n/iEiYXR0FH6/H8FgsKzCu9Ul0LCXXCFKEvqqO2IdkAKm4Wmlp4oEjY4fP45wOCzyQBBJ1Sqjkcqp\ndD3OufBlTU9P4+TJkxgfHxeyMsViEcFgEB6PBz6frz8EUHkHpIDJLbBfXlkFqtVquFwuLCwswOv1\n4tatW4KASiQ082RWGg4DpZb5xBNP4Ny5czh16pQggTR6gsEg7t+//3hJAe/s7ODYsWPI5/MiskF+\nWa1WTExMYHR0tEwpsNrT38yKnNwaHA4HRkZGMD8/j49+9KN4+umnMTMzIww/pUhYXl7Ge++9h/X1\ndSSTyZau3Qx6QsLdu3cxNzeHbDYr0pnJTjuLxSJUp6hbIIkuQruVwBjD5OQknnnmGSwtLeHJJ5/E\n4uIiLBYLtFotgsEgVlZWcPv2bVy5cgWXL18WOkDKe+g0ekLC2toavF4vwuFSvIDJZBLdE2NMCOCN\njo5iYmICExMTiMViiMfjZZu3W6kElUolJnwLCws4e/Ysnn32WUxOTmJiYkK4S7a3t3Hr1i1cuXIF\n165dw/LyckeWUhtBT6SA/X4/1tbW8MEHH2Bubg7T09NCyguA6H6OHTuGixcvwu12Y3V1Faurq2It\nmDZ8V1vdquTQY4xBr9fj2LFjOHbsGE6dOoWPfOQjGB8fh9lsRrFYxNbWFtbX13Hz5k1cvXoVN2/e\nhNfrFXny+mJ0hA5IARMJy8vL0Gq1GBkZgdVqFT+QdsjMzc3B4XBgYWEBb731FlQqlZig5XK5Mieg\nsnJkgmQSLBYLnn32WTz33HOYm5vD5OSk0GkmEq5cuYL33nsPN27cwOrqqpjN9w0JVXxHTekEpNNp\nbGxsiCyJKpUKU1NTBzJzkZGenJzE6dOnYTAYkEgkxNIjOQArDVmrucANBgNOnDiB48ePw+VywWg0\nIpPJiJ36ly9fxtWrV7GysoJgMCgir/d/byM/r230RAo4m81ic3MT0WhUBFgFg0HY7XbYbDYhZGow\nGKDT6TA8PAy9Xo/Z2VmhUFtpX0CtdQhqGbTtiXbaqNVqRCIRkb/ogw8+wPLyMvx+v5CC7PUSZ8OZ\nv/a7otcB/AfO+d+wUnqdgBR7NM45PxB7JPuYaOH+zJkzYh8YpU5zOp0YGRnByMgI7Ha78KIC5ZVd\n636rOfMociKZTIpEhr/5zW/w5ptviuSxNBSt1tI6Ad5O+rV939H/QSlL/J9U+H4WwM8456crfMf3\n/wIAXC6XiIamUQulSFtYWMDJkyexsLCAkZERDA8Pl+Woa6aboH6dgn9Jn/P+/fu4e/culpeXsby8\njFgshlgsVrbI362WUI2EnkgB798AAIjYThmUGm1paUm4tj/3uc/BaDSWJRyXXR/y+9dff70sky8F\nb1FGsGAwiGAwiJs3b+JnP/sZIpEINjY2RPJAOWFUX0Zb1PAdNSwFrCiv7C/wMBTd6/XiypUrCAQC\nyGazcLvdopVQ/lOj0ViWYtNoNOLnP/85nnzySXDOyySAKaKaUmZubW3h+vXr0Gq1ZWkS2pmHdAI9\nkQJWolL3UigUsLOzg0gkgkKhgB/96Ecwm83CXjgcDqFcK78cDodIkcY5F7Lv9Nfj8Yh0OZRhWKVS\nlSWaOqwfOlVnAAALQ0lEQVTKJ/R8FaNSSwAgFvPJYba9vQ2DwSA2fNPCPxlfcoMbjUYUi0Xh9iZB\nbBLB3tzcFJrN8vyhnzCQAu4h2hodDdBdHMmNg/2GAQl9gIEK7UCF9sDxAxXaLmCgQtsABiq0AxXa\nAxio0HYBAxXabqrQ7t9EvZHPQIW2yyq0DY18MFChratC27LbgjH2HIBvc84/s///N0uc8u+2VOAR\nRjte1Eojn56oSz2q4F0SNup7qNVqse4wNTWFcDiMixcvCskueYWuEt555x2cPdvI+KDUtb/zzjs4\nd+4cvF4vPB6PSFoVDAarnvfISQE3C76/1pzP58WaBGV0IRJqgfbTNQraUzE8PAyXy4X19XVwzrtG\nQk+lgJuBvGJGYZBWqxUjIyPgnGN2drZMALsWEaurq5iZman6vQzOOVZXV0WEIecl7c16KZ4fGSng\nVkEBYA6HA2NjY5idncXCwgJyuVxZoFc1IrLZLObn5xu6Ft9fHTx+/LggIRKJYGurdgfxSEoBNwPG\nmBDNtlqteOaZZzA6OopcLidyY9RqCW5347nYOS8pn6vVauh0OnDOYbfbYTaba5732BtmGXJmYFIh\nVJKgjMBr9v+9vT2RvIpzLhLf1sKRIoECAywWy4GW0InFfxoE0D7svb09GI3GmtnNgDZJYA2o0PYa\ncoXy/cgM6h4oN4bdbheGuZPZXcQMmJWEUAuFAqxWa9e7oz0A53kNFdrDAHURVLlEgslkEiTQU9vp\nlkBdk0qlQi6XE5LDtfDIqtDWg9xXUyXTVi3qr4mobpBA128k3X+7FdjKIswACjyyKrSPEx4LFdp+\nxOuvv47XXntN5OyohcdChbYfcf78eXzrW9/Cl770JTz//PM1j32sVGgfVRwJFdp+R18OL48aBiT0\nAQYk9AGOpBRwv+FISQH3K46MFHA/o1Wb0EoQ7gBVcCSlgPsNrU7WGk7HeVTRDd/RkZQCbgfN+I4a\nGaK+BOA3AJ5gjG0wxr6KkhTwbzHG7gD45P7/A7SIIykF3G8YzJj7AAMS+gADEvoAhyIFPEA5ei4F\nPMBB9FwK+DDQqbiibuFQpIAPC/1KRM+lgI8KmnFb9FwK+Kjg/PnzeOGFF/Dqq6/itddew1tvvVX1\n2JZ8Ry1tmH54bqOHHhn0RAq4lxWvvBYFAh+2yGkt9EQKWN4v0Ow5raDSbhql4mE/oSc7dUhITqvV\nIp/PC500qhBZAVwOL28V8h4BrVYLi8UCp9OJoaGhAzJe/YCekGC32zE0NASr1Yp4PC4E66gSaOtS\noVAQWeWbJUJubbR1VqPRwGAwwOl0YnJyEi6Xq+7WpW6g3m/pGQlut1sI2u3u7iKdTouKI3G7bDZb\nJmbXSougMml/GmWjHx8fL9OC7hSq3SPtBKJN67XQipzL9znnf8oYcwD4IUpZXtYB/A6vop9gt9sx\nPj6O2dlZZDIZxOPxsj3EqVQKqVQK8XgcsVgMyWRStI79e6h3m6IyNBoNNBoNLBYLHA4HRkdHMT8/\nD5fLVSZop9xS1SlQuZRFIJVKIRqtWC0Crcq5vIJS2ppXOef/aT/X0R+higqtw+HAxMSE2JRNafLp\nhpPJJFKpFILBIHZ3dxEOh0W6ffnH1QLZFJIDcLlcmJycxOTkJGZnZ+F0OmEymTomK18LnD/MWp9O\npxGLxWoe3xMpYIPBIGS9aKQiP4WkLsVYSRXWbrcL2cdGQd2XTqcTxthms4lrV+remm0FtbpH+fdw\nzsUARM7/XQ09kQIm3U3qXihV/n6ZACA0OE0mkxgtyerj9SCPiJQq6NQvk9HvNKhMIoLEM4iAeg9T\nT6SAb9y4gbW1NVgsFszPz+PYsWNlTw6phdCQUq/Xt+X5ZIyJbC6tqNa2QhTZgb29PWSzWfzyl7/E\na6+9hnv37uH+/fs1z+2JFPCpU6eEBkI+n4fP50OxWBRdED29lI6glmBqA/cKAKI7UKlUcLvdQq2k\nXplKAioRoiyDujrqfhKJBBYWFmCz2XDp0iXk83lsbGxUvWZPpIAZY8jn80gmk0in00gkEmI3vexS\noD3GsouhFSLIMOZyObGznkiQpVuU5zTyGX2unJdQC4jH40LFxO/3lw0uqqEnUsCLi4tCqKhQKCCb\nzYrd9ACESEUqlUIikUAmkxE/rFWYzWaMjY3B6XRibm4OIyMjZeKqjaCSu0We5ZM+KM1vEokEYrEY\nIpEIQqEQwuGwkJephZ5IAS8uLmJmZkYkb6LhJBmxSCSCaDQKr9eLra0tBAIBMYxttH9WPplmsxmz\ns7OYnZ3F6OgoRkZGoNPphLywfE6jZcsg9RJSKSH5sEQigWg0KkigVy30ZMYcCoVgsVhgsVhEf6/T\n6WCxWITAHX0HlEZTJEwKtObjoVny+Pi4GKrKBFSD3L1Q10nan7JLhQhIpVJIp9NIpVJlNiEUCiEQ\nCCAYDCKVStW8Zk9IeP/99+Hz+eDxeESfb7fbMT09jfHxcahUKthsNpEgymazIZFIIB6PN2wXqHLI\nnjidTrjdblitViEpCTQ2U6aJlt/vh8fjQSAQEK4Vuk4mkxEEKFO5pdNpQQIpZdVCT1Rob9++jUAg\ngJ2dHUGC2+0Wrgur1QqbzQaz2QyVSgWz2SxE6Gj0VA/09JLbgrK5UAugOUet88mg0wx+Y2MDd+7c\nwfb2tuhyaBiay+XEEJig1Wqh0+mQzWZFKwiHw2J+VA09UaGNRCLgnCOXy4lRD+mfeTwejI2NYWxs\nTEzWdDodrFZrU/MFZUvQarXY29sTXUGtMmhekc1mkUwm4fV64fV6sbm5CY/Hg2AwWObLAh4aZlkT\nlIihlkBdUb0HoCcqtOFwWIwegJJh1ul08Hg8sFqtOH78OObn5zEzM4OpqSm4XK6yLqQV0IybpL2q\njbSIZOrLg8Egbt++jTt37sDv9yMYDB6QB6OXPIyWBwZUTiAQaGhw0RMVWpvNhrGxMYyOjoomTynK\n9Ho9jEYj8vk8IpEIVCpVU7agGug6ysUiuUw53xFJRNKDYrPZhK2SW0ClWT4RQk8/GfFGh9jtuC0a\nDnsZHh7GwsICTpw4Ac65aMZUATRiisViSKVSZTagWSIqHd/IMFf2/+t0OkxMTJQRqISSBMYYVlZW\nkEqlxHmNomW3RTNhL2tra0ilUtja2sLi4iIWFhbEcI+cXTRZk5/edqDsJqodo+xmyJ7IwquVBFLJ\nQBNx8ryhEc+pjJ6o0I6OjmJxcRFPPvkkgJJcPBkr2fFFL7n7kJ/EdgMF5GEq/VW+p5apfMoJdA/U\ndSUSCWGMd3Z24PV6EYlE6hpjGT1Rod3e3sbw8DC2t7cBoMylXOmJ3dzcxNTUVFUylNja2sLkZC2p\nnvJ7mZqaqnjtSpVN9yJ/zjlHLBYTw1BSuU2n00gmk027zHuiQhsIBPDgwQMxXKTcodTsTSYTzGaz\nWJDZ3t7GwsKCcLgpHX1K0u7cuVMzn7U8C7558yaWlpYOtD5ab6CukVbFlpeXxYRMrtx0Oi3cFDS7\nz+fzYtLWcZvQLjKZDB48eACfzwegRAItxJtMJpFhnUJSKMk4iWITOeQJVfbTNputJglyRVssFuFI\npArP5XLCqVgsFpHJZJBMJhEOhxGNRnHv3j3hpCPQuUROLSNeDz0hoVgsIp1Oi6Ee5xxarVb0m7Tw\nIhtktVotXNs6nU68tFrtgdUzak2VINucfD4vcqTSuVTxRCoNbeUWQS54mQQyxs30/dUwkALuIfhA\nCrh/Mdg42AcYkNAHGEgBD6SAB1LAnA+kgOWyB1LA+xhIAfcBBlLAXcBACrjbUsANYCAF3E0p4CZG\nSAMp4Dp1NHBb9AH6zTAfSQxI6AMMSOgDDEjoAwxI6AMMSOgDDEjoAwxI6AP8f/wmsSMpuFJ2AAAA\nAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7fc81814e0d0>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"training_dir = 'notMNIST_large'\n", | |
"pickle_filenames = [f for f in os.listdir(training_dir) if 'pickle' in f]\n", | |
"random_pickle = np.random.choice(pickle_filenames)\n", | |
"print(random_pickle)\n", | |
"with open(os.path.join(training_dir, random_pickle), 'rb') as pickle_file:\n", | |
" data = pickle.load(pickle_file)\n", | |
"print(data.shape)\n", | |
"entries = data.shape[0]\n", | |
"subplots = 3\n", | |
"for i in range(subplots):\n", | |
" index = np.random.choice(range(entries))\n", | |
" plt.subplot(subplots, 1, i+1)\n", | |
" plt.imshow(data[index], cmap=cm.Greys_r)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "cYznx5jUwzoO" | |
}, | |
"source": [ | |
"---\n", | |
"Problem 3\n", | |
"---------\n", | |
"Another check: we expect the data to be balanced across classes. Verify that.\n", | |
"\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"G.pickle (52912, 28, 28)\n", | |
"E.pickle (52912, 28, 28)\n", | |
"C.pickle (52912, 28, 28)\n", | |
"H.pickle (52912, 28, 28)\n", | |
"B.pickle (52911, 28, 28)\n", | |
"A.pickle (52909, 28, 28)\n", | |
"I.pickle (52912, 28, 28)\n", | |
"F.pickle (52912, 28, 28)\n", | |
"D.pickle (52911, 28, 28)\n", | |
"J.pickle (52911, 28, 28)\n" | |
] | |
} | |
], | |
"source": [ | |
"# ??\n", | |
"training_dir = 'notMNIST_large'\n", | |
"for filename in os.listdir(training_dir):\n", | |
" if 'pickle' not in filename:\n", | |
" continue\n", | |
" with open(os.path.join(training_dir, filename), 'rb') as pickle_file:\n", | |
" data = pickle.load(pickle_file)\n", | |
" print(filename, data.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "LA7M7K22ynCt" | |
}, | |
"source": [ | |
"Merge and prune the training data as needed. Depending on your computer setup, you might not be able to fit it all in memory, and you can tune `train_size` as needed. The labels will be stored into a separate array of integers 0 through 9.\n", | |
"\n", | |
"Also create a validation dataset for hyperparameter tuning." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"output_extras": [ | |
{ | |
"item_id": 1 | |
} | |
] | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"executionInfo": { | |
"elapsed": 411281, | |
"status": "ok", | |
"timestamp": 1444485897869, | |
"user": { | |
"color": "#1FA15D", | |
"displayName": "Vincent Vanhoucke", | |
"isAnonymous": false, | |
"isMe": true, | |
"permissionId": "05076109866853157986", | |
"photoUrl": "//lh6.googleusercontent.com/-cCJa7dTDcgQ/AAAAAAAAAAI/AAAAAAAACgw/r2EZ_8oYer4/s50-c-k-no/photo.jpg", | |
"sessionId": "2a0a5e044bb03b66", | |
"userId": "102167687554210253930" | |
}, | |
"user_tz": 420 | |
}, | |
"id": "s3mWgZLpyuzq", | |
"outputId": "8af66da6-902d-4719-bedc-7c9fb7ae7948" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Training: (200000, 28, 28) (200000,)\n", | |
"Validation: (10000, 28, 28) (10000,)\n", | |
"Testing: (10000, 28, 28) (10000,)\n" | |
] | |
} | |
], | |
"source": [ | |
"def make_arrays(nb_rows, img_size):\n", | |
" if nb_rows:\n", | |
" dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)\n", | |
" labels = np.ndarray(nb_rows, dtype=np.int32)\n", | |
" else:\n", | |
" dataset, labels = None, None\n", | |
" return dataset, labels\n", | |
"\n", | |
"def merge_datasets(pickle_files, train_size, valid_size=0):\n", | |
" num_classes = len(pickle_files)\n", | |
" valid_dataset, valid_labels = make_arrays(valid_size, image_size)\n", | |
" train_dataset, train_labels = make_arrays(train_size, image_size)\n", | |
" vsize_per_class = valid_size // num_classes\n", | |
" tsize_per_class = train_size // num_classes\n", | |
" \n", | |
" start_v, start_t = 0, 0\n", | |
" end_v, end_t = vsize_per_class, tsize_per_class\n", | |
" end_l = vsize_per_class+tsize_per_class\n", | |
" for label, pickle_file in enumerate(pickle_files): \n", | |
" try:\n", | |
" with open(pickle_file, 'rb') as f:\n", | |
" letter_set = pickle.load(f)\n", | |
" # let's shuffle the letters to have random validation and training set\n", | |
" np.random.shuffle(letter_set)\n", | |
" if valid_dataset is not None:\n", | |
" valid_letter = letter_set[:vsize_per_class, :, :]\n", | |
" valid_dataset[start_v:end_v, :, :] = valid_letter\n", | |
" valid_labels[start_v:end_v] = label\n", | |
" start_v += vsize_per_class\n", | |
" end_v += vsize_per_class\n", | |
" \n", | |
" train_letter = letter_set[vsize_per_class:end_l, :, :]\n", | |
" train_dataset[start_t:end_t, :, :] = train_letter\n", | |
" train_labels[start_t:end_t] = label\n", | |
" start_t += tsize_per_class\n", | |
" end_t += tsize_per_class\n", | |
" except Exception as e:\n", | |
" print('Unable to process data from', pickle_file, ':', e)\n", | |
" raise\n", | |
" \n", | |
" return valid_dataset, valid_labels, train_dataset, train_labels\n", | |
" \n", | |
" \n", | |
"train_size = 200000\n", | |
"valid_size = 10000\n", | |
"test_size = 10000\n", | |
"\n", | |
"valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(\n", | |
" train_datasets, train_size, valid_size)\n", | |
"_, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)\n", | |
"\n", | |
"print('Training:', train_dataset.shape, train_labels.shape)\n", | |
"print('Validation:', valid_dataset.shape, valid_labels.shape)\n", | |
"print('Testing:', test_dataset.shape, test_labels.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "GPTCnjIcyuKN" | |
}, | |
"source": [ | |
"Next, we'll randomize the data. It's important to have the labels well shuffled for the training and test distributions to match." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"id": "6WZ2l2tN2zOL" | |
}, | |
"outputs": [], | |
"source": [ | |
"def randomize(dataset, labels):\n", | |
" permutation = np.random.permutation(labels.shape[0])\n", | |
" shuffled_dataset = dataset[permutation,:,:]\n", | |
" shuffled_labels = labels[permutation]\n", | |
" return shuffled_dataset, shuffled_labels\n", | |
"train_dataset, train_labels = randomize(train_dataset, train_labels)\n", | |
"test_dataset, test_labels = randomize(test_dataset, test_labels)\n", | |
"valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "puDUTe6t6USl" | |
}, | |
"source": [ | |
"---\n", | |
"Problem 4\n", | |
"---------\n", | |
"Convince yourself that the data is still good after shuffling!\n", | |
"\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"G\n", | |
"B\n", | |
"J\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAGEAAAD/CAYAAADhariQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztfXtwW1d63++AeAMknsSDBAW+SUl8iPTKlmOtLe/aTbrT\njHfqmUyybabNdjL9o2mbx6TZZJIo2+kfSWaSmSad/NHt7s6m08y23Zl047xqeW3Fq629li1RpEiJ\npEiKJEACxJt4P2//IM/RxeUFcAGQMGThN4PB69xzzz3fOd/5zne+B+E4Dh18upB92g3ooEOEtkCH\nCG2ADhHaAB0itAE6RGgDNEUEQshPEUIeEkLWCCG/cVqNetZAGt0nEEJkANYAfBHAHoDbAH6W47iH\np9e8ZwPNzITnAaxzHLfNcVwewHcBvHE6zXq20AwR+gHs8r57jn/roE7Iz/oGhJCOXuQYHMcRsd+b\nIYIXwDned9fxb88cCCEghKCnpwccx8FoNCIYDCKZTEq6vhl2dBvAKCHETQhRAvhZAH/VRH1PLQgh\nkMlk0Gg00Gg0MJlM0Ol0kq9veCZwHFckhPwSgLdxRMxvchz3oNH6GgUhBBzHgRDRmY6z0BLz7yWT\nyaBSqaBSqWA0GpHJZGA0GhEOhyXX19SawHHc3wOYaKaOZkDZgPAzcLLzT4sY9D70Xl1dXVCr1dDr\n9TAajYwYarW67Jpq9z/zhfmswZ8F9XS8kGj88lIIRsvIZDJotVqYzWaYTCb09/cjlUpBoVCU3eep\nJEKlTuI/TK3OEvufP5JlMlnZPUqlEjiOY+9S6iSEQKfTobe3l60FhUIBXV1dNZ+Roi2JwO8YsVHO\n70SZTIaurq6yTqW/0+9ClkUIQVdXV1kZSuBSqcReFPQ/sTIajQa9vb1wuVzo6emBTqdDKpWCTCY7\n8QyV0BQRCCGPAcQAlADkOY57vom6ADx5YH7HCMsplUoolUpotVp0d3dDp9NBrVZDo9FArVazz1qt\nFhqN5gRBKOH472IdRe9fLBaRy+XYK5vNIp1OI5FIQCaTYXh4GMPDwygWi8hkMgiFQjU7no9mZ0IJ\nwDWO4yLNVCLGNyuxEplMxghgsVhgt9thsVhgNBphMBjYy2g0wmw2w2AwQC6XQ6FQsI6hnS+Xy9k7\nn1D0Mx31uVwOyWSy7BUOhxEMBpHL5eB2u+F2uxEMBuH1eisStRKaJQJB85pYAE8WWEIIFAoFuru7\nodfr2btOp2MjW6vVQqvVsg7v6elhv9GXQqFgHZlIJJBMJpFOp5FOp5HL5dgop4RQKBRQKBRQKpXs\nHjqdDnq9HiqVCgaDASaTCdlsFplMBg6HA4eHhygUCrBYLLBYLFAoFCiVSggEAtBoNJL7oFkicABu\nEEKKAP4rx3HfqOdi/gygROjq6oJGo4Hdbkd/fz/6+vrQ19cHu90Oq9UKs9nMOonK5wqFgnUmfaXT\naUQiEYRCIezt7cHj8SAQCCAYDOLw8BDZbBbZbJZdq1KpoFarodPpmKTT19eH/v5+OBwO9htdCwqF\nAgqFAkqlEpRKJVQqFVuMfT5f2WatlgDRLBFe4jhunxDSiyNiPOA47paUC/mLJeXl3d3dMBqNsFqt\ncLlccLlcjAi9vb2wWCwwmUxQq9XsoekCWCwWUSwWEYlEEAwGEQgE4PP5sL+/j93dXezu7iIYDCIU\nCuHw8BCZTAbZbJaxI6VSCbVaDa1WC5PJBJPJhP39ffh8PrhcLrjdbrhcrrJ1hxKQQqfTsTXq8PAQ\nwEmJTgzNbtb2j98DhJC/xJF6uyYRhNKPxWJBX18fhoeHMT4+DrfbDbPZDLPZDL1eD71eD61Wyx6e\ndhydSXRBTKfTWFxcxMLCAh4/fsw6PR6PIx6PI5VKIZPJIJfLoVAooFgsli3UlC2FQiGoVCpsbm5C\nr9fD4XBgfHwc4+PjcDgcbGZQaYgvgdFON5vN7PkIIWXS1qkRgRCiBSDjOC5BCNEB+EcAvi6FAHwx\nUaVSwel04vz585ibm8PnPvc5jI2NMf5fCxzHIZfLIRqNIhgM4t69e7hx4wbW1tYQjUYRj8cbfUQG\ni8UCn8+HQCCAkZERDA8Pw+VywW63AwCT1kqlEpuR9ezQm5kJdgB/eayqlgP4HxzHvV3tAv5o0el0\ncDgc6O/vx9zcHObm5jA0NIS+vj5oNJqyaV4JVFz0+XxYWFjAwsICVlZWsLOzg3g8jnw+X3FXXKuN\nfGQyGXg8HuRyOXg8Hty/fx8ulwvDw8MYHBzE4OAg3G43crkcm3W5XK5qnXw0o8DbAnBJannhIqzV\nauF2uzE7O4srV67gypUrMJvNUCgUZbvNSqIqAOTzeSQSCXi9XnzwwQd46623cHh4yAgAgK0ZUsRf\nsXeO45DJZOD1erG/v8/Y1rlz53Dx4kXMzs6iq6sLLperIhFqoSU7Zv4D9fT0oKenByMjI4z9jIyM\nwGQySRbrqFrB5/NhZWUFCwsLePDgAQ4ODhi/L5VKNdUewv8q6XiEO2jgSAKiA2ZgYKBsD5FOp9kg\nkIKWEYF2gNFoxPDwMGZnZ/H888/j8uXL6O7uhlKpPKGa4F/LB+W9Ozs7eP/99/Hhhx/C6/UinU6z\nspV23NVQabYIZwcAJJNJbG9vAwCmp6dxeHhYRoRCoSD5vi0jAt2F2u12XLhwAbOzs5iYmMC5c08O\n5yp1GP/MoFgsIpFIIB6PY2trC8vLy1hZWUEmk0GhUBBV/DUL4XkFIYRt2tRqddm+g0pfxWJRcv01\nd7uEkG8SQvyEkEXebyZCyNuEkFVCyP8lhBhq1AGVSoXu7m4MDg7iueeew8zMDHp7e0+UEyrcaCdQ\nFItF+Hw+LC4uYn19HYFAgI08oTLuNA90hEo8YZvp7rxQKCCfz1cVSYWQonL4NoCfFPz2NQDvcBw3\nAeBdAL9Z9SYyGXQ6HSwWC4aGhjA7O4uLFy/CarXW5NPC78ViEfv7+1hcXMTa2hqCwSAymQyT+Wm5\nszhREyMEf+BQItA1SSpqEuF4ByxU0L0B4DvHn78D4MvV6lCr1RgfH8drr72GqakpmEymupRc9AGL\nxSLS6TS8Xi/u37+Px48fI5FI0HaeWedXgxgx6mWHja4JNo7j/MeN8BFCbNUKazQaRoTh4WGYTKa6\nDj0oKBH29vawtLQEj8eDVCrF/m81EcQIQFEPIU7LILjqk3Mch42NDbz99tt49OgRVCqVKO8XPgi/\nQwuFApLJJKLRKCKRCKLRKJLJ5Ak21CrQexWLRabapm3KZrOIxWKsXC3W1OhM8BNC7BzH+QkhDgAH\n1Qo/99xzeOONN/DlL38Zer2enb8CtUcMlUzy+TwODw8RDAYRj8eRzWaZBFJJvj8rCAWFWCyGvb09\nHBwcMEIolUrWtlrtkzoTyPGL4q8A/Mvjz/8CwPerXex2u2G322EwGNhJF6u4Ah8VNppPhEQigUKh\n0HL+Lwaqud3e3obX68XBwQFisRiy2azkOmrOBELIXwC4BsBCCNkBcB3A7wP434SQrwLYBvAz1eoY\nGxuD1Wot63yh7F0LlAiBQACJRKJsird6JgjbFQwGsb6+jnA4DK/XC5/PVyYw1EJNInAc95UKf70m\ntaF8IjTaYblcDrFYDIFAAMlksi4R8CxRKBQQCASwtraGSCTCWFImk2FlzvpQRxJ6e3uZ3r1RlEol\nZLNZpFIpdjzZDigUCohEIiCEIJVKnVivpKAlRLDZbGVEaIQYxWKRESGfz7cdERKJBIrFItuo1dO+\nRtUW1wkhHkLInePXT1WrQ6/XQ6lUNjUT+CqBeg9NzhL0UCmVSiGbzZ4NESCutgCAP+Y4bv749ffV\nKlCpVJIOaaqBb3jVLgTgo5mNYqNqC6BcZK0KviVCo2hUJfA0oJkd8y8RQhYIIf+tlhaVqrGbATX6\n0mg0TbO2dkOjPfNnAIY5jrsEwAfgj6sVpraiQGOLMq1DpVIxw67PEhEaYtQcxwV4X78B4K1q5b/+\n9a+zTnv11Vdx7dq1uu+pUChgMBiYuNvszJKKasQ+rbVJKhHK1BaEEAfHcb7jr/8UwP1qF1+/fr1p\nnq5QKNDT04Pe3l7o9foTu+/ThtjMFVMuVjMakNq+RtUWrxJCLuHIIPgxgH9drY5SqVRxt1yJKMKy\nlAhWqxV6vb7phb4RSCV2vSqZRtUW35Z8B4DtHhthIfRhKBEsFgt0Ol2Zwe9ZnaLR90odWu2+rTL+\nkgwxIywp4F8jl8uh0+lgMpmYFbZSqWSbN1q+WYLQ82KhE4nwxf+PXket74rFIjOT4X+vhJYQIZVK\ngRybvPM7qq7Tp2MRVafTwWq1YmBgAIVCAeFwmPkLN3O4QzuXmuVTw15qA0tN5+VyOeRyOTQaDXQ6\nHVQqFZuNiUQCh4eHSCQSzBw/Go0iFotVNcdsGREUCkVdNvtCyGQyKBQKaLVaRoREIoFUKoVkMtm0\nyEqObWOVSiVMJhNzPqGGv9Q2llqEG41GWCwWdHd3s1F/cHAAn88Hv9/PLMM9Hg8KhUJVIkjRHbkI\nIe8SQpYJIUuEkH93/Ltks5e9vb26DHOFkhQdaYQQyOVyOBwOTE9PY2hoCHq9npVpBlQlQi28E4kE\ns+Lgt4s6i9tsNvT19cHhcEAmkyEYDMLv9zNz+kAggFAohEQiUdMkUspMKAD4VY7jFgghegCfEELe\nBvALODJ7+UNyFOvoN3FkCnMCHo+HhRxoVr6Xy+VwOp2YmZlBOBzG6upq3dKIGCih6Vk29dCn5u/U\nBpUQwryELBYLlEolNjc3sbW1hd3dXezv7zNHFMqa+GcLos8koXE+HO2KwR2ZwT/AURyLNwC8clzs\nOwBuogIRNjc3YbfbUSwWGRHE+LeUjpTJZDCZTOA4DgMDA7DZbOjp6WFOfXx/s3q1rfQa2mnRaBRa\nrZbNgEwmg56eHpRKJebLQAhBKBTCw4cPsbe3h2AwiFgsVuaaVcsuta41gRAyiCNL7A8B2KWavTx8\n+BDDw8PI5/Nl3jV13LfsMx2ZfX19OHfuHPb395krlEKhgEqlYodA9RjmUpaUy+VQKpXYYU0ymWQL\nMTViow4nHMdhd3cX9+7dQzgcLut4qtaudcAjmQjHrOh7AP798YwQDrGKQ259fR1+v59Nc6VSia6u\nLklSknDG0FGp0WjgdDoxPDwMv9+PfD6PQCDABAC+b1m9s4Fa0UWjUeTzecTjcebPZrVa0d/fzzq7\nUCjA5/NhbW1NclQXISQRgRAixxEB/jvHcdSyQrLZy9raGr773e/izp07eP311/HGG2+ccKyTytP5\nIq7NZsPc3BzzIdjc3GQ2r/zNFj0MqheFQoHZuVL2Qk1a+Ea/fEGiEQFB6kz4FoAVjuP+M+83avby\nB6hh9kKdKl588UWcO3cOmUyGEaGRRZWWt9lszHtzY2ODBfugRKAbJWqxXW8HUb+2TCbDJLNYLMYM\nvPjqGKE0Vw+k6I5eAvDPACwRQu7iiO38Fo46/39JMXtJp9PY3t7GRx99BLlczjSh1AFQ2HCpRFEq\nlTAYDOjv78fMzAz29vaYiXo+n0dPTw9UKhXC4XDdh+980LWCEIJ0Oo14PI5kMol8Pg+VSlVWrhFI\nkY5+BKCStkyS2Us2m8XW1hai0SiMRiPGxsbQ29sLjUYDhUJxwspZCjiOY2cMvb29mJubg1KpxPLy\nMu7fv498Pg+DwYCuri4UCgXEYrET8SpqgT+66Ts1caTyP3VuaebYtSU7ZqpeiEajWFtbw8rKCtRq\nNdvsAOJqYj4qibRyuRzd3d0YHh5mG7d4PI5YLMZU3rFY7IRoXOk+le5LQa0+qAR1GgrElkV5oUqs\nzc1NvPvuu4hGo3jxxRdhMBiYXgaA5BnBJ4pcLofRaIRSqUQqlYJSqUQkEmEzYG9vT9SBUOw3PvgK\nOrp2UeXdaZ7stYQI/NGys7PD7HQsFgsbwXRHyr9G6qjt6upiDoldXV2wWCwIhULMLHFzcxMajabM\nXEbIairVLyYiC0P4NAspC7MLwJ/jyG+5hKMYFn9KCLkO4BfxRDT9rVqmL8CRWjuZTGJ3dxe3bt1C\nMpnE5OQkJicnWcgEatFcC/yOox2r0WhgsVigVqthMplgMBgQj8dRKBSwvb2N/f19RKNRNjNrsRL+\nYk4HE5+Ip0GIRnVHN47/+2OO46oe8gPlo5ruJHd2dpBKpbC5uYnXXnsNPT09LEiIkAj1nGhpNBqo\nVCqYTCYUi0XY7XYUCgUoFAro9Xrk8/kyQ61a9fFR6WSwWUI0qjuikYAl350SgkoSqVQKBwcHyGQy\nuHPnDrq6ujAyMoKBgQE4nU4WQqeWuYyQXfDZGlUYut1upnizWCzweDzM6Zsv0tKdMh3t/HgXNPjI\nxMQEJiYmMDAwAI1GU7d/mhga1R39GMBVHNke/TyAjwH8GsdxsWrX8/kq1e1Eo1EsLCzA4/FgfHwc\nFy9exPnz5zEyMoLR0VGmJKulbxIbjfSQxuFwQK/Xo6+vD1NTUwiFQggGgwgGg8zrhx+AhKo8aAgd\nvV4Ps9nMHB8HBwdhNpuh0+kQCoWaNstsRnf0ZwD+I8dxHCHkP+HI9uhfSa2P4zjk83nk83ns7Oyw\nBZvvGZ/P51mEF7ozpmfL/LBp/AVWqOyTyWTspMxmszFVNVX4hUIhFn4nkUggnU4zpRuNgUQNDHp7\nexkxqNj9+PFjRCKRpoggKXT/se7orwH8nUB1Qf93A3iL47gZkf8kt85oNMJms8Fut7OQNjabrSzW\nEQ2/QI3A6HFjtXPh43aUnRmkUimkUimmE6KyP7X4puyIBqSiJ2vUYdzj8WBtbQ0PHz7E4uIiFhcX\nax7ecBViZUslwp8DCHIc96u835jtESHkVwBcFrPMqIcIvGvYUSKNL+d2uzEwMID+/n6YTCYYjUa2\nCN+5cwdXr15lsSYqEUMmk+GHP/whXn311ZoLKv3v5s2bzFitWCxia2sLW1tb+OSTT3Dr1i3cu3eP\nnSfXkrYqEaEZ3dFXSB22R/WAjtZMJoNAIFDmF0Y7nz8Tbt26ha2trTIiCNkV/fzWW2/B4/HQZzuh\neKN7DnrQ/73vfQ8mk4mdHfv9fhwcHLCTNKpRPVO1RRXdUc09QSOgbINKKgcHB4hEIiyio3CkU7XE\n/v4++5+KusLojgCwvb3NQqMBT4yVKQtSKpXo7+9Hf38/bDYbDg4OsLCwgHv37mFpaQnJZJJZVVCN\narOqi7YLTsvn3dyxA0Y1Xks7MJ1Os/WBTwQhW4rFYiw6C72eEqhUKjEfa6VSCblcziQ4v9+PnZ0d\nFuaNiraNambLnrlZ5VPNG3SSWDA0tTB3cLbo5FlrA3SI0AboEKENcOZEIHVkJSSEPCaE3COE3CWE\nfCTyf11RyCqUF3X/JXWae4qU/7fV6q8Kvjh42i8cEfkRADcABYAFAJNVym8CMFX5/yqOFIiLvN/+\nAMB/OP78GwB+v0b56zhSzQvrdgC4dPxZD2AVwGSl+quUF62/2uusZ0K9WQmrRqHn6oxCVqE8vY+w\nbh/HcQvHnxMA+OaeJ+qvUL5uFT9w9uyo3qyEHI4Cn98mhPyixHuURSEDUDUK2TGquv/yVPYnzD3F\n6heo+GvWL0S7LcwvcRw3D+BLAP4NIeRqA3XU2vhUdf8VquxF6uNqlK/LvRg4+1TAdWUl5HhR6AHQ\nKPS14CeE2I/bUzMKGcdxAe7JDvUbAC7znqeiuadY/WLlq9VfCQ0TgRylAv4vOIp7cRHAzxFCJgXF\nJGclJIRoj0cVyJMo9GKuufVGITvh/sv7T+j+W83cU6z+E+Vr1C+OJiSfKzg65KHfvwbgN0TK/RSO\nJId1AF+rUt8QjqSnuwCWxMoC+Asc5X7OAtjBkaOKCcA7x/d4G4CxRvk/B7B4fK//gyOeDwAvASjy\n2nDnuO1msfqrlBetv2pfNkGEN3Fk/kK//3MAfyJSjuu8jl6V+rJtVNlCqwmqSu7r64PL5WJHnfS4\ns7e3Fz09PdDr9fjOd76DX//1X2dutfSwh3+eQOv/vd/7PVy/fr0sVxpN4UXV1NT7MhKJ4Jvf/Cau\nXbuGQCCAQCCA7e1tbG9vIxwOIx6PI51Ol5nFP1kOpONTTwVc6YiRnu26XC7Mzc3h/PnzGB0dhdvt\nZlmfaEqXv/mbv8HAwICoAa/Y/YRmNBx3lM+BPzqpg8mtW7fw5ptv4uDgAH6/Hx9++CEAMI+edDpd\nVnericAWXQD7OFp0f66eCvgjCAC0Wi16enpQKBQwPz8Pl8t1In2K2WyGSqUqC7dDO5YPfmdUO0vm\n/y885qQ2RwaDgTkSUpN7vrMgNZ9p1P7oU0sFzD/fpaNPp9Ox9Imvv/465ufnYTabYTQameO2SqU6\nYaj78ssvsw6o1eGVIsyIjWBCCK5duwa1Wg25XM7yvo2NjWFzcxPLy8vMFD8SiTR8yvapnazxD+Bp\ngrqhoSGcP38eU1NTuHTpEubm5tio51tQ8wkg5MNiROD/JvZ/rT7gX0MP9Pf29rC6uoqVlRXcvn0b\nH330ESKRCLPqEwN3BunhmwbtUIfDgaGhIZajZmJiAna7naXzomWF19Xi/xS1/q9VVkgkQggMBgNG\nR0eh1WpZtIHV1VWsra3VFR0YaDERhBIQzerd39+PS5cuYX5+nmWZqpSQQmwW8P+vxFbEXpXaJ+UZ\nuru7mWUeIYS57fr9fkSj0bpMYD61maDVamGz2eB0OjE/P88SHBmNxhNShrBzqDlMKpVihldUvOT7\nDVN2R21KaUZAalbJt/6uRgDh4KHgOA4KhQJOpxMAWARj6gNNI8fXkppalgpY+JBarRYDAwO4ePEi\n5ufncfnyZRamgII/+vkdkc/nkclkEA6HWTCPaDTK/I6pKSO1QaJZC202GxwOBywWCwAwY2Op7Ipf\njraNhnmwWq2IRqPY399nKQUoEWqh5amAaXY+p9OJyclJNgNoMCnhQktHdiqVYrkTwuEw88QJhULM\nqjoej7MovRzHMftUuvCbTCYWuYUa9losFlit1jKLPgohccRGM2WpKpUKDocDo6OjiEajODw8xN7e\nnqQNXEtSAfOnI/WgOXfuHKampvC5z30OVqv1hGMIHaF0J0u95tfW1tiulZ/IlG/MyxdXCSHMIIxv\nZU2NjqenpzEzM8PM56UG0RVjMSaTCaOjo4jFYtjd3YVCoWCDohpalgqYNlqv18PpdGJoaAijo6MY\nGxtjFnPCHWupVGLJpzc2NrCwsIDFxUVsbm5ic3OT5VWTsgDyO42mGnY4HCwTbS6XQ19fHzO9F/OH\nEGNHfPT09ODcuXMIBoOwWCxQqVSSPIJakgqY32CVSsUyulqtVqZ64PN/miIlmUxiaWkJCwsLWF9f\nx+7uLrxeL8LhcFnQciHbqMbjOY5jXp3FYhG3b9+G3+9n6SdVKhWzCJci+gqfjQajogHMpeyiW5YK\nmI6uwcFBjI+Ps5wKcrmcjXrKPkqlEpLJJMsq+7d/+7dYXV1l/gRicrvY50qgwUOojenCwgIODg5g\nNpuZaxXfU78a+PejMcHNZjMmJyfx8OFDxGIxRKPRqrOhJamAqd5FrVbDbrdjcHAQLpcLer2ejRSZ\nTIZCoYBsNotIJIL79+9jeXkZ9+7dg8/nY547wv2AVKWZ2D6CEMLiVwQCAaysrMBkMmFycrKusJ/8\nfQ8AtsBT1lZrYLQkFbBMJoNarYbRaITD4YDb7UZ/fz8LScCfAel0Gn6/H4uLi3jnnXdYWne6C5XJ\nZFXDI0jdIAn1VuFwGCsrKyzO3sjIiGRXXn6dVPHH9yCqhZakAu7q6oJOp2OiodlsZpFY+LqgZDKJ\nvb09bGxsMFck6tAnllX2tMBxHJLJJLxeL/R6PWZmZhpXxom4adVCS3bMMpkMer0evb296O7uZosx\nUL4bjUQiWFtbw8LCAra3txGLxVjqFqG6olkI2RPN2RMMBpnjRzN18w+NaqFlRKDBoORyOcu8QUEf\n2Ofz4eHDh1haWsLu7i6LIEn56mknsODXRdUgVAtaz334A0R4YncqmzVCyDcB/BMAfu7YO5MQYgLw\nP3Fk3vgYwM9wVXyYqbvp9vY2dDodstksnE4nizFKNzRra2tYXFxkbqkUZ61uB4528jQMNA0+WC9o\nhDHqDy1ljwBImwnfBvCnOLIioKBZaGuG46SNi0Qi7Dhwd3cXVquVhVymO96dnR2sra3B6/WKZpA6\nS2LwY3FT9XS9oFHC0uk0y8ssJeKYFMfBW8dHmHxIDscJPEnPRadpIpFAIBBgGk0a0oB62qdSqRPn\nB2cNrVaLvr4+DA0NwWw2SyKCcLOYSCTg8/ng8XgQCoVYmviz0h3VlYWWaj5pTLpUKoVIJMLkacpL\n6Sg6rcW3Fvh7DHq0SnfyUvYIwj0HdUrc3NxkSfmk7JpPa2Gu2WP8EJntADp6aagGi8UCt9vNNLpi\nM0FsdvL1XeFwmDmbB4PBmh7+FC3JQiuEmAxd7WTsNCFccHt6emAymTA4OIjBwUEMDAzAaDRKYkd0\nJlE3X7/fj42NDWxtbcHv90tuU0uy0JZVRMo97MW+nxWEGymO49Dd3Y3+/v4yIlATl0oQrgW5XA6J\nRKKMCPz42E2rLcgpZKEVewj+Boz/YGc1E/gdL5fLWfyi8+fPY35+HrOzs7Db7aK6nkptonGbNjc3\n8ejRIywvL2N/fx+JRKKuxB0tyUIrqE/081mCzjRK/K6uLnR3d8NqteLChQu4evUqxsbGKq4FwrbS\njSNVuW9sbOBHP/oRIwINNsLXTVVD29iinhX4ijq6CJvNZoyMjGBkZATT09MYHR2Fw+GQFPaNdmo8\nHkcoFML29jYzAKO7fCp8nNpMeFrB5/+049RqNcxmM4aGhnDlyhW88MILOHfuHMxmc5mBGYXYuQUV\ns/1+P5aWlrC4uIh79+5hY2MDh4eHZRKRVLPIzyQRhPZFNPmEyWTCwMAALly4gMuXL+PatWvMirtS\njFQh+0yn00ilUtje3sbdu3fx0Ucf4fHjx/B4PGUHU/Ww2kZ1R9fRQDjOs4bYgtrV1cVM6ScmJliM\nveHhYZZCoJbNUalUQj6fRzabxfr6Oh49eoT79+/j3r172N7eRjQaLbsn/10KGtUdARLDcbYKYuIt\nNXux2WxYgnNyAAASMklEQVSYnJzE5cuX8cILLzDzRboG1CIC3QskEgmsrq7iH/7hH/DgwQPs7OzA\n7/eXiayNWGY3qjsC6vTVPUsIz5hp8EGr1Qqn04kLFy7g/PnzmJychMvlYpsxoWWdkJVQ1hMKhbC7\nu4udnR3cvXsXDx48gMfjQTQaRS6Xa4gF8dHMmlBXOM6zAp+Xl0olZj5jtVoxOzuLmZkZjI+PY2Ji\nAlarlUWSp9eIgXYoVcitr6/jk08+wZ07d+D3+9mZdzabPZXDJqmBCN3gRXs8NnEJchwLx+nkOE40\nHCc5w6BT/N02tbYzGAwwGo2YmJjAlStXMD8/zwIZVjszpp1IFYw058Pm5ibu37+P27dv45NPPmER\nv/jHrVI7nztN03iuzlTApwXhQ/OtHGjsU8p6xsfHmXsVVUMId+n8eovFIvL5PBKJBLa2ttgueG1t\nDY8fP2YJMvjnA6e12WxJKuDThBghaARHu92Oubk5vPbaaxgdHYXBYIBery9jFUJC0M/8iMWrq6v4\n8MMPsby8jAcPHiAQCJypTqslqYCbhZiOCQAsFgscDgf6+/sxNDSEoaEhTE5Ooq+vj2WqrVQXXUMi\nkQgikQj29/dZpOKNjQ1sbGxgf39fNFFdtRkgRqxaM6btAxEKtZ789o6Pj+PSpUuYnp7GxYsXceHC\nBea8QaM6Ct2s6KJLd760w5eXl5nkE4/HcXh4yI5dheYv1RZ0MSsS3qv93KWkgmo+aSI7Gq55cnIS\nly5dwoULF1ggcX48VOHBSz6fRy6XQzKZZGb1GxsbePToEfM/29raYlZ5QpU1ratWW099x9xqVOK9\ncrmcuSdNT09jamoKw8PDcLvdcDgcTPTkj0C+3ojat0ajUXi9XqysrGBlZQV7e3vY29tDIBCoaGhc\nh/Qj6bcTzyap9hZCSARqTmg0GuF0OjE4OIjLly/j6tWr6Ovrg9FohFarZeX5o5+e72YyGWQyGZaq\nd319HT/+8Y+ZxyW1BKnWjmrtpS+FQnFCFULNYPhO50I0ks7lGxzH/Um9tkfVHoIPPu+WyWRwOBys\n80dHR5lzeTWnDo7jkM1mkUql2OH7zs4O9vb2sL+/D6/Xi52dHYRCIZYvQchCpCy+tK20891uN4aH\nh8vOJWh87du3b1esryWpgGs9DB982b+rqwt2ux3T09OYnZ3F7OwsJicnoVaroVarRX2bKXK5HGM9\nd+7cwe3bt1lOnUgkwnzbhNfXlGQEggIhRyGdtVotRkdH8corr2BoaIjlj3vw4AHzda6ElqQCrvQw\nQmg0Guj1ehgMBtjtdthsNoyNjWF8fBzDw8Nlh/B8AtDcZ9SZkDoU7u3twePxYH19Hevr6yxzCDXB\nrLR5k9AnIOTIDUun07EDorm5OVy4cAH9/f3o6upCqVRihz/V0JJUwLzry96FO09qgDU8PIypqSlM\nT08zYtDQCmKq50KhgMPDQ7bRWl1dxebmJra3t+H1epFIJJiNKZ/317vw8svLZDLI5XIYDAZcvHgR\nr7zyCkZGRjA4OMjcgAuFAkvAUQ0tSQXMCvBGHmU3dDRR46vh4WGMj48z9qPX66HT6cpydFJDsnQ6\njWQyiVgsxvIbrKys4MGDB9ja2mJSD78DhZ8l6s5OfNfr9SxP3NTUFJ577jnYbDaWpuzmzZt49913\n4fF4WL6GSmhJKuBKhy2U9YyNjWFsbAxutxsulwv9/f2w2+3o6ek5EdeCHq7H43F4vV5sbW2xRdfr\n9bK4RNFoFKlUqsyUppKmUwpL4hsJUIHh0qVLbLNot9uh0+mYjuratWt48cUXcePGDdy4cQPvvPNO\nxbpbkgqYD/rAKpUKBoMBfX19mJ+fx9WrVzE0NAS73Q6DwXBiAaTiJvUjCAQCePDgAT7++GMsLy9j\nd3cXu7u7ZebzwtwJdL8gtJyo1VY+8WgeOKfTicuXL+Pzn/88nE4nbDYbO9ShKYI5jkMsFmt+JpBT\nSAVMz3nlcjnsdjucTifL1uFyudgsMJlMzGGPPnQsFmOd7vP54PP52G6Xipr7+/s4PDwsS0jKV0/U\nmgmVICRWV1cXBgcHMTw8jLm5OYyPj8Nut7PEq7QcAGbymU6na3r2tyQVMPAkkpfb7cb8/Dw753W7\n3SzsAU3HQlEqlRCLxbCzs4PV1VXcu3cPKysrzLOfb4Kez+dP+LIJO7HeBZjWQ/csCoUCg4ODuHr1\nKmZmZjAxMQGbzVbWZjoDTpUIp4Fz584xKefixYuYmprCyMgI+vr6YLfbT+h7+C/qSsuPYUE1mzSc\nAXC6hmRCNiSXy6HVamEwGDA7O4vp6WmMjIzAarVCrVafIDhViycSCeYwUg0tIcLMzAwTOR0OB+x2\nO0wm0wmRk6+zoQ9Gk1Cn02mo1Wo4HI4y7Sj/ukrf6wWfpRSLRWg0Gpb3bXp6GhMTEyypt9i9qOsV\nTXxUyzq7JUSYmJjA888/j5/4iZ+ATqdj+hVAeodpNBrYbDZ0d3eL7pjrqUsqKEvR6/UYGBjAwMAA\nXC4Xy7tZaYEvFApIJBIIh8PMLrUaWpIK2Gq1oqenh8WS4/PnStIJZU9UbeF2u1miUqrUExN9TxOU\ntyuVShiNRhgMBhgMhppBSPL5PGKxGPx+/wmrPDG0JBWw1WplI1iqlzyVaOx2O3p7eyWLlqdBCDH2\nKBZjtdJiT4ng8/kQi8VqhmNrSSpgj8eDwcHBEwceYjtRIcQyln8aqER0/vpFy+TzeUSjUfh8Phwe\nHtZ0TK/LRZGnO6orT8DGxgbC4XBZ59ejOOMfzAilp1a/pIA/Ew4PD08v1I6I7khyKuBbt24hGo3i\nzp07+OIXv8gSkH5Wkc/n8fHHH+OHP/xhmUhdCQ3rjrg6bI/sdjtef/11fOUrX2Eh1uohQi0e3ArU\naq9QOqJx8fL5fM3ZIJUdNZUnwOfzsU2LFOfqmzdvss/840O6MApf77//vujvzZbllxe2gw/+b++9\n9x7y+Tzi8TjC4TALjlINUuLXUd3RF8hR6i2atuoPCSGLhJAFHB3u/EqlOmKxGA4PD3FwcHBiVIgR\nhE8EKain/FnWXSgU8M477yCZTLIkqafl0d90KmAanjIQCLDTM5VK9amxlrMCPdSnRJAaVqEliY04\njmNRvcQa9VkhBpXeCoVCWZSXWmh7C7zPErhOKuD2RbvlWXsm0SFCG6CThbaThfZE+U4W2jNAJwut\nBHSy0Hay0J5AJwvtGaCThfYss9AeN6KW5NPJQnvGWWglST7oZKGtmYW2YbUFIeQKgOscx/3j4+9f\nO6Ip9wcNVfgMoxm7IzHJ5wQP7yjwnoCroMD7VKQjakFBjv29pqam8Oabb+KP/uiP8Mknn+C3f/u3\nmW2p8CU2na9fvy6ZjdZTtt7ypVIJv/u7v4tSqYR3330Xv/M7v4Nf/uVfxle/+tWq/dHyVMCf5QN+\nPq5du4aXX34ZXq8XHo8H3/rWtyqWbWYmSJZ86kWj61Q7QsqzfGqpgKvh5Zdfrqt8pfS+zZZtRXmg\n+exSfw9gopk6KPhs6pVXXqlS8iSediK0hdqCb85SyazkaQV9rmpoCyIAJ91rnyW0LAttJXDcE2dw\n6vZE/cwayebRTuA4jqV3qYaWZ6EVAzWHof4HxWLxqScARSuIUPUQphL4YhvHHUVdPDg4wNraGnQ6\nHaLRKGw2G2w2G3Mmp7mThdcDnw4LExM96YaSxlMKBAJ49OgR1tfXq9bVsiy07IInyjXW8FgsxsLR\neL1eLC0tYWpqChcvXmQuSjSUWjvvIaifdTAYxNLSEu7fv8+CWlVDS7LQVgPHcSwE5uHhITweD7a3\nt5FIJJDL5VAqlVhMo0opej9NcNwTC3P6DDs7O1haWsIHH3zA4upVQ8uy0NaoBwBYWMxYLMYyuHZ1\ndaG/v5/FtWsnIghnZSgUwvr6OpaWlvDee+9hcXGRBbyqhoafqI5DGEmgCjA+Ed577z0sLy8jGAwi\nm802nA/zrCA460AwGMSDBw9w9+5d5nMtlrpYiJZkoa0HdKNGY1lQ41rhWtKOoP7LqVSKsVIpQkNL\nstDWA7Ed82nn3DwN0Pbw1wS63xHOWplMVnUWtw+DPUYllcXTMBOoOTydvVLRdtEg272jq4Gua8IB\nc5ZrwqmD3/inlRiNzFgpPmt1GeF2UD+krAnfBvCTgt9oKuAJAO/iKBxnBw2iJhG4Oo1wO6gfjUpH\njRjhdlABpyWiPp2raJugUSLUZYT7LIIGKKdiazW0PBXwswKNRgPglM6YyVE6l/8HYJwQskMI+QUc\npQJ+nRCyCuCLx987aBAtTwXcwUm0ne7oWUSHCG2ADhHaAI3qjup3mO6gIhrVHQFH4Tjnj1+fei7m\npxmN6o6AOh2mO6iMZtaEuhymO6iMRolQt8P0s4azUFuUgWvAYfpZw6mqLY5Rj0N2B3XiqUgF/FlH\no7qjb59BW55ZdHbMbYAOEdoATwURPmvOhEJI0R3VFaCvWbDIJyLByT+rkPJkNJ3LRQAv4igi1yTO\n2PaIEoGm/+KnUfmsoWWpgKV2IMcd5bekSVGNRiM0Gg0jxtOAes0gW5IKWIynV2ooJYLRaITD4YDV\naoVGo4FcLn9qiADUR4iWpALm83n+S9hQGoLHYDBgYGAA4+PjcLlcLA18u7Mj/vNQFykpBsItSQV8\nXEfFRtH/1Go1enp64HK5MDMzg+effx6jo6MwGAxPBTviDxK+7ogcZ6ethJakAq4kYtLvdNHV6/Ww\n2WwYHBzEzMwMXnrpJRgMBnR3d4vmZ2v1zKCz+bTN9luSCnhsbAxGoxEmk4lF/KKpgRUKBZRKJZRK\nJQwGA8xmM5xOJyYmJliWwnaaAdUc2an3KY1KIBUtSQU8OjqKoaEhDA4OQqFQsM6nOTSpJERTAlMv\nfprpu52IwIdwJlLv07Z0lxocHMTExAQmJydZ8lOan1mlUkGj0UCr1UKlUkGtVrMQCmLsph1CKAgj\nEmSzWWQyGYRCIYRCIUSj0ZqpHvloCRFcLhcGBwcxNjbGOphuxChB5HJ5WYDCdgVfsKCuvrFYDMFg\nEB6PB16vF8FgkAUVOZXwa+SUstA6HA709/fXzOBapR0NXScVUjpL6DxOZ8HBwQE2Nzexvb0Nv9+P\naDRatiachojadBZa6hgu1qh2HfWVOo6yyVwuh0wmg0AggJWVFXz00UdYWVlBJBJhWaWkSlItyULL\nT3VVbfNSiSCtJJSUTqNsKJlM4uDgACsrK3j//ffLZgE/ktmputDy1BY/xlGWjl8ihPw8gI8B/BrH\ncTGx64Qjv145u5XutGIxlejAofw/Fothf38fOzs7ePToERYXF+H3+xGPx1kW8lOdCbzGNZyFVuxB\nW9Gxte5RbYYJ9VzFYhG5XA4+nw9bW1tYWVnB0tIS1tbWEAgEEIlEWFRjqfdn95I4/eQA/hrA3wl2\nzfR/N4C3OI6bEfmP+9KXvgSn0wmn04kvfOELoiE3T5vlSGUrwmtoUBN+nIp0Oo1EIoF4PM4ieT18\n+BArKyt4/PgxS/cooU2iD9mw2oIQ4jheL4AaZi8//dM/jfn5eczNzUEul5eJeGeFeuumcSlyuRyS\nySRSqRQCgQAODg7g8/mwt7eHvb099lswGEQwGKx7YyaGZtQWX5Fq9rK2toa5ubmyoOOUJYnpk27e\nvFlXkFex8mIzgeM43Lx5k81EvshZKBSQy+WQTqcRDocRiUTw+PFj/OAHP0ChUMD6+joePXrEgkgJ\nw+g0w2JbkoX27t27+PznPw+r1cpU0vzNGv/0TCaT4caNG3jhhRdOKP7EziM4jsMPfvADXLly5UQE\n92KxWPYqFAr4/ve/D6fTyUK+0XfKbujr8PAQ4XAYH3zwAUwmEwKBAEvgKmwLvz2NoCU75kQigf39\nfZjNZigUCkYApVIJhULBlHh015zP55HJZJh2lX+0SUeccBRnMpmyzqd8Op/PI5fLsVEejUaxurqK\ncDhc9gqFQggGg4hEIowwuVwO8Xgcfr+f6YTEROxmhYyWEKFYLCKTySCdTqNQKDBVhVAFQEGlEY7j\nmPJOeBDEH/H8d/6opwSggW/T6TTy+TwSiQRisRgikQiCwSACgQDj9aFQiLGcahpTflubRScVcAtR\nSTrqpAJuA7Snov4ZQ4cIbYBOKuBOKuAT5TupgM8AnVTAEtBJBdxJBXwCnVTAZ4BOKuCzTgUsAZ1U\nwGeZCrgOCamTCrhGH3XUFm2AdluYn0l0iNAG6BChDdAhQhugQ4Q2QIcIbYAOEdoAHSK0Af4/g/Km\nryCVVaAAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7fc7e962b790>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"train_samples = train_dataset.shape[0]\n", | |
"letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']\n", | |
"subplots = 3\n", | |
"for i in range(subplots):\n", | |
" index = np.random.choice(range(train_samples))\n", | |
" print(letters[train_labels[index]])\n", | |
" plt.subplot(subplots, 1, i+1)\n", | |
" plt.imshow(train_dataset[index], cmap=cm.Greys_r)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "tIQJaJuwg5Hw" | |
}, | |
"source": [ | |
"Finally, let's save the data for later reuse:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"id": "QiR_rETzem6C" | |
}, | |
"outputs": [], | |
"source": [ | |
"pickle_filename = 'notMNIST.pickle'\n", | |
"\n", | |
"try:\n", | |
" f = open(pickle_filename, 'wb')\n", | |
" save = {\n", | |
" 'train_dataset': train_dataset,\n", | |
" 'train_labels': train_labels,\n", | |
" 'valid_dataset': valid_dataset,\n", | |
" 'valid_labels': valid_labels,\n", | |
" 'test_dataset': test_dataset,\n", | |
" 'test_labels': test_labels,\n", | |
" }\n", | |
" pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)\n", | |
" f.close()\n", | |
"except Exception as e:\n", | |
" print('Unable to save data to', pickle_filename, ':', e)\n", | |
" raise" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"cellView": "both", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"output_extras": [ | |
{ | |
"item_id": 1 | |
} | |
] | |
}, | |
"colab_type": "code", | |
"collapsed": false, | |
"executionInfo": { | |
"elapsed": 413065, | |
"status": "ok", | |
"timestamp": 1444485899688, | |
"user": { | |
"color": "#1FA15D", | |
"displayName": "Vincent Vanhoucke", | |
"isAnonymous": false, | |
"isMe": true, | |
"permissionId": "05076109866853157986", | |
"photoUrl": "//lh6.googleusercontent.com/-cCJa7dTDcgQ/AAAAAAAAAAI/AAAAAAAACgw/r2EZ_8oYer4/s50-c-k-no/photo.jpg", | |
"sessionId": "2a0a5e044bb03b66", | |
"userId": "102167687554210253930" | |
}, | |
"user_tz": 420 | |
}, | |
"id": "hQbLjrW_iT39", | |
"outputId": "b440efc6-5ee1-4cbc-d02d-93db44ebd956" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Compressed pickle size: 690800441\n" | |
] | |
} | |
], | |
"source": [ | |
"statinfo = os.stat(pickle_filename)\n", | |
"print('Compressed pickle size:', statinfo.st_size)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "gE_cRAQB33lk" | |
}, | |
"source": [ | |
"---\n", | |
"Problem 5\n", | |
"---------\n", | |
"\n", | |
"By construction, this dataset might contain a lot of overlapping samples, including training data that's also contained in the validation and test set! Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it.\n", | |
"Measure how much overlap there is between training, validation and test samples.\n", | |
"\n", | |
"Optional questions:\n", | |
"- What about near duplicates between datasets? (images that are almost identical)\n", | |
"- Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"932 65 1189\n" | |
] | |
} | |
], | |
"source": [ | |
"import hashlib\n", | |
"train_hashes = [hashlib.sha1(image).digest() for image in train_dataset]\n", | |
"valid_hashes = [hashlib.sha1(image).digest() for image in valid_dataset]\n", | |
"test_hashes = [hashlib.sha1(image).digest() for image in test_dataset]\n", | |
"\n", | |
"valid_in_train = np.intersect1d(train_hashes, valid_hashes)\n", | |
"valid_in_test = np.intersect1d(test_hashes, valid_hashes)\n", | |
"test_in_train = np.intersect1d(test_hashes, train_hashes)\n", | |
"\n", | |
"print(len(valid_in_train), len(valid_in_test), len(test_in_train))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "L8oww1s4JMQx" | |
}, | |
"source": [ | |
"---\n", | |
"Problem 6\n", | |
"---------\n", | |
"\n", | |
"Let's get an idea of what an off-the-shelf classifier can give you on this data. It's always good to check that there is something to learn, and that it's a problem that is not so trivial that a canned solution solves it.\n", | |
"\n", | |
"Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.\n", | |
"\n", | |
"Optional question: train an off-the-shelf model on all the data!\n", | |
"\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"50 -> 54.00% correct\n", | |
"100 -> 64.00% correct\n", | |
"1000 -> 76.00% correct\n", | |
"5000 -> 77.00% correct\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.linear_model import LogisticRegression\n", | |
"\n", | |
"samples = [50, 100, 1000, 5000]\n", | |
"models = []\n", | |
"for datapoints in samples:\n", | |
" model = LogisticRegression()\n", | |
" dataset = train_dataset[:datapoints]\n", | |
" dataset = dataset.reshape(datapoints, -1)\n", | |
" labels = train_labels[:datapoints]\n", | |
" model.fit(dataset, labels)\n", | |
" models.append(model)\n", | |
"\n", | |
" len_valid_dataset = valid_dataset.shape[0]\n", | |
" validation_set = valid_dataset.reshape(len_valid_dataset, -1)\n", | |
" prediction = model.predict(validation_set)\n", | |
" print('%s -> %0.2f%% correct' % (datapoints, 100*sum(valid_labels == prediction) / len_valid_dataset))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"84.00% correct\n" | |
] | |
} | |
], | |
"source": [ | |
"# And now run the test set.\n", | |
"best_model = models[3]\n", | |
"datapoints = 5000\n", | |
"len_test_set = test_dataset.shape[0]\n", | |
"test_set = test_dataset.reshape(len_test_set, -1)\n", | |
"prediction = model.predict(test_set)\n", | |
"print('%0.2f%% correct' % (100*sum(test_labels == prediction) / len_test_set))" | |
] | |
} | |
], | |
"metadata": { | |
"colab": { | |
"default_view": {}, | |
"name": "1_notmnist.ipynb", | |
"provenance": [], | |
"version": "0.3.2", | |
"views": {} | |
}, | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment