Last active
December 21, 2015 06:29
-
-
Save mathias-brandewinder/6264315 to your computer and use it in GitHub Desktop.
Language-agnostic instructions for Digits Recognizer machine learning dojo
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This dojo is directly inspired by the Digit Recognizer competition from Kaggle.com: | |
http://www.kaggle.com/c/digit-recognizer | |
The datasets below are simply shorter versions of the training dataset from Kaggle. | |
The dataset | |
************* | |
2 datasets can be downloaded here: | |
1) a training set of 5,000 examples | |
http://brandewinder.blob.core.windows.net/public/trainingsample.csv | |
2) a validation set of 500 examples; this dataset is supplied so that | |
you can evaluate the performance of your classification model on | |
"fresh" data that hasn't been used to construct the classifier. | |
http://brandewinder.blob.core.windows.net/public/validationsample.csv | |
The files are CSV files; the first line contains column labels, and | |
each subsequent row represents a scanned hand-written digit: | |
* the first element is the actual digit (0 to 9) | |
* the next 784 elements are the 28 x 28 pixels of the image, flattened | |
into a single vector. Each pixel is gray scale, from 0 (pure black) to 255 (pure white). | |
The full 50,000 training examples dataset is available at http://www.kaggle.com/c/digit-recognizer. | |
Naive KNN (K Nearest Neighbors) algorithm | |
******************************************** | |
Naive KNN in Pseudo Code: | |
Given a Target = an image with unknown Label (28x28 Pixels), | |
Given a Training Set of Examples = images with known Label (Label:Actual Digit, Observation: 28x28 Pixels), | |
Given a Distance function between Observations, measuring how "similar" two images are, | |
For every Example, compute Distance between Example and Target, | |
Find the Neighbors of the Target = the K Examples with smallest distance to Target, | |
Return the most frequent Label among the Neighbors | |
Simplest thing that could possibly work: 1-Nearest Neighbor ("Closest Neighbor") | |
From the training set, | |
Find the closest example to the Target, | |
Return its Label | |
Suggested plan of attack | |
************************** | |
- Read the training set into a collection of "examples", with a Label (the actual number) and their pixels. | |
- Start with the Euclidean Distance to measure similarity between images: | |
X = [ x1; x2; .. xn ] | |
Y = [ y1; y2; .. yn ] | |
Dist(X, Y) = (x1-y1)^2 + (x2-y2)^2 .. + (xn-yn)^2 | |
- Build a "Closest Neighbor" classifier, a function which, given an unlabeled image as input, | |
returns the Label of the closest neighbor from the Training Set | |
- Check what proportion of the Validation Set gets properly classified using that model | |
... now find ways to improve the model :) | |
More information | |
****************** | |
More on the KNN algorithm: | |
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm | |
Discussion on algorithm approaches and improvements | |
http://www.kaggle.com/c/digit-recognizer | |
Slides: | |
http://www.slideshare.net/mathias-brandewinder/fsharp-and-machine-learning-dojo |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment