This is an overview over the inner workings of Detexify. Extended knowlege in pattern recognition or machine learning is not necessary as I will explain some basics but understanding of linear algebra will definitely help. I have to note that I am not an expert, either. I more or less stumbled into this because of this project. Experts in this field may safely skip the first section.
The specific problem in this case is "What's the LaTeX code for this symbol?". But we can generalise that. We have a bunch of objects that can be partitioned into k categories. Suppose we look at a certain object and now the question is "What category is this object from?". While human beings are quite good at deciding this kind of questions computers find it rather difficult.
The first thing to do is to give the computer some understanding of the objects that should be classified. This is done by mapping the objects into the so called feature space. A feature is some property of the object that can be quantized for example if the objects are people I may choose their age in years as one feature. It is often possible to choose the n-dimensional rational vector space as the feature space. fig.1 shows a two dimensional feature space with objects of categories A, B and C mapped into it. The problem of classification of a new object is then the decision which the nearest category is. This of course means that the computer must know some samples of each category (which is called training data) to base its decision on. (This is called statistical pattern recognition - there are other approaches but they will not be described here as Detexify is based on the statistical one.)
fig.1
A x x
x x x B
x x x
x x x
x x
x x
x x x x
C X x x
A* x
x
x x
The objects in Detexify are symbols but what the computer gets as raw data are strokes of ink (polylines). These look basically like [[{x:3,y:5,timestamp:123}, {x:10,y:45,timestamp:125}]] - an Array of Strokes and each Stroke is an Array of Hashes that contain position and timestamp. From these strokes a set of features is extracted (which is called the feature vector and will be described later). The feature space is the n-dimensional rational vector space (n=15 at the time of writing).
The classification in Detexify is implemented with the k-nearest neighbor algorithm (almost). To classify an unknown pattern the training data is first sorted based on the euclidean distance in the feature space to this pattern. Usually then the k nearest neighbours are examined and the pattern is probably of the class most common among them but since I want 5 entries in the hit list of symbols nearest neighbours are examined until symbols of 5 different classes are found. The score of a symbol (class) is then the number of occurences among these nearest neighbours.
The chosen features are an important aspect of pattern recognition. When reading papers about handwritten character recognition I realized that there does not seem to be any consent about what the best feature set for a given alphabet of characters is. Everyone just develops an ad-hoc solution. Of course the requirements of the feature set are pretty clear. It must separate different classes sufficiently and abstract from inner-class variation.
My solution is a Potpourri of approaches I found in different papers especially on chinese and mathematical symbol recognition.
Before extracting any features I normalize (preprocess) the strokes in two ways. First the strokes are scaled and translated to fit and be centered in [0,1]x[0,1]. Next the points are redistributed. The browser creates strokes homogeneous in time by firing a mousemove event every few milliseconds but instead I want strokes to be homogeneous in space so that each pair of neighbouring points has the same distance to each other.
The features I use (at the time of writing) are number of strokes, point density and directional features. Point density means simply that I count how many points of the strokes fall into a certain area of [0,1]x[0,1] - for example [0,1]x[0,0.2]. The directional features I use can best be described with the help of some ascii art:
fig.2
NW N NE | x --› x |
\|/ | v ^ w | v -› NE
W -o- E | / |
/|\ | x | w -› E
SW S SE | \ |
For two succeeding points in a stroke the vector between them is assigned one of N, NE, E, E, SE, S, SW, W and NW. This way a histogram of orientations is obtained each entry containing the number of stroke segments between two succeding points heading into a specific direction.
All the features extracted are numeric values and thus can be concatenated into a rational vector - this is the feature vector.
Handwritten character recognition bears immanent difficulties. There is a large variety in different writers output in such things as stroke number and stroke order but also many other things. Another difficulty is the sheer number of LaTeX symbols (the comprehensive list contains over 4000). I chose a k-nearest neighbours classifier because I thought that would support some of the variety in different writers output. To understand that please look at fig.1 again. Class A might be any symbol but there are two predominant ways to write it (A and A*) which results in distribution in the feature space like in fig.1. But if training samples of both ways are available the classifier should be able to recognize the correct class anyway. The chosen features are all histogram features which abstract from stroke order. This combination works pretty well so far - better than I expected actually.
Of course there is a lot room for improvement. The training samples are saved as full strokes so a change in features used is possible at any moment. The vast amount of training data you already provided also makes a good test set for different classifiers. A few samples could be used as test samples while the rest remains as training data. A new classifier can then be tested against the test samples and the real class of the test samples can be compared to the classifiers output.
All in all I think Detexify is a solid base to improve on.
Does Detexify provide an interface to the mouse movements?
I mean conversion of a mouse-drawn curve to a list of coordinate pairs.
My program in Haskell operates with curves represented as lists of coordinate pairs.
But it needs an interface of
mouse-drawn curve <--> list of coordinate pairs.
Meanwhile I use a Python script for this. But it will be more natural to do all the thing with remaining in the Haskell area.