Combinatorical question, how many segmentations of a word of length n
are there?
Intuition:
helloworld
h|elloworld
he|lloworld
...
h|e|lloworld
...
h|e|l|l|o|w|o|r|l|d
You can have a single bar in n
places (n
choose 1), two bars in n
choose
2 places and so forth. This is also known as the number of k-combinations
for all k
.
Another way to look at it. We have n - 1
slots to place bars into.
For each slot, we can decide, whether it stays empty or not. So the complete
number of decisions we can make amounts to 2 ^ (n - 1)
.
You can also view the space as the number of all binary numbers of length n - 1
.
The number of segmentations is exponential, hence not suitable for any brute force method at all.
We can't be too greedy in this, for example:
lovesnails
Should not automatically resolved to
love snails
but
loves nails
needs to be considered, too.
One can use a metric (which is introduced later) to find the most probable segmentation. The computation is simple, less than 10 lines:
def split_pairs(word):
return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]
def segment(word, word_seq_fitness=None):
if not word or not word_seq_fitness:
return []
all_segmentations = [[first] + segment(rest)]
for (first, rest) in split_pairs(word)]
return max(all_segmentations, key=word_seq_fitness)
The problem now shifts to the question, what words do have meaning. Choosing
a dictionary list might end up with a poor model, since a lot of real world
string do not make it into the dictionary, for example the bbconline
could
be problematic.
A solution is to create a probabilistic model of the english language.
In the simplest case, we need a list of words and associated counts, representing the words frequency in some corpus. Now the task is to calculate the likelihood, that a given sequence of words has been drawn from the distribution, that generated this corpus (and of which we know the frequencies).
Determine the probability of each word in the candidate segmentation and multiply their probabilities. This is Bayes Rule, with the assumtions that each event (occurence of a word) is independent of the others.
The independence assumptions is strong and quite wrong, words are not independent of each other at all. (Is this alone enough to call it a bag of words model?)
This is also called naive Bayes, and it is called naive, because it make the wrong, but useful independence assumption. A data point has a number of features, all independent.
For natural language, nGrams should be also considered, since words tend to appear in groups. One can also model occurences within texts (without a necessary order, e.g. sequence memoizer).
We restrict n
to 5 here.
See segmentation.py
.
A bigger corpus does not mean, that the analysis gets better. There is the problem of words, that are not in the corpus.
interesting !