Created
June 6, 2012 21:43
-
-
Save plexus/2885025 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Very (very) naive wordt segmentation algorithm for Chinese | |
# (or any language with similar characteristics, works at the | |
# character level.) | |
class Partitioner | |
attr_reader :ngrams | |
# +ngrams+ Enumerable list of ngrams | |
def initialize(ngrams, lookahead = 6) | |
@lookahead = lookahead | |
@ngrams = {} | |
ngrams.each {|ng| @ngrams[ng] = true} | |
end | |
# Goes from beginning to end, each time trying to find the longest | |
# initial n characters that are in the list of known n-grams | |
def partition(text) | |
text = text.split('') | |
result = [] | |
while text and not text.empty? | |
lookahead = @lookahead | |
while lookahead > 0 | |
test = text[0...lookahead].join | |
if lookahead == 1 || ngrams[test] | |
result << test | |
text = text[lookahead..-1] | |
break | |
end | |
lookahead-=1 | |
end | |
end | |
result | |
end | |
end |
This one is really just a placeholder so I can work on other aspects of my app now, and then revisit the segmentation problem later.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No I don't plan to come up with my own scheme, there's been plenty of academic efforts already to come up with good algorithms. I am more thinking of porting one or more of those to Ruby, or creating Ruby bindings to a C implementation.