Created
June 6, 2012 21:43
-
-
Save plexus/2885025 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Very (very) naive wordt segmentation algorithm for Chinese | |
# (or any language with similar characteristics, works at the | |
# character level.) | |
class Partitioner | |
attr_reader :ngrams | |
# +ngrams+ Enumerable list of ngrams | |
def initialize(ngrams, lookahead = 6) | |
@lookahead = lookahead | |
@ngrams = {} | |
ngrams.each {|ng| @ngrams[ng] = true} | |
end | |
# Goes from beginning to end, each time trying to find the longest | |
# initial n characters that are in the list of known n-grams | |
def partition(text) | |
text = text.split('') | |
result = [] | |
while text and not text.empty? | |
lookahead = @lookahead | |
while lookahead > 0 | |
test = text[0...lookahead].join | |
if lookahead == 1 || ngrams[test] | |
result << test | |
text = text[lookahead..-1] | |
break | |
end | |
lookahead-=1 | |
end | |
end | |
result | |
end | |
end |
No I don't plan to come up with my own scheme, there's been plenty of academic efforts already to come up with good algorithms. I am more thinking of porting one or more of those to Ruby, or creating Ruby bindings to a C implementation.
This one is really just a placeholder so I can work on other aspects of my app now, and then revisit the segmentation problem later.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've been looking for good segmentation toolkit for ruby or python. Apparently there is an annual competition for Chinese segmentation but I don't know what the best toolkit is out there that can be used in ruby/python. Are you starting your own effort?