Skip to content

Instantly share code, notes, and snippets.

@lukhnos
Created February 4, 2014 09:12
Show Gist options
  • Save lukhnos/8800394 to your computer and use it in GitHub Desktop.
Save lukhnos/8800394 to your computer and use it in GitHub Desktop.
How to Use Whoosh to Index Documents that Contain CJK Characters (First Take)

Whoosh's default analyzer does not handle CJK characters (in particular Chinese and Japanese) well. If you pass typical Chinese or Japanese paragraphes, often you'll find an entire sentence is treated as one token.

A Whoosh analyzer is consists of one tokenizer and zero or more filters. As a result, we can easily use this recipe from Lucene's CJKAnalyzer:

An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter

Which inspired me to make this first take:

class CJKFilter(Filter):
    def __call__(self, tokens):
        ngt = NgramTokenizer(minsize=1, maxsize=2)

        for t in tokens:
            if len(t.text) > 0 and ord(t.text[0]) >= 0x2e80:
                for t in ngt(t.text):
                    t.pos = True
                    yield t
            else:
                yield t

This is a flawed way of testing if a token contains CJK characters – I'm just testing if the first codepoint in the filtering text is or is larger than U+2E80, which the first codepoint of the CJK radicals. But as a first take, this already quite well.

Once we have this filter, we can then create our own analyzer:

my_analyzer = RegexTokenizer() | LowercaseFilter() | CJKFilter()

You can pipe the entire thing to StopFilter() if you need to remove stop words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment