How to Use Whoosh to Index Documents that Contain CJK Characters (First Take)

Whoosh's default analyzer does not handle CJK characters (in particular Chinese and Japanese) well. If you pass typical Chinese or Japanese paragraphes, often you'll find an entire sentence is treated as one token.

A Whoosh analyzer is consists of one tokenizer and zero or more filters. As a result, we can easily use this recipe from Lucene's CJKAnalyzer:

An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter

Which inspired me to make this first take:

class CJKFilter(Filter):
    def __call__(self, tokens):
        ngt = NgramTokenizer(minsize=1, maxsize=2)

        for t in tokens:
            if len(t.text) > 0 and ord(t.text[0]) >= 0x2e80:
                for t in ngt(t.text):
                    t.pos = True
                    yield t
            else:
                yield t

This is a flawed way of testing if a token contains CJK characters – I'm just testing if the first codepoint in the filtering text is or is larger than U+2E80, which the first codepoint of the CJK radicals. But as a first take, this already quite well.

Once we have this filter, we can then create our own analyzer:

my_analyzer = RegexTokenizer() | LowercaseFilter() | CJKFilter()

You can pipe the entire thing to StopFilter() if you need to remove stop words.

lukhnos/whoosh-cjk-analyser.md