Whoosh's default analyzer does not handle CJK characters (in particular Chinese and Japanese) well. If you pass typical Chinese or Japanese paragraphes, often you'll find an entire sentence is treated as one token.
A Whoosh analyzer is consists of one tokenizer and zero or more filters. As a result, we can easily use this recipe from Lucene's CJKAnalyzer:
An
Analyzer
that tokenizes text withStandardTokenizer
, normalizes content withCJKWidthFilter
, folds case withLowerCaseFilter
, forms bigrams of CJK withCJKBigramFilter
, and filters stopwords withStopFilter
Which inspired me to make this first take:
class CJKFilter(Filter):
def __call__(self, tokens):