Whoosh's default analyzer does not handle CJK characters (in particular Chinese and Japanese) well. If you pass typical Chinese or Japanese paragraphes, often you'll find an entire sentence is treated as one token.
A Whoosh analyzer is consists of one tokenizer and zero or more filters. As a result, we can easily use this recipe from Lucene's CJKAnalyzer:
An
Analyzer
that tokenizes text withStandardTokenizer
, normalizes content withCJKWidthFilter
, folds case withLowerCaseFilter
, forms bigrams of CJK withCJKBigramFilter
, and filters stopwords withStopFilter
Which inspired me to make this first take:
class CJKFilter(Filter):
def __call__(self, tokens):
ngt = NgramTokenizer(minsize=1, maxsize=2)
for t in tokens:
if len(t.text) > 0 and ord(t.text[0]) >= 0x2e80:
for t in ngt(t.text):
t.pos = True
yield t
else:
yield t
This is a flawed way of testing if a token contains CJK characters – I'm just testing if the first codepoint in the filtering text is or is larger than U+2E80, which the first codepoint of the CJK radicals. But as a first take, this already quite well.
Once we have this filter, we can then create our own analyzer:
my_analyzer = RegexTokenizer() | LowercaseFilter() | CJKFilter()
You can pipe the entire thing to StopFilter()
if you need to remove stop words.