Calisphere Search

The new index will be testing this indexing stratigey. The same Tokenizer and filters are run when indexing, and on user's queries.

standard tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

Note that words are split at hyphens.

The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

standard filter

does nothing with latter day lucene

lower case filter

Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left unchanged.

porter stemming filter

This filter applies the Porter Stemming Algorithm for English.

ASCII folding filter

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.