The new index will be testing this indexing stratigey. The same Tokenizer and filters are run when indexing, and on user's queries.
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
-
Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
-
The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.
Note that words are split at hyphens.
The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>
, <NUM>
, <SOUTHEAST_ASIAN>
, <IDEOGRAPHIC>
, and <HIRAGANA>
.
does nothing with latter day lucene
Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left unchanged.
This filter applies the Porter Stemming Algorithm for English.
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.