Last active
January 25, 2020 22:42
-
-
Save Tapan/9802b8178363b3de61e1b2aea1b9b697 to your computer and use it in GitHub Desktop.
Elasticsearch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Analyzer | |
1. character filter => add, remove or transform the text i.e strip html tag from the document. | |
2. tokenizer => split text into terms. | |
3. token filter => consists of zero or more filters like uppercase, synonyms, stop words etc. | |
Analyze API | |
POST _analyze | |
{ | |
"tokenizer": "standard", | |
"text": "I'm in the mood for drinking semi-dry red wine!" | |
} | |
POST _analyze | |
{ | |
"filter": ["lowercase"], | |
"char_filter": ["html_strip"], | |
"text": "I'm in the mood for drinking semi-dry red wine!" | |
} | |
POST _analyze | |
{ | |
"analyzer": "standard", | |
"text": "I'm in the mood for drinking semi-dry red wine!" | |
} | |
Inverted index | |
The result of analysis is stored in inverted index | |
Inverted index is mapped to a text field. A index consisting of two full text field will have two inverted index | |
corresponding to the text field. | |
Below is inverted index for the field title: | |
Term Document #1 Document #2 | |
best tick mark | |
carborana tick mark | |
delicious tick mark | |
pasta tick mark tick mark | |
Tokenizer | |
1. Word tokenizer eg. standard tokenizer | |
Letter tokenizer => divides text into terms when encountering a character that is not a letter | |
lowercase tokenizer => lowercase all terms | |
whitespace tokenizer => divides text into terms when encountering whitespace | |
uax url email tokenizer => treats url and emails as single tokens | |
2. Partial word tokenizer | |
breaks up text or words into small fragments. Used for partial word matching. | |
eg N-Ggram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of the specified length. | |
"Red Wine" | |
[re, red, ed, wi, win, wine, in, ine, ne] | |
edge-ngram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of each word beginning from the start of the word. | |
"Red Wine" | |
[Re, Red, Wi, Win, Wine] | |
3. Structured text tokenizer | |
Used for structured text such as email, zipcodes, identifiers etc | |
keyword tokenizers | |
pattern tokenizers | |
path tokenizers => splits hierarchical values (eg. file systems path) and emits a term for each component in the tree | |
Token filters | |
standard token filters | |
lowercase filter | |
uppercase filter | |
n-gram filter | |
edge-ngram token filter | |
stop filter | |
word_delimter filter => splits words into subwords and performs transformations on subwords groups. | |
[Wi-Fi, PowerShell] => [Wi, Fi, Power, Shell] | |
stemmer token filter | |
keyword marker token filter(keyword_marker) | |
=> protects words from being modified by stemmers | |
snowball token filter | |
synonyms | |
trim | |
Analyzer | |
standard | |
whitespace | |
simple | |
keyword | |
stop | |
pattern |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment