Skip to content

Instantly share code, notes, and snippets.

@scottrice10
Created September 10, 2013 01:00
Show Gist options
  • Select an option

  • Save scottrice10/6503673 to your computer and use it in GitHub Desktop.

Select an option

Save scottrice10/6503673 to your computer and use it in GitHub Desktop.
An interesting problem from StackOverflow: How to preserve the special character in a token, while also tokenizing the individual special characters. Example: "H&R Blocks tokenized as: "H", "R", "H&R", "Blocks" With the help of this blogpost, a possible solution:
"settings" : {
"analysis" : {
"filter" : {
"blocks_filter" : {
"type" : "word_delimiter",
"preserve_original": "true"
},
"shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"filter_stop":{
"type":"stop",
"enable_position_increments":"false"
}
},
"analyzer" : {
"blocks_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "blocks_filter", "shingle"]
}
}
}
},
"mappings" : {
"type" : {
"properties" : {
"company" : {
"type" : "string",
"analyzer" : "blocks_analyzer"
}
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment