Created
February 7, 2012 14:45
-
-
Save lgueye/1760014 to your computer and use it in GitHub Desktop.
elasticsearch : dealing with case and accents
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# delete index (will print an error if 'my_index' doesn't exist, you can safely ignore it) | |
curl -XDELETE 'http://localhost:9200/my_index' | |
# create index with its settings | |
curl -XPOST 'http://localhost:9200/my_index' -d '{ | |
"index.analysis.analyzer.default.type":"custom", | |
"index.analysis.analyzer.default.tokenizer":"standard", | |
"index.analysis.analyzer.default.filter.0":"lowercase", | |
"index.analysis.analyzer.default.filter.1":"asciifolding" | |
}' | |
# check index analyzer behaviour | |
# we can note that lowercase filter and asciifolding filters work at index phase | |
# 2 tokens are stored : 'ingenieur' and 'java' | |
curl -XGET 'localhost:9200/my_index/_analyze?text=Ingénieur+Java' | |
# add data | |
curl -XPUT 'http://localhost:9200/my_index/my_type/1' -d '{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}' | |
curl -XPUT 'http://localhost:9200/my_index/my_type/2' -d '{"reference":"ADV-REF-00000002", "title":"Conservateur documentaliste"}' | |
curl -XPUT 'http://localhost:9200/my_index/my_type/3' -d '{"reference":"ADV-REF-00000003", "title":"Technicien qualité validation H/F"}' | |
curl -XPUT 'http://localhost:9200/my_index/my_type/4' -d '{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}' | |
curl -XPUT 'http://localhost:9200/my_index/my_type/5' -d '{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}' | |
# search data | |
# the above queries should return the same results (2 hits) | |
curl http://localhost:9200/my_index/my_type/_search?q=Ingénieur* | |
curl http://localhost:9200/my_index/my_type/_search?q=ingénieur* | |
curl http://localhost:9200/my_index/my_type/_search?q=ingenieur* | |
curl http://localhost:9200/my_index/my_type/_search?q=Ingén* | |
curl http://localhost:9200/my_index/my_type/_search?q=ingén* | |
curl http://localhost:9200/my_index/my_type/_search?q=ingén* | |
curl http://localhost:9200/my_index/my_type/_search?q=ingen* |
The problem is the oposite, i need also to get Ingen ... tried like this:
curl -XGET 'http://172.16.181.128:9200/sandbox/tests/_search' -d '{
"query" : {
"text" : {
"user" : {
"query" : "ingen",
"type" : "boolean",
"operator" : "AND",
"fuzziness" : "0.5"
}
}
}
}'
AND IT WORKS but because the aproximation if i have too many differences between the words than it will not work... so this does not solve all the accent problem.. do someone know how to simply index by IGNORING accents?
You can try this:
Or replace chars with accents with ? exmple
Find: "camión"
{ "query": { "query_string": { "analyze_wildcard": true, "query": "cami?n" } } }
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
Yes, the key is accents encoding. Instead of "curl http://localhost:9200/my_index/my_type/_search?q=Ingén_" use "curl http://localhost:9200/my_index/my_type/_search?q=Ing%C3A9n_"
Cheers