doc-search01.lo:19200
curl -XPUT "doc-search01.lo:19200/megacorp/employee/5" -d '
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}
'
curl -XGET "doc-search01.lo:19200/megacorp/employee/2"
Lists first 10 entries:
http://doc-search01.lo:19200/megacorp/employee/_search
Lists first 1000 entries:
http://doc-search01.lo:19200/megacorp/employee/_search?size=1000
Lists all entries with "Smith"
http://doc-search01.lo:19200/megacorp/employee/_search?q=Smith
Lists all entries with "music" in interests:
http://doc-search01.lo:19200/megacorp/employee/_search?q=interests:music
You can perform the same interests/music search using the DSL. You just need to pass a JSON request body:
curl -XGET "doc-search01.lo:19200/megacorp/employee/_search" -d '
{
"query" : {
"match" : {
"interests" : "music"
}
}
}
'
It's more verbose! Why use it?
Try this:
curl -XGET "doc-search01.lo:19200/megacorp/employee/_search?pretty=1" -d '
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
}
'
It will match people who have "rock", "climbing", or "rock climbing" in their about section, sorted by relevance. Nice!
If you want to search for "rock climbing" exact match, use "match_phrase":
GET /megacorp/employee/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
}
}
The match query is the go-to query—the first query that you should reach for whenever you need to query any field. It is a high-level full-text query, meaning that it knows how to deal with both full-text fields and exact-value fields.
By default, if you search for "brown dog", ES will return that docs have "brown" OR "dog". You can change that to an AND, so it returns docs that contain "brown" AND "dog": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-multi-word.html#match-improving-precision
Or you can specify that "at least 1/2 the search words should be in a doc": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-multi-word.html#match-precision
You can retrieve just the title and author of an indexed book with:
GET /books_test/author/_search?_source=title,author
Or if you just want the indexed content with none of the metadata (like "found", "_index", "_type" etc):
GET /books_test/author/1/_source
A doc is immutable. So to "update" it, you just PUT a new version of that doc. What if you just want to make an incremental change to the existing doc? Elasticsearch has an API for that:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/partial-updates.html
But behind the scenes, it is just grabbing the current doc data, changing it, and then PUTting that as a new version.
If you do a partial update like this:
POST /website/blog/1/_update
{
"doc" : {
"tags" : [ "testing" ],
"views": 0
}
}
It will merge this new data with the old data. You can also run some code IN ELASTICSEARCH to make the update:
You can get multiple docs if you know their ids:
Maybe there is a nice way to search several indexes at once? YUP!
/gb,us/_search
search all types in the gb and us indices
/g*,u*/_search
search all types in any indices beginning with g or beginning with u
How big is too big?
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big
They recommend batches of 1k - 5k to start, or around 5 - 15mb in size.
Turn off refreshes completely when bulk indexing:
You specify a mapping for a type. You can think of it as:
- index = database
- type = table
- mapping = schema for table
- document = row in a table
An index has multiple documents. A document has a type. Each document has one or more fields. For example, you might have an index "scribd". That might have types word_document and user. A document of type word_document might have a title and an author. A user might have a name and a login.
You don't need to specify a mapping at all. If you don't, ES will figure it out automatically: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/dynamic-mapping.html
But it might be useful for you to specify a mapping, like this:
"name": {
"type": "string",
"analyzer": "whitespace"
}
"name" is a field in your document. This says that name is a string, and before you index name into the inverted index, analyze it with the whitespace analyzer.
Here are the types you can have: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping-intro.html
Quin strongly recommends that you only have one type per index. You can run into issues if you have multiple types with the same field name on the same index: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping.html#_avoiding_type_gotchas
According to the ES documentation itself:
To ensure that you don't run into these conflicts, it is advisable to ensure that fields with the same name are mapped in the same way in every type in an index.
Create an index with the settings and mappings:
PUT /my_index
{
"settings": { ... any settings ... },
"mappings": {
"type_one": { ... any mappings ... },
"type_two": { ... any mappings ... },
...
}
}
Turn off replicas and only one shard for an index:
PUT /index_name
{
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}
fox and foxes is basically the same thing. When you store these terms in your inverted index, you don't want to store both since they are similar. Instead, "foxes" can be stemmed, i.e. reduced to it's root form, which is "fox".
Take this phrase: "the lord of the rings". We need to store this in our inverted index. An analyzer will:
- tokenize this phrase (typically
["the", "lord", "of", "the", "rings"]
). These are called tokenizers. - normalize each token (lowercase, remove whitespace, stemming, remove stopwords like "a" and "the"). These are called token filters.
Then you can put those normalized tokens in your inverted index. ES contains some built-in analyzers. It also contains a bunch of tokenizers and token filters that you can mix and match to create custom analyzers.
Built-in analyzers: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/analysis-intro.html#_built_in_analyzers
Language-specific analyzers are interesting too:
Language-specific analyzers are available for many languages. They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English stopwords & common words like and or the which don't have much impact on relevance — which it removes, and it is able to stem English words because it understands the rules of English grammar.
GET /_analyze?analyzer=standard Text to analyze
Will show you how "Text to analyze" will get analyzed by the standard analyzer.
In Search Lite we introduced the _all field: a special field that indexes the values from all other fields as one big string. The query_string query clause (and searches performed as ?q=john) defaults to searching in the _all field if no other field is specified.
Search all fields at once:
GET /_search
{
"match": {
"_all": "john smith marketing"
}
}
The timed_out value tells us whether the query timed out or not. By default, search requests do not timeout. If low response times are more important to you than complete results, you can specify a timeout as 10 or "10ms" (10 milliseconds), or "1s" (1 second):
GET /_search?timeout=10ms
It should be noted that this timeout does not halt the execution of the query, it merely tells the coordinating node to return the results collected so far and to close the connection. In the background, other shards may still be processing the query even though results have been sent.
Use the timeout because it is important to your SLA, not because you want to abort the execution of long running queries.
Good work Elasticsearch.
{
"must_not": { "match": { "geo": "US" }},
}
In general, you have must
, must_not
, and should
(i.e. nice-to-have):
"bool": {
"must": { "match": { "tweet": "elasticsearch" }},
"must_not": { "match": { "name": "mary" }},
"should": { "match": { "tweet": "full text" }}
}
The bool
filter is used to combine multiple filter clauses using Boolean logic. It accepts three parameters:
For exact matches:
must
These clauses must match, likeand
must_not
These clauses must not match, likenot
should
At least one of these clauses must match, likeor
For full text matches, should
is instead:
should
If these clauses match, then they increase the _score, otherwise they have no effect. They are simply used to refine the relevance score for each document.
SEE THE MOST IMPORTANT QUERIES AND FILTERS HERE:
query = full text search, inexact (how WELL does this match?) filter = exact match (does "2014-06-01" match "2014-06-02" EXACTLY?)
Use query for text search, use filter for geo
As a general rule, use query clauses for full text search or for any condition that should affect the relevance score, and use filter clauses for everything else.
Filters are important because they are very, very fast. Filters do not calculate relevance (avoiding the entire scoring phase) and are easily cached.
"you should use filters as often as you can"
filters could be used for isbns? what else? if you use filter for text, make sure you read this caveat: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_term_filter_with_text
searching with filters in depth: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_finding_exact_values.html
Internally, when you filter docs, the results are cached in a bitset in memory: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_internal_filter_operation
When you search, filters are applied before full-text searches. ES gets all the docs that matched using these bitsets. So it is able to narrow down the # of docs it has to run a full-text search on.
Updates to these bitsets happen incrementally as you add documents to your index.
You can ask ElasticSearch to sort the results:
GET /_search
{
"query" : {
"filtered" : {
"filter" : { "term" : { "user_id" : 1 }}
}
},
"sort": { "date": { "order": "desc" }} <-- sort by date
}
Or sort by date, then sort by the _score
that elasticsearch automatically assigns the results:
"sort": [
{ "date": { "order": "desc" }},
{ "_score": { "order": "desc" }}
]
Suppose you are searching for the "lord of the rings". Documents that have all 4 words will be given more weight that docs that only have 2/4 words. This is called "coordination": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html#coord
When you search, you can boost a particular field like "weight author match heavier than translator match here": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-query-strings.html#prioritising-clauses
Not sure if useful for us since we will most likely just do a full-text search with _all
.
You can also boost values with the "should" clause i.e. if this book matches "lord of the rings", show it. And then specify "should" = "tolkien". So books with "tolkien" in them will be ranked higher.
You can also specify how much to boost:
{ "match": {
"content": {
"query": "Lucene",
"boost": 2
}
}}
Looks like:
GET /my_index/books/_search
{
"query": {
"prefix": {
"title": "the lor"
}
}
}
finds all books with titles beginning with "the lor".
The prefix query is a low-level query that works at the term level. It doesn't analyze the query string before searching. It assumes that you have passed it the exact prefix that you want to find.
By default, the prefix query does no relevance scoring. It just finds matching documents and gives them all a score of 1. Really, it behaves more like a filter than a query. The only practical difference between the prefix query and the prefix filter is that the filter can be cached.
This explains how a prefix query works:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/prefix-query.html
the shorter the prefix, the more terms need to be visited. If we were to look for the prefix W instead of W1, perhaps we would match 10 million terms instead of just one million.
You can also use "match_phrase_prefix":
{
"match_phrase_prefix" : {
"brand" : "johnnie walker bl"
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/slop.html
Not sure if this is useful for us. Suppose you set slop to 2. Now instead of searching for "lord of the rings", you can search for "of lord the rings". Slop just makes it so that even if words are out of order, it will still match. The higher the slop, the more words can be out of order.
Suppose you are searching for "the lor". You can't use a full-text search, since that would search your inverted index for "the" and "lor". "lor" wouldn't match anything. So you can do a prefix search for "the lor" and that will match terms in your index that START WITH "lor". But prefix search is slower! You have to SCAN your index! What's an alternative?
Convert your input into edge ngrams: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html
And then do a full-text search on that.
Here's the walkthrough: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html
If you index something with your custom edge ngram analyzer, make sure the search query you use later specifies the "standard" analyzer! Otherwise it will use a edge ngram analyzer on your search query! Search for "The name:f condition is satisfied by the second document".
ES docs suggest using the completion suggester:
This data structure lives in memory and makes prefix lookups extremely fast, much faster than any term-based query could be. It is an excellent match for autocompletion of names and brands, whose words are usually organized in a common order: “Johnny Rotten� rather than “Rotten Johnny.�
When word order is less predictable, edge n-grams can be a better solution than the completion suggester. This particular cat may be skinned in myriad ways.
Given a string like "lord of the rings",
Tokenizer converts it into tokens: ["lord", "of", "the", "rings"] Token filter runs a filter on each one of those tokens, like "lowercase" or "n-gram".
After you create your own custom analyzer, you can test how it will perform on a string (i.e. what it will index):
curl -XGET 'doc-search01.lo:19200/test/_analyze?analyzer=&pretty' -d 'Foo Bar'
Benchmark the following on speed and quality of results:
-
use a prefix search Advantages: built-in solution, no special indexing required Disadvantages: could be slow!
-
use full-text search with edge n-gram analyzer Advantages: hopefully FAST to query since it does exact match Disadvantages: index size could be large, indexing will take time
-
use the completion suggestor I was using Advantages: supposedly blazing fast Disadvantages: inflexible, indexing takes a lot of time, we have to compute every possible input outselves.
On inflexibility:
No filtering, or advanced queries. In many, and perhaps most, autocomplete applications, no advanced querying is required. Let's suppose, however, that I only want auto-complete results to conform to some set of filters that have already been established (by the selection of category facets on an e-commerce site, for example). There is no way to handle this with completion suggest. If I need more advanced querying capabilities I will need to set up a different sort of autocomplete system.
GET _cluster/health
GET _cluster/health?level=indices
GET _cluster/health?level=shards