Elasticsearch Introduction

What is Elasticsearch?

It is a highly scalable, open-source, full-text search engine.
It allows you to store and search data quickly and in near real time.
It is built on top of Apache Lucene.
It is schemaless.
It stores data in the form of JSON documents.
It has REST Apis for storing and searching data.

ES Components

Cluster = Server(s)
Node = Server
Index = Database
Type = Table
Document = Record (or row)

Type of Nodes

Data Node - Storing the data and performing operations on data (indexing, searching, aggregation, etc.)
Master Node - Maintaining the health of the cluster and performing administrative tasks. (creating/deleting indices, tracking which nodes are part of the cluster)
Coordinating Node - Receives requests from client applications and aggregates results from Data/Master Nodes.
By default a node is a master-eligible node and a data node.

Installing Elasticsearch v5.6.0

curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.tar.gz
tar -xvf elasticsearch-5.6.0.tar.gz
cd elasticsearch-5.6.0/bin
./elasticsearch

Installing Kibana v5.6.0

curl -L -O https://artifacts.elastic.co/downloads/kibana/kibana-5.6.0-darwin-x86_64.tar.gz
tar -xvf kibana-5.6.0-darwin-x86_64.tar.gz
cd kibana-5.6.0-darwin-x86_64/bin
./kibana

Start ES

cd ~/elasticsearch-5.6.0
bin/elasticsearch => (http://localhost:9200)

Start Kibana

cd ~/kibana-5.6.0-darwin-x86_64/
bin/kibana => (http://localhost:5601)

ES configurations

elasticsearch.yml
jvm.options

Console

Kibana -> Dev Tools -> Console (called Sense previously)

Explore Elasticsearch Cluster

GET /
GET /_cat/health?v
GET /_cat/nodes?v
GET /_cat/indices?v

Create an index

PUT library
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0
  }
}

Create a Document

PUT /library/books/1
{
  "title": "The quick brown fox",
  "price": 5,
  "colors": ["red", "green", "blue"]
}

Document meta fields

_index
_type
_id
_score
_source

Create documents in Bulk

index is the operation here, along with that we specify the document _id.

POST library/books/_bulk
{ "index": { "_id": 2 } }
{ "title": "The quick brown fox jumps over the lazy dog", "price": 15, "colors": ["blue", "yellow"] }
{ "index": { "_id": 3 } }
{ "title": "The quick brown fox jumps over the lazy dog", "price": 8, "colors": ["red", "blue"] }
{ "index": { "_id": 4 } }
{ "title": "Brown fox brown dog", "price": 2, "colors": ["black", "yellow", "red", "blue"] }
{ "index": { "_id": 5 } }
{ "title": "Lazy dog", "price": 9, "colors": ["red", "blue", "green"] }

Get a Document

GET /library/books/1

Update a Document

By re-indexing them (requires all attributes to be specified)

POST /library/books/1
{
  "title": "The quick fantastic fox",
  "price": 5,
  "colors": ["red", "green", "blue"]
}

Or by using the update API (you can specify the attribute(s) to be updated)

POST /library/books/1/_update
{
  "doc": {
    "title": "The quick brown fox"
  }
}

Delete a Document

DELETE /library/books/1

Basic Search (Find all documents)

This does not do any scoring so all docs have the same score.
Get all documents in the books type.

GET library/books/_search

Find all documents having "fox" in their title

Get documents having fox in their title field.

GET library/books/_search
{
  "query": {
    "match": {
      "title": "fox"
    }
  }
}

Relevance

The relevance score of each document is represented by a positive floating-point number called the _score.
The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
The scoring algorithm used in Elasticsearch is known as TF/IDF (term frequency/inverse document frequency)

Term frequency

How often does the term appear in the field?
The more often, the more relevant.
A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.

Inverse document frequency

How often does each term appear in the index?
The more often, the less relevant.
Terms that appear in many documents have a lower weight than more-uncommon terms.

Field-length norm

How long is the field?
The longer it is, the less likely it is that words in the field will be relevant.
A term appearing in a short title field carries more weight than the same term appearing in a long content field.
In case of multiple clauses, the more clauses that match, the higher the _score.
In case of multiple query clauses, the _score from each of these query clauses is combined to calculate the overall _score for the document.

Find all "quick" and "dog" documents (`match` query with multiple terms)

Get documents having either quick or dog in their title field.

GET library/books/_search
{
  "query": {
    "match": {
      "title": "quick dog"
    }
  }
}

Find documents with phrase "quick dog" (`match_phrase` query)

Get documents having phrase quick dog in their title field.

GET library/books/_search
{
  "query": {
    "match_phrase": {
      "title": "quick dog"
    }
  }
}

We can also do combinations of queries

Let's find all docs with "quick" and "lazy dog".
bool query allows us to combine multiple queries
must clause is similar to AND in SQL, all conditions inside must match.

GET library/books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "quick"
          }
        },
        {
          "match_phrase": {
            "title": "lazy dog"
          }
        }
      ]
    }
  }
}

Or negate parts of a query

Get documents which must not have quick and lazy dog in their title field.

GET library/books/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "title": "quick"
          }
        },
        {
          "match_phrase": {
            "title": "lazy dog"
          }
        }
      ]
    }
  }
}

Let's find all docs with "quick" OR "lazy dog".

Combinations can be boosted for different effects.
should clause is similar to OR in SQL, either condition inside must match.

GET library/books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": {
              "query": "quick dog"
            }
          }
        },
        {
          "match_phrase": {
            "title": {
              "query": "lazy dog",
              "score": 3
            }
          }
        }
      ]
    }
  }
}

Highlighting matching fragments

It tells you what parts of the title field matches
You can configure this to use different kinds of emphasis markers.

GET library/books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": {
              "query": "quick dog",
              "score": 2
            }
          }
        },
        {
          "match_phrase": {
            "title": {
              "query": "lazy dog"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

Filtering

Filtering is often faster than querying, because it doesn't has to calculate score.
Get documents that have price more than 5.

GET library/books/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "price": {
            "gt": 5
          }
        }
      }
    }
  }
}

Querying & Filtering together

Get documents that have dog in the title and the price is between 5 & 10.

GET library/books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "dog"
          }
        }
      ],
      "filter": {
        "range": {
          "price": {
            "gte": 5,
            "lte": 10
          }
        }
      }
    }
  }
}

Analysis

How does full text search actually works?
When documents are indexed, each document undergo an Analysis step.
Analysis is a combination of tokenization and token filters.
Analysis = Tokenization + Token filters
Tokenization = It takes the field and breaks it into multiple parts called tokens
Token Filters = It applies some filters on the tokens, to massage into diffrent format.

Tokenization breaks sentences into discrete tokens

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "text": "Brown fox brown dog"
}

And token filters manipulate those tokens

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "Brown fox brown dog"
}

You can combine multiple token filters

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "unique"],
  "text": "Brown brown brown fox brown fox dog"
}

Instead of specifying a `tokenizer` and `token filter`, you can specify an `analyzer`.

Analyzer = A tokenizer + 0 or more token filters
This applies the standard tokenizer and standard lowercase token filter.

GET /library/books/_analyze
{
  "analyzer": "standard",
  "text": "Brown fox brown dog"
}

Understanding analysis is very important, because it helps your queries to be more relevant, and the emitted tokens define whether a document matches a query or not.

standard tokenizer did not break quick.brown_Fox and removed things like $, @

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "THE quick.brown_Fox Jumped! $19.95 @ 3.0"
}

Let's look at the `letter` tokenizer

Now we split quick.brown and brown_Fox
but the integers and special chars are ingnored
because it only tokenizes alphabets.

GET /library/books/_analyze
{
  "tokenizer": "letter",
  "filter": ["lowercase"],
  "text": "THE quick.brown_Fox Jumped! $19.95 @ 3.0"
}

Another example with `uax_url_email` tokenizer

With standard tokenizer
This breaks all the words in the email and the URL

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "text": "[email protected] website https://www.elastic.co"
}

With uax_url_email tokenizer
This does not breaks the email and the URL

GET /library/books/_analyze
{
  "tokenizer": "uax_url_email",
  "text": "[email protected] website https://www.elastic.co"
}

Aggregations

Can be used to explore your data and get statistics on stored data.

Let's find popular colors (without search results)

GET /library/_search
{
  "size": 0,
  "aggs": {
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      }
    }
  }
}

And you can search/aggregate at the same time

Aggregation works on the documents returned by the search query.

GET /library/_search
{
  "query": {
    "match": {
      "title": "dog"
    }
  },
  "aggs": {
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      }
    }
  }
}

Multiple aggregations can be calculated at once and can be nested to further perform calculations.

GET /library/_search
{
  "size": 0,
  "aggs": {
    "price-statistics": {
      "terms": {
        "field": "colors.keyword"
      }
    },
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      },
      "aggs": {
        "avg-price-per-color": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

Index Mappings

ES is schemaless, when you index a document, ES will try to infer the type of each field in the document.

How to define an index mapping

famous-librarians is a new index
librarian is the type
text field types are analyzed for full-text search

PUT /famous-librarians
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "analysis": {
        "analyzer": {
          "my-desc-analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filters": ["lowercase"]
          }
        }
      }
    }
  },
  "mappings": {
    "librarian": {
      "properties": {
        "name": {
          "type": "text"
        },
        "favorite-colors": {
          "type": "keyword"
        },
        "birth-date": {
          "type": "date",
          "format": "year_month_day"
        },
        "hometown": {
          "type": "geo_point"
        },
        "description": {
          "type": "text",
          "analyzer": "my-desc-analyzer"
        }
      }
    }
  }
}

Get the index mapping

GET /famous-librarians/_mapping

Let's add few documents to the `famous-librarians` index

PUT /famous-librarians/librarian/1
{
  "name": "Sarah Byrd Askew",
  "favorite-colors": ["yellow", "light-grey"],
  "birth-date": "1877-02-15",
  "hometown": {
    "lat": "32.349722",
    "lon": "-86.641111"
  },
  "description": "An American public librarian who poineered the establishment of libraries in the United States. https://en.wikipedia.org/wiki/Sarah_Byrd_Askew"
}

PUT /famous-librarians/librarian/2
{
  "name": "John J Beckley",
  "favorite-colors": ["red", "white"],
  "birth-date": "1757-08-07",
  "hometown": {
    "lat": "51.507222",
    "lon": "-0.1275"
  },
  "description": "An American political campaign manager and the first Librarian of the United States Congress - https://en.wikipedia.org/wiki/John_J._Beckley"
}

Search librarians

POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "name": "john"
    }
  }
}

POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "description": "https://en.wikipedia.org/wiki/Sarah_Byrd_Askew"
    }
  }
}

POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "description": "https://en.wikipedia.org/wiki/John_J._Beckley"
    }
  }
}

Next Steps

ElasticSearch Documentation - https://www.elastic.co/guide/index.html

nitinstp23/es_talk_notes.md

Elasticsearch Introduction

What is Elasticsearch?

ES Components

Type of Nodes

Installing Elasticsearch v5.6.0

Installing Kibana v5.6.0

Start ES

Start Kibana

ES configurations

Console

Explore Elasticsearch Cluster

Create an index

Create a Document

Document meta fields

Create documents in Bulk

Get a Document

Update a Document

Delete a Document

Basic Search (Find all documents)

Find all documents having "fox" in their title

Relevance

Term frequency

Inverse document frequency

Field-length norm

Find all "quick" and "dog" documents (match query with multiple terms)

Find documents with phrase "quick dog" (match_phrase query)

We can also do combinations of queries

Or negate parts of a query

Let's find all docs with "quick" OR "lazy dog".

Highlighting matching fragments

Filtering

Querying & Filtering together

Analysis

Tokenization breaks sentences into discrete tokens

And token filters manipulate those tokens

You can combine multiple token filters

Instead of specifying a tokenizer and token filter, you can specify an analyzer.

Understanding analysis is very important, because it helps your queries to be more relevant, and the emitted tokens define whether a document matches a query or not.

Let's look at the letter tokenizer

Another example with uax_url_email tokenizer

Aggregations

Let's find popular colors (without search results)

And you can search/aggregate at the same time

Multiple aggregations can be calculated at once and can be nested to further perform calculations.

Index Mappings

How to define an index mapping

Get the index mapping

Let's add few documents to the famous-librarians index

Search librarians

Next Steps

Find all "quick" and "dog" documents (`match` query with multiple terms)

Find documents with phrase "quick dog" (`match_phrase` query)

Instead of specifying a `tokenizer` and `token filter`, you can specify an `analyzer`.

Let's look at the `letter` tokenizer

Another example with `uax_url_email` tokenizer

Let's add few documents to the `famous-librarians` index