Skip to content

Instantly share code, notes, and snippets.

@NoriSte
Created May 29, 2019 13:18
Show Gist options
  • Save NoriSte/fd7bb0fff3221f964e48847262953736 to your computer and use it in GitHub Desktop.
Save NoriSte/fd7bb0fff3221f964e48847262953736 to your computer and use it in GitHub Desktop.
elastic-search-docs-intro

Elastic Search

Getting started

Getting started - Concepts

Cluster - A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. It's identified by name.

Node - A single server that is part of your cluster.

Index - An index is a collection of documents that have somewhat similar characteristics. An index is identified by a (lowercase) name.

Document - A basic unit of information that can be indexed

Glossary

You can read the Glossary.

APIs

Create an index (new "customer” index)

PUT /customer?pretty

and get all the indexes

GET /_cat/indices?v
PUT /customer/_doc/1?pretty
{
  "name": "John Doe"
}

please note: a new PUT replaces the document!

It is important to note that Elasticsearch does not require you to explicitly create an index first before you can index documents into it. In the previous example, Elasticsearch will automatically create the customer index if it didn’t already exist beforehand.

When indexing, the ID part is optional. If not specified, Elasticsearch will generate a random ID and then use it to index the document.

Putting all together

PUT /customer
PUT /customer/_doc/1
{
  "name": "John Doe"
}
GET /customer/_doc/1
DELETE /customer

Elasticsearch pattern = <HTTP Verb> /<Index>/<Endpoint>/<ID>

Elasticsearch provides data manipulation and search capabilities in near real time. By default, you can expect a one second delay (refresh interval) from the time you index/update/delete your data until the time that it appears in your search results. This is an important distinction from other platforms like SQL wherein data is immediately available after a transaction is completed.

Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.

Updates can also be performed by using simple scripts. This example uses a script to increment the age by 5:

POST /customer/_update/1?pretty
{
  "script" : "ctx._source.age += 5"
}

As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 - Jane Doe) in one bulk operation:

POST /customer/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }

This example updates the first document (ID of 1) and then deletes the second document (ID of 2) in one bulk operation:

POST /customer/_bulk?pretty
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}

The Bulk API does not fail due to failures in one of the actions. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in).

Now let’s start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format.

An example

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}

[Introducing the Query

Language](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-query-lang.html)

Elasticsearch provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.

GET /bank/_search
{
  "query": { "match_all": {} },
  "from": 10,
  "size": 1,
  "sort": { "balance": { "order": "desc" } }
}

Note that if size is not specified, it defaults to 10. Note that if from is not specified, it defaults to 0.

By default, the full JSON document is returned as part of all searches. This is referred to as the source (_source field in the search hits). If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.

Let’s move on to the query part. Previously, we’ve seen how the match_all query is used to match all documents. Let’s now introduce a new query called the match query, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).

This example returns the account numbered 20:

GET /bank/_search
{
  "query": { "match": { "account_number": 20 } }
}

This example returns all accounts containing the term "mill" or "lane" in the address:

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } }
}

This example is a variant of match (match_phrase) that returns all accounts containing the phrase "mill lane" in the address:

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

Let’s now introduce the bool query. The bool query allows us to compose smaller queries into bigger queries using boolean logic.

This example composes two match queries and returns all accounts containing "mill" and "lane" in the address:

GET /bank/_search

{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

We can combine must, should, and must_not clauses simultaneously inside a bool query. Furthermore, we can compose bool queries inside any of these bool clauses to mimic any complex multi-level boolean logic.

This example returns all accounts of anybody who is 40 years old but doesn’t live in ID(aho):

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

In the previous section, we skipped over a little detail called the document score (_score field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

But queries do not always need to produce scores, in particular when they are only used for "filtering" the document set. Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.

The bool query that we introduced in the previous section also supports filter clauses which allow us to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed. As an example, let’s introduce the range query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response.

To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default):

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}

Note that we set size=0 to not show search hits because we only want to see the aggregation results in the response.

Building on the previous aggregation, this example calculates the average account balance by state (again only for the top 10 states sorted by count in descending order):


GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

Notice how we nested the average_balance aggregation inside the group_by_state aggregation. This is a common pattern for all the aggregations. You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.

Building on the previous aggregation, let’s now sort on the average balance in descending order:


GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

Pretty Resultsedit

When appending ?pretty=true to any request made, the JSON returned will be pretty formatted (use it for debugging only!). Another option is to set ?format=yaml which will cause the result to be returned in the (sometimes) more readable yaml format.

Human readable outputedit

Statistics are returned in a format suitable for humans (e.g. "exists_time": "1h" or "size": "1kb") and for computers (e.g. "exists_time_in_millis": 3600000 or "size_in_bytes": 1024). The human readable values can be turned off by adding ?human=false to the query string. This makes sense when the stats results are being consumed by a monitoring tool, rather than intended for human consumption. The default for the human flag is false.

Date Math

Most parameters which accept a formatted date value — such as gt and lt in range queries, or from and to in daterange aggregations — understand date maths.

The expression starts with an anchor date, which can either be now, or a date string ending with ||. This anchor date can optionally be followed by one or more maths expressions:

+1h: Add one hour -1d: Subtract one day /d: Round down to the nearest day

In the page there are lot of other useful tips.

The Elastic Common Schema (ECS) is an open source specification, developed with support from the Elastic user community. ECS defines a common set of fields to be used when storing event data in Elasticsearch, such as logs and metrics.

ECS defines "Core" and "Extended" fields.

  • Core fields. Fields that are most common across all use cases are defined as core fields. These generalized fields are used by analysis content across use cases. Focus on populating these fields first.

  • Extended fields. Any field that is not a core field is defined as an extended field. Extended fields may apply to more narrow use cases, or may be more open to interpretation depending on the use case. Extended fields are more likely to change over time.

Each ECS field in a table is identified as core or extended.

General guidelines

  • The document MUST have the @timestamp field.
  • Use the data types defined for an ECS field.
  • Use the ecs.version field to define which version of ECS is used.
  • Map as many fields as possible to ECS.

Guidelines for field names

  • Field names must be lower case

  • Combine words using underscore

  • No special characters except underscore

  • Use present tense unless field describes historical information.

  • Use singular and plural names properly to reflect the field content.

    • For example, use requests_per_sec rather than request_per_sec.
  • Use prefixes for all fields, except for the base fields.

    • For example, all host fields are prefixed with host.. Such a grouping is called a field set.
  • Nest fields inside a field set with dots

    • The document structure should be nested JSON objects. If you're ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.
  • General to specific. Organise the nesting of field sets from general to specific, to allow grouping fields into objects with a prefix like host.*.

  • Avoid repetition or stuttering of words

    • If part of the field name is already in the name of the field set, avoid repeating it. Example: host.host_ip should be host.ip.
    • Exceptions can be made, when changing the name of the field would break a strong convention in the community. Example: host.hostname is an exception to this rule.
  • Avoid abbreviations when possible

    • Exceptions can be made, when the name used for the concept is too strongly in favor of the abbreviation. Example: ip fields, or field sets such as os, geo.

Remember that IDs and most codes are keywords, not integers.

Elasticsearch can index text using datatypes:

  • text Text indexing allows for full text search, or searching arbitrary words that are part of the field.
  • keyword Keyword indexing offers faster exact match filtering, prefix search (like autocomplete), and makes aggregations (like Kibana visualizations) possible.
  • @timestamp
  • label: Custom key/value pairs.

Can be used to add meta information to events. Should not contain nested objects. All values are stored as keyword.
Example: docker and k8s labels.
type: object
example: {'application': 'foo-bar', 'env': 'production'} - message: For log events the message field contains the log message, optimized for viewing in a log viewer. - tags: List of keywords used to tag each event.
type: keyword
example: ["production", "env2"]

The agent fields contain the data about the software entity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment