Skip to content

Instantly share code, notes, and snippets.

@ktkaushik
Last active December 23, 2015 04:19
Show Gist options
  • Select an option

  • Save ktkaushik/6578966 to your computer and use it in GitHub Desktop.

Select an option

Save ktkaushik/6578966 to your computer and use it in GitHub Desktop.
Basics : Elasticsearch

#Elasticsearch

##My findings

####Features

  • Real-Time search
  • Easy Distributed indexing and searching capabilities
  • Autmatically knocking off failed nodes to ensure safrty of data
  • Document oriented
  • Schema-free
  • RESTful

Practical Usage

There are myriad cases in which elasticsearch is useful. Some use cases more clearly call for it than others. Listed below are some tasks which for which elasticsearch is particularly well suited.

  • Searching a large number of product descriptions for the best match for a specific phrase (say “chef’s knife”) and returning the best results
  • Given the previous example, breaking down the various departments where “chef’s knife” appears
  • Searching text for words that sound like “season”
  • Auto-completing a search box based on partially typed words based on previously issued searches while accounting for mis-spellings
  • Storing a large quantity of semi-structured (JSON) data in a distributed fashion, with a specified level of redundancy across a cluster of machines

###Problems It’s actually bad at solving problems for which relational databases are optimized. Problems such as those listed below. This maybe because it stores data in a schema-free manner.

  • Calculating how many items are left in the inventory
  • Figuring out the sum of all line-items on all the invoices sent out in a given month
  • Executing two operations transactionally with rollback support
  • Creating records that are guaranteed to be unique across multiple given terms, for instance a phone number and extension

###Installtion

Refer this page

Get some preloaded data here

#Document and Field

The smallest individual unit of data in elasticsearch is a field. A field contains a single piece of data, like the number 42 or the string "Hello, World!", or a single list of data of the same type, such as the array [5, 6, 7, 8].

Documents are collections of fields, and comprise the base unit of storage in elasticsearch; something like a row in a traditional RDBMS. The reason a document is considered the base unit of storage is because, peculiar to Lucene, all field updates fully rewrite a given document to storage (while preserving unmodified fields).

###Everything is JSON

As mentioned earlier, the data is always stored in the format of JSON.

Example

{
    "_id":"1",
    "username":"ktkaushik",
    "repos":"24",
    "languages":["ruby", "javascript", "coffeescript"]
}

Some key names are reserved for internal usage by Elasticsearch. For example, the id field above is reserved and is supposed to be unique for every document.While elasticsearch deals with JSON exclusively, internally, the JSON is converted to flat fields for Lucene’s key/value API. Arrays in documents are mapped to Lucene multi-values.

#Clustering and Indexing

The elastic part of Elasticsearch is actually inspired from the promising capabilities of clustering; seamlessly adding, removing servers and expand capacity. To understand this, we actually need to understand the basics of Lucene Indexing first.

###Lucene & Elasticsearch Indexing

A Lucene index is subdivided into a variable number of segments at any given time. Each of these segments is a completely separate index in and of itself.

Lucene indexes create more segments as documents are added, and when they become more, it tries to merge them back into fewer segments. The smaller the number of segments, the faster operations run (one segment is optimal). Merging, however, has costs as well, so Lucene attempts to merge segments at a rate where merge costs and search efficiency are balanced.

It is due to this architecture that searches be seamlessly executed over multiple indexes. A search over 2 indexes with 1 segment apiece is almost identical to a search over 1 index with 2 segments.

But Elasticsearch rolls a little differently. ES has a wrapper which obviously sits on top of Lucene's indexing architecture. What it does is that, be default, it has 5 shards into which the data is split in multiple proportions (unique, if you may).

A shard itself is a single logical index, but is comprised of a number of Lucene indexes–a primary and a configurable number of replicas–all of which contain the same documents, but are full Lucene indexes in and of themselves. Multiples are created to allow for both durability guarantees and distributed search scalability across clusters.

####On adding a document

When a document is added to the elasticsearch index, it is routed to the proper shard based on its id. When a search is executed it is run in parallel over all the shards in an index (on either a primary, or replica Lucene index), and then the results are combined. This means that splitting your documents over one elasticsearch index with 5 shards is equivalent to manually splitting your data over 5 elasticsearch indexes with one shard.

###Index Durability Durability in elasticsearch is implemented by its replica feature, whereby data is mirrored to multiple servers simultaneously. By default elasticsearch indexes have a replica count value set to 1. This means that each piece of data will exist on at least 2 servers in a running elasticsearch cluster, once on a primary, and once on a secondary location. Upping the replica count to 4 would mean that same piece of data would be guaranteed to exist on at least 5 separate servers.

Should a server fail, elasticsearch will self-heal. Given a replica count of 1, and a cluster consisting of 3 servers, it will be the case that each server will have 2/3 of the cluster data available. In this scenario the loss of a single server will be tolerated without data-loss. If a single server in that setup were to fail the cluster state, visible at the cluster health endpoint /cluster/_health would change from green to yellow. Some data on the cluster would only be present on a single server, which would cause elasticsearch to attempt to re-balance the replicas, dividing replica indexes evenly between the 2 remaining servers. Should the third server be fixed and added back in, elasticsearch would re-migrate the data back across all 3 servers. If two of the three servers were to fail, the state would change from yellow to red

###Write Durability In an elasticsearch cluster one may write data at one of three consistency levels:

  • all
  • quorum
  • one

..with decreasing guarantees for data-durability.

  • one consistency level is easy enough to understand: a single node will receive the data and persist it before acknowledging the write.After that point, the data will eventually be replicated to all replicas of the shard.
  • The all consistency is similarly simple; each and every replica in the shard acknowledges the write before a response is returned.
  • Lastly, when using the default quorum consistency level a majority of shards within the cluster must acknowledge the write before a response is returned.

image

The one consistency level will return most quickly, followed by the slower quorum consistency level, and finally the all consistency level, which is slowest of all. For most applications the quorum consistency level is a good trade-off between safety and performance.

#River

The River is the stream of constant data, the push and pull which intents to keep the data Real Time.

That constant data stream can come in different forms and from different sources. It can come directly from a user in an application that uses elasticsearch directly. For example, publishing a new status message, a new blog comment, or a review of a restaurant on apps that automatically apply that change to elasticsearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment