Elasticsearch has many metrics that can be used to determine if a cluster is healthy. Listed below are the metrics that are currently a good idea to monitor with the reason(s) why they should be monitored and any possible recourse for issues.
Unless otherwise noted, all of the API requests work starting with 1.0.0.
Metrics are an easy way to monitor the health of a cluster and they can be easily accessed from the HTTP API. Each Metrics table is broken down by their source; the version of Elasticsearch that is required to retrieve the metric is noted by its name if it is not present in 1.0.0.
Each metric has an associated warning level and error level. These levels are meant to indicate when the associated Metric should be monitored versus acted upon as soon as possible.
Further reading that adds to the details found here can be found in the Elasticsearch guide about monitoring individual nodes relevant to the latest version of Elasticsearch.
With the release of 1.0.0 came the introduction of the _cat
APIs. These APIs provide a human element to metrics by providing a more tabular format to read. At any time, you can quickly find all available _cat
APIs by making a request to the top level _cat
endpoint:
# Find all available _cat APIs available to you
# (this was run on Elasticsearch 1.2.2):
curl -XGET localhost:9200/_cat
=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
As you read through the rest of this document, feel free to cross reference those requests with the options found here. For example:
# Ensure that both data nodes see the same master
curl localhost:9200/_cat/master?v
id host ip node
8Y_a2JvMRnGGX3ypdL85Dw my-hostname 192.168.1.31 Amergin
# Check other data nodes
curl localhost:9201/_cat/master?v
id host ip node
8Y_a2JvMRnGGX3ypdL85Dw my-hostname 192.168.1.31 Amergin
# Check all nodes in the cluster
curl localhost:9200/_cat/nodes?v
host ip heap.percent ram.percent load node.role master name
my-hostname 192.168.1.31 12 75 3.84 d * Amergin
my-hostname 192.168.1.31 18 75 3.84 d m Deathlok
The _cat/master
API tells you on a per-node basis who it sees as its master node (notice the port changes). This is generally much easier to check, as a human, than the equivalent /_cluster/state/master_node
request. Details on the various _cat
APIs can be found here.
Node information represents details about individual nodes in a cluster that are critical to stability.
All API calls in this table must be prefixed by /_nodes/{api-call}
. Each JSON Path represents the relevant field starting at the returned node sub-object (nodes.{node}.{json-path}
) from the specified API Call.
These metrics are currently only concerned with versions. Except during upgrades (when it's expected temporarily), you should always use the same version on each node.
# This is how all of these commands should look:
curl -XGET localhost:9200/_nodes/{api-call}?human\&pretty
For example:
# Java Version
curl -XGET localhost:9200/_nodes/jvm?human\&pretty
{
"cluster_name" : "elasticsearch-cluster-name",
"nodes" : {
"8Y_a2JvMRnGGX3ypdL85Dw" : {
"name" : "Amergin",
"transport_address" : "inet[my-hostname/192.168.1.31:9300]",
"host" : "my-hostname",
"ip" : "192.168.1.31",
"version" : "1.2.2",
"build" : "9902f08",
"http_address" : "inet[/192.168.1.31:9200]",
"jvm" : {
"pid" : 16669,
"version" : "1.8.0_20",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.20-b23",
"vm_vendor" : "Oracle Corporation",
"start_time" : "2014-11-18T20:24:00.063Z",
"start_time_in_millis" : 1416342240063,
"mem" : {
"heap_init" : "256mb",
"heap_init_in_bytes" : 268435456,
"heap_max" : "990.7mb",
"heap_max_in_bytes" : 1038876672,
"non_heap_init" : "2.4mb",
"non_heap_init_in_bytes" : 2555904,
"non_heap_max" : "0b",
"non_heap_max_in_bytes" : 0,
"direct_max" : "990.7mb",
"direct_max_in_bytes" : 1038876672
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
}
}
}
}
To access the Java Version, nodes.8Y_a2JvMRnGGX3ypdL85Dw.jvm.version
would be the expansion for nodes.{node}.{json-path}
.
Metric | API Call | JSON Path | Explanation | Solution |
---|---|---|---|---|
Java Version | jvm | jvm.version | Elasticsearch is written in Java. In rare cases, intercommunication between nodes will use the features that get tweaked between Java releases, which can cause issues when the Java version does not match on both sides of the communication. As such, not maintaining the same version can lead to unexpected failures at this level. | Maintain the same version of Java installed on all servers running Elasticsearch. |
Elasticsearch Version | version | Elasticsearch is written with backward compatibility in mind, but sometimes non-backward compatible changes will occur to support desired features and improve performance. | Maintain the same version of Elasticsearch installed on all nodes. |
In the case of memory and file operations, distributing the workload is also a potential solution to capacity constraints.
All API calls in this table must be prefixed by /_nodes/stats/{api-call}
. Each JSON Path represents the relevant field starting at the
returned node sub-object (nodes.{node}.{json-path}
) from the specified API Call.
# This is how all of these commands should look:
curl -XGET localhost:9200/_nodes/stats/{api-call}?human\&pretty
For example:
# File Descriptors
curl -XGET localhost:9200/_nodes/stats/process?human\&pretty
{
"cluster_name" : "elasticsearch-cluster-name",
"nodes" : {
"8Y_a2JvMRnGGX3ypdL85Dw" : {
"timestamp" : 1416458421494,
"name" : "Amergin",
"transport_address" : "inet[my-hostname/192.168.1.31:9300]",
"host" : "my-hostname",
"ip" : [ "inet[my-hostname/192.168.1.31:9300]", "NONE" ],
"process" : {
"timestamp" : 1416458421494,
"open_file_descriptors" : 435,
"cpu" : {
"percent" : 1,
"sys" : "7m",
"sys_in_millis" : 420451,
"user" : "32.1m",
"user_in_millis" : 1927525,
"total" : "39.1m",
"total_in_millis" : 2347976
},
"mem" : {
"resident" : "344.9mb",
"resident_in_bytes" : 361709568,
"share" : "-1b",
"share_in_bytes" : -1,
"total_virtual" : "4.8gb",
"total_virtual_in_bytes" : 5162328064
}
}
}
}
}
To access the File Descriptors, nodes.8Y_a2JvMRnGGX3ypdL85Dw.process.open_file_descriptors
would be the expansion for nodes.{node}.{json-path}
.
Metric | API Call | JSON Path | Warning | Error | Explanation | Solution |
---|---|---|---|---|---|---|
Java GC CMS | jvm | jvm.gc.old.collection_count | Increased by greater than X per minute. |
Increased by greater than X per minute. |
The number of Concurrent Mark Sweep collections
that run per minute should stay roughly the same on a healthy cluster, which should be used to determine X . Bursts
in server load should be reflected by seeing bursts in collections, but they should stabilize with the load. On a cluster that is
constantly needing to do more-and-more collects, the risk is that more-and-more time is being spent doing garbage collections
rather than processing. In worst case scenarios, this can lead to slow responses and out of memory issues that eventually lead
to failed nodes.
|
Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server. |
Available Disk Space | fs/data | fs.data.available_in_bytes | 20% of total disk space left | 10% of total disk space left | Running out of disk space means that nothing can be inserted or updated. As a result, the node will fail. | Add more disk space. |
File Descriptors | process | process.open_file_descriptors | 70% of the maximum number of file descriptors | 90% of the maximum number of file descriptors |
File descriptors are used for connections and file operations. As Elasticsearch grows and scales, this number will increase,
particularly when it is under heavy load. If this number reaches the maximum, then new connections and file operations cannot occur
until old ones have closed. This will cause intermittent node failures.
The current maximum value can be read by calling |
Increase the system's maximum file descriptor count, which is OS specific (see ulimit for many Linux distributions).
|
Java Heap Size | jvm | jvm.mem.heap_used_percent | 80% of total heap for 10 minutes | 90% of total heap for 10 minutes | The Java Virtual Machine (JVM) heap is the main memory used by the Java processes, which includes Elasticsearch. Like any other process, if it runs out of memory, then it will crash and lead to node failures. | Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server. |
HTTP Connections | http | http.total_opened | Increases by greater than 50 per minute. | Increases by greater than 100 per minute. | The number of HTTP connections can be indicative of server demand by request, but not by content. Some requests are much easier to fulfill than others, but having too many requests--simple or complex--can cause a node issues. |
Better distributing a workload (adding more nodes) is the easiest way to reduce the number of connections to a particular node. In
some cases, it may be possible to tune client software to send fewer requests per minute if the software is written to poll
Elasticsearch on an interval. Naturally, avoiding the need to make requests is the easiest way to reduce the number that come in,
such as duplicating requests unnecessary.
Many clients, such as the PHP client and Javascript client, create connections for every request. In those cases, you just cannot avoid it, but in other cases it can be avoided. For instance, using the .NET client, it will default to HTTP pipelining, which allows a single HTTP connection to be reused. Using appropriate Keep Alive times, persistent connections, and pipelining from the client can greatly reduce the number of connections, which can reduce network overhead. For those clients that do not allow those features to be controlled, it can help to setup a proxy for Elasticsearch (e.g., using Nginx) so that the proxy can use those features, which allows you to continue using "harmful" clients while still reducing the number of overall connections. |
Thread Pool Rejections | thread_pool | thread_pool.POOL.rejected | Increased by greater than X per minute. |
Increased by greater than X per minute. |
The number of rejected threads per minute should stay roughly the same on a healthy cluster, which should be used to determine
X . A rejected thread means the requested action did not occur (e.g.,
nodes.NODE.thread_pool.get.rejected indicates failed get requests) because the associated thread pool was
full; it does not mean that it started, but failed later.
|
Adding more available threads to the problematic thread pool can lower the number of rejections, but, in general, this should only be done when directed by support. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel. |
Thread Pool Queue | thread_pool | thread_pool.POOL.queue | Increased by greater than X per minute. |
Increased by greater than X per minute. |
The number of threads per minute should stay roughly the same on a healthy cluster, which should be used to determine
X . Optimally, this value should be 0, but peak periods may reasonably see threads queued. A queued thread means the
requested action has not occurred yet (e.g., nodes.NODE.thread_pool.get.queue indicates delayed get requests)
because the associated thread pool was full; it means that it is waiting to be processed and it has not yet been rejected.
|
Adding more available threads to the problematic thread pool can lower the number of queued threads, but, in general, this should only be done when directed by support. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel. |
Load Average | os | os.load_average.AVERAGE | Relative to nodes.NODE.os.available_processors from /_nodes/os . |
Relative to nodes.NODE.os.available_processors from /_nodes/os . |
The average processor load on the node. As the load approaches complete utilization of each processor, it means other server
processes are not being executed and that likely some threads within the process may be executing more slowly.
The |
Adding more available processors to the server chassis or virtual machine. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel. |
Filter Cache Size | indices/filter_cache | indices.filter_cache.memory_size_in_bytes | Total sum of cache sizes is greater than 60% of heap size. | Total sum of cache sizes is greater than 70% of heap size. |
Elasticsearch uses caches to speed up frequently performed actions. If the caches take up too much memory, then it is possible to
get into situations where the rest of the Elasticsearch process is waiting for memory to become available, which may cause actions
to run slower.
If the warning or error levels are reached, then lowering the level of the worst offending or least used caches can help to speed up Elasticsearch. Note: The ID Cache is the in-memory-join table maintaining Parent/Child relationships. There is not currently a setting to control the amount of memory used to maintain this relationship and there is not much that can be done to effect its footprint. Because it resides on the heap, it is still a good idea to monitor its usage. Starting in 1.1.0, the ID Cache is actually stored in the Field Data Cache, but both metrics are still reported separately. |
Each cache can be independently cleared, or you may choose to clear them all at the same time using the
clear
cache API. If levels are set to allow the behavior causing concern, then clearing the cache may just be
delaying the problem.
Updating the specific settings for each cache can be done to control situations that consistently cause issue. For example, setting
the |
ID Cache Size | indices/id_cache | indices.id_cache.memory_size_in_bytes | ||||
Field Data Size | indices/fielddata | indices.fielddata.memory_size_in_bytes | ||||
Percolate Size | indices/percolate | indices.percolate.memory_size_in_bytes | ||||
Query Cache Size (>= 1.4.0) | indices/query_cache | indices.query_cache.memory_size_in_bytes |
The overall health of the cluster is an important aspect of Elasticsearch deployments that have more than a single shard. Understanding these core metrics can lead to a more stable deployment.
All API calls in this table must be prefixed by /_cluster/{api-call}
. Each JSON Path represents the relevant field starting at the
returned object's root (just {json-path}
) from the specified API Call.
# This is how all of these commands should look:
curl -XGET localhost:9200/_cluster/{api-call}?pretty
For example:
# Status
curl -XGET localhost:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch-cluster-name",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 26,
"active_shards" : 26,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 26
}
To access the Status, status
would be the expansion for {json-path}
.
Metric | API Call | JSON Path | Warning | Error | Explanation | Solution |
---|---|---|---|---|---|---|
Status | health | status | The status is "yellow" for more than 5 minutes. |
The status is "red" . |
"green" is the desired status , which indicates a healthy cluster that has properly allocated all primary
and replica shards. A "yellow" status indicates that at least one replica is missing, but all
data is still searchable. The worst is the "red" status that indicates at least one primary
shard, as well as its replicas, is missing; this means that searches will return partial results and indexing to the missing
shard(s) will cause an exception.
|
Investigate the cause of any node, shard or replica issues by checking the logs and monitoring other metrics. Once recovered, it is important to ensure that the original problem does not repeat itself. For example, if a node failed because its disk was full, then restarting it will not prevent the issue from immediately recurring. |
Data Nodes | health | number_of_data_nodes | The value is less than expected. | Elasticsearch clusters are only healthy if all data nodes are available. If data nodes are missing, then it is possible to only get partial results from queries as well as to fail to index data that would otherwise be put there. | ||
Master Node | state/master_node | master_node | The value is different for any node. |
If any node is in disagreement about the master_node , then problems can quickly occur because the cluster is not in a
safe state where they do agree on the master_node . A disagreement on the master_node is called a "split
brain".
In a split brain situation, there is effectively more than one cluster because multiple nodes are still running, but some cannot
talk to others. Each cluster thinks that it is _the_ cluster and behaves like the
other nodes are simply missing from it. The best way to detect a split brain is by
checking Under the worst circumstances, outside connections are able to see _each_ side and the assumption is that they are working together in the background. In this scenario, it is possible to cause issues |
Stop the problematic nodes as soon as possible to avoid issues. Before restarting any nodes, ensure that network connectivity between all nodes is working properly so that intercommunication can happen. |
The overall health of an index is an important aspect of all searches, indexes (writes), and retrievals. If an index is having problems, then all users of the index will be directly effected.
All API results in this table are the result of /_stats?level=cluster
(not specifying the level
shows all indices by default, which is unnecessary). Each JSON Path represents the relevant field starting at the returned node sub-object (_all.total.{json-path}
) from the API Call.
To determine the relative warning/error state, you must monitor values between lookups, looking for relatively large variance.
# This is how all of these commands should look:
curl -XGET localhost:9200/_stats?level=cluster\&human\&pretty
For example:
# Total Search Requests
curl -XGET localhost:9200/_stats?level=cluster\&human\&pretty
{
"_shards" : {
"total" : 52,
"successful" : 26,
"failed" : 0
},
"_all" : {
... removed for brevity ...
"total" : {
"docs" : {
"count" : 495599,
"deleted" : 0
},
"store" : {
"size" : "362.4mb",
"size_in_bytes" : 380072547,
"throttle_time" : "2.1m",
"throttle_time_in_millis" : 128652
},
... removed for brevity ...
"search" : {
"open_contexts" : 0,
"query_total" : 8664,
"query_time" : "16.4s",
"query_time_in_millis" : 16458,
"query_current" : 0,
"fetch_total" : 8664,
"fetch_time" : "2.4s",
"fetch_time_in_millis" : 2484,
"fetch_current" : 0
},
... removed for brevity ...
}
}
}
To access the Total Search Requests, _all.total.search.query_total
would be the expansion for _all.total.{json-path}
.
Metric | JSON Path | Explanation | Solution |
---|---|---|---|
Total Search Requests | search.query_total | The total number of queries (searches). | Determine the cause for the sudden surge or drop in queries. It is possible that a connected application has lost all network connectivity to the cluster. Surges in queries could be innocent, unintentionally looped queries in application code, or the sign of possible abuse through connected applications. |
Total Search Request Time | search.query_total_in_millis | The total time spent on queries (searches) in milliseconds. | In addition to looking at the number of requests, the complexity of requests is important. The search slow log should be checked for unexpected increases in the total time. The slow log must be manually enabled. |
Total Index Requests | indexing.index_total | The total number of indexes (writes). | Determine the cause for the sudden surge or drop in indexes. It is possible that a connected application has lost all network connectivity to the cluster. Surges in indexing could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications. |
Total Index Request Time | indexing.index_total_in_millis | The total time spent on indexing (writing) in milliseconds. | In addition to looking at the number of requests, the complexity of requests is important. The index slow log should be checked for unexpected increases in the total time. The slow log must be manually enabled. |
Successful Get Requests | get.exists_total | The total number of get requests. | Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request for document(s)). |
Successful Get Request Time | get.exists_total_in_millis | The total time spent on get requests in milliseconds. | Get requests can be performed in real time and they should not show change dramatic changes. Sudden changes should be investigated with regard to the overall health of shards, indices, and clusters. |
Missed Get Requests | get.missing_total | The total number of get request misses. |
Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network
connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or
the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request
for document(s)).
The most obvious cause of frequently missed get requests are made by applications that have predictable IDs defined via a given pattern (e.g., associated with a user's ID) and indices that do not always have documents defined. Depending on the application, this may or may not be an issue. |
Missed Get Request Time | get.missing_total_in_millis | The total time spent on get request misses in milliseconds. | Get requests can be performed in real time and they should not show dramatic changes. Sudden changes should be investigated with regard to the overall health of shards, indices, and clusters. |
Some information is available in Elasticsearch's logs. The following table provide some of the log output that should trigger a warning or alert based on the assumed severity.
The log line should be loosely interpreted as a regular expression, so .*
effectively represents a placeholder.
Metric | Monitored Text | Seriousness | Explanation | Solution |
---|---|---|---|---|
Out of Memory | java.lang.OutOfMemoryError | Error | The running node ran out of memory. |
Determine the cause of the error, and adjust the memory issues accordingly (e.g., add more memory to the server or adjust caches).
Be sure to use the appropriate Java heap size for your
environment. The default Java heap size is |
File Descriptor Issues | java.io.*Exception .* (Too many open files) | Error | Elasticsearch has tried to open too many file descriptors. | This generally means that you need to adjust the OS-level file descriptor settings to increase the number available to a single process. |
Internal Communication Failures | java.io.StreamCorruptedException: invalid internal transport message format | Error | Elasticsearch communication failed internally between nodes. | The solution depends largely on the reason for the failure. The most common failure is because nodes are running on different JVM versions. |
Corrupted Translog | failed to retrieve translog after .* operations, ignoring the rest, considered corrupted | Error | Elasticsearch cannot parse a translog, which could mean data loss has occurred after a failed restart. | The solution depends largely on the reason for the failure. Contact support. |
Lucene Merge Issues | org.apache.lucene.index.MergePolicy$MergeException | Warning | An issue occurred while merging Lucene indices. This occurs related to a single Elasticsearch shard. | The solution depends largely on the reason for the failure. Merging usually occurs automatically in the background and this could just be a non-issue. If the problem persists, then contact support. |
Lucene Index Issues | org.apache.lucene.index.CorruptIndexException | Warning | An issue occurred while reading a Lucene index. This occurs related to a single Elasticsearch shard. | The solution depends largely on the reason for the failure. If the problem persists, then contact support. |
Low Disk Space | After allocating, node .* would have less than the required .*% free disk threshold .*, preventing allocation | Warning | The specified node is low on disk space. | Allocate more disk space to the specified node. |