- How do I troubleshoot search latency spikes in my Amazon OpenSearch Service cluster?
- How can I improve the indexing performance on my Amazon OpenSearch Service cluster?
- Resolution
- Be sure that the shards are distributed evenly across the data nodes for the index that you're ingesting into
- Increase the refresh_interval to 60 seconds or more
- Change the replica count to zero
- Experiment to find the optimal bulk request size
- Use an instance type that has SSD instance store volumes (such as I3)
- Reduce response size
- Related information
- Resolution
- How do I resolve search or write rejections in OpenSearch Service?
https://repost.aws/knowledge-center/opensearch-latency-spikes
For search requests, OpenSearch Service calculates the round trip time as follows:
Round trip = Time the query spends in the query phase + Time in the fetch phase + Time spent in the queue + Network latency
The SearchLatency metric on Amazon CloudWatch gives you the time that the query spent in the query phase.
To troubleshoot search latency spikes in your OpenSearch Service cluster, there are multiple steps that you can take:
- Check for insufficient resources provisioned on the cluster
- Check for search rejections using the ThreadpoolSearchRejected metric in CloudWatch
- Use the search slow logs API and the profile API
- Resolve any 504 gateway timeout errors
If you have insufficient resources provisioned on your cluster, then you might experience search latency spikes. Use the following best practices to make sure that you have sufficient resources provisioned.
1. Review the CPUUtilization metric and the JVMMemory pressure of the cluster using CloudWatch. Confirm that they're within the recommended thresholds. For more information, see Recommended CloudWatch alarms for Amazon OpenSearch Service.
2. Use the node stats API to get node level statistics on your cluster:
GET /_nodes/stats
In the output, check the following sections: caches, fielddata, and jvm. To compare the outputs, run this API multiple times with some delay between each output.
3. OpenSearch Service uses multiple caches to improve its performance and the response time of requests:
- The file-system cache, or page cache, that exists on the operating system level
- The shard level request cache and query cache that both exist on the OpenSearch Service level
Review the node stats API output for cache evictions. A high number of cache evictions in the output means that the cache size is inadequate to serve the request. To reduce your evictions, use bigger nodes with more memory.
To view specific cache information with the node stats API, use the following requests. The second request is for a shard-level request cache:
GET /_nodes/stats/indices/request_cache?human
GET /_nodes/stats/indices/query_cache?human
For more information on OpenSearch caches, see Elasticsearch caching deep dive: Boosting query speed one cache at a time on the Elastic website.
For steps to clear the various caches, see Clear index or data stream cache in the OpenSearch website.
4. Performing aggregations on fields that contain highly unique values might cause an increase in the heap usage. If aggregation queries are already in use, then search operations use fielddata. Fielddata also sorts and accesses the field values in the script. Fielddata evictions depend on the size of the indices.fielddata.cache.size file, and this accounts for 20% of the JVM heap space. When the cache is exceeded, eviction start.
To view the fielddata cache, use this request:
GET /_nodes/stats/indices/fielddata?human
For more information on troubleshooting high JVM memory, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?
To troubleshoot high CPU utilization, see How do I troubleshoot high CPU utilization on my Amazon OpenSearch Service cluster?
To check for search rejections using CloudWatch, follow the steps in How do I resolve search or write rejections in Amazon OpenSearch Service?
To identify both long running queries and the time that a query spent on a particular shard, use slow logs. You can set thresholds for the query phase and then fetch the phase for each index. For more information on setting up slow logs, see Viewing Amazon Elasticsearch Service slow logs. For a detailed breakdown of the time that's spent by your query in the query phase, set "profile":true for your search query .
Note: If you set the threshold for logging to a very low value, your JVM memory pressure might increase. This might lead to more frequent garbage collection that then increases CPU utilization and adds to cluster latency. Logging more queries might also increase your costs. A long output of the profile API also adds significant overhead to any search queries.
From the application logs of your OpenSearch Service cluster, you can see specific HTTP error codes for individual requests. For more information on resolving HTTP 504 gateway timeout errors, see How can I prevent HTTP 504 gateway timeout errors in Amazon OpenSearch Service?
Note: You must activate error logs to identify specific HTTP error codes. For more information about HTTP error codes, see Viewing Amazon OpenSearch Service error logs.
There are a number of other factors that can cause high search latency. Use the following tips to further troubleshoot high search latency:
- Frequent or long running garbage collection activity might cause search performance issues. Garbage collection activity might pause threads and increase search latency. For more information, see A heap of trouble: Managing Amazon OpenSearch Service's managed heap on the Elastic website.
- Provisioned IOPS (or i3 instances) might help you remove any Amazon Elastic Block Store (Amazon EBS) bottleneck. In most cases, you don't need them. Before you move an instance to i3, it's a best practice to test the performance between i3 nodes and r5 nodes.
- A cluster with too many shards might increase resource utilization, even when the cluster is inactive. Too many shards slow down query performance. Although increasing the replica shard count can help you achieve faster searches, don't go beyond 1000 shards on a given node. Also, make sure that the shard sizes are between 10 GiB and 50 GiB. Ideally, the maximum number of shards on a node is heap * 20.
- Too many segments or too many deleted documents might affect search performance. To improve perform, use force merge on read-only indices. Also, increase the refresh internal on the active indices, if possible. For more information, see Lucene's handling of deleted documents on the Elastic website.
- If your cluster is in a Virtual Private Cloud (VPC), then it's a best practice to run your applications within the same VPC.
- Use UltraWarm nodes or hot data nodes for read-only data. Hot storage provides the fastest possible performance for indexing and searching new data. However, UltraWarm nodes are a cost-effective way to store large amounts of read-only data on your cluster. For indices that you don't need to write to and don't require high performance, UltraWarm offers significantly lower costs per GiB of data.
- Determine if your workload benefits from having the data that you're searching for available on all nodes. Some applications benefit from this approach, especially if there are few indices on your cluster. To do this, increase the number of replica shards.
Note: This might add to indexing latency. Also, you might need additional Amazon EBS storage to accommodate the replicas that you add. This increases your EBS volume costs. - Search as few fields as possible, and avoid scripts and wildcard queries. For more information, see Tune for search speed on the Elastic website.
- For indices with many shards, use custom routing to help speed up searches. Custom routing makes sure that you query only the shards that hold your data, rather than broadcast the request to all shards. For more information, see Customizing your document routing on the Elastic website.
Recommended CloudWatch alarms for Amazon OpenSearch Service
https://repost.aws/knowledge-center/opensearch-indexing-performance
Be sure that the shards are distributed evenly across the data nodes for the index that you're ingesting into
Use the following formula to confirm that the shards are evenly distributed:
Number of shards for index = k * (Number of data nodes), where k is the number of shards per node
For example, if there are 24 shards in the index, and there are eight data nodes, then OpenSearch Service assigns three shards to each node. For more information, see Get started with Amazon OpenSearch Service: How many shards do I need?
Refresh your OpenSearch Service index so that your documents are available for search. Note that refreshing your index requires the same resources that are used by indexing threads.
The default refresh interval is one second. When you increase the refresh interval, the data node makes fewer API calls. The refresh interval can be shorter or faster, depending on the length of the refresh interval. To prevent 429 errors, it's a best practice to increase the refresh interval.
Note: The default refresh interval is one second for indices that receive one or more search requests in the last 30 seconds. For more information about the updated default interval, see _refresh API version 7.x on the Elasticsearch website.
If you anticipate heavy indexing, consider setting the index.number_of_replicas value to "0." Each replica duplicates the indexing process. As a result, disabling the replicas improves your cluster performance. After the heavy indexing is complete, reactivate the replicated indices.
Important: If a node fails while replicas are disabled, you might lose data. Disable the replicas only if you can tolerate data loss for a short duration.
Start with the bulk request size of 5 MiB to 15 MiB. Then, slowly increase the request size until the indexing performance stops improving. For more information, see Using and sizing bulk requests on the Elasticsearch website.
Note: Some instance types limit bulk requests to 10 MiB. For more information, see Network limits.
I3 instances provide fast and local memory express (NVMe) storage. I3 instances deliver better ingestion performance than instances that use General Purpose SSD (gp2) Amazon Elastic Block Store (Amazon EBS) volumes. For more information, see Petabyte scale for Amazon OpenSearch Service.
To reduce the size of OpenSearch Service's response, use the filter_path parameter to exclude unnecessary fields. Be sure that you don't filter out any fields that are required to identify or retry failed requests. Those fields can vary by client.
In the following example, the index-name, type-name, and took fields are excluded from the response:
curl -XPOST "es-endpoint/index-name/type-name/_bulk?pretty&filter_path=-took,-items.index._index,-items.index._type" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "user" : "testuser" }
{ "update" : {"_id" : "1", "_index" : "test2"} }
{ "doc" : {"user" : "example"} }
For more information, see Reducing response size.
Increase the value of index.translog.flush_threshold_size
By default, index.translog.flush_threshold_size is set to 512 MB. This means that the translog is flushed when it reaches 512 MB. The weight of the indexing load determines the frequency of the translog. When you increase index.translog.flush_threshold_size, the node performs the translog operation less frequently. Because OpenSearch Service flushes are resource-intensive operations, reducing the frequency of translogs improves indexing performance. By increasing the flush threshold size, the OpenSearch Service cluster also creates fewer large segments (instead of multiple small segments). Large segments merge less often, and more threads are used for indexing instead of merging.
Note: An increase in index.translog.flush_threshold_size can also increase the time that it takes for a translog to complete. If a shard fails, then recovery takes more time because the translog is larger.
Before increasing index.translog.flush_threshold_size, call the following API operation to get current flush operation statistics:
curl -XPOST "os-endpoint/index-name/_stats/flush?pretty"
Replace the os-endpoint and index-name with your respective variables.
In the output, note the number of flushes and the total time. The following example output shows that there are 124 flushes, which took 17,690 milliseconds:
{
"flush": {
"total": 124,
"total_time_in_millis": 17690
}
}
To increase the flush threshold size, call the following API operation:
$ curl -XPUT "os-endpoint/index-name/_settings?pretty" -d "{"index":{"translog.flush_threshold_size" : "1024MB"}}"
In this example, the flush threshold size is set to 1024 MB, which is ideal for instances that have more than 32 GB of memory.
Note: Choose the appropriate threshold size for your OpenSearch Service domain.
Run the _stats API operation again to see whether the flush activity changed:
$ curl _XGET "os-endpoint/index-name/_stats/flush?pretty"
Note: It's a best practice to increase the index.translog.flush_threshold_size only for the current index. After you confirm the outcome, apply the changes to the index template.
Segment replication strategy
Segment replication can be applied in a variety of scenarios, including:
- High write loads without high search requirements and with longer refresh times.
- When experiencing very high loads, you want to add new nodes but don’t want to index all data immediately.
- OpenSearch cluster deployments with low replica counts, such as those used for log analytics.
PUT /my-index1
{
"settings": {
"index": {
"replication.type": "SEGMENT"
}
}
}
For the best performance, it is recommended that you enable the following settings:
- Segment replication backpressure
- Balanced primary shard allocation, using the following command:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.balance.prefer_primary": true,
"segrep.pressure.enabled": true
}
}
You can set the default replication type for newly created cluster indexes in the opensearch.yml
file as follows:
cluster.indices.replication.strategy: 'SEGMENT'
Try Segment replication approach
Best practices for Amazon OpenSearch Service
https://repost.aws/knowledge-center/opensearch-resolve-429-error
When you write or search for data in your OpenSearch Service cluster, you might receive the following HTTP 429 error or es_rejected_execution_exception:
error":"elastic: Error 429 (Too Many Requests): rejected execution of org.elasticsearch.transport.TransportService$7@b25fff4 on
EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@768d4a66[Running,
pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 820898]] [type=es_rejected_execution_exception]"
Reason={"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@3ad6b683 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bef81a5[Running, pool size = 25, active threads = 23, queued tasks = 1000, completed tasks = 440066695]]"
The following variables can contribute to an HTTP 429 error or es_rejected_execution_exception:
- Data node instance types and search or write limits
- High values for instance metrics
- Active and Queue threads
- High CPU utilization and JVM memory pressure
HTTP 429 errors can occur because of search and write requests to your cluster. Rejections can also come from a single node or multiple nodes of your cluster.
Note: Different versions of Elasticsearch use different thread pools to process calls to the _index API. Elasticsearch versions 1.5 and 2.3 use the index thread pool. Elasticsearch versions 5.x, 6.0, and 6.2 use the bulk thread pool. Elasticsearch versions 6.3 and later use the write thread pool. For more information, see Thread pool on the Elasticsearch website.
A data node instance type has fixed virtual CPUs (vCPUs). Plug the vCPU count into your formula to retrieve the concurrent search or write operations that your node can perform before the node enters a queue. If an active thread becomes full, then the thread spills over to a queue and is eventually rejected. For more information about the relationship between vCPUs and node types, see OpenSearch Service pricing.
Additionally, there is a limit to how many searches per node or writes per node that you can perform. This limit is based on the thread pool definition and Elasticsearch version number. For more information, see Thread pool on the Elasticsearch website.
For example, if you choose the R5.2xlarge node type for five nodes in your Elasticsearch cluster (version 7.4), then the node will have 8 vCPUs.
Use the following formula to calculate maximum active threads for search requests:
int ((# of available_processors * 3) / 2) + 1
Use the following formula to calculate maximum active threads for write requests:
int (# of available_processors)
For an R5.2xlarge node, you can perform a maximum of 13 search operations:
(8 VCPUs * 3) / 2 + 1 = 13 operations
For an R5.2xlarge node, you can perform a maximum of 8 write operations:
8 VCPUs = 8 operations
For an OpenSearch Service cluster with five nodes, you can perform a maximum of 65 search operations:
5 nodes * 13 = 65 operations
For an OpenSearch Service cluster with five nodes, you can perform a maximum of 40 write operations:
5 nodes * 8 = 40 operations
To troubleshoot your 429 exception, check the following Amazon CloudWatch metrics for your cluster:
- IndexingRate: The number of indexing operations per minute. A single call to the _bulk API that adds two documents and updates two counts as four operations that might spread across nodes. If that index has one or more replicas, other nodes in the cluster also record a total of four indexing operations. Document deletions don't count towards the IndexingRate metric.
- SearchRate: The total number of search requests per minute for all shards on a data node. A single call to the _search API might return results from many different shards. If five different shards are on one node, then the node reports "5" for this metric, even if the client only made one request.
- CoordinatingWriteRejected: The total number of rejections that occurred on the coordinating node. These rejections are caused by the indexing pressure that accumulated since the OpenSearch Service startup.
- PrimaryWriteRejected: The total number of rejections that occurred on the primary shards. These rejections are caused by indexing pressure that accumulated since the last OpenSearch Service startup.
- ReplicaWriteRejected: The total number of rejections that occurred on the replica shards because of indexing pressure. These rejections are caused by indexing pressure that accumulated since the last OpenSearch Service startup.
- ThreadpoolWriteQueue: The number of queued tasks in the write thread pool. This metric tells you whether a request is being rejected because of high CPU usage or high indexing concurrency.
- ThreadpoolWriteRejected: The number of rejected tasks in the write thread pool.
Note: The default write queue size was increased from 200 to 10,000 in OpenSearch Service version 7.9. As a result, this metric is no longer the only indicator of rejections from OpenSearch Service. Use the CoordinatingWriteRejected, PrimaryWriteRejected, and ReplicaWriteRejected metrics to monitor rejections in versions 7.9 and later. - ThreadpoolSearchQueue: The number of queued tasks in the search thread pool. If the queue size is consistently high, then consider scaling your cluster. The maximum search queue size is 1,000.
- ThreadpoolSearchRejected: The number of rejected tasks in the search thread pool. If this number continually grows, then consider scaling your cluster.
- JVMMemoryPressure: The JVM memory pressure specifies the percentage of the Java heap in a cluster node. If JVM memory pressure reaches 75%, OpenSearch Service initiates the Concurrent Mark Sweep (CMS) garbage collector. The garbage collection is a CPU-intensive process. If JVM memory pressure stays at this percentage for a few minutes, then you might encounter cluster performance issues. For more information, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?
Note: The thread pool metrics that are listed help inform you about the IndexingRate and SearchRate.
For more information about monitoring your OpenSearch Service cluster with CloudWatch, see Instance metrics.
If there is a lack of CPUs or high request concurrency, then a queue can fill up quickly, resulting in an HTTP 429 error. To monitor your queue threads, check the ThreadpoolSearchQueue and ThreadpoolWriteQueue metrics in CloudWatch.
To check your Active and Queue threads for any search rejections, use the following command:
GET /_cat/thread_pool/search?v&h=id,name,active,queue,rejected,completed
To check Active and Queue threads for write rejections, replace "search" with "write". The rejected and completed values in the output are cumulative node counters, which are reset when new nodes are launched. For more information, see the Example with explicit columns section of cat thread pool API on the Elasticsearch website.
Note: The bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using. When the queue is full, new requests are rejected.
Search rejections
A search rejection error indicates that active threads are busy and that queues are filled up to the maximum number of tasks. As a result, your search request can be rejected. You can configure OpenSearch Service logs so that these error messages appear in your search slow logs.
Note: To avoid extra overhead, set your slow log threshold to a generous amount. For example, if most of your queries take 11 seconds and your threshold is "10", then OpenSearch Service takes more time to write a log. You can avoid this overhead by setting your slow log threshold to 20 seconds. Then, only a small percentage of the slower queries (that take longer than 11 seconds) is logged.
After your cluster is configured to push search slow logs to CloudWatch, set a specific threshold for slow log generation. You can set a specific threshold for slow log generation with the following HTTP POST call:
curl -XPUT http://<your domain’s endpoint>/index/_settings -d '{"index.search.slowlog.threshold.query.<level>":"10s"}'
Write rejections
A 429 error message as a write rejection indicates a bulk queue error. The es_rejected_execution_exception[bulk] indicates that your queue is full and that any new requests are rejected. This bulk queue error occurs when the number of requests to the cluster exceeds the bulk queue size (threadpool.bulk.queue_size). A bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using.
You can configure OpenSearch Service logs so that these error messages appear in your index slow logs.
Note: To avoid extra overhead, set your slow log threshold to a generous amount. For example, if most of your queries take 11 seconds and your threshold is "10", then OpenSearch Service will take additional time to write a log. You can avoid this overhead by setting your slow log threshold to 20 seconds. Then, only a small percentage of the slower queries (that take longer than 11 seconds) is logged.
After your cluster is configured to push search slow logs to CloudWatch, set a specific threshold for slow log generation. To set a specific threshold for slow log generation, use the following HTTP POST call:
curl -XPUT http://<your domain’s endpoint>/index/_settings -d '{"index.indexing.slowlog.threshold.query.<level>":"10s"}'
Here are some best practices that mitigate write rejections:
- When documents are indexed faster, the write queue is less likely to reach capacity.
- Tune bulk size according to your workload and desired performance. For more information, see Tune for indexing speed on the Elasticsearch website.
- Add exponential retry logic in your application logic. The exponential retry logic makes sure that failed requests are automatically retried.
Note: If your cluster continuously experiences high concurrent requests, then the exponential retry logic won't help resolve the 429 error. Use this best practice when there is a sudden or occasional spike of traffic. - If you are ingesting data from Logstash, then tune the worker count and bulk size. It's a best practice to set your bulk size between 3-5 MB.
For more information about indexing performance tuning, see How can I improve the indexing performance on my OpenSearch Service cluster?
Here are some best practices that mitigate search rejections:
- Switch to a larger instance type. OpenSearch Service relies heavily on the filesystem cache for faster search results. The number of threads in the thread pool on each node for search requests is equal to the following: int((# of available_processors * 3) / 2) + 1. Switch to an instance with more vCPUs to get more threads to process search requests.
- Turn on search slow logs for a given index or for all indices with a reasonable threshold value. Check to see which queries are taking longer to run and implement search performance strategies for your queries. For more information, see Troubleshooting Elasticsearch searches, for beginners or Advanced tuning: Finding and fixing slow Elasticsearch queries on the Elasticsearch website.