So. I ran into a great deal of stress around ElasticSearch/Logstash performance lately. These are just a few lessons learned, documented so I have a chance of finding them again.
Both ElasticSearch and Logstash produce logs. On my RHEL install they're located in /var/log/elasticsearch and /var/log/logstash. These will give you some idea of problems then things go really wrong. For example, in my case, ElasticSearch got so slow that Logstash would time out sending it logs. These issues show up in the logs. Also, Elasticsearch would start logging problems when JVM Garbage collection took longer than 30 seconds, which is a good indicator of memory pressure on ElasticSearch.
ElasticSearch (and Logstash when it's joined to an ES Cluster) processes tasks in a queue, that you can peek into. Before realizing this I didn't have any way to understand what was happening in ElasticSearch besides the logs. You can look at the pending tasks queue with this command:
curl -XGET 'http://localhost:9200/_cluster/pending_tasks?pretty'
The basic analysis is: Queued tasks: bad. Queued tasks that sit in the queue for a long time. Bad.
What the pending tasks query has going for it is that it returns pretty quickly, even when ElasticSearch is under pressure. So if you're tweeking settings and re-trying fixes it's a quick indicator to tell if what you've done helps.
When used to store Logstash logs, Elasticsearch will build a mapping for the logs that are sent to it. A Mapping is basically a schema definition that Elasticsearch uses to determine how to index documents. It builds this on the fly in the case of Logstash. The problem I ran into was with logs that are very dynamic. This lead to huge mappings that slowed down log injestion to the point of causing ElasticSearch to thrash with mapping updates. You can get the mappings for an index with this command:
curl -XGET 'http://localhost:9200/index_name/_mapping'
If the mapping is huge, that could be a problem. If the mappings are updating frequently that's a bigger problem. You can see mapping update tasks in the Pending Tasks queue, as well as in the ElasticSearch logs if mapping updates slow down to the point that they trip a log.
You can increase the amount of memory available to ElasticSearch in /etc/sysconfig/elasticsearch on RHEL. I've given my install 1/2 of available RAM, but sizing memory for JVM apps is a black art, especially with OSes that steal available RAM to build filesystem cache's, which RHEL does. If you see ElasticSearch doing excessive garbage collection via the logs or the pending tasks it's memory starved. Throwing more memory at it can help, but it also can gloss over underlying problems.
Don't have dynamic logs. Or rather, don't send dynamic logs over to ElasticSearch as json. In my case I have a single field in my logs, 'context' that is dynamic and is used to hold debug information. In Logstash I broke that field down into JSON for processing and then sent it along to ElasticSearch which dutifully indexed all of the fields in the context, even when they changed wildly from app to app and even log to log. My solutiuon was to break context out into a temporary variable in logstash, process it, then get rid of it before sending the log to ElasticSearch. This left context as a serialized json string rather than a json object that ElasticSearch will index.
Context is still searchable, it's a little slower for search since it's a big string rather than split into individual fields, but it's better than things being so slow that they break.