Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:
- Realtime indexing + querying is tough. Required us to throw beefed up dedicated hardware at that problem while we were serving historical queries on nodes w/ a different config (typical hot, warm cold node configuration).
- As always, skewed data sets require special consideration in index and document schema modelling.
- JVM heap, aggregation query and doc mapping optimization needed or you'll easily hit OOM on nodes which can lead to...
- Bad failure scenarios where you get an entire cluster brought to a halt, no queries able to be served. Literally one bad and greedy query can put your node and cluster in a very bad state.
- Depending on your document mapping, disk storage requirements can easily bite you but are made better by https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch
+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the default for all fields fixed one of the most frequent issues that newbies encounter that lead to JVM OOM. Disk compression is another great example and they have a tendency to fix every bug we hit in the next minor version release.
Of course not everyone shares that opinion https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0.