Elastic{ON} 2015 best practices

Notes by: Aleck Landgraf

config and index settings

#1 take away: use 3 master nodes
- these can be whimpy EC2 instances
- don't issue search queries to them
- don't put data on them, i.e. don't index to them
- only to hold state of the cluser
- then have data nodes
- then at scale or for aggs, have search/client nodes
- prevent bad GC issues
unique cluster.name
set ES_HEAP_SIZE to half the available memory or 30g, whichever is smaller
- e.g. export ES_HEAP_SIZE=10g then restart elasticsearch
- or for ubuntu edit the param in /etc/default/elasticsearch
- decimals are not allowd (10.5g), use m instead (10500m)
for write heavy or bulk indexing, set the index.refresh_interval to -1 then back to 1s after indexing or force a manual refresh
- e.g. http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
```
curl -XPUT localhost:9200/test/_settings -d '{
  "index" : {
      "refresh_interval" : "-1"
  } }'
```
set doc_values to true per field that is numeric to keep field data for those out of memory http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html
```
PUT /music/_mapping/song
{
  "properties" : {
    "tag": {
      "type":       "string",
      "index" :     "not_analyzed",
      "doc_values": true 
    }
  }
}
```
Setting doc_values to true at field creation time is all that is required to use disk-based fielddata instead of in-memory fielddata. doc values are supported on non-analyzed string fields and numeric fields (byte, short, integer, long, float, double, date)
creating a lot of indexes, use index templates
- when new index is created, the matching template will be used for all index config and mappings
- great for time series daily indexes
time series (TS) indexes
- create an index per day (date) or week or month, etc
- search across indexes
- can route index to different nodes by age
  - for the use case where you might have the current month on two big data nodes, and the rest down sampled on 4 small data nodes
- allows cheap closing of an index which frees up resources other than disk
- allows cheap deleting of an index versus deleting data from an index
- still have to write you own down sampling
- use timeParts format where each entry has the iso date and time parts to allow cheap aggs and analysis:
```
'isoData': '2015-04-12T13:30:22+08:00',
'timeParts': {
  'month': 'April',
  'weekDay': 'Monday',
  'week': 18,
  'day': 12,
  'hour': 13,
  'minute': 30,
  'AM/PM': 'PM',
  'quarter': 2,
}
```
lots of data, use few doc_types, maybe only one
NO Java 6
use oracle JVM
use the same JVM on all nodes

search

use count where you can
- It's very fast since it doesn't have to go to disk for data
use filters not query, try to get your whole search via filters unless you need rank
nesting filters into aggs is probably better than a flat search with aggs since it'll have to search twice
user explain/validate to tune search queries (maybe like postgres explain?)

maintenance

hot threads API will point out bad things
check cluster health and stats
use marvel locally (dev license) to figure out what you should be watching
set up alerts for disk almost full, etc.
see Tim's github elasticsearch talk for a ton on info here.
/_nodes/stats to check heap usage

alecklandgraf/elasticon.md

Elastic{ON} 2015 best practices

config and index settings

search

maintenance