Notes by: Aleck Landgraf
-
#1 take away: use 3 master nodes
- these can be whimpy EC2 instances
- don't issue search queries to them
- don't put data on them, i.e. don't index to them
- only to hold state of the cluser
- then have data nodes
- then at scale or for aggs, have search/client nodes
- prevent bad GC issues
-
unique
cluster.name
-
set
ES_HEAP_SIZE
to half the available memory or30g
, whichever is smaller- e.g.
export ES_HEAP_SIZE=10g
then restart elasticsearch - or for ubuntu edit the param in
/etc/default/elasticsearch
- decimals are not allowd (10.5g), use
m
instead (10500m)
- e.g.
-
for write heavy or bulk indexing, set the
index.refresh_interval
to-1
then back to1s
after indexing or force a manual refresh- e.g. http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "-1" } }'
-
set
doc_values
to true per field that is numeric to keep field data for those out of memory http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.htmlPUT /music/_mapping/song { "properties" : { "tag": { "type": "string", "index" : "not_analyzed", "doc_values": true } } }
Setting
doc_values
to true at field creation time is all that is required to use disk-based fielddata instead of in-memory fielddata.doc values
are supported on non-analyzed string fields and numeric fields (byte, short, integer, long, float, double, date) -
creating a lot of indexes, use index templates
- when new index is created, the matching template will be used for all index config and mappings
- great for time series daily indexes
-
time series (TS) indexes
-
create an index per day (date) or week or month, etc
-
search across indexes
-
can route index to different nodes by age
- for the use case where you might have the current month on two big data nodes, and the rest down sampled on 4 small data nodes
-
allows cheap closing of an index which frees up resources other than disk
-
allows cheap deleting of an index versus deleting data from an index
-
still have to write you own down sampling
-
use timeParts format where each entry has the iso date and time parts to allow cheap aggs and analysis:
'isoData': '2015-04-12T13:30:22+08:00', 'timeParts': { 'month': 'April', 'weekDay': 'Monday', 'week': 18, 'day': 12, 'hour': 13, 'minute': 30, 'AM/PM': 'PM', 'quarter': 2, }
-
-
lots of data, use few
doc_types
, maybe only one -
NO Java 6
-
use oracle JVM
-
use the same JVM on all nodes
- use count where you can
- It's very fast since it doesn't have to go to disk for data
- use filters not query, try to get your whole search via
filters
unless you need rank - nesting filters into aggs is probably better than a flat search with aggs since it'll have to search twice
- user explain/validate to tune search queries (maybe like postgres explain?)
- hot threads API will point out bad things
- check cluster health and stats
- use marvel locally (dev license) to figure out what you should be watching
- set up alerts for disk almost full, etc.
- see Tim's github elasticsearch talk for a ton on info here.
- /_nodes/stats to check heap usage