Skip to content

Instantly share code, notes, and snippets.

@alecklandgraf
Last active January 13, 2017 20:42
Show Gist options
  • Save alecklandgraf/ab0423e8fab1669f5358 to your computer and use it in GitHub Desktop.
Save alecklandgraf/ab0423e8fab1669f5358 to your computer and use it in GitHub Desktop.
elasticon notes

Elastic{ON} 2015 best practices

Notes by: Aleck Landgraf

config and index settings

  • #1 take away: use 3 master nodes

    • these can be whimpy EC2 instances
    • don't issue search queries to them
    • don't put data on them, i.e. don't index to them
    • only to hold state of the cluser
    • then have data nodes
    • then at scale or for aggs, have search/client nodes
    • prevent bad GC issues
  • unique cluster.name

  • set ES_HEAP_SIZE to half the available memory or 30g, whichever is smaller

    • e.g. export ES_HEAP_SIZE=10g then restart elasticsearch
    • or for ubuntu edit the param in /etc/default/elasticsearch
    • decimals are not allowd (10.5g), use m instead (10500m)
  • for write heavy or bulk indexing, set the index.refresh_interval to -1 then back to 1s after indexing or force a manual refresh

    curl -XPUT localhost:9200/test/_settings -d '{
      "index" : {
          "refresh_interval" : "-1"
      } }'
    
  • set doc_values to true per field that is numeric to keep field data for those out of memory http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html

    PUT /music/_mapping/song
    {
      "properties" : {
        "tag": {
          "type":       "string",
          "index" :     "not_analyzed",
          "doc_values": true 
        }
      }
    }
    

    Setting doc_values to true at field creation time is all that is required to use disk-based fielddata instead of in-memory fielddata. doc values are supported on non-analyzed string fields and numeric fields (byte, short, integer, long, float, double, date)

  • creating a lot of indexes, use index templates

    • when new index is created, the matching template will be used for all index config and mappings
    • great for time series daily indexes
  • time series (TS) indexes

    • create an index per day (date) or week or month, etc

    • search across indexes

    • can route index to different nodes by age

      • for the use case where you might have the current month on two big data nodes, and the rest down sampled on 4 small data nodes
    • allows cheap closing of an index which frees up resources other than disk

    • allows cheap deleting of an index versus deleting data from an index

    • still have to write you own down sampling

    • use timeParts format where each entry has the iso date and time parts to allow cheap aggs and analysis:

      'isoData': '2015-04-12T13:30:22+08:00',
      'timeParts': {
        'month': 'April',
        'weekDay': 'Monday',
        'week': 18,
        'day': 12,
        'hour': 13,
        'minute': 30,
        'AM/PM': 'PM',
        'quarter': 2,
      }
  • lots of data, use few doc_types, maybe only one

  • NO Java 6

  • use oracle JVM

  • use the same JVM on all nodes

search

  • use count where you can
    • It's very fast since it doesn't have to go to disk for data
  • use filters not query, try to get your whole search via filters unless you need rank
  • nesting filters into aggs is probably better than a flat search with aggs since it'll have to search twice
  • user explain/validate to tune search queries (maybe like postgres explain?)

maintenance

  • hot threads API will point out bad things
  • check cluster health and stats
  • use marvel locally (dev license) to figure out what you should be watching
  • set up alerts for disk almost full, etc.
  • see Tim's github elasticsearch talk for a ton on info here.
  • /_nodes/stats to check heap usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment