#elasticsearch Crash Course!
- A way to search... things
- A way to search your data in terms of natural language, and so much more
- A distributed version of lucene with a JSON API.
- A fancy clustered database
A software library providing full-text indexing and search. elastisearch provides an HTTP interface, clustering support, and other tools on top of it.
- Data is stored in an index, similar to an SQL DB
- Each index can store multiple types, similar to an SQL table
- Items inside the index are documents that have a type
- Specifying attributes for a type is optional
- All data is sent as JSON, and can have an arbitrary depth
# Setup our server
server = Stretcher::Server.new('http://localhost:9200')
# Create the index with its schema
server.index(:foo).create(mappings: {
tweet: {
properties: {
text: {type: 'string',
analyzer: 'snowball'}}}}) rescue nil
words = %w(Many dogs dog cat cats candles candleizer abscond rightly candlestick monkey monkeypulley deft deftly)
id = 0
words.each {|w|
id+=1
server.index(:foo).type(:tweet).put(id, {text: w })
}
- The document is a simple JSON hash:
{"text": "word" }
- Each document has a unique ID
- We use
put
, elasticsearch has a RESTish API
# A simple search
server.index(:foo).search(query: {match: {text: "abscond"}}).results.map(&:text)
=> ["abscond"]
- our query is actually a JSON object
- our response is also JSON!
Analysis is the process whereby words are transformed into tokens. The Snowball analyzer, for instance, turns english words into tokens based on their stems.
server.analyze("deft", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftly", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("deftness", analyzer: :snowball).tokens.map(&:token)
=> ["deft"]
server.analyze("candle", analyzer: :snowball).tokens.map(&:token)
=> ["candl"]
server.analyze("candlestick", analyzer: :snowball).tokens.map(&:token)
=> ["candlestick"]
# Will match deft and deftly
server.index(:foo).search(query: {match: {text: "deft"}}).results.map(&:text)
=> ["deft", "deftly"]
# Will match candle, but not candlestick
server.index(:foo).search(query: {match: {text: "candle"}}).results.map(&:text)
# => ["candles"]
# NGram
server.analyze("news", tokenizer: "ngram", filter: "lowercase").tokens.map(&:token)
# => ["n", "e", "w", "s", "ne", "ew", "ws"]
# Stop word
server.analyze("The quick brown fox jumps over the lazy dog.", analyzer: :stop).tokens.map(&:token)
#=> ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]
# Path Hierarchy
server.analyze("/var/lib/racoons", tokenizer: :path_hierarchy).tokens.map(&:token)
# => ["/var", "/var/lib", "/var/lib/racoons"]
# Create the index
server.index(:users).create(settings: {analysis: {analyzer: {my_ngram: {type: "custom", tokenizer: "ngram", filter: 'lowercase'}}}}, mappings: {user: {properties: {name: {type: :string, analyzer: :my_ngram}}}})
# Store some fake data
users = %w(bender fry lela hubert cubert hermes calculon)
users.each_with_index {|name,i| server.index(:users).type(:user).put(i, {name: name}) }
# Our analyzer in action
server.index(:users).analyze("hubert", analyzer: :my_ngram).tokens.map(&:token)
# => ["h", "u", "b", "e", "r", "t", "hu", "ub", "be", "er", "rt"]
# Some queries
# Exact
server.index(:users).search(query: {match: {name: "Hubert"}}).results.map(&:name)
=> ["hubert", "cubert", "bender", "hermes", "fry", "calculon", "lela"]
# A Mis-spelled query
server.index(:users).search(query: {match: {name: "Calclulon"}}).results.map(&:name)
=> ["calculon", "lela", "cubert", "bender", "hubert"]
# Individual docs can be boosted
server.index(:users).type(:user).put(1000, {name: "boiler", "_boost" => 1_000_000})
server.index(:users).search(query: {match: {name: "bender"}}).results.map(&:name)
# Wha?
# => ["boiler", "bender", "hermes", "cubert", "hubert", "calculon", "fry", "lela"]
server.index(:users).search(query: {match: {name: "lela"}}).results.map(&:name)
# Sweet Zombie Jesus!
=> ["boiler", "lela", "calculon", "bender", "hermes", "cubert", "hubert"]
ElasticSearch can report counts of common terms in documents, frequently seen on the left-hand side of web-sites these are 'facets'
# Create a mapping for bands, with a 'name' and a 'genre'
server.index(:bands).create(mappings: {band: {properties: {name: {type: :string}, genre: {type: :string, index: :not_analyzed} }}})
#Import some docs
[["Stone Roses", "madchester"], ["Boards of Canada", "IDM"], ["Aphex Twin", "IDM"], ["Mogwai", "Post Rock"], ["Godspeed", "Post Rock"], ["Harry Belafonte", "Calypso"]].
each_with_index {|b,i|
server.index(:bands).type(:band).put(i, {name: b[0], genre: b[1]})
}
# Perform a search
server.index(:bands).search(facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
# => [["Post Rock", 2], ["IDM", 2], ["madchester", 1], ["Calypso", 1]]
# A more specific search
server.index(:bands).search(query: {match: {name: "Boards"}}, facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
# => [["IDM", 1]]
- All queries run across all shards in the cluster
- Shards are allocated automatically to nodes and rebalanced
- A query to any node will work, the actual queries will be executed on the proper shard / node
- Shards are rack aware
- Indexes have a configurable number of replicas, set this based on your failure tolerance
- elasticsearch is easy to set up!
- Just a java jar, all you need is java installed
- Has a .deb package available
- Clustering just works...
- If on a LAN they will find each other and figure everything out
- If on EC2, install the EC2 plugin and they will find each other
- There is no built-in security, but proxying nginx in front works well
- http://www.elasticsearch.org/
- Paramedic Cluster Monitoring tool: https://github.com/karmi/elasticsearch-paramedic
- This presentation: https://gist.github.com/andrewvc/5022184