Skip to content

Instantly share code, notes, and snippets.

@JasonTrue
Last active July 17, 2024 10:01
Show Gist options
  • Save JasonTrue/3cd6a7094e23cd72bfb870604521f415 to your computer and use it in GitHub Desktop.
Save JasonTrue/3cd6a7094e23cd72bfb870604521f415 to your computer and use it in GitHub Desktop.
Searchkick and Elastic Search guidance

Resources:

https://github.com/ankane/searchkick

Indexing

By default, simply adding the call 'searchkick' to a model will do an unclever indexing of all fields (but not has_many or belongs_to attributes).

In practice, you'll need to customize what gets indexed. This is done by defining a method on your model called search_data

def search_data
  {
    id: id;
    stringified_id: id.to_s,
    tags: tags.join(" "),
    user: user.full_name,
    pass_rate: calculate_pass_rate
  }
end

When you change the search_data hash structure, you'll need to reindex that model. You can do that in the rails console by typing Model.reindex but you can also use the rake task searchkick:reindex:all, or index just one specific model.

Searching

In all recent versions of Elasticsearch, you need to explicitly specify the fields you'll search.

Your search should look something like this:

ModelName.search(query, fields: ['stringified_id', 'name', 'description', ...])

Common Indexing challenges, common solutions

We want to be able to search by ID (in full-text queries)

By default, an integer field can only be searched as an integer, but if you coerce the field to be a string it's searchable with full text search.

def search_data
  {
    id: id;
    _stringified_id: id.to_s,_
  }
end

Your search should look something like this:

ModelName.search(query, fields: ['stringified_id', 'name', 'description', ...])

It's worth noting that because you can use non-string types (including arrays of non-string types), it sometimes comes in handly to do more of your searching/filtering in Elastic than in Postgres. You can combine a full text query with some specific fields.

def search_data
    {
      blog_id: blog_id,
      author: user.name,
      author_id: user.id,
      publish_year: publish_at.year,
      publish_month: publish_at.month,
      publish_day: publish_at.day,
      publish_at: publish_at,
      created_at: created_at,
      updated_at: updated_at,
      tags: tag_list,
      story: story,
      title: title,
      approved: approved
    }
  end

Then a flexible, type-aware search that still does full text search on some fields, like title and story:

search_params = { approved: true, publish_at: { lte: 'now/m' } }
search_params = search_params.merge(blog_id: @blog.id) if @blog.present?
search_params = search_params.merge(publish_year: @year) if @year.present?
search_params = search_params.merge(publish_month: @month) if @month.present?
search_params = search_params.merge(publish_day: @day) if @day.present?
search_params = search_params.merge(tags: {all: @tags}) if @tags.present?

if @query.present?
  @posts = Post.search(@query, fields: [:title, :story], where: search_params, page: params[:page])
else
  @posts = Post.search(fields: [:title, :story], where: search_params, page: params[:page], per_page: 20,
                             order: [{publish_at: :desc}])
end
logger.info({ query: @query, params: search_params })      

We want to eager load associations so that it's not so expensive to update the index.

Define a scope by this name, and invoke appropriate #joins or #includdes.

scope :search_import, -> { includes(study_tracking: study_tracking_details) }

We have soft-deleted records and want to exclude them from indexing

Similar to the above solution,

scope :search_import, -> { where(deleted: false) }

However, this scope is used only for batch import. When an individual entity is saved, it is updated separately, so you'll also want to implement:

  def should_index?
    !deleted
  end

Avoid short query strings (single or two character searches) returning lots of results.

By default, misspelling-gentle search is turned on in searchkick. So the two ways to reduce unwanted search results are to turn off or adjust the misspelling-friendly feature, or to query with a relevancy score filter.

For example, UserCourse.search(params[:query], { fields: ["name^5", "id"], misspellings: {below: 5}

Alternatively, MyModel.search query, body_options: {min_score: 1} tunes out a lot of noise.

Match "Reallyenglish, Co., Ltd." organization with "really"

Make sure the index includes this configuration for the field you want:

searchkick word_start: [:name]

Match word start on a specific search:

UserCourse.search(query,
  fields: ['stringified_id', 'name', 'description', ...]
  match: :word_start
)

Match "Test Reallyenglish Program" program with "english" (In the middle of name)

Make sure the index includes this configuration for the field you want:

searchkick word_middle: [:name]

Then for the search:

UserCourse.search(query,
  fields: ['stringified_id', 'name', 'description', ...]
  match: :word_middle
)

Don't match "新潟大学" organization with "新人" (Disabling ambiguity)

Exact match with User Email

UserCourse.search(query,
  fields:  [{email: :exact}, :name]
  match: :word_middle
)

This is a case sensitive search, however, and probably not exactly what you want. More likely you'll want a tokenizer to treat an email address as a single word, which is a little more complicated. An article below covers this, but requires a custom mapping to implement, and a reconfigured analzyer.

https://medium.com/linagora-engineering/searching-email-address-in-elasticsearch-3b09a11e3c2b

This will require something like:

searchkick merge_mappings: true, mappings: {...}

And may require using an explicit search body.

However one solution to avoid this complexity would be to use the exact matching above, and index the field as lowercase, and maybe to pre-filter strings that look like email addresses in queries to lower-case.

Case sensitivity

By default searches are case insensitive. To override that for everything, you can alter the searchkick call searchkick case_sensitive: [:field :list], or use exact matching:

UserCourse.search(query,
  fields:  [{my_field: :exact}, :other_field]

Japanese-aware indexing

While there's reasonable support out of the box for Japanese search, you can get additional features with the elasticsearch analysis-kuromoji plugin.

searchkick language: "japanese"

If you go down this route, and want to support multiple analyzers, you need to use the searchkick mappings feature and multiple fields. It's not terribly hard, but it's more involved than a quick FAQ can handle.

See https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html for some possible options, and the searchkick docs for how to do custom mappings and custom/advanced search.

Any notice of combinations above

Generally combinations are supported by choosing the right field to query. Most of the parameters that normally take a symbol can be replaced with a hash from that symbol to various options. ( https://github.com/ankane/searchkick will have better examples than I can provide).

Not compatible with each other

In principle, you can create several fields that have their own analyzers and behaviors. When you build up the Search call, you can combine options. I'm not aware of specific incompatibilities but relevancy weighting may appear better or worse depending on the user's expectations. So for example, if you have a dilemma about how to search something, you could potentially use very dissimilar pseudo_fields with different search rules, and just include all of them, with potentially different boosting rules, in your search call.

Some parameters update frequently or require a lot of CPU time to reindex

In conjunction with a scheduled background job, you can call ModelName.reindex(:custom_reindexer) and have a method like that returns only the fields that need special treatment.

def custom_reindexer
  {
    just_the_field_that_matters: calculation_method
  }
end
@AdityaBhutani
Copy link

GREAT !! Really helpful l

@34code
Copy link

34code commented Dec 2, 2023

thanks for this!

@anko20094
Copy link

By default searches are case insensitive.

I'm not sure about this.
If you have some record with value NEW YORK It won't find by new york or New York

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment