Skip to content

Instantly share code, notes, and snippets.

@oddlyfunctional
Created December 3, 2020 10:09
Show Gist options
  • Save oddlyfunctional/92a9a28e06eb15a0720553a6f276bbcc to your computer and use it in GitHub Desktop.
Save oddlyfunctional/92a9a28e06eb15a0720553a6f276bbcc to your computer and use it in GitHub Desktop.

New reports system

In order to improve the response time of our reports, as well as enabling us to write more features such as visualizations, we are comparing two possible solutions for storing, querying and exporting precompiled datasets: Elasticsearch and PostgreSQL.

Requirements

Within a dataset of up to millions of documents and adding up to GBs, we want to perform quick searches.

The user will filter the results by choosing predefined fields, choosing operators and filling values, then combining them with AND and OR operators.

The user will also sort the results by choosing multiple columns, each with its sorting direction.

These results will be paginated in the frontend, loading 10k initial documents, then using a size of 100k for each subsequent page.

In order to prevent the UI from freezing, we will render a specified maximum of records at the same time, so supporting backwards pagination (as the user scrolls back) is a nice-to-have.

The query result set may also be exported in its entirety to csv or Excel (including all pages). It's ok for this operation not to feel instantaneous.

The search must take into account the user's current language, and match on the appropriate field's translation in the index (when available, otherwise it falls back to the default).

The index must work well with different character sets, such as Japanese, Chinese and Korean.

In the near future we will want to perform full text search in long texts (including multiple languages), so we need to take that into account.

We will also calculate aggregations and build visualizations over these datasets.

Comparison

Elasticsearch PostgreSQL
Querying language is simple but has a learning curve Industry standard
Non-transferable skill set Most skills can be transferred to other relational DBs
Product-specific implementation, so it would be difficult to change vendors Implementation can be product-agnostic for the most part as long as it's SQL
Much younger product, has great maintenance and decent third-party support but there's no way of ensuring it will continue to do so Stronger guarantee that it will continue to be maintained and have great third-party support
Maintained by a big company, so it's easier to get qualified support Not maintained by a single centralized company, but several companies offer support
Both have huge community and great documentation -
Easier and faster aggregations Multiple queries for complex aggregations
More features out-of-the-box Can achieve all the same features but requires additional effort or plugins/libraries
Adds more tools (such as facets) to our current stack so we have a larger toolset to implement new features Not that different from MySQL (note: it does have better support for full text search, unstructured data, geocoding, etc)
Smaller latency by connecting directly from the client No simple safe way to connect directly from the client, so needs to go through an application server
No need to decode queries by sending them directly to ES, but requires additional authn/authz Can use the authn/authz from our application server, but adds maintenance costs on decoding safely the queries
Fine-tuned control over full text search, at the cost of a steeper learning curve Less options for full text search, usually relying on external plugins
Both support non-alphabetic languages using plugins -
Configuration described by a data structure, so it's simple to model in code and write test for Schema and indices defined separately in SQL-specific statements, more complex to code and difficult to test
More resilient to configuration errors common in migrations, worst-case scenario is usually omitted results Configuration errors usually cause a runtime failures
Automatically generated configuration is decent in many cases, so it's faster to prototype in All configuration needs to be precisely defined
Simpler to implement offset-based pagination Needs an additional query to retrieve total number of results
First-party UI (Kibana) with monitoring, visualization, etc Third-party, less powerful solutions
Simple to do horizontal scaling, manages transparently multiple nodes Needs third-party solution and hard-coded configuration for horizontal scaling, it's easier to do vertical scaling
We already have plans for other features that would use Elasticsearch No current plans to use PostgreSQL for anything except this
Slower batch changes such as bulk insert, which we will do when updating the database Very fast batch changes, can be even faster by disabling indices before the update
No simple way to export the entire result of a query (possible solutions include logstash or scrolling through all pages of a query) Easy to export the whole result
Edge case scenarios can cause inconsistencies between replicas (note: this has been mitigated in recent versions, also our use case is not prone to causing these scenarios) Basically no possibility of inconsistencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment