In order to improve the response time of our reports, as well as enabling us to write more features such as visualizations, we are comparing two possible solutions for storing, querying and exporting precompiled datasets: Elasticsearch and PostgreSQL.
Within a dataset of up to millions of documents and adding up to GBs, we want to perform quick searches.
The user will filter the results by choosing predefined fields, choosing operators and filling values, then combining them with AND and OR operators.
The user will also sort the results by choosing multiple columns, each with its sorting direction.
These results will be paginated in the frontend, loading 10k initial documents, then using a size of 100k for each subsequent page.
In order to prevent the UI from freezing, we will render a specified maximum of records at the same time, so supporting backwards pagination (as the user scrolls back) is a nice-to-have.
The query result set may also be exported in its entirety to csv or Excel (including all pages). It's ok for this operation not to feel instantaneous.
The search must take into account the user's current language, and match on the appropriate field's translation in the index (when available, otherwise it falls back to the default).
The index must work well with different character sets, such as Japanese, Chinese and Korean.
In the near future we will want to perform full text search in long texts (including multiple languages), so we need to take that into account.
We will also calculate aggregations and build visualizations over these datasets.
Elasticsearch | PostgreSQL |
---|---|
Querying language is simple but has a learning curve | Industry standard |
Non-transferable skill set | Most skills can be transferred to other relational DBs |
Product-specific implementation, so it would be difficult to change vendors | Implementation can be product-agnostic for the most part as long as it's SQL |
Much younger product, has great maintenance and decent third-party support but there's no way of ensuring it will continue to do so | Stronger guarantee that it will continue to be maintained and have great third-party support |
Maintained by a big company, so it's easier to get qualified support | Not maintained by a single centralized company, but several companies offer support |
Both have huge community and great documentation | - |
Easier and faster aggregations | Multiple queries for complex aggregations |
More features out-of-the-box | Can achieve all the same features but requires additional effort or plugins/libraries |
Adds more tools (such as facets) to our current stack so we have a larger toolset to implement new features | Not that different from MySQL (note: it does have better support for full text search, unstructured data, geocoding, etc) |
Smaller latency by connecting directly from the client | No simple safe way to connect directly from the client, so needs to go through an application server |
No need to decode queries by sending them directly to ES, but requires additional authn/authz | Can use the authn/authz from our application server, but adds maintenance costs on decoding safely the queries |
Fine-tuned control over full text search, at the cost of a steeper learning curve | Less options for full text search, usually relying on external plugins |
Both support non-alphabetic languages using plugins | - |
Configuration described by a data structure, so it's simple to model in code and write test for | Schema and indices defined separately in SQL-specific statements, more complex to code and difficult to test |
More resilient to configuration errors common in migrations, worst-case scenario is usually omitted results | Configuration errors usually cause a runtime failures |
Automatically generated configuration is decent in many cases, so it's faster to prototype in | All configuration needs to be precisely defined |
Simpler to implement offset-based pagination | Needs an additional query to retrieve total number of results |
First-party UI (Kibana) with monitoring, visualization, etc | Third-party, less powerful solutions |
Simple to do horizontal scaling, manages transparently multiple nodes | Needs third-party solution and hard-coded configuration for horizontal scaling, it's easier to do vertical scaling |
We already have plans for other features that would use Elasticsearch | No current plans to use PostgreSQL for anything except this |
Slower batch changes such as bulk insert, which we will do when updating the database | Very fast batch changes, can be even faster by disabling indices before the update |
No simple way to export the entire result of a query (possible solutions include logstash or scrolling through all pages of a query) | Easy to export the whole result |
Edge case scenarios can cause inconsistencies between replicas (note: this has been mitigated in recent versions, also our use case is not prone to causing these scenarios) | Basically no possibility of inconsistencies |