cratedb Index/Partitions/Shards

This document aims to give some general understand and guidance on some topics for cratedb in the context of our IoT use-case with timeseries data. Where possible sample numbers from the c-holistic-comparision pitch are added for the purpose of illustration:

Indices and how cratedb analises the OBJECT type.
Cardinality of the data and why it is important to understand it.
Some basic undertanding of paritioning.
Sharding Sizes and basic numbers of thumb.
At the end there is FAQ with information and links about Lucene indices in general.
What todo with the vast amount of data you are probably inserting

a little bit context of the IoT use-case

CREATE TABLE IF NOT EXISTS "doc"."timeseries" (
   "ts" TIMESTAMP WITH TIME ZONE,
   "g_ts_day" TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_trunc('week', "ts"),
   "payload" OBJECT(STRICT) AS (
      "button_press" BIGINT,
      "line" BIGINT,
      "plant" BIGINT,
      "sensor_id" BIGINT,
      "value" REAL
   )
)
CLUSTERED INTO 21 SHARDS
PARTITIONED BY ("g_ts_week")

General lecture (//todo)

How to set an index and what to consider?

Obvisouly without an index a databases life would be very hard. Fortunately cratedb makes it very easy to end up with a decent performance without thinking too much about indices at all. But nevertheless there are some points to understand.

PRIMARY KEY

The link tuning for insert performance suggests to use the system column _id instead of a uuid in case that there is no explicty primary key.

The value is a unique identifier for each row in a table and is a compound string representation of all primary key values of that row. If no primary keys are defined the id is randomly generated. If no dedicated routing column is defined the _id value is used for distributing the records on the shards.

Here is an example doing a select _id, ts from timeseries limit 3; with an auto-generated _id, as there is no primary key.

_id	ts
M7-seHEB4zR5qR-wVs_Z	 1586892842500
Nb-seHEB4zR5qR-wVs_Z	 1586892842500
W7-seHEB4zR5qR-wVs_Z	 1586892842500

This table has a primary key defined select _id, value from outlier limit 1;

_id	value
1	  6.420000076293945

In the context of our IoT use case we tend to come up with a schema-less table design which gives a maximum of flexibility. Nevertheless it makes sense to have some sort of primary key over the data. Be warned about using a timestamp as your primary key. Since Albert Einstein we know that time is relative. It might even be the case that the time setup on the nodes in a cluster is kind of messed up. Or what makes things further complicated if time between nodes doing data processing and the cluster nodes is out of sync (clock drift or clock skew).

ROUTING //todo

Keeps data together in Shards.

OBJECT (IGNORE)

Just to be aware of the option to disable indexing of an object by specifingignored

Cardinality

In SQL, the term cardinality refers to the uniqueness of data values contained in a particular column of a database table. The lower the cardinality, the more duplicated elements in a column. Thus, a column with the lowest possible cardinality would have the same value for every row. SQL databases use cardinality to help determine the optimal query plan for a given query.

Wikipedia - Cardinality (SQL statements) distinguishes between

High-cardinality (eg. a primary Index, something more like a UserID)
Normal-cardinality (eg. customer last name)
Low-cardinality (eg. a boolean, or Y/N/.. )

select count(*) from  (
select distinct payload['sensor_id'] as sensors from timeSERIES
 ) as cardio

c-holistic dataset analysis

field	cardinality	detail/avg. ratio
payload['sensor_id']	2500	2_452_848
payload['line']	5	1_226_424_000
payload['plant']	100	61_321_200
payload['value']	141	43_490_200
payload['button_press']	2	306_606_0000
g_ts_week	4	partition key ~2_503_047_000 per week
g_ts_day	4x7	~432_000_000 per day
count(*)		6_132_120_000
1h over 1 sensor		~8_000
24h over 1 sensor		~192_000
size on disk		~400GB

QUERY PERFORMANCE

This is where the magic happens. Typically it is the questions of a good index choosen. A primary key is a good thing, if you use it in the query. In our IoT case we typically reccommend to use the field type OBJECT (for the purpose of schema-less operations), which puts an index on the fields unless the option ignored is set, which does not create an index (which IMHO is not that clear in the documentation). Which can be done eg. by something similar to

...where payload['plant'] = '1' -- and _docid % 10 = 0

To be fast the query has to use an index and the number of records returned should be as small as possible. Especially if you want to to aggreations on top of that data.

There is some interesting information in our documentation, be sure to know that, but it might not be a valid thing in each case.

PARTITIONING

There is a decent explanation in the documentation.

parititon stores the data in Shards. The number of shards is set by create table (default=3). Paritioning data is a very important concept. It is something like a sub-table. It makes perfect sense to think about the queries you want to run and how that would influence the partitioning.

Working on timeseries style-data it makes probably sense to partition on hour, week, day, ...

Questions to ask:

How do you plan to query your data?

SHARDING

The number of shards is set by create table (default=3) and can be changed.

Changing the number of shards in general works in the following steps.

A new target table is created but with more/less number of primary shards.

The segments from the source table (the underling Lucene index to be precise) are hard-linked into the target table at file system level.

The source table is dropped while the new table is renamed into the source and then recovered in the cluster.

crate - SHARDING GUIDE

General Guidance

Rules of "thumb":

a shard size of 50GB seems to be something that works good in many cases.
20 shards / GB of HEAP

To get this right. The partition size is nothing that is configured. It is defined by how much data is stored into a table/partition and then further by the number of shards specified by the create table statement. If you are still in the design phase of your datamodel it might need some adjustments to get this right. Probably some trial and error.

Shard Basics

Data is organized in indices.
Each index is made up to one or many shards.
Each shard is an instance of a Lucene index.
Data is peridodically published into new immutable Lucene segemnts on disk. Then it can be queried.
shards are the unit at which data is moved around in the cluster. eg. rebalancing data.
50GB is often quoted as a good shard size and works for many scenarios
number of shards a node can hold is proportional to the amount of HEAP available. No fixed limit. Keep max. 20 shards per GB. eg. 30GB heap --> 600 shards

Shards an performance

Each query is executed in a single thread per shard.
Multiple shards can be queried in parallel.
minimum query latency (without caching)
- depend on data
- type of query
- *size of the shard
best way to determine the maximum shard size from a query performance perspective is realistic benchmarks

Example anaylsis from the c-holistic use-case

Looking at the rules of thumb: a) Shard Size: 9 GB b) Shards per Node / per Parition: 3 c) HEAP: 4GB RAM

52 weeks x 3 Shards = 156 / 4 GB RAM = 39 Shards / GB

Which would probably suggest to increase the RAM on the nodes!? //TODO

select sum(size / 1024 / 1024 / 1024) as size_in_gb, 
   sum(num_docs), id, -- partition_ident as pid,
       node['name'] as node_name
  from sys.shards
  where primary = true  
     and partition_ident = '04732d9o6spj8c1o60o30c1g' 
     and table_name = 'timeseries'
  group by id, num_docs, node_name
  order by node_name asc

Shard Size / documents per Node

size_in_gb	sum(num_docs)	id	node_name
9	 143998259	 1	 crate-b2734c87681143269e27395d3479440e-0
9	 144006288	 12	 crate-b2734c87681143269e27395d3479440e-0
9	 144009711	 8	 crate-b2734c87681143269e27395d3479440e-0
9	 143982328	 14	 crate-b2734c87681143269e27395d3479440e-1
9	 144009269	 10	 crate-b2734c87681143269e27395d3479440e-1
9	 143996393	 15	 crate-b2734c87681143269e27395d3479440e-2
9	 144012346	 3	 crate-b2734c87681143269e27395d3479440e-2
9	 143977050	 13	 crate-b2734c87681143269e27395d3479440e-2
9	 144012496	 5	 crate-b2734c87681143269e27395d3479440e-3
9	 144027387	 19	 crate-b2734c87681143269e27395d3479440e-3
9	 144008157	 2	 crate-b2734c87681143269e27395d3479440e-4
9	 143983898	 0	 crate-b2734c87681143269e27395d3479440e-4
9	 144020408	 9	 crate-b2734c87681143269e27395d3479440e-4
9	 143997629	 7	 crate-b2734c87681143269e27395d3479440e-5
9	 143990450	 17	 crate-b2734c87681143269e27395d3479440e-5
9	 143993762	 20	 crate-b2734c87681143269e27395d3479440e-6
9	 143999000	 4	 crate-b2734c87681143269e27395d3479440e-6
9	 143997743	 6	 crate-b2734c87681143269e27395d3479440e-6
9	 143977914	 11	 crate-b2734c87681143269e27395d3479440e-6
9	 144006164	 16	 crate-b2734c87681143269e27395d3479440e-6
9	 143993348	 18	 crate-b2734c87681143269e27395d3479440e-6

select  count(*) as number_of_shards, 
       node['name'] as node_name
   from sys.shards
  where primary = true  and partition_ident = '04732d9o6spj8c1o60o30c1g' and
    table_name = 'timeseries'
  group by node_name
  order by node_name asc
  limit 100;

Shards per Node

number_of_shards	node_name
3	 crate-b2734c87681143269e27395d3479440e-0
2	 crate-b2734c87681143269e27395d3479440e-1
3	 crate-b2734c87681143269e27395d3479440e-2
2	 crate-b2734c87681143269e27395d3479440e-3
3	 crate-b2734c87681143269e27395d3479440e-4
2	 crate-b2734c87681143269e27395d3479440e-5
6	 crate-b2734c87681143269e27395d3479440e-6

Sizing

crate - SHARDING GUIDE How many shards should my I have in my Elasticsearch cluster? Scale is not infinite (es) realistic benchmarks (shards)

FAQ

What is a lucence index?

Can be thought of as a self-contained search egine that indexes and handels queries for a subset of the data in a cluster.

Apache Lucene 8.5.1 Documentation

quantitave cluster sizing

realistic benchmarks](https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing)

bulk indexing? analyzed fields .. what is not analyzed? peak or constant data flow coming in? compression level? force merge - 100% sure we are done with writing into this index plot index speed and query response time????

WalBeh/cratedb - partitions - shards.md

Select an option

No results found