This document aims to give some general understand and guidance on some topics for cratedb in the context of our IoT use-case with timeseries data. Where possible sample numbers from the c-holistic-comparision pitch are added for the purpose of illustration:
- Indices and how cratedb analises the
OBJECTtype. - Cardinality of the data and why it is important to understand it.
- Some basic undertanding of paritioning.
- Sharding Sizes and basic numbers of thumb.
- At the end there is FAQ with information and links about Lucene indices in general.
- What todo with the vast amount of data you are probably inserting
CREATE TABLE IF NOT EXISTS "doc"."timeseries" (
"ts" TIMESTAMP WITH TIME ZONE,
"g_ts_day" TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_trunc('week', "ts"),
"payload" OBJECT(STRICT) AS (
"button_press" BIGINT,
"line" BIGINT,
"plant" BIGINT,
"sensor_id" BIGINT,
"value" REAL
)
)
CLUSTERED INTO 21 SHARDS
PARTITIONED BY ("g_ts_week")Obvisouly without an index a databases life would be very hard. Fortunately cratedb makes it very easy to end up with a decent performance without thinking too much about indices at all. But nevertheless there are some points to understand.
The link tuning for insert performance suggests to use the system column _id instead of a uuid in case that there is no explicty primary key.
The value is a unique identifier for each row in a table and is a compound string representation of all primary key values of that row. If no primary keys are defined the id is randomly generated. If no dedicated routing column is defined the _id value is used for distributing the records on the shards.
Here is an example doing a select _id, ts from timeseries limit 3; with an auto-generated _id, as there is no primary key.
_id ts
M7-seHEB4zR5qR-wVs_Z 1586892842500
Nb-seHEB4zR5qR-wVs_Z 1586892842500
W7-seHEB4zR5qR-wVs_Z 1586892842500
This table has a primary key defined select _id, value from outlier limit 1;
_id value
1 6.420000076293945
In the context of our IoT use case we tend to come up with a schema-less table design which gives a maximum of flexibility. Nevertheless it makes sense to have some sort of primary key over the data. Be warned about using a timestamp as your primary key. Since Albert Einstein we know that time is relative. It might even be the case that the time setup on the nodes in a cluster is kind of messed up. Or what makes things further complicated if time between nodes doing data processing and the cluster nodes is out of sync (clock drift or clock skew).
Keeps data together in Shards.
Just to be aware of the option to disable indexing of an object by specifingignored
In SQL, the term cardinality refers to the uniqueness of data values contained in a particular column of a database table. The lower the cardinality, the more duplicated elements in a column. Thus, a column with the lowest possible cardinality would have the same value for every row. SQL databases use cardinality to help determine the optimal query plan for a given query.
Wikipedia - Cardinality (SQL statements) distinguishes between
- High-cardinality (eg. a primary Index, something more like a UserID)
- Normal-cardinality (eg. customer last name)
- Low-cardinality (eg. a boolean, or Y/N/.. )
select count(*) from (
select distinct payload['sensor_id'] as sensors from timeSERIES
) as cardio
| field | cardinality | detail/avg. ratio |
|---|---|---|
| payload['sensor_id'] | 2500 | 2_452_848 |
| payload['line'] | 5 | 1_226_424_000 |
| payload['plant'] | 100 | 61_321_200 |
| payload['value'] | 141 | 43_490_200 |
| payload['button_press'] | 2 | 306_606_0000 |
| g_ts_week | 4 | partition key ~2_503_047_000 per week |
| g_ts_day | 4x7 | ~432_000_000 per day |
| count(*) | 6_132_120_000 | |
| 1h over 1 sensor | ~8_000 | |
| 24h over 1 sensor | ~192_000 | |
| size on disk | ~400GB |
This is where the magic happens. Typically it is the questions of a good index choosen. A primary key is a good thing, if you use it in the query. In our IoT case we typically reccommend to use the field type OBJECT (for the purpose of schema-less operations), which puts an index on the fields unless the option ignored is set, which does not create an index (which IMHO is not that clear in the documentation). Which can be done eg. by something similar to
...where payload['plant'] = '1' -- and _docid % 10 = 0
To be fast the query has to use an index and the number of records returned should be as small as possible. Especially if you want to to aggreations on top of that data.
There is some interesting information in our documentation, be sure to know that, but it might not be a valid thing in each case.
There is a decent explanation in the documentation.
parititon stores the data in Shards. The number of shards is set by create table (default=3). Paritioning data is a very important concept. It is something like a sub-table. It makes perfect sense to think about the queries you want to run and how that would influence the partitioning.
Working on timeseries style-data it makes probably sense to partition on hour, week, day, ...
- How do you plan to query your data?
The number of shards is set by create table (default=3) and can be changed.
Changing the number of shards in general works in the following steps.
- A new target table is created but with more/less number of primary shards.
- The segments from the source table (the underling Lucene index to be precise) are hard-linked into the target table at file system level.
- The source table is dropped while the new table is renamed into the source and then recovered in the cluster.
Rules of "thumb":
- a shard size of 50GB seems to be something that works good in many cases.
- 20 shards / GB of HEAP
To get this right. The partition size is nothing that is configured. It is defined by how much data is stored into a table/partition and then further by the number of shards specified by the create table statement. If you are still in the design phase of your datamodel it might need some adjustments to get this right. Probably some trial and error.
- Data is organized in indices.
- Each index is made up to one or many shards.
- Each shard is an instance of a Lucene index.
- Data is peridodically published into new immutable Lucene segemnts on disk. Then it can be queried.
- shards are the unit at which data is moved around in the cluster. eg. rebalancing data.
- 50GB is often quoted as a good shard size and works for many scenarios
- number of shards a node can hold is proportional to the amount of HEAP available. No fixed limit. Keep max. 20 shards per GB. eg. 30GB heap --> 600 shards
- Each query is executed in a single thread per shard.
- Multiple shards can be queried in parallel.
- minimum query latency (without caching)
- depend on data
- type of query
- *size of the shard
- best way to determine the maximum shard size from a query performance perspective is realistic benchmarks
Looking at the rules of thumb: a) Shard Size: 9 GB b) Shards per Node / per Parition: 3 c) HEAP: 4GB RAM
52 weeks x 3 Shards = 156 / 4 GB RAM = 39 Shards / GB
Which would probably suggest to increase the RAM on the nodes!? //TODO
select sum(size / 1024 / 1024 / 1024) as size_in_gb,
sum(num_docs), id, -- partition_ident as pid,
node['name'] as node_name
from sys.shards
where primary = true
and partition_ident = '04732d9o6spj8c1o60o30c1g'
and table_name = 'timeseries'
group by id, num_docs, node_name
order by node_name asc
size_in_gb sum(num_docs) id node_name
9 143998259 1 crate-b2734c87681143269e27395d3479440e-0
9 144006288 12 crate-b2734c87681143269e27395d3479440e-0
9 144009711 8 crate-b2734c87681143269e27395d3479440e-0
9 143982328 14 crate-b2734c87681143269e27395d3479440e-1
9 144009269 10 crate-b2734c87681143269e27395d3479440e-1
9 143996393 15 crate-b2734c87681143269e27395d3479440e-2
9 144012346 3 crate-b2734c87681143269e27395d3479440e-2
9 143977050 13 crate-b2734c87681143269e27395d3479440e-2
9 144012496 5 crate-b2734c87681143269e27395d3479440e-3
9 144027387 19 crate-b2734c87681143269e27395d3479440e-3
9 144008157 2 crate-b2734c87681143269e27395d3479440e-4
9 143983898 0 crate-b2734c87681143269e27395d3479440e-4
9 144020408 9 crate-b2734c87681143269e27395d3479440e-4
9 143997629 7 crate-b2734c87681143269e27395d3479440e-5
9 143990450 17 crate-b2734c87681143269e27395d3479440e-5
9 143993762 20 crate-b2734c87681143269e27395d3479440e-6
9 143999000 4 crate-b2734c87681143269e27395d3479440e-6
9 143997743 6 crate-b2734c87681143269e27395d3479440e-6
9 143977914 11 crate-b2734c87681143269e27395d3479440e-6
9 144006164 16 crate-b2734c87681143269e27395d3479440e-6
9 143993348 18 crate-b2734c87681143269e27395d3479440e-6
select count(*) as number_of_shards,
node['name'] as node_name
from sys.shards
where primary = true and partition_ident = '04732d9o6spj8c1o60o30c1g' and
table_name = 'timeseries'
group by node_name
order by node_name asc
limit 100;
number_of_shards node_name
3 crate-b2734c87681143269e27395d3479440e-0
2 crate-b2734c87681143269e27395d3479440e-1
3 crate-b2734c87681143269e27395d3479440e-2
2 crate-b2734c87681143269e27395d3479440e-3
3 crate-b2734c87681143269e27395d3479440e-4
2 crate-b2734c87681143269e27395d3479440e-5
6 crate-b2734c87681143269e27395d3479440e-6
crate - SHARDING GUIDE How many shards should my I have in my Elasticsearch cluster? Scale is not infinite (es) realistic benchmarks (shards)
Can be thought of as a self-contained search egine that indexes and handels queries for a subset of the data in a cluster.
realistic benchmarks](https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing)
bulk indexing? analyzed fields .. what is not analyzed? peak or constant data flow coming in? compression level? force merge - 100% sure we are done with writing into this index plot index speed and query response time????