So I have some time series data, potentially lots of it - 10-100M+ items.
The schema is roughly:
timestamp
: datetime of eventvalue
: float value of eventsource_id
: the "thing" that generated this datazone_id
: the container in whichsource_id
is located
Data will be inserted into the store in arbitrary order.
I need to be able to efficiently work out, at a given point of time, the latest-recorded/last-known value for every source_id
within a given zone_id
. So for the given data, queried at time=8
:
timestamp | value | source_id | zone_id
---------------------------------------
10 | 5 | 1 | 2
8 | 3 | 1 | 2
1 | 6 | 2 | 2
3 | 4 | 1 | 2
5 | 3 | 3 | 2
2 | 2 | 2 | 2
...the results would be:
source_id | value
-----------------
1 | 3
2 | 2
3 | 3
I'm trying to determine the most appropriate storage engine for performing this query. At the moment I'm using PostgreSQL with window queries but it's fairly slow. This doesn't seem like a particularly hard problem to solve, so does anybody know of a better technology I could try?