Skip to content

Instantly share code, notes, and snippets.

@jaz303
Last active August 29, 2015 14:23
Show Gist options
  • Save jaz303/65eccda026664c877ecc to your computer and use it in GitHub Desktop.
Save jaz303/65eccda026664c877ecc to your computer and use it in GitHub Desktop.

So I have some time series data, potentially lots of it - 10-100M+ items.

The schema is roughly:

  • timestamp: datetime of event
  • value: float value of event
  • source_id: the "thing" that generated this data
  • zone_id: the container in which source_id is located

Data will be inserted into the store in arbitrary order.

I need to be able to efficiently work out, at a given point of time, the latest-recorded/last-known value for every source_id within a given zone_id. So for the given data, queried at time=8:

timestamp | value | source_id | zone_id
---------------------------------------
10        | 5     | 1         | 2
8         | 3     | 1         | 2
1         | 6     | 2         | 2
3         | 4     | 1         | 2
5         | 3     | 3         | 2
2         | 2     | 2         | 2

...the results would be:

source_id | value
-----------------
1         | 3
2         | 2
3         | 3

I'm trying to determine the most appropriate storage engine for performing this query. At the moment I'm using PostgreSQL with window queries but it's fairly slow. This doesn't seem like a particularly hard problem to solve, so does anybody know of a better technology I could try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment