Skip to content

Instantly share code, notes, and snippets.

@xeoncross
Last active February 4, 2020 18:28
Show Gist options
  • Save xeoncross/0595802a36f5e49ec3629a76c82a1efb to your computer and use it in GitHub Desktop.
Save xeoncross/0595802a36f5e49ec3629a76c82a1efb to your computer and use it in GitHub Desktop.
Compact storage of time series values by using bitmaps and assigning every value a unique id for that bitmap.

With most of the companies I’ve worked with I needed to store basic client+timestamp activity logs in a database. Could be access logs, ad impressions, thread/post views, email activity, campaign clicks, etc.. it’s always some client identifier + a timestamp. Then I need a report (SUM) of how many events exist in a given block of time. However, I also often need the actual individual rows so I can look at how a certain client itself performed over a certain block of time.

Storing 8/16 byte client ids + 4 byte timestamps is rather costly when doing this hundreds of thousands/millions of times per day. What am I looking for?

Originally I thought I might have a good use-case for time-series with bitmaps for each date (or time deltas). However, often the clients are very sparse so I might only see them a few times before they disappear and the cost of generating a bitmap for the whole search space is more than simply storing a few regular records.

ip+timestamp+someid
email+timestamp+someid
user_id+timestamp+someid
etc..

Where “someid” is the topic we’re watching this data for

Had an interesting idea. What if values were saved and given an id, then each row would reference which values it stored using the bitmap offset of that values id. Assuming second-level resolution, you would have ~31 million values a year resulting in about 4MB of bitmap for each thing you’re tracking. If this is sparse it would obviously be a bad idea to allocate that full years bitmap, so deltas could be used to skip large swaths of the bitmap you don’t need.

A uint32 gives us 4,294,967,295 values. At ~31 million seconds a year we can store almost 140 years of second-resolution timestamp numbers.

value = 2020-02-04T18:06:05
value_id = next free uint32
key -> byte slice of marker+bytes pairs
  {ip/email/user/clientid} -> [marker][byte1][byte2],[marker][bitmapbyte1],etc...

Marker would specify the length of bytes it's marking (0-255) and then you read the next X bytes to get the number of bitmap bits you are skiping. Then there is a marker specifying how long the bitmap is. Then a marker for the next segment you are "skipping".

I guess the markers for the “swaths” would need to be 1 byte specifying the length of the next bytes number and would probably offset the savings from just storing a straight 8 byte timestamp value.

[marker][byte1][byte2],[marker][bitmapbyte1],[marker][byte1],[marker][bitmapbyte],etc…

Could also use bitmaps with a value header instead of markers inside the content values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment