The current on-disk format is that each KV is stored in the tree file as
<<$t,TreeId:22/binary,$s,Segment:64/integer,Key/binary>>
TreeId
is 22 bytes (20 bytes from the vnode id, and 2 bytes for N value). The implementation in hashtree uses 1M segments, and the Segment
is stored in a 64 bit unsiged. Finally, the Key
is usually term_to_binary({Bucket,Key})
.
See https://github.com/basho/riak_core/blob/develop/src/hashtree.erl#L651-L662
All in all, we can shave this down significantly, saving valuable I/O.
Imagine this:
Key = <<BucketLen:16,Bucket/binary,Key/binary>>,
TreeId = <<Partition:24, N:8>>
<<$t, TreeId/binary, $s,Segment:32,Key/binary>>
This would shave 33 bytes off each such entry. For a system with 1B keys, N=3, that is an over all reduction of the storage load with 100GB. That's a lot of data in need of being read to do a rehash.
Limitations:
- can only have 2^32 segments
- can only have 2^24 partitions
- max N-value is 255.
But I could live with that.
This is a minor optimization compared to the above.
When hashing segments, we load up the full list of {<<Key>>,<<Hash>>}
, then run term_to_binary
on it, and then pass it to crypto:sha
. I imagine, we could save some by utilizing knowledge of the erlang term format.
The problem as I see it, is that the term_to_binary call could result in a very large binary; and a need to copy a whole lot of data.
The associalted sha_test.erl proves me wrong, btw.