Given a collection of elements X, a sharding function should map every element of X into a shard.
trait Sharder[X] {
/// The final output must always be in the range [0, totalShards-1], hopefully evenly distributed in that range.
def apply(totalShards: Int): X => Int
}Note that we have to fix the number of shards we want before getting our sharding function. One thing we may want is to change the number of shards. When we do this, it is expected that a bunch of x: X will now map to new shards. Depending on how we intend to change the number of shards, we can be clever about defining Sharder[X] so as to minimize the number of x: X which move to a new shard.
For simplicitly, I'm going to assume X is Int in what follows (since you could hash X).
If every time we reshard we intend to multiply the number of shards by d, the ideal Sharder is one that takes the is the mod of the input:
object MultiplySharder[Int] {
def apply(totalShards: Int): Int => Int = _ % totalShards
}On average,
If every time we reshard we intend to add just one more shard, a good Sharder is one that will use contiguous intervals as shards:
object AddingOneSharder[Int] {
def apply(totalShards: Int): Int => Int = {
val shardSize = Int.MaxValue / totalShards + 1 // the +1 is for some edge cases - it doesn't change distributions
_ / shardSize
}
}On average, 1 / 2 elements will change shards. Here's a quick sketch of why.
The first shard contains 0 to Int.MaxValue / totalShards. When you reshard, it'll have switch to containing 0 to Int.MaxValue / (totalShards + 1), so you'll need to move Int.MaxValue / totalShards - Int.MaxValue / (totalShards + 1) elements. The next shard will need to move twice that, then next three times that, and so on. You end up needing to move half the elements.
( \sum_{i=1}^n i ) * ( 1/n - 1/(n + 1) )
= n(n + 1)/2 * (n + 1 - n)/(n(n + 1))
= 1/2
Note that the shards N / 2 to N move more than half of their elements to another shard, so it would be more efficient to just re-label the shards. If we assume re-labelling of shards is possible, the second half of shards will need to move the same number of elements as the first half. In the end, you only need to move approximately between N / 4 and N / 2 elements (the larger the N, the closer you get to N / 4).
( 2 * \sum_{i=1}^{n/2} i ) * ( 1/n - 1/(n + 1) )
= 2 * n(n + 2)/8 * (n + 1 - n)/(n(n + 1))
= 1/4 * (n+2)/(n+1)