Skip to content

Instantly share code, notes, and snippets.

@harpocrates
Last active February 1, 2019 19:42
Show Gist options
  • Select an option

  • Save harpocrates/d48d982972412d706bc021afe1f28371 to your computer and use it in GitHub Desktop.

Select an option

Save harpocrates/d48d982972412d706bc021afe1f28371 to your computer and use it in GitHub Desktop.

Given a collection of elements X, a sharding function should map every element of X into a shard.

trait Sharder[X] {
  /// The final output must always be in the range [0, totalShards-1], hopefully evenly distributed in that range.
  def apply(totalShards: Int): X => Int
}

Note that we have to fix the number of shards we want before getting our sharding function. One thing we may want is to change the number of shards. When we do this, it is expected that a bunch of x: X will now map to new shards. Depending on how we intend to change the number of shards, we can be clever about defining Sharder[X] so as to minimize the number of x: X which move to a new shard.

For simplicitly, I'm going to assume X is Int in what follows (since you could hash X).

Growing the number of shards exponentially

If every time we reshard we intend to multiply the number of shards by d, the ideal Sharder is one that takes the is the mod of the input:

object MultiplySharder[Int] {
   def apply(totalShards: Int): Int => Int = _ % totalShards
}

On average, $\frac{d-1}{d}$ elements will change shards.

Growing the number of shards by 1

If every time we reshard we intend to add just one more shard, a good Sharder is one that will use contiguous intervals as shards:

object AddingOneSharder[Int] {
  def apply(totalShards: Int): Int => Int = {
    val shardSize = Int.MaxValue / totalShards + 1  // the +1 is for some edge cases - it doesn't change distributions
    _ / shardSize
  }
}

On average, 1 / 2 elements will change shards. Here's a quick sketch of why.

The first shard contains 0 to Int.MaxValue / totalShards. When you reshard, it'll have switch to containing 0 to Int.MaxValue / (totalShards + 1), so you'll need to move Int.MaxValue / totalShards - Int.MaxValue / (totalShards + 1) elements. The next shard will need to move twice that, then next three times that, and so on. You end up needing to move half the elements.

( \sum_{i=1}^n i ) * ( 1/n - 1/(n + 1) )
  = n(n + 1)/2 * (n + 1 - n)/(n(n + 1))
  = 1/2

Note that the shards N / 2 to N move more than half of their elements to another shard, so it would be more efficient to just re-label the shards. If we assume re-labelling of shards is possible, the second half of shards will need to move the same number of elements as the first half. In the end, you only need to move approximately between N / 4 and N / 2 elements (the larger the N, the closer you get to N / 4).

( 2 * \sum_{i=1}^{n/2} i ) * ( 1/n - 1/(n + 1) )
  = 2 * n(n + 2)/8 * (n + 1 - n)/(n(n + 1))
  = 1/4 * (n+2)/(n+1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment