Redis SCAN cursor

Have you ever wondered what the cursor value that Redis returns for SCAN commands means? I did, and set out to make sense of these seemingly arbitrary numbers.

Some time ago I was reading a StackOverflow article, that linked the documentation inside the Redis SCAN implementation. At the time, I only quickly skimmed the text. I remember thinking that it should be possible to come up with an estimate of how far along the SCAN is.

Today I 'cracked' the algorithm. It is not really that much of an achievement, given it is thoroughly documented, but figuring it out without the source code documentation still gave me a Eureka-moment. The result:

-> % node debug-redis-SCAN.js 21
INPUT   REVERSE-BITWISE       REVERSE VAL  % OF 2^21-1 (111111…)
858947  11000010110110001011  1596182      76.11%
1885267 110010100010001100111 1655911      78.96%
2008915 110010101110010101111 1662127      79.25%
1566163 110010111010011111101 1668349      79.55%
962867  11001100100011010111  1675694      79.90%
307123  1100110111110101001   1687204      80.45%
784031  11111001011011111101  2043386      97.43%

The table says it all if you look closely, but might seem a bit cryptic at first. I'll explain what Redis does, spoiler: it is as elegant as it is simple.

Suppose you have N keys, then Redis has reserved a table that is a bit bigger than N, namely it reserves a power of 2 such that 2^X > N (read: such that it fits). In above result the X is 21, so the maximum number of 'buckets' the table fits is 2.097.151. Once you use more keys than that, Redis will double the table, more on that later. The minimal index of that table is 00000… and the maximum is 11111… (with N bits).

If you SCAN, you give it a number where to start, which defaults to 0. Redis will take this number, and revert the bits. Then it will read the keys at that index, increment the index, and loop, until the COUNT indexes have been tried. Suppose your table can hold 8 items (X=3, 2³=16). Lets see what happens if we SCAN with a COUNT of 1, Redis would use these values:

COMMAND  BITS SWAPPED  NEXT  SWAPPED  CURSOR RETURNED
SCAN 0   __0  000      001   100      4
SCAN 4   100  001      010   010      2
SCAN 2   _10  010      011   110      6
SCAN 6   110  011      100   001      1
SCAN 1   __1  100      101   101      5
SCAN 5   101  101      110   011      3
SCAN 3   _11  110      111   111      7
SCAN 7   111  111      000   000      0

'Rehashing'

Here, we increase the cursor with 1 every time, by using COUNT 1. If we would instead use a larger COUNT, then Redis would scan the range of keys that are inside the [START…START+COUNT] range. Because it swaps the bit order, it can 'survive' situations where the table size is doubled/quadrupled. Example when using COUNT 2:

# no resize
COMMAND  BITS  SWAPPED  +1    +2    SWAPPED  CURSOR RETURNED
SCAN 0   __0   000      001   010   010      2
SCAN 2   _10   010      011   100   001      1
vs
# with resize
COMMAND  BITS  SWAPPED  +1    +2    SWAPPED  CURSOR RETURNED
SCAN 0   __0   000      001   010   010      2
====== table resizes to X=4 =========================
SCAN 2   __10  0100     0101  0110  0110     6
SCAN 6   _110  0110     0111  1000  0001     1

As you scan see from this demo, it doesn't matter that the table resizes half-way the SCAN: it still reaches cursor 1, it only takes 1 extra iteration.

Ok, great, but how does this ensure we get every element at least once? Because when the table is resized, the hashing is 'stable'. So a key that is stored at index 001 with x=3, will be stored at 0010 or 0011 with x=4. Important: the key keeps the prefix 001! And since we're using this bit-ordering to walk through all keys, we'll never miss a key when the table is rehashed.

Since a picture says more than a thousand words… here is a picture of what happens when we resize a x=2 table to x=4. Say we just got [KEY2, KEY3] and cursor 2 from Redis. If we would continue, we would get KEY1. But what happens if KEY5 is inserted first, and Redis rehashes: if we continue after the rehashing, we will get [KEY5] and cursor 6, and we'll get [KEY1] the time after.

If the table doubles, every row is 'split' and the existing cursors out in the client applications will point at the first of the 2 new rows.

Exploiting this knowledge

Given this knowledge, we can now do more stuff than we could before. For example:

if we know the approximate size of the table, we can guess X
given X, we can generate a cursor at approximately 50% and parallelize the iteration
given a cursor and knowing X we can estimate how far along we are

In a followup article we'll dive into how to do these things!

References:

During the process of writing this article I found this article, which explains how SCAN works in more detail: https://medium.freecodecamp.org/redis-hash-table-scan-explained-537cc8bb9f52
Source of SCAN: https://github.com/antirez/redis/blob/fd0ee469ab165d0e005e9fe1fca1c4f5c604cd56/src/dict.c#L778-L861

hermanbanken/README.md

Select an option