Hash Integrals: Slice Hashing below O(log n) ============================================ **Update**: the technique employed here is called a [polynomial rolling hash](https://en.wikipedia.org/wiki/Rolling_hash#Polynomial_rolling_hash) as I have recently learned. Some notes on how to enable fast computation of hashes of slices of vectors (below complexity `O(log n)`), as well as hashes of joined vectors. We start from first principles. To simplify matters, a vector has the same element type as our hash, which stands in for the actual data element we wish to hash. Each vector `v` is paired with a secondary summed-hash vector `h` of which each element `i` stores `hash(v[0..i])` to form a tuple of vectors `(v,h)`. Appending a new element `p` to `v` is then performed in `O(1)` as follows, with a simple auxiliary linear operation: ``` append((v, h), p) == append(v, p), append(h, (last(h) * C + hash(p)) last(h) == h[countof(h)-1] h[-1] == hash_type(0) ``` Where `C` is a fixed prime constant close to the maximum value of `hash_type`. Note that the hashing of `p` with the user-chosen function `hash()` is optional and needs only to be performed if `p` is of larger size than `sizeof(hash_type)`. For 64-bit hashes, we can use `C := 0x66d6cf4cc5ddd26d:u64` (its modulo-inverse is `C^-1 := 0x24ffe0c7fcc70765:u64`). We compute the hash of any left-aligned half-open subrange `v[0..x)` by accessing the cached accumulated value in `h`: ``` hash(v[0..x)) == hash(h[x-1]) ``` The hash of an arbitrary subrange `[a..b)` of `v` can be computed in `O(log (b - a))` by a pseudolinear operation: ``` hash(v[a..b)) == hash(h[b-1] - h[a-1] * (C ** (b - a))) ``` where `**` denotes integer exponentiation. Finally, two vectors `(u, g)` and `(v, h)` can be joined in `O(n + log countof(h))` for `n` elements in `v` as follows: ``` join((u, g), (v, h)) == join(u, v), join(g, last(g) * (C ** countof(h)) + h) ``` The hash of the combined range of two joined vectors can therefore be computed in `O(log countof(h))` just from the very last value of `h'`: ``` hash(join((u, g), (v, h))) == hash(last(g) * (C ** countof(h)) + last(h)) ``` Substring Search ---------------- An arbitrary vector `q` can be found in `v` in `O(n log m)` amortized for `m == countof(q), n == countof(v) - m` by testing all `n` slice hashes of size `m` in `v` until a match with the hash of `q` has been found. This is faster than the naive approach with complexity `O(n * m)` for large sizes of `q`. This is equivalent to the [Rabin-Karp algorithm](https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm). An improved technique allows us to find a substring in `O(m log m)`, by precomputing bucketed maps of total size `2 * n` which map every possible log-sized, log-aligned subtile hash to its location in advance. We can then break down the substring into all phase-shifted variants of its log-sized and log-aligned subtiles, and search for their adjacent occurrence in the map. Parallelization --------------- The computation of `h` from `v` is massively parallelizable on the GPU using [prefix sums](https://en.wikipedia.org/wiki/Prefix_sum) or [jump flooding](https://en.wikipedia.org/wiki/Jump_flooding_algorithm). Higher Dimensions ----------------- As this solution is analog to integral arithmetic, it readily extends to higher dimensions. In 2D, we represent the accumulated hash as a [summed-area table](https://en.wikipedia.org/wiki/Summed-area_table), where each top-left-aligned half-open range hash can be computed from `(v, h)` as follows: ``` h[x, y] == C * (h[x - 1, y] + h[x, y - 1] - C * h[x - 1, y - 1]) + hash(v[x,y]) h[-1, y] == h[x, -1] == hash_type(0) ``` All other operations can be derived accordingly, and analog to how integrals of SAT subranges are classically performed. Efficient substring (subtile) search also extends to higher dimensions with a precomputed map.