Created
May 27, 2017 17:46
-
-
Save malcolmgreaves/963856d7e6c8f76335d997c4a7ecbd6c to your computer and use it in GitHub Desktop.
Primer on hash functions. From: https://github.com/google/farmhash/blob/master/Understanding_Hash_Functions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| UNDERSTANDING HASH FUNCTIONS | |
| by Geoff Pike | |
| Version 0.2 --- early draft --- comments and questions welcome! | |
| References appear in square brackets. | |
| 1 INTRODUCTION | |
| Hashing has proven tremendously useful in constructing various fast | |
| data structures and algorithms. It is typically possible to simplify | |
| the analysis of hash-based algorithms if one assumes that the relevant | |
| hash functions are high quality. At the other extreme, if the | |
| relevant hash functions were always to return the same value, many | |
| hash-based algorithms become algorithms that are slower, simpler, but still well-known. | |
| For example, a chaining hash table devolves into a linked list. | |
| There are many possible definitions of hash function quality. For | |
| example, one might want a list of keys and their hashes to provide no | |
| pattern that would allow an opponent to predict anything about the | |
| hashes of other keys. Although I cannot prove it, I think I can meet | |
| this and many other definitions of quality with | |
| f(s) = SHA-3(concatenation of z and s), | |
| where z is some secret string known only to me. This well-known trick | |
| provides, I think, more high-quality hash functions than anyone will | |
| need, though greater computational power in the future may push us to | |
| replace SHA-3 from time to time. | |
| In short, discussions about choosing a hash function are almost always | |
| discussions about speed, energy consumption, or similar. Concerns | |
| about hash quality are easy to fix, for a price. | |
| 2 ANATOMY OF A HASH FUNCTION | |
| Hash functions that input strings of arbitrary length are written in | |
| terms of an internal state, S. In many cases the internal state is a | |
| fixed number of bits and will fit in machine registers. One generic | |
| sketch of a string hash is: | |
| let S = some initial value | |
| let c = the length of S in bits | |
| while (input is not exhausted) { | |
| let t = the next c bits of input (padded with zeroes if less than c remain) | |
| S = M(S xor t) | |
| } | |
| let n = the number of bytes hashed | |
| return F(S, n) | |
| where M is a hash function that inputs and outputs c bits, and F is a | |
| hash function that inputs c bits (plus, say, 64 for its second argument) | |
| and outputs however many bits one needs to return. In some sense we have | |
| reduced the string-hashing problem to two integer hashing problems. | |
| 2.1 INTEGER HASHING TECHNIQUES | |
| A hash function that inputs and outputs the same number of bits, say, | |
| 32, can use reversible bit-twiddling operations, each of which is | |
| "onto" in the mathematical sense. For example, multiplication by an | |
| odd constant is reversible, as all odd numbers are relatively prime to | |
| 2^32. Other commonly used reversible operations include: | |
| o Adding or xoring a constant | |
| o Bitwise rotation or other bitwise permutations | |
| o bit j = (bit j) xor (bit k) for unequal constants j and k | |
| o "Shift mix": S = S xor (S >> k), where k is, say, 17 | |
| o Replacing a fixed-length bit string with its cyclic redundancy | |
| checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike] | |
| Each of the above is a "bad" hash function that inputs and outputs | |
| the same number of bits. One can simply compose two or more of those | |
| bad hash functions to construct a higher-quality hash function. | |
| One common quality goal for integer hashing (and string hashing) is | |
| that flipping the 19th bit, or any other small change, applied to | |
| multiple input keys, causes a seemingly unpredictable difference each | |
| time. Similarly, any change to an input should lead to a seemingly | |
| unpredictable selection of the output bits to flip. | |
| Therefore, if we want a high-quality hash function that inputs c bits | |
| and outputs fewer than c bits, we can simply truncate the output of a | |
| high-quality hash function that inputs and outputs c bits. | |
| To give a concrete example, here is Bob Jenkins' mix(), published in | |
| 1996 [Jenkins]. Its input is 96 bits in three 32-bit variables, and its output | |
| is 96 bits. However, one may use a subset of the output bits, as every | |
| output bit is affected by every non-empty subset of the input bits. | |
| Input: a, b, and c | |
| Algorithm: | |
| a -= b; a -= c; a ^= (c>>13); | |
| b -= c; b -= a; b ^= (a<<8); | |
| c -= a; c -= b; c ^= (b>>13); | |
| a -= b; a -= c; a ^= (c>>12); | |
| b -= c; b -= a; b ^= (a<<16); | |
| c -= a; c -= b; c ^= (b>>5); | |
| a -= b; a -= c; a ^= (c>>3); | |
| b -= c; b -= a; b ^= (a<<10); | |
| c -= a; c -= b; c ^= (b>>15); | |
| Output: a, b, and c | |
| 2.2 VARIATIONS ON STRING HASHING | |
| There are three variations on our initial sketch worth noting. | |
| First, for speed, one can special-case short inputs, as the CityHash | |
| and FarmHash algorithms do. The number of special cases can be | |
| reduced by using loads that may overlap: for example, a hash of a 9- | |
| to 16-byte string can be implemented by a hash that inputs two 8-byte | |
| values (the first 8 and last 8 bytes of the input string) and the string | |
| length [CityHash, FarmHash]. | |
| Second, one may choose different means of incorporating input bits | |
| into the internal state. One example: the mixing of S and input bits | |
| may be interleaved with the mixing of parts of S and other parts of S. | |
| Another example: the input bits processed in a loop iteration might be | |
| xor'ed into multiple places in S, rather than just one, or might be | |
| hashed with each other before touching S [Murmur]. The advantages and | |
| disadvantages of these are unclear. | |
| Third, one may repeatedly "squeeze information" from S, by remixing it with | |
| itself and then revealing a subset of S. This is convenient when one would | |
| like a family of hash functions with different output lengths. A special | |
| case of the idea, called the "sponge construction," has been well studied and | |
| adopted by the authors of Keccak and others [SHA-3]. | |
| 3 HASH FUNCTIONS FOR HASH TABLES | |
| It isn't hard to find real-life examples where hash tables or the hash | |
| functions for them take more than 5% of a program's CPU time. | |
| Improvements to hash tables and their hash functions are therefore a | |
| classic example of software performance tuning. Unfortunately, the | |
| best choice may be platform-dependent, so to avoid writing your own | |
| collection of #ifdefs, please consider selecting something like the | |
| FarmHash family of hash functions, that supply decent | |
| platform-dependent logic for you. | |
| To tune a program, often one will replace an existing hash function with a | |
| faster, lower-quality hash function, despite the increased chance of unlucky | |
| or pathological performance problems. Clever algorithms can mitigate this | |
| risk. For example, hash tables can start with one hash function and then | |
| switch to another if things seem to be going poorly. Therefore, one should | |
| rarely plan to spend much CPU time on a secure hash function (such as SHA-3) | |
| or a near-universal hash function (such as VHASH) when seeking the best | |
| possible performance from a hash table. Against that, those types of hash | |
| functions can limit the risk of pathological performance problems when one is | |
| designing around typical hash-based algorithms that stick with a single hash | |
| function no matter how it behaves on the data at hand. | |
| 4 | |
| REFERENCES | |
| [Murmur] Appleby, Austin. https://code.google.com/p/smhasher, | |
| https://sites.google.com/site/murmurhash/ | |
| [SMHasher] Appleby, Austin. https://code.google.com/p/smhasher | |
| [SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/ | |
| [Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html | |
| [VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In | |
| Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006. | |
| [CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash | |
| [FarmHash] Pike, Geoff. https://code.google.com/p/farmhash | |
| [Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment