Created
November 19, 2012 14:39
-
-
Save iperelivskiy/4110988 to your computer and use it in GitHub Desktop.
JS simple hash function
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
var hash = function(s) { | |
/* Simple hash function. */ | |
var a = 1, c = 0, h, o; | |
if (s) { | |
a = 0; | |
/*jshint plusplus:false bitwise:false*/ | |
for (h = s.length - 1; h >= 0; h--) { | |
o = s.charCodeAt(h); | |
a = (a<<6&268435455) + o + (o<<14); | |
c = a & 266338304; | |
a = c!==0?a^c>>21:a; | |
} | |
} | |
return String(a); | |
}; |
@bryc - super explanation, thank you
Going with the 32-bit hash for now but will change to the 53-bit version if we experience frequent collisions.
A comprehensive list of general purpose hash functions and their implementations can found here:
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@u01jmg3
Yes, actually. If you're just comparing two strings at once, and not storing the hash results or crosschecking all possibilities, then a simple 32-bit hash will do the job without any problems. But keep in mind, hashes only tell you if two messages don't match. If two messages do match, it's not absolute proof that the messages are identical.
For example, with FunHash/TSH, if you iterate 100s of millions of random JSON string comparisons, you will eventually find some collisions, where two different JSON strings produce the same hash and erroneously pass the comparison test. However, this is extremely uncommon and you would need to do up to 2 billion iterations to find collisions. Through my own testing, I've found it typically requires 650 million iterations before a single collision is found. But it can take as few as 35 million or as many as 1.44 billion iterations to find a collision. In fact, I once ran my 1.5 billion test 4 times in a row, and got 0 collisions each, but it does happen.
So unless you need to do more than 30 million JSON comparisons in one go, you should not have to worry about collisions.
Here's some simple test code that tests for this. It's set to 2 million, because JS isn't really fast enough to do 1.5 billion.
However, 32-bit hashes can actually be a poor choice in some situations. Here's a test where 250,000 iterations is practically assured to give at least one collision against the full set, when storing them all in a hash map:
With the above test, collisions can be found after as low as 20,000 iterations.
This is why sometimes it's better to just use a larger hash size than 32-bit. See my answer on StackOverflow: https://stackoverflow.com/a/52171480/815680.
cyrb53
is a 53-bit hash that's still comparably fast to the 32-bit one, but the fact that it has 53 bits reduces collisions by an exponential scale.TSH
(TinySimpleHash) at the bottom of that post is basically the same algorithm asFunHash
just with a different name.If you want something highly robust, I have an efficient port of 128-bit MurmurHash3, but it needs 4 comparisons instead of one: https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash3_128.js.
Further reading:
https://en.wikipedia.org/wiki/Pigeonhole_principle
https://en.wikipedia.org/wiki/Birthday_problem