-
-
Save iperelivskiy/4110988 to your computer and use it in GitHub Desktop.
var hash = function(s) { | |
/* Simple hash function. */ | |
var a = 1, c = 0, h, o; | |
if (s) { | |
a = 0; | |
/*jshint plusplus:false bitwise:false*/ | |
for (h = s.length - 1; h >= 0; h--) { | |
o = s.charCodeAt(h); | |
a = (a<<6&268435455) + o + (o<<14); | |
c = a & 266338304; | |
a = c!==0?a^c>>21:a; | |
} | |
} | |
return String(a); | |
}; |
@bryc, can you, please, share your test functions?
@Undefitied Here you go. It's set to 4096 iterations per test, just for faster loading. And it also tests the bit distribution under "Bit averages range", where 50% is the expected value of true randomness. Just keep in mind these are just basic tests with small sample sizes. SMHasher is better but you might need to be a rocket scientist to use it.
@bryc: would funhash()
be efficient to use against a JSON string to see quickly whether two JSON strings match each other (rather than comparing full JSON strings)?
@bryc: would
funhash()
be efficient to use against a JSON string to see quickly whether two JSON strings match each other (rather than comparing full JSON strings)?
Yes, actually. If you're just comparing two strings at once, and not storing the hash results or crosschecking all possibilities, then a simple 32-bit hash will do the job without any problems. But keep in mind, hashes only tell you if two messages don't match. If two messages do match, it's not absolute proof that the messages are identical.
For example, with FunHash/TSH, if you iterate 100s of millions of random JSON string comparisons, you will eventually find some collisions, where two different JSON strings produce the same hash and erroneously pass the comparison test. However, this is extremely uncommon and you would need to do up to 2 billion iterations to find collisions. Through my own testing, I've found it typically requires 650 million iterations before a single collision is found. But it can take as few as 35 million or as many as 1.44 billion iterations to find a collision. In fact, I once ran my 1.5 billion test 4 times in a row, and got 0 collisions each, but it does happen.
So unless you need to do more than 30 million JSON comparisons in one go, you should not have to worry about collisions.
Here's some simple test code that tests for this. It's set to 2 million, because JS isn't really fast enough to do 1.5 billion.
// TSH (TinySimpleHash) is the same as FunHash with slightly different numbers and a new name.
TSH=s=>{for(var i=0,h=9;i<s.length;)h=Math.imul(h^s.charCodeAt(i++),9**9);return h^h>>>9}
for(var i = 0; i < 2000000; i++) {
var hash1 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
var hash2 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
if (hash1 === hash2) console.warn(`Dupe at ${i}, Number ${hash1}`);
}
console.info("Done");
However, 32-bit hashes can actually be a poor choice in some situations. Here's a test where 250,000 iterations is practically assured to give at least one collision against the full set, when storing them all in a hash map:
TSH=s=>{for(var i=0,h=9;i<s.length;)h=Math.imul(h^s.charCodeAt(i++),9**9);return h^h>>>9}
var hashMap = {};
for(var i = 0; i < 250000; i++) {
var hash1 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
if (hashMap[hash1]) console.warn(`Dupe at ${i}, Index ${hash1}`);
hashMap[hash1] = (hashMap[hash1] || 0) + 1;
}
console.info("Done");
With the above test, collisions can be found after as low as 20,000 iterations.
This is why sometimes it's better to just use a larger hash size than 32-bit. See my answer on StackOverflow: https://stackoverflow.com/a/52171480/815680. cyrb53
is a 53-bit hash that's still comparably fast to the 32-bit one, but the fact that it has 53 bits reduces collisions by an exponential scale. TSH
(TinySimpleHash) at the bottom of that post is basically the same algorithm as FunHash
just with a different name.
If you want something highly robust, I have an efficient port of 128-bit MurmurHash3, but it needs 4 comparisons instead of one: https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash3_128.js.
Further reading:
https://en.wikipedia.org/wiki/Pigeonhole_principle
https://en.wikipedia.org/wiki/Birthday_problem
@bryc - super explanation, thank you
Going with the 32-bit hash for now but will change to the 53-bit version if we experience frequent collisions.
A comprehensive list of general purpose hash functions and their implementations can found here:
Seems to be from ga.js (Google Analytics), and was used to hash domain name strings. It's not that good honestly. Distribution and collision resistance is very poor. Only useful for its original purpose: hashing domain name strings.
For example, hash for "a" is
00184061
and "b" is00188062
. "revenge" is097599b9
and "revenue" is0575be19
. They have many common bits, which implies bad mixing. And collisions have higher-than-chance probability: "1zbN" and "1t6g" =00e8b236
, "RivW" and "R1G7" =10063cd2
.Did a quick sanity test:
Fails most tests, but 484 in 102,401 is not bad. I can see why they used it for domain strings. But an even simpler function is faster AND better: