#Scala .hashCode vs. MurmurHash3 for Spark's MLlib
This is simple test of two hashing functions:
- Scala's native implementation (
obj.##
), used inHashingTF
- MurmurHash3, included in Scala, used by Vowpal Wabbit and many others
The test uses the aspell dictionary generated with the "insane" setting (download), which produces 676,547 entries, and explores the following grid:
- Feature vector sizes: 2^18..22
- Multiple: 1..4 (a 4 multiple repeats the word 4 times to emulate longer strings)
For English dictionary words, which are short, Scala's native implementation does OK relative to MurmurHash3 at all feature vector sizes. However, as input string length increases, the native hashCode
implementation performs badly, its collision rate becoming 200-500% worse than Murmur's.
Dictionary size 676547 with 1x multiple and 262144 features:
hashCode: 242165 uniques, 64.2057% collision rate
murmur: 242251 uniques, 64.1930% collision rate
Dictionary size 676547 with 2x multiple and 262144 features:
hashCode: 121887 uniques, 81.9840% collision rate
murmur: 242273 uniques, 64.1898% collision rate
Dictionary size 676547 with 3x multiple and 262144 features:
hashCode: 242281 uniques, 64.1886% collision rate
murmur: 242236 uniques, 64.1952% collision rate
Dictionary size 676547 with 4x multiple and 262144 features:
hashCode: 65225 uniques, 90.3591% collision rate
murmur: 242196 uniques, 64.2012% collision rate
Dictionary size 676547 with 1x multiple and 524288 features:
hashCode: 379436 uniques, 43.9158% collision rate
murmur: 380189 uniques, 43.8045% collision rate
Dictionary size 676547 with 2x multiple and 524288 features:
hashCode: 194525 uniques, 71.2474% collision rate
murmur: 380070 uniques, 43.8221% collision rate
Dictionary size 676547 with 3x multiple and 524288 features:
hashCode: 379644 uniques, 43.8851% collision rate
murmur: 380139 uniques, 43.8119% collision rate
Dictionary size 676547 with 4x multiple and 524288 features:
hashCode: 121794 uniques, 81.9977% collision rate
murmur: 379846 uniques, 43.8552% collision rate
Dictionary size 676547 with 1x multiple and 1048576 features:
hashCode: 497309 uniques, 26.4931% collision rate
murmur: 498609 uniques, 26.3009% collision rate
Dictionary size 676547 with 2x multiple and 1048576 features:
hashCode: 266298 uniques, 60.6387% collision rate
murmur: 498648 uniques, 26.2951% collision rate
Dictionary size 676547 with 3x multiple and 1048576 features:
hashCode: 497693 uniques, 26.4363% collision rate
murmur: 498645 uniques, 26.2956% collision rate
Dictionary size 676547 with 4x multiple and 1048576 features:
hashCode: 194769 uniques, 71.2113% collision rate
murmur: 498493 uniques, 26.3181% collision rate
Dictionary size 676547 with 1x multiple and 2097152 features:
hashCode: 577030 uniques, 14.7095% collision rate
murmur: 578521 uniques, 14.4892% collision rate
Dictionary size 676547 with 2x multiple and 2097152 features:
hashCode: 336288 uniques, 50.2935% collision rate
murmur: 578186 uniques, 14.5387% collision rate
Dictionary size 676547 with 3x multiple and 2097152 features:
hashCode: 577426 uniques, 14.6510% collision rate
murmur: 578171 uniques, 14.5409% collision rate
Dictionary size 676547 with 4x multiple and 2097152 features:
hashCode: 266570 uniques, 60.5985% collision rate
murmur: 578144 uniques, 14.5449% collision rate
Dictionary size 676547 with 1x multiple and 4194304 features:
hashCode: 623614 uniques, 7.8240% collision rate
murmur: 625075 uniques, 7.6080% collision rate
Dictionary size 676547 with 2x multiple and 4194304 features:
hashCode: 415648 uniques, 38.5633% collision rate
murmur: 624846 uniques, 7.6419% collision rate
Dictionary size 676547 with 3x multiple and 4194304 features:
hashCode: 623818 uniques, 7.7938% collision rate
murmur: 624398 uniques, 7.7081% collision rate
Dictionary size 676547 with 4x multiple and 4194304 features:
hashCode: 336577 uniques, 50.2508% collision rate
murmur: 624710 uniques, 7.6620% collision rate