Created
September 12, 2015 04:42
Revisions
-
ssimeonov created this gist
Sep 12, 2015 .There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,95 @@ #Scala .hashCode vs. MurmurHash3 for Spark's MLlib This is simple test of two hashing functions: - Scala's native implementation (`obj.##`), used in [`HashingTF`](https://github.com/swoop-inc/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala#L45) - MurmurHash3, included in Scala, used by Vowpal Wabbit and many others The test uses the aspell dictionary generated with the "insane" setting ([download](https://s3.amazonaws.com/uploads.hipchat.com/14531/52695/70GtMUiHpypac80/aspell.txt)), which produces 676,547 entries, and explores the following grid: - Feature vector sizes: 2^<sup>18..22</sup> - Multiple: 1..4 (a 4 multiple repeats the word 4 times to emulate longer strings) For English dictionary words, which are short, Scala's native implementation does OK relative to MurmurHash3 at all feature vector sizes. However, as input string length increases, the native `hashCode` implementation performs badly, its collision rate becoming 200-500% worse than Murmur's. ``` Dictionary size 676547 with 1x multiple and 262144 features: hashCode: 242165 uniques, 64.2057% collision rate murmur: 242251 uniques, 64.1930% collision rate Dictionary size 676547 with 2x multiple and 262144 features: hashCode: 121887 uniques, 81.9840% collision rate murmur: 242273 uniques, 64.1898% collision rate Dictionary size 676547 with 3x multiple and 262144 features: hashCode: 242281 uniques, 64.1886% collision rate murmur: 242236 uniques, 64.1952% collision rate Dictionary size 676547 with 4x multiple and 262144 features: hashCode: 65225 uniques, 90.3591% collision rate murmur: 242196 uniques, 64.2012% collision rate Dictionary size 676547 with 1x multiple and 524288 features: hashCode: 379436 uniques, 43.9158% collision rate murmur: 380189 uniques, 43.8045% collision rate Dictionary size 676547 with 2x multiple and 524288 features: hashCode: 194525 uniques, 71.2474% collision rate murmur: 380070 uniques, 43.8221% collision rate Dictionary size 676547 with 3x multiple and 524288 features: hashCode: 379644 uniques, 43.8851% collision rate murmur: 380139 uniques, 43.8119% collision rate Dictionary size 676547 with 4x multiple and 524288 features: hashCode: 121794 uniques, 81.9977% collision rate murmur: 379846 uniques, 43.8552% collision rate Dictionary size 676547 with 1x multiple and 1048576 features: hashCode: 497309 uniques, 26.4931% collision rate murmur: 498609 uniques, 26.3009% collision rate Dictionary size 676547 with 2x multiple and 1048576 features: hashCode: 266298 uniques, 60.6387% collision rate murmur: 498648 uniques, 26.2951% collision rate Dictionary size 676547 with 3x multiple and 1048576 features: hashCode: 497693 uniques, 26.4363% collision rate murmur: 498645 uniques, 26.2956% collision rate Dictionary size 676547 with 4x multiple and 1048576 features: hashCode: 194769 uniques, 71.2113% collision rate murmur: 498493 uniques, 26.3181% collision rate Dictionary size 676547 with 1x multiple and 2097152 features: hashCode: 577030 uniques, 14.7095% collision rate murmur: 578521 uniques, 14.4892% collision rate Dictionary size 676547 with 2x multiple and 2097152 features: hashCode: 336288 uniques, 50.2935% collision rate murmur: 578186 uniques, 14.5387% collision rate Dictionary size 676547 with 3x multiple and 2097152 features: hashCode: 577426 uniques, 14.6510% collision rate murmur: 578171 uniques, 14.5409% collision rate Dictionary size 676547 with 4x multiple and 2097152 features: hashCode: 266570 uniques, 60.5985% collision rate murmur: 578144 uniques, 14.5449% collision rate Dictionary size 676547 with 1x multiple and 4194304 features: hashCode: 623614 uniques, 7.8240% collision rate murmur: 625075 uniques, 7.6080% collision rate Dictionary size 676547 with 2x multiple and 4194304 features: hashCode: 415648 uniques, 38.5633% collision rate murmur: 624846 uniques, 7.6419% collision rate Dictionary size 676547 with 3x multiple and 4194304 features: hashCode: 623818 uniques, 7.7938% collision rate murmur: 624398 uniques, 7.7081% collision rate Dictionary size 676547 with 4x multiple and 4194304 features: hashCode: 336577 uniques, 50.2508% collision rate murmur: 624710 uniques, 7.6620% collision rate ``` This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,67 @@ #!/bin/sh exec scala -J-Xmx8g "$0" "$@" !# import scala.io.Source._ import scala.util.hashing.MurmurHash3 def pow2(n: Int) = Math.pow(2, n).toInt val murmur = MurmurHash3.stringHashing case class HashFunction(name: String, hashFunction: (String) => Int, numFeatures: Int) { val uniques = scala.collection.mutable.SortedSet[Int]() def hash(str: String) = uniques += nonNegativeMod(hashFunction(str)) def collisionRate(numFeatures: Int) = (numFeatures - uniques.size).toDouble / numFeatures private def nonNegativeMod(x: Int): Int = { val rawMod = x % numFeatures rawMod + (if (rawMod < 0) numFeatures else 0) } } case class Test(numFeatures: Int, multiple: Int, hashFunctions: Array[HashFunction]) { def hash(str: String) = { val scaledString = str * multiple hashFunctions.foreach(_.hash(scaledString)) } def report(wordCount: Int) = { println(s"\nDictionary size $wordCount with ${multiple}x multiple and $numFeatures features:") hashFunctions.foreach { hashFunction => println(f" ${hashFunction.name}: ${hashFunction.uniques.size} uniques, ${100 * hashFunction.collisionRate(wordCount)}%1.4f%% collision rate") } } } object Script { val tests = for ( numFeatures <- Seq(18, 19, 20, 21, 22).map(pow2); multiple <- Seq(1, 2, 3, 4) ) yield new Test( numFeatures, multiple, Array( new HashFunction("hashCode", (word: String) => word.##, numFeatures), new HashFunction("murmur", (word: String) => murmur.hash(word), numFeatures) )) def main(args: Array[String]) { val path = args(0) var wordCount = 0 fromFile(path).getLines().foreach { word => wordCount += 1 tests.foreach(_.hash(word)) } tests.foreach(_.report(wordCount)) } } Script.main(args)