Simeon Simeonov ssimeonov

#Scala .hashCode vs. MurmurHash3 for Spark's MLlib

This is simple test of two hashing functions:

The test uses the aspell dictionary generated with the "insane" setting (download), which produces 676,547 entries, and explores the following grid:

	package wordle

	/** Wordle solver, game runner & simulator
	*
	* Optimizes based on a combination of an allowed word list (from the Wordle source code or any
	* other source), word frequency data and the move in the game.
	*
	* @note
	* [[Wordle.Game]] is mutable to allow for play in an environment without easy STDIN input. Use
	* [[Wordle.Game.nextMove()]]. All words are in lowercase. Patterns are entered as as strings of

	➜ jvm-packages git:(master) ✗ mvn -Dspark.version=2.1.0 package
	Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
	[INFO] Scanning for projects...
	[WARNING]
	[WARNING] Some problems were encountered while building the effective model for ml.dmlc:xgboost4j:jar:0.7
	[WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:exec-maven-plugin is missing. @ line 40, column 29
	[WARNING]
	[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
	[WARNING]
	[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.

	case class FInfo(
	path: String,
	parent: String,
	isDir: Boolean,
	size: Long,
	modificationTime: Long,
	partitions: Map[String, String]) {

	// @todo encoding issues
	def hasExt(ext: String) = endsWith(ext)

	object DataFrameFunctions {

	final val TEMP_TABLE_PLACEHOLDER = "~tbl~"

	/** Executes a SQL statement on the dataframe.
	* Behind the scenes, it registers and cleans up a temporary table.
	*
	* @param df input dataframe
	* @param stmtTemplate SQL statement template that uses the value of
	* `TEMP_TABLE_PLACEHOLDER` for the table name.

	➜ spark git:(master) ✗ build/sbt sql/test
	Using /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home as default JAVA_HOME.
	Note, this will be overridden by -java-home if it is set.
	Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
	[info] Loading global plugins from /Users/sim/.sbt/0.13/plugins
	[info] Loading project definition from /Users/sim/dev/spx/spark/project/project
	[info] Loading project definition from /Users/sim/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
	[warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
	[info] Loading project definition from /Users/sim/dev/spx/spark/project
	[info] Set current project to spark-parent (in build file:/Users/sim/dev/spx/spark/)

	[info] spark-streaming: found 30 potential binary incompatibilities (filtered 8)
	[error] * method delaySeconds()Int in class org.apache.spark.streaming.Checkpoint does not have a correspondent in new version
	[error] filter with: ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.Checkpoint.delaySeconds")
	[error] * class org.apache.spark.streaming.receiver.ActorSupervisorStrategy does not have a correspondent in new version
	[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streaming.receiver.ActorSupervisorStrategy")
	[error] * object org.apache.spark.streaming.receiver.IteratorData does not have a correspondent in new version
	[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streaming.receiver.IteratorData$")
	[error] * class org.apache.spark.streaming.receiver.ByteBufferData does not have a correspondent in new version
	[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streami

	object ContrivedAdd {

	import shapeless._
	import record._
	import syntax.singleton._
	import shapeless.ops.record.Updater
	import scalaz._
	import Scalaz._

	case class S[L <: HList](total: Int, scratch: L)

	val ctx = sqlContext
	import ctx.implicits._

	// With nested structs, sometimes JSON is a much more readable form than display()
	def showall(df: DataFrame, num: Int): Unit = df.limit(num).toJSON.collect.foreach(println)
	def showall(sql: String, num: Int = 100): Unit = showall(ctx.sql(sql), num)

	def hivePath(name: String) = s"/user/hive/warehouse/$name"

	// Bug workaround