srnghn’s gists

srnghn / ANOVA_Spark_2.0.scala

Last active December 6, 2019 07:25

ANOVA Test for Spark 2.0 (using RelationalGroupedDataset instead of Iterable[RDD[Double]]). The categorical and scale columns to be evaluated are to be selected from a DataFrame, converted to class type Dataset[CatTuple] (defined in this code) and passed to the ANOVA function. The returned object is of class ANOVAStats (also defined here) and co…

	/**
	* Create a class, CatTuple, to pass to the ANOVA function so that columns can be referred to by specific names.
	* Create a class, ANOVAStats, that will be returned from the ANOVA function so that its outputs can be selected and referred to by name.
	**/
	final case class CatTuple(cat: String, value: Double)
	final case class ANOVAStats(dfb: Long, dfw: Double, F_value: Double, etaSq: Double, omegaSq: Double)

	// Column names to use when converting to CatTuple
	val colnames = Seq("cat", "value")

srnghn / Pearsons_R_Correlation_Spark_2.0.scala

Created October 5, 2016 00:22

Pearson's R Correlation for Spark 2.0. Created after getting inconsistant results with Statistics.corr. The two scale columns to be evaluated are to be selected from a DataFrame, converted to class type Dataset[ScaleTuple] (defined in this code) and passed to the correlation function.

	// Create a class, ScaleTuple, to pass to the Pearson's R function so that columns can be referred to by specific names.
	final case class ScaleTuple(var1: Double, var2: Double)

	// Column names to use when converting to ScaleTuple
	val colnames = Seq("var1", "var2")

	/**
	* Implementation of Pearson's R function: calculates r, the measurement of linear dependence between two variables
	* Utilizes DataSet's 'agg' function
	**/

srnghn / ANOVA_Spark_2.0.py

Created October 20, 2016 23:16

ANOVA Test for Spark 2.0 using PySpark. The function returns 5 values: degrees of freedom between (numerator), degrees of freedom within (denominator), F-value, eta squared and omega squared.

	from pyspark.sql.functions import *

	# Implementation of ANOVA function: calculates the degrees of freedom, F-value, eta squared and omega squared values.
	# Expects that 'categoryData' with two columns, the first being the categorical independent variable and the second being the scale dependent variable

	def getAnovaStats(categoryData) :
	cat_val = categoryData.toDF("cat","value")
	cat_val.createOrReplaceTempView("df")
	newdf = spark.sql("select A.cat, A.value, cast((A.value * A.value) as double) as valueSq, ((A.value - B.avg) * (A.value - B.avg)) as diffSq from df A join (select cat, avg(value) as avg from df group by cat) B where A.cat = B.cat")
	grouped = newdf.groupBy("cat")