Skip to content

Instantly share code, notes, and snippets.

@tbertelsen
Forked from kaja47/pearson.scala
Last active February 28, 2021 22:14
Show Gist options
  • Save tbertelsen/4353d4a8a4386afb0abb to your computer and use it in GitHub Desktop.
Save tbertelsen/4353d4a8a4386afb0abb to your computer and use it in GitHub Desktop.
Calculating pearson for Breeze vectors
import breeze.linalg._
import breeze.stats._
import scala.math.sqrt
/**
* Effecient for sparse vectors. Scales in O(activeSize)
*/
// Must take SparseVector, for implicits to be linked correctly
def pearson(a: SparseVector[Double], b: SparseVector[Double]): Double = {
if (a.length != b.length)
throw new IllegalArgumentException("Vectors not of the same length.")
val n = a.length
val dot = a.dot(b)
val adot = a.dot(a)
val bdot = b.dot(b)
val amean = mean(a)
val bmean = mean(b)
// See Wikipedia http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
(dot - n * amean * bmean ) / ( sqrt(adot - n * amean * amean) * sqrt(bdot - n * bmean * bmean) )
}
/**
* Works for all all vectors. Scales in O(length)
*/
def pearson(a: Vector[Double], b: Vector[Double]): Double = {
// Delegate to efficient method if possible,
if (a.isInstanceOf[SparseVector] && b.isInstanceOf[SparseVector]) {
return pearson(a.asInstanceOf[SparseVector[Double]], b.asInstanceOf[SparseVector[Double]])
}
if (a.length != b.length)
throw new IllegalArgumentException("Vectors not of the same length.")
val n = a.length
val dot = a.dot(b)
val adot = a.dot(a)
val bdot = b.dot(b)
val amean = mean(a)
val bmean = mean(b)
// See Wikipedia http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
(dot - n * amean * bmean ) / ( sqrt(adot - n * amean * amean) * sqrt(bdot - n * bmean * bmean) )
}
@tbertelsen
Copy link
Author

The efficient method can be several thousand times faster. For two vectors with approximately 1000 non-zero elements, this is the difference:

Length Effecient method Inefficient method
1 048 576 <0.005 s 0.22 s
4 194 304 <0.005 s 1.05 s
16 777 216 <0.005 s 2.86 s
67 108 864 ~0.01 s Throws OutOfMemory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment