Skip to content

Instantly share code, notes, and snippets.

@MishaelRosenthal
Last active August 29, 2015 13:59
Show Gist options
  • Select an option

  • Save MishaelRosenthal/10981683 to your computer and use it in GitHub Desktop.

Select an option

Save MishaelRosenthal/10981683 to your computer and use it in GitHub Desktop.
package com.liveperson.lpbt.research.hadoop.scripts.EligibilityExperiment.auc.aucCalculators
import com.twitter.scalding._
/**
* Created with IntelliJ IDEA.
* User: mishaelr
* Date: 4/17/14
* Time: 1:21 PM
*
* Based on:
* http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
*/
trait AucCalculator[Score, Label] {
type Counter = Int
case class AucAccumulator(numRectangles: Counter, currentHeight: Counter, numPositives: Counter, numNegatives: Counter)
def isPositive(value: Label): Boolean
def calAucRocWithSorting(scoresAndLabelsPipe: TypedPipe[(Score, Label)])(implicit ord:Ordering[Score]): TypedPipe[Double] = {
val sorted = scoresAndLabelsPipe.groupAll.sortBy(_._1).reverse.mapValues(_._2)
calcAucRoc(sorted)
}
/**
* It is assumed that the pipe is sorted according to the scores.
* @return The AUC of ROC if defined, NaN otherwise. For example NaN will be returned if isPositive(label)
* is always true or always false.
*/
def calcAucRoc(sortedLabelsPipe: Grouped[Unit, Label]): TypedPipe[Double] = {
val accumulated = sortedLabelsPipe.foldLeft(AucAccumulator(0, 0, 0, 0)){
case (acc, label) =>
if(isPositive(label))
acc.copy(currentHeight = acc.currentHeight + 1, numPositives = acc.numPositives + 1)
else
acc.copy(numRectangles = acc.numRectangles + acc.currentHeight, numNegatives = acc.numNegatives + 1)
}
accumulated.values.map{
case AucAccumulator(numRectangles, _, numPositives, numNegatives) => numRectangles * 1.0 / numPositives * 1.0 / numNegatives
}
}
}
package com.liveperson.lpbt.research.hadoop.scripts.EligibilityExperiment.auc.aucCalculators
/**
* Created with IntelliJ IDEA.
* User: mishaelr
* Date: 4/17/14
* Time: 1:21 PM
*/
trait DoubleIntAucCalculator extends AucCalculator[Double, Int] {
def isPositive(value: Int) = value > 0
}
@MishaelRosenthal
Copy link
Author

Doesn't distribute well, since it uses an exact calculation, which requires sorting.
Sorting is done using groupAll which forces the pipe into one reducer.

A better distributed approximate algorithm can be derived from the probabilistic definition of AUC of the ROC:
"A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment