Last active
August 29, 2015 13:59
-
-
Save MishaelRosenthal/10981683 to your computer and use it in GitHub Desktop.
AUC of ROC Scalding Calculator: http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| package com.liveperson.lpbt.research.hadoop.scripts.EligibilityExperiment.auc.aucCalculators | |
| import com.twitter.scalding._ | |
| /** | |
| * Created with IntelliJ IDEA. | |
| * User: mishaelr | |
| * Date: 4/17/14 | |
| * Time: 1:21 PM | |
| * | |
| * Based on: | |
| * http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve | |
| */ | |
| trait AucCalculator[Score, Label] { | |
| type Counter = Int | |
| case class AucAccumulator(numRectangles: Counter, currentHeight: Counter, numPositives: Counter, numNegatives: Counter) | |
| def isPositive(value: Label): Boolean | |
| def calAucRocWithSorting(scoresAndLabelsPipe: TypedPipe[(Score, Label)])(implicit ord:Ordering[Score]): TypedPipe[Double] = { | |
| val sorted = scoresAndLabelsPipe.groupAll.sortBy(_._1).reverse.mapValues(_._2) | |
| calcAucRoc(sorted) | |
| } | |
| /** | |
| * It is assumed that the pipe is sorted according to the scores. | |
| * @return The AUC of ROC if defined, NaN otherwise. For example NaN will be returned if isPositive(label) | |
| * is always true or always false. | |
| */ | |
| def calcAucRoc(sortedLabelsPipe: Grouped[Unit, Label]): TypedPipe[Double] = { | |
| val accumulated = sortedLabelsPipe.foldLeft(AucAccumulator(0, 0, 0, 0)){ | |
| case (acc, label) => | |
| if(isPositive(label)) | |
| acc.copy(currentHeight = acc.currentHeight + 1, numPositives = acc.numPositives + 1) | |
| else | |
| acc.copy(numRectangles = acc.numRectangles + acc.currentHeight, numNegatives = acc.numNegatives + 1) | |
| } | |
| accumulated.values.map{ | |
| case AucAccumulator(numRectangles, _, numPositives, numNegatives) => numRectangles * 1.0 / numPositives * 1.0 / numNegatives | |
| } | |
| } | |
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| package com.liveperson.lpbt.research.hadoop.scripts.EligibilityExperiment.auc.aucCalculators | |
| /** | |
| * Created with IntelliJ IDEA. | |
| * User: mishaelr | |
| * Date: 4/17/14 | |
| * Time: 1:21 PM | |
| */ | |
| trait DoubleIntAucCalculator extends AucCalculator[Double, Int] { | |
| def isPositive(value: Int) = value > 0 | |
| } |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Doesn't distribute well, since it uses an exact calculation, which requires sorting.
Sorting is done using groupAll which forces the pipe into one reducer.
A better distributed approximate algorithm can be derived from the probabilistic definition of AUC of the ROC:
"A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example."