Differences between Precision, Recall, F-score, Dice, Accuracy, ROC AUC...

Made on HackMD, use it to get the correct display Available at:

HackMD
Gist

What are the differences between the several metrics of binary classification performance evaluation ?

Link Towards Data Science See also p.197+ chap. 5.4.11+ : look for a file named like StatisticsMachineLearningPython.pdf

Analogy Project Bee Wings

On a photo of a bee's wing, detect intersection points (pixels coordinates) between the wing's veins, with an authorized error of 30px radius circle

pixels marked as 'intersection' are labeled 'positive'
all other pixels of the image are considered 'negative'
huge imbalance between positive and negative classes (in this example)

:::info TP : true positives FP : false positives TN : true negatives FN : false negatives :::

Precision :

:::warning aka.

Positive Predicted Value (PPV) :::
ratio of the correctly 'intersection' labeled by our program over all 'intersection' labeled by our program
how sure you are of your true positives
choose precision if you want to be more confident of your true positives :::info $$ precision = \frac{TP}{TP + FP} $$ ::: :::danger $precision \neq accuracy$ :::

Recall (of the positive class) :

:::warning aka.

Sensitivity (SEN)
True Positive Rate (TPR)
Hit rate :::
ratio of correctly 'intersection' labeled by our program over all that are really 'intersection'
equivalent of (but not the same thing as) specificity for positive values
how sure you are that you are not missing any positives
choose recall if the idea of false positives is far better than false negatives (ex: cancer / HIV detection...) :::info $$ recall = \frac{TP}{TP + FN} $$ ::: :::danger $recall = sensitivity$ BUT $recall \neq specificity$ :::

Specificity (SPC) :

:::warning aka.

True Negative Rate (TNR)
Recall of the negative class (denomination rarely used) :::
ratio of correctly 'not intersection' labeled by our program over all that are really 'not intersection'
equivalent of (but not the same thing as) recall for negative values
choose specificity if you want to cover all negatives, meaning you don’t want any false alarms (i.e false positives) :::info $$ specificity = \frac{TN}{TN + FP} $$ ::: :::danger $recall = sensitivity$ BUT $recall \neq specificity$ :::

F-score :

:::warning aka.

Dice score
F1-score
F-measure
F-value :::
considering precision and recall : harmonic mean of the precision and recall
F-score is best if there is some sort of balance between precision & recall in the system
oppositely F-score isn't so high if one measure is improved at the expense of the other. For example, if precision is 1 & recall is 0, or recall is 1 & precision is 0, F-score will be 0
F-score is a good measure if you have an uneven class distribution between positive and negative counts --> suited for the analogy of the Bee Wings project :::info $$ F_score = \frac{2 * (precision * recall)}{precision + recall} $$ ::: :::danger $$ F_score \neq \frac{precision + recall}{2} $$ :::

Accuracy (ACC) :

ratio of correctly labeled subjects over the whole pool of subjects
great measure but only when you have symmetric datasets (false negatives & false positives counts are close)
if the cost of false positives and false negatives are different then F1 is your savior :::info $$ accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$ ::: :::danger $precision \neq accuracy$ :::

Balanced Accuracy (bACC):

arithmetic mean of sensitivity (recall) and specificity, or the average accuracy obtained on either class
avoids inflated performance estimates on imbalanced datasets :::info $$ bACC = \frac{recall + specificity}{2} $$ :::

ROC and AUC :

:::info ROC: Receiver Operating Characteristic AUC: Area Under Curve (under the ROC) ::: :::warning aka. curve of Recall (TPR) (y-axis) relative to the False Positive Rate (FPR) (x-axis) where : $$ FPR = \frac{FP}{FP + TN} $$ ::: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

ideal situation: curve sticking to y-axis and 'ceiling'
worst situation: identity function: full random

SasquatchYuja/metrics_classif_eval.md

Differences between Precision, Recall, F-score, Dice, Accuracy, ROC AUC...

What are the differences between the several metrics of binary classification performance evaluation ?

Analogy Project Bee Wings

Precision :

Recall (of the positive class) :

Specificity (SPC) :

F-score :

Accuracy (ACC) :

Balanced Accuracy (bACC):

ROC and AUC :