Machine Learning Metrics

These metrics are primarily focused on classification tasks, evaluating how well a model distinguishes between different categories.

General Definitions (Foundation)

Before diving into metrics, the paper establishes common terms based on a confusion matrix:

TP (True Positives): Correctly predicted positive instances.
TN (True Negatives): Correctly predicted negative instances.
FP (False Positives): Incorrectly predicted positive instances (Type I error).
FN (False Negatives): Incorrectly predicted negative instances (Type II error).
P (Positives): Total actual positive instances ( $P = TP + FN$ ).
N (Negatives): Total actual negative instances ( $N = TN + FP$ ).
T (Total): Total instances ( $T = P + N = TP + TN + FP + FN$ ).
n: Number of classes (for multi-class problems).

Precision / Positive Predictive Value (PPV)

Formula

$Precision = \frac{TP}{TP + FP}$

Description

Measures the accuracy of the positive predictions. It answers the question: “Of all instances predicted as positive, what proportion actually are positive?”

Why Use It / Context

Crucial when the cost of a False Positive is high. For example:

Spam Detection: You want to minimize legitimate emails being marked as spam (FP). High precision means if the model says it’s spam, it’s very likely spam.
Medical Diagnosis (Specific Test): When a positive test leads to costly or invasive follow-up procedures, high precision ensures fewer healthy patients undergo them unnecessarily.
Fraud Detection: Minimizing false alarms that inconvenience legitimate users.

Advantages / Disadvantages

(+) Minimizes false positives.
(+) Useful when the cost or consequence of FPs is significant.
(-) Ignores True Negatives and False Negatives entirely.
(-) Can be misleadingly high if the model is overly conservative (predicts positive very rarely). It might achieve high precision by missing many actual positive cases (low recall).

Precision Averaging Methods (for Multi-Class)

Macro Average Precision (APmacro)

Formula: $AP_{macro} = \frac{1}{n} \sum_{i = 1}^{n} Precision_{i} = \frac{1}{n} \sum_{i = 1}^{n} \frac{T P _{i}}{T P _{i} + F P _{i}}$
Description: Calculates precision for each class independently, then computes the unweighted average.
Why Use: To give equal importance to the precision of every class, regardless of how common or rare it is. Useful for a balanced view across categories.
Pros/Cons: (+) Treats all classes equally. (-) Can be skewed by performance on very rare classes; doesn’t reflect overall instance-level accuracy.

Micro Average Precision (APmicro)

Formula: $AP_{micro} = \frac{\sum _{i = 1}^{n} T P _{i}}{\sum _{i = 1}^{n} ( T P _{i} + F P _{i} )}$
Description: Aggregates the TPs and FPs across all classes before calculating a single precision value.
Why Use: To assess overall performance giving weight according to class frequency. Useful when you care more about correctly classifying the majority instances or for an aggregate view on imbalanced datasets.
Pros/Cons: (+) Reflects overall instance contribution. (-) Can be dominated by the performance on larger classes, potentially masking poor performance on smaller classes.

Weighted Average Precision (APweighted)

Formula: $AP_{weighted} = \frac{\sum _{i = 1}^{n} w _{i} \cdot Precision _{i}}{\sum _{i = 1}^{n} w _{i}}$ (where $w_{i}$ is the support/number of true instances for class $i$ )
Description: Calculates precision for each class, then computes a weighted average based on the number of true instances for each class (support).
Why Use: To balance the macro approach (equal weight) and micro approach (instance weight) by considering class size but still averaging per-class scores. Often provides a good balance on imbalanced datasets.
Pros/Cons: (+) Accounts for class imbalance better than macro. (-) Can still obscure issues if weights don’t perfectly reflect importance.

Negative Predictive Value (NPV)

Formula

$NPV = \frac{TN}{TN + FN}$

Description

Measures the accuracy of the negative predictions. It answers the question: “Of all instances predicted as negative, what proportion actually are negative?”

Why Use It / Context

Important when correctly identifying true negatives is critical, i.e., when a False Negative is costly but you want confidence in a negative prediction.

Medical Screening: A high NPV means if a patient tests negative, they are very likely disease-free, reducing anxiety and unnecessary follow-up.
Quality Control: Ensuring that items marked as ‘not defective’ are indeed not defective.

Advantages / Disadvantages

(+) Measures the reliability of a negative prediction.
(+) Useful in contexts with low disease prevalence where negative results are common.
(-) Ignores True Positives and False Positives.
(-) Less informative if the rate of False Negatives is very high.

Recall / True Positive Rate (TPR) / Sensitivity / Hit Rate

Formula

$Recall = \frac{TP}{P} = \frac{TP}{TP + FN}$

Description

Measures the model’s ability to find all the actual positive instances. It answers the question: “Of all the actual positive instances, what proportion did the model correctly identify?”

Why Use It / Context

Crucial when the cost of a False Negative is high. You want to miss as few positive cases as possible.

Disease Detection: Failing to detect a disease (FN) can have severe consequences. High recall is paramount.
Fraud Detection: Missing a fraudulent transaction (FN) can be costly.
Safety Systems: Failing to detect a hazard (FN).

Advantages / Disadvantages

(+) Ensures minimal false negatives (missed positives).
(+) Critical for identifying rare but important positive cases.
(-) Ignores True Negatives and False Positives.
(-) Can be achieved by predicting everything as positive, which would lead to terrible precision.

Recall Averaging Methods (for Multi-Class)

Macro Average Recall (ARmacro)

Formula: $AR_{macro} = \frac{1}{n} \sum_{i = 1}^{n} Recall_{i} = \frac{1}{n} \sum_{i = 1}^{n} \frac{T P _{i}}{T P _{i} + F N _{i}}$
Description: Calculates recall for each class independently, then computes the unweighted average.
Why Use: To give equal importance to the recall of every class, regardless of its frequency.
Pros/Cons: (+) Treats all classes equally. (-) Can be skewed by performance on rare classes.

Micro Average Recall (ARmicro)

Formula: $AR_{micro} = \frac{\sum _{i = 1}^{n} T P _{i}}{\sum _{i = 1}^{n} ( T P _{i} + F N _{i} )}$
Description: Aggregates the TPs and FNs across all classes before calculating a single recall value. (Note: This value is identical to Micro Average Precision and overall Accuracy in multi-class settings where each instance belongs to exactly one class).
Why Use: To assess overall instance-level recall performance, weighted by class frequency.
Pros/Cons: (+) Reflects overall instance contribution. (-) Dominated by larger classes.

Weighted Average Recall (ARweighted)

Formula: $AR_{weighted} = \frac{\sum _{i = 1}^{n} w _{i} \cdot Recall _{i}}{\sum _{i = 1}^{n} w _{i}}$ (where $w_{i}$ is the support for class $i$ )
Description: Calculates recall for each class, then computes a weighted average based on class support.
Why Use: Balances per-class recall performance with class frequency. Good for imbalanced datasets.
Pros/Cons: (+) Accounts for imbalance. (-) May obscure issues if weights don’t reflect importance.

True Negative Rate (TNR) / Specificity / Selectivity

Formula

$TNR = \frac{TN}{N} = \frac{TN}{TN + FP}$

Description

Measures the model’s ability to find all the actual negative instances. It answers the question: “Of all the actual negative instances, what proportion did the model correctly identify?”

Why Use It / Context

Important when correctly identifying negatives is crucial, avoiding false alarms.

Medical Screening: High specificity ensures healthy individuals are correctly identified as negative, avoiding unnecessary stress and procedures (minimizes FPs).
Spam Detection: Ensuring legitimate emails are not classified as spam.

Advantages / Disadvantages

(+) Emphasizes correct classification of negatives, reduces false alarms (FPs).
(+) Important for protecting healthy/normal cases from incorrect positive classification.
(-) Ignores True Positives and False Negatives.
(-) Must be balanced with Recall (Sensitivity). A model predicting everything as negative has perfect TNR but zero Recall.

Prevalence

Formula

$Prevalence = \frac{P}{T} = \frac{P}{P + N} = \frac{TP + FN}{TP + TN + FP + FN}$

Description

The proportion of actual positive instances in the dataset.

Why Use It / Context

Understanding the baseline distribution of classes is crucial before evaluating a model. High prevalence of one class indicates an imbalanced dataset, which affects how other metrics (like Accuracy) should be interpreted and might necessitate specific modeling techniques (e.g., resampling, class weighting).

Advantages / Disadvantages

(+) Identifies class distribution imbalance.
(+) Guides selection of appropriate evaluation metrics and modeling techniques.
(-) If imbalance is ignored, models can appear accurate just by predicting the majority class.
(-) It describes the dataset, not the model’s performance directly.

Accuracy (A)

Formula

$A = \frac{TP + TN}{T} = \frac{TP + TN}{TP + TN + FP + FN}$

Description

The overall proportion of correct classifications (both positive and negative).

Why Use It / Context

Provides a simple, intuitive measure of overall model correctness. It’s often the first metric people look at. Best suited for datasets where classes are relatively balanced and the costs of FP and FN are similar.

Advantages / Disadvantages

(+) Easy to understand and calculate.
(+) Suitable for balanced datasets.
(-) Very misleading on imbalanced datasets (a model predicting the majority class can achieve high accuracy without learning anything useful about the minority class).
(-) Does not distinguish between FP and FN errors, which often have different costs.

Balanced Accuracy (BA)

Formula

Binary: $BA_{binary} = \frac{TPR + TNR}{2} = \frac{1}{2} (\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$
Multi-class: $BA_{multi} = \frac{1}{n} \sum_{i = 1}^{n} Recall_{i}$ (Identical to Macro Average Recall)

Description

Calculates the average recall obtained on each class. For binary cases, it’s the average of sensitivity and specificity.

Why Use It / Context

Provides a better measure of overall performance on imbalanced datasets than raw accuracy because it averages performance across classes, giving equal weight to each class.

Advantages / Disadvantages

(+) Less biased by class imbalance than standard accuracy.
(+) Useful when overall performance across all classes is important, regardless of their size.
(-) Doesn’t incorporate precision; a model could have high BA but poor precision on some classes.
(-) The multi-class version is just Macro Recall.

Balanced Accuracy Weighted (BAW)

Formula

$BAW = \frac{1}{\sum _{i = 1}^{n} w _{i}} \sum_{i = 1}^{n} w_{i} \cdot Recall_{i}$ (where $w_{i}$ is support for class $i$ ) (Identical to Weighted Average Recall)

Description

A weighted average of the recall for each class, where weights are typically the class supports. It extends the BA concept by weighting the per-class recall scores.

Why Use It / Context

Useful for multi-class problems with imbalance where you want a recall-focused metric that accounts for class sizes, balancing between Macro and Micro perspectives.

Advantages / Disadvantages

(+) Robust against class imbalance in multi-class settings.
(-) Identical to Weighted Average Recall, which might be a more common name.
(-) Requires careful calculation/justification of weights if not using standard support.

Average Accuracy (AA)

Formula

$AA = \frac{1}{n} \sum_{i = 1}^{n} A_{i} = \frac{1}{n} \sum_{i = 1}^{n} \frac{T P _{i} + T N _{i}}{T P _{i} + F P _{i} + T N _{i} + F N _{i}}$ (where $A_{i}$ is accuracy computed for class $i$ in a one-vs-rest manner)

Description

Calculates the accuracy for each class individually (treating it as a binary one-vs-rest problem) and then averages these per-class accuracies.

Why Use It / Context

An attempt to get a class-aware accuracy measure. Might be useful if the dataset is balanced and you want to see the average one-vs-rest accuracy.

Advantages / Disadvantages

(+) Provides a class-wise accuracy perspective.
(-) Less common than other averaging methods.
(-) Can yield poor results on imbalanced data if the underlying one-vs-rest classifiers are biased.

Average Class Accuracy (ACA)

Formula

$ACA = w \cdot TPR + (1 - w) \cdot TNR$ (where $w$ is the weight for the positive/minority class, $0 \leq w \leq 1$ )

Description

A weighted average of the True Positive Rate (Recall/Sensitivity, often representing minority class accuracy) and the True Negative Rate (Specificity, often representing majority class accuracy).

Why Use It / Context

Specifically designed for binary classification problems with significant class imbalance, allowing explicit weighting of performance on the minority vs. majority class.

Advantages / Disadvantages

(+) Directly addresses significant class imbalance by allowing tunable focus.
(-) Primarily for binary classification.
(-) Choosing the appropriate weight $w$ can be subjective or require domain knowledge.

Error Rate (ER)

Formula

$ER = \frac{FP + FN}{T} = 1 - Accuracy$

Description

The proportion of all predictions that were incorrect. It’s the complement of Accuracy.

Why Use It / Context

Focuses directly on the model’s mistakes rather than its successes. Useful when the primary goal is error minimization.

Advantages / Disadvantages

(+) Direct measure of overall errors.
(-) Same limitations as Accuracy: Misleading on imbalanced datasets.

Average Error Rate (AER)

Formula

$AER = \frac{1}{n} \sum_{i = 1}^{n} E R_{i} = \frac{1}{n} \sum_{i = 1}^{n} \frac{F P _{i} + F N _{i}}{T P _{i} + F P _{i} + T N _{i} + F N _{i}}$ (where $E R_{i}$ is error rate for class $i$ one-vs-rest)

Description

Calculates the error rate for each class individually (one-vs-rest) and then averages these per-class error rates.

Why Use It / Context

Provides a class-aware view of the error rate, useful when minimizing errors per class is important.

Advantages / Disadvantages

(+) Focuses on errors per class.
(-) Same limitations as AA: Less common, can be poor on imbalanced data.

F-score / F $_{β}$ -score

Formula

$F_{β} = (1 + β^{2}) \cdot \frac{Precision \cdot Recall}{( β ^{2} \cdot Precision ) + Recall}$

Description

The weighted harmonic mean of Precision and Recall. The parameter $β$ controls the weighting between precision and recall.

$β = 1$ (F1-score): Balances precision and recall equally.
$β > 1$ : Gives more weight to recall (e.g., F2-score).
$β < 1$ : Gives more weight to precision (e.g., F0.5-score).
$β = 0$ : (F0-score) The formula simplifies to just Precision.

Why Use It / Context

Provides a single metric that combines both precision and recall. Useful when you need to balance the trade-off between minimizing FPs (Precision) and minimizing FNs (Recall), and the relative importance can be quantified by $β$ . Widely used when class distribution is potentially imbalanced.

Specific Variants Mentioned:

F1-score ( $β = 1$ ): $F_{1} = 2 \frac{Precision \cdot Recall}{Precision + Recall}$ . Most common variant, balances P and R equally.
F0.5-score ( $β = 0.5$ ): $F_{0.5} = 1.25 \frac{Precision \cdot Recall}{0.25 \cdot Precision + Recall}$ . Emphasizes Precision more.
F2-score ( $β = 2$ ): $F_{2} = 5 \frac{Precision \cdot Recall}{4 \cdot Precision + Recall}$ . Emphasizes Recall more.
F0-score ( $β = 0$ ): $F_{0} = Precision$ . Focuses entirely on Precision.

Advantages / Disadvantages

(+) Combines precision and recall into one score.
(+) Tunable emphasis via $β$ .
(+) Often more informative than accuracy on imbalanced datasets.
(-) Less interpretable than Precision and Recall individually.
(-) Doesn’t consider True Negatives.
(-) Harmonic mean means it’s sensitive to low values (if either P or R is very low, F-score will be low).

(Macro, Micro, Weighted averages for F-score follow the same logic as for Precision/Recall)

False Discovery Rate (FDR)

Formula

$FDR = \frac{FP}{FP + TP} = 1 - Precision$

Description

The proportion of positive predictions that were actually false (false positives). It’s the complement of Precision.

Why Use It / Context

Crucial in situations with multiple comparisons or hypothesis tests (like genomics, exploratory research) where you expect some false positives due to chance. FDR control aims to limit the proportion of false discoveries among all claimed discoveries (positive predictions).

Advantages / Disadvantages

(+) Focuses on the error rate among positive predictions (discoveries).
(+) Important concept in multiple hypothesis testing to control false positives.
(-) Ignores True Negatives and False Negatives.
(-) Sensitive to class imbalance.

(Macro, Micro, Weighted averages for FDR follow the same logic)

False Omission Rate (FOR)

Formula

$FOR = \frac{FN}{FN + TN} = 1 - NPV$

Description

The proportion of negative predictions that were actually false (false negatives). It’s the complement of NPV.

Why Use It / Context

Focuses on the error rate among instances predicted as negative. Important when you need to know the likelihood that a negative prediction is wrong.

Advantages / Disadvantages

(+) Measures error rate among negative predictions.
(-) Ignores True Positives and False Positives.

False Positive Rate (FPR) / Fall-out

Formula

$FPR = \frac{FP}{N} = \frac{FP}{FP + TN} = 1 - TNR$

Description

The proportion of actual negative instances that were incorrectly classified as positive. It’s the complement of Specificity (TNR).

Why Use It / Context

Measures the rate of false alarms among the actual negative population. Crucial component of the ROC curve. Important when the cost of acting on a false positive is high (e.g., unnecessary medical treatment, flagging legitimate transactions).

Advantages / Disadvantages

(+) Focuses on error rate within the true negative population.
(-) Ignores True Positives and False Negatives.

False Negative Rate (FNR) / Miss Rate

Formula

$FNR = \frac{FN}{P} = \frac{FN}{FN + TP} = 1 - Recall$

Description

The proportion of actual positive instances that were incorrectly classified as negative. It’s the complement of Recall (TPR).

Why Use It / Context

Measures the rate of missed positives among the actual positive population. Critical when failing to detect a positive case is costly or dangerous (e.g., missing a disease, security threat).

Advantages / Disadvantages

(+) Focuses on the error rate within the true positive population (missed cases).
(+) Important for recall-focused assessments.
(-) Ignores True Negatives and False Positives.
(-) Focusing solely on minimizing FNR can lead to poor precision.

(Macro, Micro, Weighted averages for FNR follow the same logic)

Positive Likelihood Ratio (LR+)

Formula

$LR+ = \frac{Sensitivity}{1 - Specificity} = \frac{TPR}{FPR}$

Description

Ratio of the probability of a positive test result given the condition is present, to the probability of a positive test result given the condition is absent. Answers: “How much more likely is a positive test in someone with the condition compared to someone without?”

Why Use It / Context

Medical diagnostics. Quantifies the strength of a positive test result in increasing the likelihood (odds) of having the condition. Higher LR+ indicates stronger evidence for the condition.

Advantages / Disadvantages

(+) Combines sensitivity and specificity into a measure of diagnostic evidence strength for positive tests.
(-) Does not directly consider NPV or FNR.

Negative Likelihood Ratio (LR-)

Formula

$LR- = \frac{1 - Sensitivity}{Specificity} = \frac{FNR}{TNR}$

Description

Ratio of the probability of a negative test result given the condition is present, to the probability of a negative test result given the condition is absent. Answers: “How much more likely is a negative test in someone with the condition compared to someone without?” (Lower values are better).

Why Use It / Context

Medical diagnostics. Quantifies the strength of a negative test result in decreasing the likelihood (odds) of having the condition. Lower LR- indicates stronger evidence against the condition.

Advantages / Disadvantages

(+) Combines sensitivity and specificity into a measure of diagnostic evidence strength for negative tests.
(-) Does not directly consider PPV or FPR.

Diagnostic Odds Ratio (DOR)

Formula

$DOR = \frac{L R +}{L R -} = \frac{TP \cdot TN}{FP \cdot FN}$

Description

The ratio of the odds of a positive test in the group with the condition to the odds of a positive test in the group without the condition. It summarizes the diagnostic accuracy into a single number.

Why Use It / Context

Provides a single summary statistic for diagnostic test performance that is independent of prevalence. Higher DOR indicates better discriminatory ability.

Advantages / Disadvantages

(+) Single metric summarizing diagnostic performance.
(+) Independent of disease prevalence.
(-) Less intuitive interpretation than LR+ or LR-.
(-) Requires all four values (TP, TN, FP, FN). Increased calculation complexity.

Fowlkes–Mallows Index (FM)

Formula

$FM = PP V \cdot TPR = Precision \cdot Recall$

Description

The geometric mean of Precision and Recall. Also used in clustering to measure the similarity between two clusterings.

Why Use It / Context

Provides a single score balancing precision and recall using the geometric mean (which tends to be lower than arithmetic mean if values differ greatly). Assess similarity between predicted and true classifications/clusters.

Advantages / Disadvantages

(+) Balances Precision and Recall.
(+) Claimed to be robust to noise (in clustering contexts).
(-) Less common than F1-score (harmonic mean).

Informedness / Bookmaker Informedness (BM) / Youden’s J statistic / Youden’s index

Formula

$Informedness = Sensitivity + Specificity - 1 = TPR + TNR - 1$

Description

Measures how informed a prediction is, compared to random guessing. It represents the probability that a prediction is informed (vs. random chance). Ranges from -1 (perversely incorrect) to +1 (perfectly informed), with 0 indicating chance-level performance.

Why Use It / Context

Provides a single measure of discriminative power that considers both positive and negative classes, correcting for chance. Useful when you want to know how much better than random guessing the model is.

Advantages / Disadvantages

(+) Considers both TPR and TNR.
(+) Corrects for chance performance.
(-) Result range [-1, 1] might be less intuitive for some users than [0, 1].

Markedness (MK)

Formula

$MK = PP V + NP V - 1 = Precision + NPV - 1$

Description

Measures how marked a condition is by the predictor, i.e., the probability that a predictor identifies the condition. It’s the dual metric to Informedness. Ranges from -1 to +1.

Why Use It / Context

Assesses the predictive power of the test result itself (how reliable is a positive/negative prediction?). Complements Informedness.

Advantages / Disadvantages

(+) Considers both Precision and NPV.
(+) Complements Informedness.
(-) Result range [-1, 1]. Less common than Informedness.

Matthews Correlation Coefficient (MCC) / phi coefficient / Yule phi coefficient

Formula (as defined in paper)

$MCC = TPR \cdot TNR \cdot PP V \cdot NP V$ (Note: The standard formula is $MCC = \frac{TP \cdot TN - FP \cdot FN}{( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN )}$ , which ranges from -1 to +1 and is generally considered more robust. The paper’s version seems restricted to [0,1]).

Description

Measures the correlation between the predicted and actual classifications. Takes into account all four entries of the confusion matrix (TP, TN, FP, FN).

Why Use It / Context

Considered one of the most robust single-value metrics for binary classification, especially useful for imbalanced datasets, as it requires good performance across all four confusion matrix categories to score highly.

Advantages / Disadvantages

(+) Balanced measure considering all four matrix entries.
(+) Generally robust to class imbalance (especially the standard formula).
(-) The paper’s formula variant might be less standard/robust.
(-) More complex calculation and interpretation than basic metrics.

Jaccard Index (JI) / Threat Score (TS) / Critical Success Index (CSI)

Formula

$JI = \frac{TP}{TP + FN + FP}$

Description

Ratio of the true positives (intersection) to the sum of true positives, false negatives, and false positives (union of predicted and actual positives). Measures the similarity between predicted positive set and actual positive set.

Why Use It / Context

Commonly used in object detection and segmentation (where it’s identical to IoU calculated on pixels/voxels). Useful for evaluating overlap while penalizing both missed detections (FN) and false alarms (FP).

Advantages / Disadvantages

(+) Good for overlap evaluation in detection/segmentation.
(-) Ignores True Negatives. Performance on background/negative class doesn’t affect the score.

Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC)

Description

ROC Curve: A plot of True Positive Rate (TPR, Sensitivity) on the y-axis against False Positive Rate (FPR, 1-Specificity) on the x-axis at various classification thresholds.
AUC: The area under the ROC curve. Ranges from 0 to 1.

Formula (AUC)

$AUC = \int_{x = 0}^{1} TPR (FPR^{- 1} (x)) d x$ (Conceptual integral)

Why Use It / Context

AUC measures the model’s ability to distinguish between positive and negative classes across all possible thresholds. A value of 1.0 indicates perfect separation, while 0.5 indicates random chance performance. It’s widely used for comparing the overall discriminative power of models, especially when the optimal operating threshold isn’t known or when dealing with imbalanced data (though its utility here is debated vs. PR AUC).

Advantages / Disadvantages

(+) Evaluates model performance across all thresholds.
(+) Provides a single score for overall discriminative ability.
(+) Standard and widely understood.
(+) Suitable for imbalanced data (less affected than accuracy).
(-) Does not reflect performance at specific, practical thresholds.
(-) Can be insensitive to significant changes in model performance that don’t alter rank ordering of predictions.
(-) Focuses on ranking rather than probability calibration.
(-) Ignores the cost of errors.

Average Precision (AP) / Area Under the Precision-Recall Curve

Description

Precision-Recall (PR) Curve: A plot of Precision (PPV) on the y-axis against Recall (TPR) on the x-axis at various classification thresholds.
AP / PR AUC: The area under the PR curve. Ranges from 0 to 1.

Formula (AP)

$AP = \int_{x = 0}^{1} Precision (Recall^{- 1} (x)) d x$ (Conceptual integral; practical calculation often uses interpolation methods).

Why Use It / Context

AP summarizes the PR curve, focusing on the performance regarding the positive class. It is particularly useful for highly imbalanced datasets where the number of true negatives is vast, as ROC AUC might appear overly optimistic in such cases. Standard metric in information retrieval and object detection.

Advantages / Disadvantages

(+) More informative than ROC AUC on highly imbalanced datasets with few positive instances.
(+) Focuses on positive class performance trade-offs.
(+) Evaluates performance across thresholds.
(-) Not as easily interpretable as a single point metric like F1.
(-) Calculation can be complex depending on the interpolation method used.

Mean Average Precision (mAP)

Formula

$mAP = \frac{1}{n} \sum_{i = 1}^{n} AP_{i}$ (where $A P_{i}$ is the Average Precision for class $i$ ).

Description

The mean of the Average Precision (AP) scores calculated for each class individually in a multi-class problem.

Why Use It / Context

Standard evaluation metric for multi-class object detection and information retrieval tasks. Provides a single score summarizing the performance across all classes, based on the PR curve for each.

Advantages / Disadvantages

(+) Provides a comprehensive evaluation across multiple classes for detection/retrieval.
(+) Widely adopted standard in certain fields (e.g., object detection).
(-) Calculation complexity is high (requires calculating AP for each class).
(-) Interpretation as a single number can mask variability in performance between classes.

H-measure (H)

Formula

$H = \int_{θ = 0}^{1} π (θ) \cdot EC (θ) d θ$ where $EC (θ) = C (θ)_{FP} \cdot P (FP ∣ θ) + C (θ)_{FN} \cdot P (FN ∣ θ)$ is the expected cost at threshold $θ$ , and $π (θ)$ is a weight distribution over thresholds.

Description

Integrates misclassification costs across a range of decision thresholds, weighted by a distribution reflecting the importance or likelihood of different thresholds being used. It aims to provide a cost-sensitive alternative to AUC.

Why Use It / Context

Addresses limitations of AUC by incorporating the costs of different types of errors (FP vs. FN) and focusing on relevant threshold ranges. Useful in applications where misclassification costs are unequal and known (e.g., medical diagnosis, fraud detection with specific cost models).

Advantages / Disadvantages

(+) Cost-sensitive evaluation.
(+) Handles class imbalance through cost functions.
(+) Directly incorporates threshold relevance.
(-) More complex to understand and implement than AUC.
(-) Requires defining a cost function ( $C$ ) and a threshold distribution ( $π$ ), which can be subjective or hard to determine. Results are sensitive to these choices.

Cohen’s Kappa ( $κ$ )

Formula

$κ = \frac{P _{o} - P _{e}}{1 - P _{e}}$ where $P_{o}$ is observed accuracy and $P_{e}$ is the expected accuracy due to chance agreement (calculated from marginal totals of the confusion matrix).

Description

Measures the agreement between predicted and actual classifications, correcting for the agreement that would be expected purely by chance. Ranges typically from -1 to +1.

Why Use It / Context

Originally for inter-rater reliability, it’s used in ML to assess if the model’s agreement with the ground truth is significantly better than random chance, especially useful when class distributions might make chance agreement high.

Advantages / Disadvantages

(+) Accounts for chance agreement, potentially providing a more robust measure than raw accuracy.
(+) Useful for comparing model to human performance or assessing annotation quality.
(-) Sensitive to the prevalence (base rates) of classes; can give low scores even with high accuracy if prevalence is skewed (paradoxes of kappa).
(-) Interpretation of magnitude can be debated.

Gini Impurity

Formula

$Gini (D) = 1 - \sum_{i = 1}^{n} p_{i}^{2}$ (where $p_{i}$ is the proportion of instances of class $i$ in dataset/node $D$ ).

Description

Measures the “impurity” of a set of instances. A Gini impurity of 0 means all instances belong to the same class (pure node). It represents the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of labels in the subset.

Why Use It / Context

Primarily used as a splitting criterion in decision tree algorithms (like CART). The algorithm chooses splits that result in the largest decrease in Gini impurity, aiming to create purer child nodes.

Advantages / Disadvantages

(+) Computationally efficient splitting criterion for decision trees.
(-) Tends to favor splits that result in binary outcomes.
(-) Can be biased towards features with a large number of distinct values.

P4 metric

Formula

$P_{4} = \frac{4}{\frac{1}{Precision} + \frac{1}{Recall} + \frac{1}{Specificity} + \frac{1}{NPV}}$

Description

The harmonic mean of the four basic conditional probabilities: Precision, Recall, Specificity, and NPV.

Why Use It / Context

An attempt to create a single, symmetrical metric that incorporates all four fundamental aspects of binary classification performance.

Advantages / Disadvantages

(+) Symmetrical with respect to positive/negative class definition.
(+) Tends to zero if any component is zero, tends to one if all components are one.
(-) Does not allow for weighting the importance of the components.
(-) Very rarely used in practice.

Skill Score (SS)

Formula (as defined in paper)

$SS = 1 - \frac{Metric _{GT}}{Metric _{P}}$ (where Metric $_{P}$ is the model’s score, Metric $_{GT}$ is the score of a perfect/reference model).

Description

Measures the performance of a forecast relative to some reference forecast (here, the “best possible expectation”). SS=1 is perfect, SS=0 means performance is equal to reference, SS<0 means worse than reference.

Why Use It / Context

Provides an intuitive way to rank model performance against a defined benchmark or theoretical best score. The base Metric could be accuracy, error rate, etc.

Advantages / Disadvantages

(+) Intuitive ranking relative to a benchmark.
(-) Interpretation depends heavily on the chosen base Metric.
(-) Can scale indefinitely negative if the model performs much worse than the reference, depending on the base metric.

Relative Improvement Factor

Formula (as defined in paper)

$Relative improvement factor = \frac{1 - Metric _{GT}}{1 - Metric _{P}}$ (Assuming Metric is accuracy-like [0,1])

Description

Measures the relative quality compared to a reference, focusing on the reduction in error or gain in accuracy relative to the maximum possible improvement.

Why Use It / Context

An alternative way to rank model performance relative to a benchmark, focusing on the ‘gap’ to perfection.

Advantages / Disadvantages

(+) Intuitive ranking perspective.
(-) Interpretation depends heavily on the chosen base Metric.
(-) Can scale indefinitely positive as model performance approaches perfection ( $M e t r i c_{P} \to 1$ ).

Computer Vision Metrics

These metrics are often used for tasks like regression (e.g., depth estimation), image quality assessment, or segmentation/detection evaluation.

Error (E)

Formula

$E = GT - P$

Description

The simple difference between the ground truth value ( $GT$ ) and the predicted value ( $P$ ).

Why Use It / Context

Fundamental building block for many other regression metrics. Indicates the direction and magnitude of deviation for a single prediction.

Advantages / Disadvantages

(+) Simple, direct measure of deviation.
(-) Single instance measure, needs aggregation.
(-) Sensitive to outliers if averaged directly (see ME).

Absolute Error / Sum of Absolute Errors (AE)

Formula

$AE = \sum_{i = 1}^{n} ∣ E_{i} ∣ = \sum_{i = 1}^{n} ∣ G T_{i} - P_{i} ∣$

Description

The sum of the absolute differences between predicted and ground truth values over all instances.

Why Use It / Context

Measures the total magnitude of errors, ignoring their direction. Useful when the direction doesn’t matter, only the total deviation.

Advantages / Disadvantages

(+) Captures total error magnitude.
(-) Does not average, scales with the number of instances ( $n$ ).
(-) Large individual errors can be masked by many small errors if only the sum is considered without context.

Relative Absolute Error (RAE)

Formula

$RAE = \frac{\sum _{i = 1}^{n} ∣ G T _{i} - P _{i} ∣}{\sum _{i = 1}^{n} ∣ G T _{i} - GT ∣}$ (where $\overline{GT}$ is the mean of ground truth values)

Description

Normalizes the sum of absolute errors (AE) by the sum of absolute errors of a baseline model that always predicts the mean of the ground truth values.

Why Use It / Context

Provides a scale-independent measure of performance relative to a simple baseline. Values less than 1 indicate the model is better than the baseline. Useful for comparing models on datasets with different scales.

Advantages / Disadvantages

(+) Scale-independent comparison relative to baseline.
(-) Can be sensitive if the baseline error ( $\sum ∣ G T_{i} - \overline{GT} ∣$ ) is very small.

Mean Error (ME) / Bias

Formula

$ME = \frac{1}{n} \sum_{i = 1}^{n} E_{i} = \frac{1}{n} \sum_{i = 1}^{n} (G T_{i} - P_{i})$

Description

The average of the errors. Measures the average tendency of the model to over-predict or under-predict (bias).

Why Use It / Context

Useful for identifying systematic bias in predictions (e.g., consistently predicting depth slightly too high or too low). Simple check for overall bias.

Advantages / Disadvantages

(+) Simple check for systematic bias.
(-) Positive and negative errors cancel out, potentially hiding large individual errors of opposite signs.
(-) Sensitive to outliers.

Mean Percentage Error (MPE)

Formula

$MPE = \frac{100}{n} \sum_{i = 1}^{n} \frac{E _{i}}{G T _{i}} = \frac{100}{n} \sum_{i = 1}^{n} \frac{G T _{i} - P _{i}}{G T _{i}}$

Description

The average of the percentage errors.

Why Use It / Context

Measures the average bias in percentage terms, providing a relative sense of over/under prediction.

Advantages / Disadvantages

(+) Provides relative bias information.
(-) Undefined if any $G T_{i} = 0$ .
(-) Errors can cancel out.
(-) Can be skewed by small $G T_{i}$ values.

Mean Absolute Error (MAE) / L1 Loss

Formula

$MAE = \frac{1}{n} \sum_{i = 1}^{n} ∣ E_{i} ∣ = \frac{1}{n} \sum_{i = 1}^{n} ∣ G T_{i} - P_{i} ∣$

Description

The average of the absolute errors. Measures the average magnitude of errors.

Why Use It / Context

Common metric for regression tasks. Provides an easily interpretable measure of average error magnitude in the original units of the data. Less sensitive to outliers than MSE.

Advantages / Disadvantages

(+) Interpretable in original units.
(+) Robust to outliers compared to MSE.
(-) Does not penalize large errors as heavily as MSE.

Mean Absolute Percentage Error (MAPE)

Formula

$MAPE = \frac{100}{n} \sum_{i = 1}^{n} \frac{E _{i}}{G T _{i}} = \frac{100}{n} \sum_{i = 1}^{n} \frac{G T _{i} - P _{i}}{G T _{i}}$

Description

The average of the absolute percentage errors.

Why Use It / Context

Provides a scale-independent measure of average error magnitude, expressed as a percentage. Useful for comparing forecast accuracy across time series or datasets with different scales.

Advantages / Disadvantages

(+) Scale-independent, intuitive percentage interpretation.
(-) Undefined if any $G T_{i} = 0$ .
(-) Can produce infinite or undefined values if $G T_{i}$ is very close to zero.
(-) Asymmetric: Penalizes negative errors (P > GT) more heavily than positive errors (P < GT).

Mean Absolute Scaled Error (MASE)

Formula (Standard Definition, inferred correction to paper)

$MASE = \frac{MAE}{MAE _{naive}} = \frac{\frac{1}{n} \sum _{i = 1}^{n} ∣ G T _{i} - P _{i} ∣}{\frac{1}{n - m} \sum _{i = m + 1}^{n} ∣ G T _{i} - G T _{i - m} ∣}$ (where $m$ is the period for seasonal naive, often $m = 1$ for non-seasonal).

Description

Scales the MAE of the model by the MAE of a naive benchmark forecast (e.g., predicting the previous value, or the value from the previous season).

Why Use It / Context

Provides a scale-independent measure of forecast accuracy relative to a simple but often hard-to-beat benchmark. Recommended for comparing forecast methods across different time series. MASE < 1 indicates the model is better than the naive forecast. Suitable for time-series predictions (e.g., motion tracking).

Advantages / Disadvantages

(+) Scale-independent and interpretable relative to naive baseline.
(+) Avoids issues of division by zero (unlike MAPE).
(-) Can be less intuitive than MAPE for non-technical audiences.
(-) Naive forecast MAE can be zero if data is constant.

Mean Normalized Bias (MNB)

Formula

$MNB = \frac{1}{n} \sum_{i = 1}^{n} \frac{E _{i}}{P _{i}} = \frac{1}{n} \sum_{i = 1}^{n} \frac{G T _{i} - P _{i}}{P _{i}}$

Description

Calculates the average bias (error) normalized by the predicted value for each instance.

Why Use It / Context

Evaluates systematic errors relative to the predicted magnitude, useful for understanding if relative bias depends on the prediction level.

Advantages / Disadvantages

(+) Evaluates systematic errors relative to prediction magnitude.
(-) Undefined if any $P_{i} = 0$ .
(-) Does not detect localized errors.

Normalized Mean Bias (NMB)

Formula

$NMB = \frac{\sum _{i = 1}^{n} E _{i}}{\sum _{i = 1}^{n} P _{i}} = \frac{\sum _{i = 1}^{n} ( G T _{i} - P _{i} )}{\sum _{i = 1}^{n} P _{i}}$

Description

The total bias (sum of errors) divided by the sum of the predicted values. Represents the overall relative bias.

Why Use It / Context

Used for comparing overall model bias independently of the scale of the predictions/observations (e.g., commonly used in air quality model evaluation).

Advantages / Disadvantages

(+) Allows comparison of overall bias across different scales/models.
(-) Sensitive to outliers (in both numerator and denominator).
(-) Denominator can be zero or near-zero.

Squared Error / Sum of Squared Errors (SE) / L2 Loss (sum)

Formula

$SE = \sum_{i = 1}^{n} E_{i}^{2} = \sum_{i = 1}^{n} (G T_{i} - P_{i})^{2}$

Description

The sum of the squares of the errors between predicted and ground truth values.

Why Use It / Context

Forms the basis of MSE and RMSE. Squaring the errors heavily penalizes large deviations. Often used implicitly in optimization due to its mathematical properties (differentiability).

Advantages / Disadvantages

(+) Strongly emphasizes large errors.
(+) Mathematically convenient for optimization.
(-) Not in the original units of the data.
(-) Very sensitive to outliers.
(-) Scales with the number of instances.

Mean Square Error (MSE) / L2 Loss (mean)

Formula

$MSE = \frac{1}{n} \sum_{i = 1}^{n} E_{i}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (G T_{i} - P_{i})^{2}$

Description

The average of the squared errors.

Why Use It / Context

A very common metric for regression problems and loss function in model training. It strongly penalizes large errors.

Advantages / Disadvantages

(+) Strongly penalizes large errors.
(+) Differentiable, widely used in optimization.
(-) Not in the original units of the data, making direct interpretation difficult.
(-) Highly sensitive to outliers.

Root Mean Square Error (RMSE)

Formula

$RMSE = MSE = \frac{1}{n} \sum_{i = 1}^{n} (G T_{i} - P_{i})^{2}$

Description

The square root of the Mean Square Error.

Why Use It / Context

Another very common regression metric. It represents the standard deviation of the prediction errors (residuals). It’s in the same units as the original data, making it more interpretable than MSE, while still penalizing large errors more than MAE.

Advantages / Disadvantages

(+) In the same units as the target variable.
(+) Penalizes large errors more than MAE.
(-) More sensitive to outliers than MAE.

Normalized Root Mean Square Error (NRMSE)

Formula (Example: Normalization by Mean)

$NRMSE = \frac{RMSE}{GT}$ (Other normalizations exist, e.g., by range or standard deviation)

Description

Normalizes the RMSE by dividing it by a measure of the scale of the ground truth data (e.g., mean, range, standard deviation).

Why Use It / Context

To compare RMSE values across datasets or variables with different scales. Provides a relative measure of error.

Advantages / Disadvantages

(+) Allows comparison across different scales.
(-) The choice of normalization method affects the value and interpretation.
(-) Inherits sensitivity to outliers from RMSE.

Root Mean Squared Logarithmic Error (RMSLE)

Formula

$RMSLE = \frac{1}{n} \sum_{i = 1}^{n} (ln (P_{i} + 1) - ln (G T_{i} + 1))^{2}$

Description

Calculates the RMSE on the logarithmically transformed predicted and ground truth values (adding 1 to handle potential zeros).

Why Use It / Context

Useful when you care more about the relative (percentage) error than the absolute error, especially when the target variable spans several orders of magnitude. It penalizes under-prediction more heavily than over-prediction. Often used for predicting prices, counts, or depth.

Advantages / Disadvantages

(+) Focuses on relative errors.
(+) Less sensitive to large absolute errors if the relative error is small.
(-) Penalizes under-prediction more than over-prediction.
(-) Log transformation assumes positive values (+1 used for zero handling).

Peak Signal-to-Noise Ratio (PSNR)

Formula

$PSNR = 10 lo g_{10} (\frac{MAX _{I}^{2}}{MSE})$ (where MAX $_{I}$ is the maximum possible pixel value, e.g., 255 for 8-bit grayscale image)

Description

Measures the ratio between the maximum possible power (energy) of a signal and the power of corrupting noise that affects its fidelity. Expressed in decibels (dB). Higher PSNR generally indicates better reconstruction quality.

Why Use It / Context

Widely used standard metric in image and video compression and restoration to quantify the reconstruction quality relative to the original. Simple to calculate if MSE is known.

Advantages / Disadvantages

(+) Simple, standard, widely used baseline.
(+) Directly related to MSE.
(-) Often correlates poorly with human perception of image quality. Small pixel-wise differences that significantly impact perception might yield high PSNR, and vice versa.
(-) Less effective for complex distortions beyond simple noise or compression artifacts.

Structural Similarity (SSIM) Index

Formula

$SSIM (x, y) = \frac{( 2 μ _{x} μ _{y} + c _{1} ) ( 2 σ _{x y} + c _{2} )}{( μ _{x}^{2} + μ _{y}^{2} + c _{1} ) ( σ _{x}^{2} + σ _{y}^{2} + c _{2} )}$ (Typically computed on local image patches and averaged). $c_{1}, c_{2}$ are small constants for stability.

Description

Measures the similarity between two images based on perceived structural information, luminance, and contrast. Ranges from -1 to +1, where 1 indicates identical images.

Why Use It / Context

Designed to better reflect human visual perception of image quality compared to PSNR/MSE. Used for evaluating image compression, restoration, denoising algorithms where perceptual quality is important.

Advantages / Disadvantages

(+) Correlates better with human perception of quality than PSNR/MSE.
(+) Considers structure, luminance, and contrast.
(-) More computationally complex than PSNR/MSE.
(-) Can be sensitive to local variations or specific textures.

Structural Dissimilarity (DSSIM)

Formula

$DSSIM (x, y) = \frac{1 - SSIM ( x , y )}{2}$

Description

A simple transformation of SSIM into a distance metric, ranging from 0 (identical) to 1 (completely different).

Why Use It / Context

Provides a dissimilarity measure based on structural properties, potentially useful as a loss function where lower values are better and the range is normalized [0, 1].

Advantages / Disadvantages

(+) Inherits SSIM’s correlation with perception.
(+) Provides a bounded [0, 1] distance metric.
(-) Inherits SSIM’s complexity.

Intersection over Union (IoU) / Jaccard Index (JI)

Formula

$IoU = \frac{∣ Area of Overlap ∣}{∣ Area of Union ∣} = \frac{∣ GT \cap P ∣}{∣ GT \cup P ∣}$

Description

Calculates the ratio of the intersection area to the union area of the predicted region ( $P$ ) and the ground truth region ( $GT$ ). Ranges from 0 to 1.

Why Use It / Context

The standard metric for evaluating the accuracy of object detection and semantic segmentation tasks. Measures how well the predicted boundary/mask aligns with the true boundary/mask. Higher IoU means better alignment.

Advantages / Disadvantages

(+) Intuitive measure of overlap.
(+) Standard benchmark for detection/segmentation.
(+) Scale-invariant (based on ratios).
(-) Sensitive to small misalignments, especially for small objects.
(-) Does not distinguish types of errors (e.g., slightly too small vs. completely misplaced).
(-) Ignores performance on the negative/background class.

Dice Coefficient (DC) / F1-score (Segmentation) / Sørensen–Dice index

Formula

$DC = \frac{2 \cdot ∣ Area of Overlap ∣}{∣ Total Area ∣} = \frac{2∣ GT \cap P ∣}{∣ GT ∣ + ∣ P ∣}$

Description

Calculates twice the intersection area divided by the sum of the areas of the predicted and ground truth regions. Ranges from 0 to 1. Mathematically related to IoU ( $D C = 2 \cdot I o U / (1 + I o U)$ ) and equivalent to the F1-score calculated on pixels/voxels.

Why Use It / Context

Very common metric for semantic segmentation, especially in medical imaging. Balances precision and recall at the pixel/voxel level.

Advantages / Disadvantages

(+) Widely used in segmentation, particularly medical imaging.
(+) Directly related to F1-score, balancing false positives and false negatives at pixel level.
(-) Like IoU, ignores background performance.
(-) Can be more sensitive to the size of the segmentation compared to IoU in some edge cases (though paper states no size differentiation).

Overlap Coefficient (OC) / Szymkiewicz–Simpson coefficient

Formula

$OC = \frac{∣ Area of Overlap ∣}{m i n ( ∣ Area _{GT} ∣ , ∣ Area _{P} ∣ )} = \frac{∣ GT \cap P ∣}{m i n ( ∣ GT ∣ , ∣ P ∣ )}$

Description

Calculates the ratio of the intersection area to the area of the smaller of the two regions (predicted or ground truth). Ranges from 0 to 1.

Why Use It / Context

Measures overlap relative to the size of the smaller region. Useful in scenarios testing for containment or if one region is expected to be a subset of the other, or when evaluating detection of smaller objects within larger ones.

Advantages / Disadvantages

(+) Focuses on overlap relative to the smaller region’s size.
(-) Less common than IoU or Dice.
(-) Value of 1 doesn’t necessarily mean perfect match (only that the smaller region is fully contained within the larger one’s overlap).

Quartz 4

Explorer

A Consolidated Overview of Evaluation and Performance Metrics for Machine Learning and Computer Vision

Machine Learning Metrics

General Definitions (Foundation)

Precision / Positive Predictive Value (PPV)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Precision Averaging Methods (for Multi-Class)

Macro Average Precision (APmacro)

Micro Average Precision (APmicro)

Weighted Average Precision (APweighted)

Negative Predictive Value (NPV)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Recall / True Positive Rate (TPR) / Sensitivity / Hit Rate

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Recall Averaging Methods (for Multi-Class)

Macro Average Recall (ARmacro)

Micro Average Recall (ARmicro)

Weighted Average Recall (ARweighted)

True Negative Rate (TNR) / Specificity / Selectivity

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Prevalence

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Accuracy (A)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Balanced Accuracy (BA)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Balanced Accuracy Weighted (BAW)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Average Accuracy (AA)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Average Class Accuracy (ACA)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Error Rate (ER)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

Average Error Rate (AER)

Formula

Description

Why Use It / Context

Advantages / Disadvantages

F-score / Fβ​-score

Formula

Description

Why Use It / Context

Specific Variants Mentioned:

Advantages / Disadvantages

False Discovery Rate (FDR)

F-score / F $_{β}$ -score