7. Bayesian Classification

7.1 Introduction

All Bayesian classification methods are fundamentally based on Bayes’ Theorem.

Bayes’ Theorem Formula:

Given two random events, $A$ and $B$ , Bayes’ Theorem is stated as:

$P (A ∣ B) = \frac{P ( B ∣ A ) P ( A )}{P ( B )}$

This can also be expanded using the law of total probability for the denominator, assuming $A$ and its complement $\overset{ˉ}{A}$ form a partition of the sample space:

$P (A ∣ B) = \frac{P ( B ∣ A ) P ( A )}{P ( B ∣ A ) P ( A ) + P ( B ∣ A ˉ ) P ( A ˉ )}$ (Eq 7.1)

Terminology:

$A$ and $B$ : Random events.
$\overset{ˉ}{A}$ : The complement of event $A$ .
$A$ : Typically represents a hypothesis to be tested or predicted.
$B$ : Represents the new data, observation, or evidence used to predict $A$ .
$P (A)$ : The prior probability of hypothesis $A$ . This reflects our belief or knowledge before observing the evidence $B$ .
$P (B ∣ A)$ : The likelihood of observing evidence $B$ given that hypothesis $A$ is true. Represents experience or prior knowledge about the relationship between $A$ and $B$ .
$P (B)$ : The observation probability or evidence probability. This is the overall probability of observing the evidence $B$ , regardless of hypothesis $A$ . It acts as a normalization constant.
$P (A ∣ B)$ : The posterior probability of hypothesis $A$ after observing the evidence $B$ . This is the updated belief about $A$ .

Core Idea of Bayes’ Theorem:

It allows converting the computation of the posterior probability $P (A ∣ B)$ into a form involving the likelihood $P (B ∣ A)$ , which is often easier to estimate or compute.
Extremely helpful when $A$ and $B$ are dependent events and predicting $A$ directly is difficult due to lack of information.
Information about $B$ (evidence) helps predict $A$ (hypothesis) more accurately.
Information needed (like $P (A)$ and $P (B ∣ A)$ ) can often be obtained from historical data or experience.

General Application Idea:

Predicting one event using other related events is a common practice (e.g., symptoms predict disease, clouds predict rain, rainfall predicts harvest). Bayes’ theorem provides a formal framework for this.

Example 1: Weather Prediction

Goal: Predict if the weather will be fine (No Rain) for a weekend sports event, given the prediction of high humidity.
Events:
- $R = Rain$
- $\overset{ˉ}{R} = No rain$
- $H = High humidity (humidity > 80%)$
Prior Information (from history/meteorology):
- $P (R) = 35% = 0.35$
- $P (\overset{ˉ}{R}) = 65% = 0.65$
- $P (H ∣ R) = 30% = 0.3$ (Likelihood of high humidity given it rains)
- (Implicitly assumed in the PDF’s calculation for this specific example, though not explicitly stated beforehand: $P (H ∣ \overset{ˉ}{R}) = 1$ . This seems counter-intuitive, likely meaning if it doesn’t rain, humidity is guaranteed to be high in this simplified scenario, or there’s a typo in the text/calculation setup. We follow the PDF’s calculation flow.)
Evidence: We know there will be high humidity ( $H$ ).
Prediction: Calculate the chance of no rain given high humidity, $P (\overset{ˉ}{R} ∣ H)$ .
Calculation (using Bayes’ Theorem as presented in PDF): $P (\overset{ˉ}{R} ∣ H) = \frac{P ( H ∣ R ˉ ) P ( R ˉ )}{P ( H )}$ $P (\overset{ˉ}{R} ∣ H) = \frac{P ( H ∣ R ˉ ) P ( R ˉ )}{P ( H ∣ R ˉ ) P ( R ˉ ) + P ( H ∣ R ) P ( R )}$

Substituting the values used in the PDF’s calculation: $P (\overset{ˉ}{R} ∣ H) = \frac{0.3 \times 0.65}{0.3 \times 0.65 + 1 \times 0.35}$ $P (\overset{ˉ}{R} ∣ H) = \frac{0.195}{0.195 + 0.35} = \frac{0.195}{0.545} \approx 0.357$ or $35.7%$

(Note: The PDF calculation arrives at $\approx 23%$ with a denominator of $0.845$ . There appears to be a numerical error or inconsistency in the values used within the PDF’s example calculation itself. The formula structure is correct, but the numbers applied don’t match the stated priors consistently leading to the PDF’s final answer. Reporting the PDF’s given result: $\approx 23%$ .)
Conclusion: This posterior probability helps make a decision about the sports event. Prior information was crucial. In practice, more factors (temperature, pressure) are combined.

Example 2: Image Classification (Zebra)

Goal: Determine how likely an image containing black and white strips ( $B W S$ ) is a zebra ( $Z$ ). We want $P (Z ∣ B W S)$ .
Challenge: Need evidence/statistics to determine this likelihood accurately (e.g., 70%, 80%, 99%?).
Approach: Use statistics from a sample of images (image database) to approximate population statistics.
Events:
- $Z = Zebra image$
- $\overset{ˉ}{Z} = non-Zebra image$
- $B W S = Black and white strips detected (feature)$
Prior/Likelihood Information (from training set/database):
- $P (B W S ∣ Z) = 1.0$ (Likelihood of strips given it’s a zebra - assume all zebras have strips)
- $P (Z) = 0.05$ (Prior probability of an image being a zebra)
- $P (B W S ∣ \overset{ˉ}{Z}) = 0.01$ (Likelihood of strips given it’s not a zebra - some other things might have strips)
Evidence: We detected black and white strips ( $B W S$ ) in a new image.
Prediction: Calculate $P (Z ∣ B W S)$ .
Calculation (using Bayes’ Theorem): $P (Z ∣ B W S) = \frac{P ( B W S ∣ Z ) P ( Z )}{P ( B W S )}$ $P (Z ∣ B W S) = \frac{P ( B W S ∣ Z ) P ( Z )}{P ( B W S ∣ Z ) P ( Z ) + P ( B W S ∣ Z ˉ ) P ( Z ˉ )}$

Note: $P (\overset{ˉ}{Z}) = 1 - P (Z) = 1 - 0.05 = 0.95$ . $P (Z ∣ B W S) = \frac{1.0 \times 0.05}{1.0 \times 0.05 + 0.01 \times 0.95}$ $P (Z ∣ B W S) = \frac{0.05}{0.05 + 0.0095} = \frac{0.05}{0.0595} \approx 0.8403$ or $84%$
Conclusion: High chance ( $\approx 84%$ ) the image is a zebra. High confidence in classification.

Extension to Multiple Events (Hypotheses)

Bayes’ theorem can be extended when the evidence $B$ could be related to one of multiple mutually exclusive and exhaustive events $A_{1}, A_{2}, ..., A_{n}$ .

$P (A_{i} ∣ B) = \frac{P ( B ∣ A _{i} ) P ( A _{i} )}{P ( B )}$

Where the denominator is expanded using the law of total probability:

$P (A_{i} ∣ B) = \frac{P ( B ∣ A _{i} ) P ( A _{i} )}{\sum _{k = 1}^{n} P ( B ∣ A _{k} ) P ( A _{k} )}$ (Eq 7.2)

Use Case: Predicting which specific event $A_{i}$ is responsible for the observation $B$ .
Example: Fever ( $B$ ) can be caused by many diseases $A_{i}$ (flu, infection, etc.). Each disease $A_{i}$ has a different probability of causing fever, $P (B ∣ A_{i})$ . Given a patient has fever ( $B$ ), Eq 7.2 can be used to find the probability they have a specific disease, e.g., flu ( $P (A_{f l u} ∣ B)$ ).
Note: In practice (like clinical diagnosis), multiple symptoms ( $B_{1}, B_{2}, ...$ ) are often combined to increase diagnostic accuracy.

7.2 Naïve Bayesian Image Classification

Based on a simple application of Bayes’ theorem to numerical and high-dimensional image data.

7.2.1 NB Formulation

Setup:

Given a set of $N$ images: $I = {I_{1}, I_{2}, ..., I_{N}}$ .
Given a set of $n$ semantic classes: $C = {C_{1}, C_{2}, ..., C_{n}}$ (these are the “hypotheses” or events).
Each image $I$ belongs to one class in $C$ .
Each image $I$ is represented by a feature vector: $x = (x_{1}, x_{2}, ..., x_{m})$ (this is the “evidence” or observation).

Goal:

Classify (or annotate) a given image $I$ (represented by its feature vector $x$ ) into one of the classes $C_{i}$ . We want to find the class $C_{i}$ that maximizes the posterior probability $P (C_{i} ∣ I)$ or $P (C_{i} ∣ x)$ .

Applying Bayes’ Theorem:

$P (C_{i} ∣ x) = \frac{P ( x ∣ C _{i} ) P ( C _{i} )}{P ( x )}$

Expanding the denominator $P (x)$ (evidence probability):

$P (C_{i} ∣ x) = \frac{P ( x ∣ C _{i} ) P ( C _{i} )}{\sum _{k = 1}^{n} P ( x ∣ C _{k} ) P ( C _{k} )}$ (Eq 7.3, Eq 7.4)

Simplification:

The denominator $P (x) = \sum_{k = 1}^{n} P (x ∣ C_{k}) P (C_{k})$ is the same for all classes $C_{i}$ when classifying a specific image $x$ .
It acts as a scaling factor and is independent of the class $C_{i}$ we are considering.
Let $Z = P (x)$ . Then:

$P (C_{i} ∣ x) = \frac{1}{Z} P (x ∣ C_{i}) P (C_{i})$ (Eq 7.5)

Decision Rule: Maximizing A Posteriori (MAP)

To decide the class of image $I$ (with features $x$ ), we choose the class $C_{i}$ with the highest posterior probability. Since $Z$ is constant for a given $x$ , maximizing $P (C_{i} ∣ x)$ is equivalent to maximizing the numerator $P (x ∣ C_{i}) P (C_{i})$ .

$\hat{C} = ar g max_{i} P (C_{i} ∣ x) = ar g max_{i} {P (x ∣ C_{i}) P (C_{i})}$ (Eq 7.6)

Where $\hat{C}$ is the predicted class.

Components:

Prior Probability $P (C_{i})$ :
- Often assumed to be uniform for all classes ( $P (C_{i}) = 1/ n$ ) if there’s no prior knowledge favoring any class.
- Alternatively, can be estimated from the training data as the frequency or proportion of training samples belonging to class $C_{i}$ .
Likelihood $P (x ∣ C_{i})$ :
- The probability of observing the feature vector $x$ given that the image belongs to class $C_{i}$ .
- This is the core component that needs to be modeled or learned from the training data.
- Since image features $x$ are often numerical and continuous, they typically need to be discretized or modeled using probability distributions before computing the likelihood.

Procedure for Likelihood Modeling (Typical Practice):

Described in the following Training section.

Training Steps:

Create Training Database: Collect images from all $n$ classes $C_{1}, ..., C_{n}$ .
Feature Extraction & Clustering:
- Extract feature vectors $x$ from all training images.
- Cluster these feature vectors into $m$ clusters $X_{1}, ..., X_{m}$ using a vector quantization (VQ) algorithm (e.g., K-means). Each cluster $X_{j}$ represents a region in the feature space.
Compute Centroids: Calculate the centroid $x_{j}$ for each cluster $X_{j}$ . The centroid $x_{j}$ acts as a representative feature vector for all samples in that cluster.
Calculate Likelihoods: Estimate the likelihood $P (x_{j} ∣ C_{i})$ (the probability of a feature vector belonging to cluster $X_{j}$ given it’s from class $C_{i}$ ). This is done by calculating the frequency:

$P (x_{j} ∣ C_{i}) = \frac{No. of samples in cluster X _{j} which are from class C _{i}}{Total no. of samples in cluster X _{j}}$ (Eq 7.7)

Classification/Annotation Steps:

Input: A new image $I$ with its extracted feature vector $x$ .
Match Feature: Find the closest cluster centroid $x_{j}$ to the input feature vector $x$ (e.g., using Euclidean distance). This step effectively discretizes the input feature $x$ by mapping it to $x_{j}$ .
Apply MAP: Use the MAP decision rule (Eq 7.6), replacing the continuous likelihood $P (x ∣ C_{i})$ with the discretized likelihood $P (x_{j} ∣ C_{i})$ calculated during training (Eq 7.7):

$\hat{C} = ar g max_{i} {P (x_{j} ∣ C_{i}) P (C_{i})}$

Calculate this value for all classes $i = 1, ..., n$ and choose the class $\hat{C}$ that gives the maximum value. This gives the posterior probability $P (C_{i} ∣ I)$ .

Figure 7.1: Image classification with Naïve Bayesian method

Diagram Description: Illustrates the two main modules: Training and Annotation.
- Training Module:
  1. Input Images are fed into Features extraction.
  2. Features are clustered into $m$ clusters (visualized as groups of points).
  3. Centroid $x_{j}$ is found for each cluster.
  4. Model building calculates the likelihoods $p (x_{j} ∣ c)$ (using Eq 7.7) for each centroid $x_{j}$ and class $c$ .
- Annotation Module:
  1. Input image x (new image) goes through Features extraction.
  2. Matching finds the closest centroid $x_{j}$ from the training phase to the input features $x$ .
  3. The corresponding likelihood $p (x_{j} ∣ c)$ is retrieved.
  4. The posterior $p(x_j|c)p(c)` is calculated.
  5. MAP decision selects the class with the highest posterior.
  6. Output is the predicted `Label, $\overset{c}{^}$ .
- A matching arrow connects the centroids from training to the matching step in annotation, indicating the link between the learned model and its application.

7.2.2 NB with Independent Features

Assumption: The individual features $x_{1}, x_{2}, ..., x_{m}$ within the feature vector $x$ are independent of each other, given the class $C_{i}$ .

This is the “naïve” part of Naïve Bayes - this assumption is often violated in practice but simplifies the model significantly and works surprisingly well.

Likelihood Calculation:

Under the independence assumption, the likelihood of the entire feature vector $x$ is the product of the likelihoods of its individual features:

$P (x ∣ C_{i}) = P (x_{1}, x_{2}, ..., x_{m} ∣ C_{i}) = \prod_{j = 1}^{m} P (x_{j} ∣ C_{i})$ (Eq 7.8)

Applicability:

Useful when features are naturally independent or treated as such.
Examples:
- Nominal features from web crawling (e.g., presence/absence of keywords like ‘tree’, ‘grass’, ‘sand’). If $x = (sand, water, sky, people)$ , classify image as “beach” vs “non-beach”.
- Combining different types of numerical image features (e.g., $x = (color features, shape features, texture features)$ ) where independence is assumed for computational simplicity.

7.2.3 NB with Bag of Features

Setup:

An image $I$ is segmented into $k$ regions (or patches, blocks).
Each region $j$ is represented by its own feature vector $x_{j}$ .
The image $I$ is represented as a bag (set) of these regional feature vectors: $I = {x_{1}, x_{2}, ..., x_{k}}$ . The spatial arrangement is ignored.

Assumption:

The regions within the image are independent of each other, given the class $C_{i}$ .

Conditional Probability Calculation:

The probability of observing the entire bag of features $I$ given the class $C_{i}$ is the product of the probabilities of observing each regional feature vector:

$P (I ∣ C_{i}) = P (x_{1}, x_{2}, ..., x_{k} ∣ C_{i}) = \prod_{j = 1}^{k} P (x_{j} ∣ C_{i})$ (Eq 7.9)

Here, $P (x_{j} ∣ C_{i})$ could be modeled using the clustering approach from section 7.2.1 (mapping $x_{j}$ to a cluster centroid) or using distributional assumptions if features are continuous.

7.7 Image Classification with Gaussian Process (GP)

Alternative Perspective on Feature Vectors:

GMM View: Treats a multidimensional feature vector $x = (x_{1}, x_{2}, ..., x_{n})$ as a single point in an $n$ -dimensional space ( $R^{n}$ ). Models the distribution of these points.
GP View: Treats a multidimensional feature vector $x$ as a discretized function $f : D \to R$ , where $y = f (d)$ . The indices ( $1, ..., n$ ) represent points $d_{i}$ in some domain $D$ , and the feature values $x_{i} = f (d_{i})$ are the function outputs.
- Example: A histogram feature vector. The bin indices are the domain $D ‘, an d t h e binh e i g h t s a re t h e f u n c t i o n v a l u es$ x_i$.

Figure 7.6: Feature vectors shown as functions

Diagram Description: Shows three different histograms (normalized feature vectors) from the same class, represented by vertical bars (red, green, blue). Above the bars, corresponding continuous functions (curves in red, green, blue) are shown, visualizing how each histogram can be seen as a sample from an underlying function.

Analogy to Regression:

Linear regression fits a line to a cluster of data points.
Gaussian Process can be thought of as fitting a mean function and confidence band to a cluster of functions (where each function corresponds to a data sample like a histogram).

Figure 7.7: A cluster of multidimensional data (green) and the approximation function (pink)

Diagram Description: Shows many data points (green dots) representing the values $x_{i} = f (d_{i})$ from multiple sample functions (histograms) plotted together. The pink curves show the learned Gaussian Process model: a mean function (center pink curve) and confidence bounds (outer pink curves) that capture the distribution of functions in this class.

Formal Setup:

Data: A set of $N$ data samples $x_{1}, x_{2}, ..., x_{N}$ from a certain class $C$ .
Dimensionality: Each sample $x_{i}$ is a $D$ -dimensional feature vector: $x_{i} = (x_{i 1}, x_{i 2}, ..., x_{i D})$ .
Matrix Representation: Create a matrix $X$ of size $D \times N$ . Each row $d_{j}$ ( $j = 1, ..., D$ ) represents the $j$ -th dimension across all $N$ samples.

X = x_{11} x_{12} ⋮ x_{1 D} x_{21} x_{22} ⋮ x_{2 D} \dots \dots ⋱ \dots x_{N 1} x_{N 2} ⋮ x_{N D} = d_{1} d_{2} ⋮ d_{D}

(Eq 7.24) Here, $d_{j} = (x_{1 j}, x_{2 j}, ..., x_{N j})$ is the $j$ -th row vector.

Gaussian Process Definition:
- Assume the elements within each dimensional vector $d_{j}$ are samples from (or follow) a Normal distribution $N (μ_{j}, σ_{j})$ .
- The entire matrix $X$ (viewed as a collection of function values) is modeled as a Gaussian Process (GP).
- A GP is defined by a mean function $μ (d)$ and a covariance (kernel) function $k (d_{i}, d_{j})$ .
- The data $X$ follows a multivariate Normal distribution: $X \sim N (μ_{X}, K_{XX})$ .
Mean Vector ( $μ_{X}$ ) and Covariance Matrix ( $K_{XX}$ ):
- The mean vector $μ_{X}$ contains the mean values for each dimension: $μ_{X} = μ (d_{1}) μ (d_{2}) ⋮ μ (d_{D})$ (Eq 7.25) (Often simplified to assume a zero mean prior).
- The covariance matrix $K_{XX}$ (or $Σ_{XX}$ ) describes the correlations between dimensions: $K_{XX} = k (d_{1}, d_{1}) k (d_{2}, d_{1}) ⋮ k (d_{D}, d_{1}) k (d_{1}, d_{2}) k (d_{2}, d_{2}) ⋮ k (d_{D}, d_{2}) \dots \dots ⋱ \dots k (d_{1}, d_{D}) k (d_{2}, d_{D}) ⋮ k (d_{D}, d_{D})$ (Eq 7.26) Where $k (d_{i}, d_{j})$ is the kernel function evaluating the covariance between dimension $i$ and dimension $j$ . Typical kernels include the RBF kernel.
Prediction with GP:
- Given new data points $X_{*}$ .
- Concatenate observed data $X$ and new data $X_{*}$ . The joint distribution is also a GP: $f ([X X_{*}]) \sim N ([μ_{X} μ_{X_{*}}], [K_{XX} K_{X_{*} X} K_{X X_{*}} K_{X_{*} X_{*}}])$ (Eq 7.27, using K notation consistent with 7.26) Where $K_{X X_{*}}$ , $K_{X_{*} X}$ , $K_{X_{*} X_{*}}$ are covariance matrices computed using the kernel function between the respective data points.
- The predictive distribution for the new data $X_{*}$ given the observed data $X$ is also Gaussian: $p (X_{*} ∣ X) = N (μ_{p re d}, K_{p re d})$ Where:
  - Predictive Mean: $μ_{p re d} = μ_{X_{*}} + K_{X_{*} X} K_{XX}^{- 1} (X - μ_{X})$
  - Predictive Covariance: $K_{p re d} = K_{X_{*} X_{*}} - K_{X_{*} X} K_{XX}^{- 1} K_{X X_{*}}$ (Eq 7.28)
- This allows predicting the values of the new data points and quantifying the uncertainty in the prediction.
- Proof referenced to Appendix [7–9].

7.8 Summary

Chapter Overview:

Introduced Bayesian classification as the first image classification method.
Described several important applications and variations:
- Naïve Bayes (NB)
- Word Co-occurrence (WCC) model
- Cross-Media Relevance Model (CMRM)
- Parametric model (using GMMs)
- Gaussian Process (GP) method

Key Features of Bayesian Classifiers:

Generative:
- They are typically generative models.
- They assume an underlying distribution or model for the likelihood probability ( $P (evidence ∣ hypothesis)$ or $P (x ∣ C_{i})$ ).
- This likelihood model is usually learned or estimated from known training samples.
Intuitive:
- Compared to “black-box” classifiers like SVM or Neural Networks (ANN), Bayesian classifiers are often more intuitive.
- The results can be more easily interpreted by humans.
- The core idea directly reflects the process of updating prior beliefs using new evidence via Bayes’ theorem.

Quartz 4

Explorer