7.1 Introduction
All Bayesian classification methods are fundamentally based on Bayes’ Theorem.
Bayes’ Theorem Formula:
Given two random events, and , Bayes’ Theorem is stated as:
This can also be expanded using the law of total probability for the denominator, assuming and its complement form a partition of the sample space:
(Eq 7.1)
Terminology:
- and : Random events.
- : The complement of event .
- : Typically represents a hypothesis to be tested or predicted.
- : Represents the new data, observation, or evidence used to predict .
- : The prior probability of hypothesis . This reflects our belief or knowledge before observing the evidence .
- : The likelihood of observing evidence given that hypothesis is true. Represents experience or prior knowledge about the relationship between and .
- : The observation probability or evidence probability. This is the overall probability of observing the evidence , regardless of hypothesis . It acts as a normalization constant.
- : The posterior probability of hypothesis after observing the evidence . This is the updated belief about .
Core Idea of Bayes’ Theorem:
- It allows converting the computation of the posterior probability into a form involving the likelihood , which is often easier to estimate or compute.
- Extremely helpful when and are dependent events and predicting directly is difficult due to lack of information.
- Information about (evidence) helps predict (hypothesis) more accurately.
- Information needed (like and ) can often be obtained from historical data or experience.
General Application Idea:
Predicting one event using other related events is a common practice (e.g., symptoms predict disease, clouds predict rain, rainfall predicts harvest). Bayes’ theorem provides a formal framework for this.
Example 1: Weather Prediction
-
Goal: Predict if the weather will be fine (No Rain) for a weekend sports event, given the prediction of high humidity.
-
Events:
-
Prior Information (from history/meteorology):
- (Likelihood of high humidity given it rains)
- (Implicitly assumed in the PDF’s calculation for this specific example, though not explicitly stated beforehand: . This seems counter-intuitive, likely meaning if it doesn’t rain, humidity is guaranteed to be high in this simplified scenario, or there’s a typo in the text/calculation setup. We follow the PDF’s calculation flow.)
-
Evidence: We know there will be high humidity ().
-
Prediction: Calculate the chance of no rain given high humidity, .
-
Calculation (using Bayes’ Theorem as presented in PDF):
Substituting the values used in the PDF’s calculation: or
(Note: The PDF calculation arrives at with a denominator of . There appears to be a numerical error or inconsistency in the values used within the PDF’s example calculation itself. The formula structure is correct, but the numbers applied don’t match the stated priors consistently leading to the PDF’s final answer. Reporting the PDF’s given result: .)
-
Conclusion: This posterior probability helps make a decision about the sports event. Prior information was crucial. In practice, more factors (temperature, pressure) are combined.
Example 2: Image Classification (Zebra)
-
Goal: Determine how likely an image containing black and white strips () is a zebra (). We want .
-
Challenge: Need evidence/statistics to determine this likelihood accurately (e.g., 70%, 80%, 99%?).
-
Approach: Use statistics from a sample of images (image database) to approximate population statistics.
-
Events:
-
Prior/Likelihood Information (from training set/database):
- (Likelihood of strips given it’s a zebra - assume all zebras have strips)
- (Prior probability of an image being a zebra)
- (Likelihood of strips given it’s not a zebra - some other things might have strips)
-
Evidence: We detected black and white strips () in a new image.
-
Prediction: Calculate .
-
Calculation (using Bayes’ Theorem):
Note: . or
-
Conclusion: High chance () the image is a zebra. High confidence in classification.
Extension to Multiple Events (Hypotheses)
Bayes’ theorem can be extended when the evidence could be related to one of multiple mutually exclusive and exhaustive events .
Where the denominator is expanded using the law of total probability:
(Eq 7.2)
- Use Case: Predicting which specific event is responsible for the observation .
- Example: Fever () can be caused by many diseases (flu, infection, etc.). Each disease has a different probability of causing fever, . Given a patient has fever (), Eq 7.2 can be used to find the probability they have a specific disease, e.g., flu ().
- Note: In practice (like clinical diagnosis), multiple symptoms () are often combined to increase diagnostic accuracy.
7.2 Naïve Bayesian Image Classification
Based on a simple application of Bayes’ theorem to numerical and high-dimensional image data.
7.2.1 NB Formulation
Setup:
- Given a set of images: .
- Given a set of semantic classes: (these are the “hypotheses” or events).
- Each image belongs to one class in .
- Each image is represented by a feature vector: (this is the “evidence” or observation).
Goal:
Classify (or annotate) a given image (represented by its feature vector ) into one of the classes . We want to find the class that maximizes the posterior probability or .
Applying Bayes’ Theorem:
Expanding the denominator (evidence probability):
(Eq 7.3, Eq 7.4)
Simplification:
- The denominator is the same for all classes when classifying a specific image .
- It acts as a scaling factor and is independent of the class we are considering.
- Let . Then:
(Eq 7.5)
Decision Rule: Maximizing A Posteriori (MAP)
To decide the class of image (with features ), we choose the class with the highest posterior probability. Since is constant for a given , maximizing is equivalent to maximizing the numerator .
(Eq 7.6)
Where is the predicted class.
Components:
- Prior Probability :
- Often assumed to be uniform for all classes () if there’s no prior knowledge favoring any class.
- Alternatively, can be estimated from the training data as the frequency or proportion of training samples belonging to class .
- Likelihood :
- The probability of observing the feature vector given that the image belongs to class .
- This is the core component that needs to be modeled or learned from the training data.
- Since image features are often numerical and continuous, they typically need to be discretized or modeled using probability distributions before computing the likelihood.
Procedure for Likelihood Modeling (Typical Practice):
Described in the following Training section.
Training Steps:
-
Create Training Database: Collect images from all classes .
-
Feature Extraction & Clustering:
- Extract feature vectors from all training images.
- Cluster these feature vectors into clusters using a vector quantization (VQ) algorithm (e.g., K-means). Each cluster represents a region in the feature space.
-
Compute Centroids: Calculate the centroid for each cluster . The centroid acts as a representative feature vector for all samples in that cluster.
-
Calculate Likelihoods: Estimate the likelihood (the probability of a feature vector belonging to cluster given it’s from class ). This is done by calculating the frequency:
(Eq 7.7)
Classification/Annotation Steps:
-
Input: A new image with its extracted feature vector .
-
Match Feature: Find the closest cluster centroid to the input feature vector (e.g., using Euclidean distance). This step effectively discretizes the input feature by mapping it to .
-
Apply MAP: Use the MAP decision rule (Eq 7.6), replacing the continuous likelihood with the discretized likelihood calculated during training (Eq 7.7):
Calculate this value for all classes and choose the class that gives the maximum value. This gives the posterior probability .
Figure 7.1: Image classification with Naïve Bayesian method
- Diagram Description: Illustrates the two main modules: Training and Annotation.
- Training Module:
Input Imagesare fed intoFeaturesextraction.- Features are
clusteredinto clusters (visualized as groups of points). Centroidis found for each cluster.Model buildingcalculates the likelihoods (using Eq 7.7) for each centroid and class .
- Annotation Module:
Input image x(new image) goes throughFeaturesextraction.Matchingfinds the closest centroid from the training phase to the input features .- The corresponding likelihood is retrieved.
- The posterior $p(x_j|c)p(c)` is calculated.
MAPdecision selects the class with the highest posterior.- Output is the predicted `Label, .
- A
matchingarrow connects the centroids from training to the matching step in annotation, indicating the link between the learned model and its application.
- Training Module:
7.2.2 NB with Independent Features
Assumption: The individual features within the feature vector are independent of each other, given the class .
- This is the “naïve” part of Naïve Bayes - this assumption is often violated in practice but simplifies the model significantly and works surprisingly well.
Likelihood Calculation:
Under the independence assumption, the likelihood of the entire feature vector is the product of the likelihoods of its individual features:
(Eq 7.8)
Applicability:
- Useful when features are naturally independent or treated as such.
- Examples:
- Nominal features from web crawling (e.g., presence/absence of keywords like ‘tree’, ‘grass’, ‘sand’). If , classify image as “beach” vs “non-beach”.
- Combining different types of numerical image features (e.g., ) where independence is assumed for computational simplicity.
7.2.3 NB with Bag of Features
Setup:
- An image is segmented into regions (or patches, blocks).
- Each region is represented by its own feature vector .
- The image is represented as a bag (set) of these regional feature vectors: . The spatial arrangement is ignored.
Assumption:
- The regions within the image are independent of each other, given the class .
Conditional Probability Calculation:
The probability of observing the entire bag of features given the class is the product of the probabilities of observing each regional feature vector:
(Eq 7.9)
- Here, could be modeled using the clustering approach from section 7.2.1 (mapping to a cluster centroid) or using distributional assumptions if features are continuous.
7.7 Image Classification with Gaussian Process (GP)
Alternative Perspective on Feature Vectors:
- GMM View: Treats a multidimensional feature vector as a single point in an -dimensional space (). Models the distribution of these points.
- GP View: Treats a multidimensional feature vector as a discretized function , where . The indices () represent points in some domain , and the feature values are the function outputs.
- Example: A histogram feature vector. The bin indices are the domain x_i$.
Figure 7.6: Feature vectors shown as functions
- Diagram Description: Shows three different histograms (normalized feature vectors) from the same class, represented by vertical bars (red, green, blue). Above the bars, corresponding continuous functions (curves in red, green, blue) are shown, visualizing how each histogram can be seen as a sample from an underlying function.
Analogy to Regression:
- Linear regression fits a line to a cluster of data points.
- Gaussian Process can be thought of as fitting a mean function and confidence band to a cluster of functions (where each function corresponds to a data sample like a histogram).
Figure 7.7: A cluster of multidimensional data (green) and the approximation function (pink)
- Diagram Description: Shows many data points (green dots) representing the values from multiple sample functions (histograms) plotted together. The pink curves show the learned Gaussian Process model: a mean function (center pink curve) and confidence bounds (outer pink curves) that capture the distribution of functions in this class.
Formal Setup:
- Data: A set of data samples from a certain class .
- Dimensionality: Each sample is a -dimensional feature vector: .
- Matrix Representation: Create a matrix of size . Each row () represents the -th dimension across all samples.
(Eq 7.24) Here, is the -th row vector.
-
Gaussian Process Definition:
- Assume the elements within each dimensional vector are samples from (or follow) a Normal distribution .
- The entire matrix (viewed as a collection of function values) is modeled as a Gaussian Process (GP).
- A GP is defined by a mean function and a covariance (kernel) function .
- The data follows a multivariate Normal distribution: .
-
Mean Vector () and Covariance Matrix ():
- The mean vector contains the mean values for each dimension: (Eq 7.25) (Often simplified to assume a zero mean prior).
- The covariance matrix (or ) describes the correlations between dimensions: (Eq 7.26) Where is the kernel function evaluating the covariance between dimension and dimension . Typical kernels include the RBF kernel.
-
Prediction with GP:
- Given new data points .
- Concatenate observed data and new data . The joint distribution is also a GP: (Eq 7.27, using K notation consistent with 7.26) Where , , are covariance matrices computed using the kernel function between the respective data points.
- The predictive distribution for the new data given the observed data is also Gaussian:
Where:
- Predictive Mean:
- Predictive Covariance: (Eq 7.28)
- This allows predicting the values of the new data points and quantifying the uncertainty in the prediction.
- Proof referenced to Appendix [7–9].
7.8 Summary
Chapter Overview:
- Introduced Bayesian classification as the first image classification method.
- Described several important applications and variations:
- Naïve Bayes (NB)
- Word Co-occurrence (WCC) model
- Cross-Media Relevance Model (CMRM)
- Parametric model (using GMMs)
- Gaussian Process (GP) method
Key Features of Bayesian Classifiers:
- Generative:
- They are typically generative models.
- They assume an underlying distribution or model for the likelihood probability ( or ).
- This likelihood model is usually learned or estimated from known training samples.
- Intuitive:
- Compared to “black-box” classifiers like SVM or Neural Networks (ANN), Bayesian classifiers are often more intuitive.
- The results can be more easily interpreted by humans.
- The core idea directly reflects the process of updating prior beliefs using new evidence via Bayes’ theorem.