III.1 Introduction

Problem: Massive number of digital images online are useless without organization.
Image Classification:
- What: Organizing images into different classes based on image features. Typically assigns a single main label (e.g., ‘mountain’).
- Why: To enable searching and retrieval.
- How (Basic): Extract features (e.g., color histogram) from an unknown image and compare them to features of known, labeled images. Assign the label of the best match.
Image Annotation:
- What: Labeling an image with multiple relevant semantic keywords (e.g., ‘mountain’, ‘sky’, ‘snow’, ‘trees’). Classifying an image into multiple classes.
- How: Often uses Multiple Instance Learning (MIL) where an image is a ‘bag’ of features/instances. If any instance is positive for a label, the bag (image) gets the label.
Relation: Closely related. Good classification helps annotation, and vice-versa.
Challenge: Simple matching to one example image is unreliable as one image doesn’t represent the whole class well. Need robust classifiers trained on many examples.

What: Builds a model representing how data for each class is generated. Often a probabilistic distribution $P (features ∣ class)$ .
Why (Analogy): Like having an abstract concept/model for each object type (e.g., an “ideal apple”).
How: Learn the probability distribution from many sample images of the class (e.g., using Gaussian Mixture Models, Bayesian methods).
Use: Classify a new image by finding which class model most likely generated its features (using Bayes’ theorem to find $P (class ∣ features)$ ).

What: Learns a decision boundary directly between different classes in the feature space. Doesn’t explicitly model each class.
Why: Focuses on separating classes rather than describing them individually.
How: Collect training data from different classes. Find an optimal separator (e.g., hyperplane) that distinguishes between the classes based on feature similarity/difference.
Use: Classify a new image based on which side of the decision boundary its features fall.

Chapter 7: Bayesian Classification

What: A mathematical formula relating conditional probabilities. Foundation for Bayesian classifiers.
Formula: $P (A ∣ B) = \frac{P ( B ∣ A ) P ( A )}{P ( B )}$
- $P (A ∣ B)$ : Posterior probability (probability of hypothesis A given evidence B).
- $P (B ∣ A)$ : Likelihood (probability of evidence B given hypothesis A).
- $P (A)$ : Prior probability (initial belief in hypothesis A).
- $P (B)$ : Evidence (probability of observing B).
Why: Allows updating beliefs (prior → posterior) based on new evidence. Often easier to estimate $P (B ∣ A)$ than $P (A ∣ B)$ directly. Incorporates prior knowledge.
Use: Predict events based on related information (e.g., disease from symptoms, image class from features).

What: A simple classification method applying Bayes’ theorem to image features.
Goal: Given image features $x$ , find the most probable class $C_{i}$ . Maximize the posterior $P (C_{i} ∣ x)$ .
MAP Criterion: Choose class $\hat{C}$ that maximizes $P (x ∣ C_{i}) P (C_{i})$ (since $P (x)$ is constant for all classes). $\hat{C} = ar g max_{i} {P (x ∣ C_{i}) P (C_{i})}$
How (Likelihood $P (x ∣ C_{i})$ Estimation):
- Discretization/VQ: Cluster features from all training images (Vector Quantization). $x$ is assigned to the nearest cluster centroid $x_{j}$ . The likelihood $P (x ∣ C_{i})$ is approximated by $P (x_{j} ∣ C_{i})$ , calculated as the proportion of class $C_{i}$ samples within cluster $X_{j}$ . (Fig 7.1)
- Naïve Assumption (Independent Features): Assumes feature vector components $x = (x_{1}, ..., x_{m})$ are independent given the class. Simplifies likelihood: $P (x ∣ C_{i}) = \prod_{j = 1}^{m} P (x_{j} ∣ C_{i})$ .
- Bag of Features (BoF): Image $I$ represented as a set of independent region features ${x_{1}, ..., x_{k}}$ . Likelihood: $P (I ∣ C_{i}) = \prod_{j = 1}^{k} P (x_{j} ∣ C_{i})$ .
Use: Basic image classification.

What: An early method for explicit image annotation (assigning multiple keywords).
Goal: Link visual features (from image blocks) to semantic words (labels).
How:
1. Divide labeled training images into blocks.
2. Cluster blocks using VQ to create visual words (VWs).
3. For each VW cluster $c_{i}$ , build a histogram of co-occurring text words $P (w_{j} ∣ c_{i})$ .
4. Annotate a new image: find nearest VWs for its blocks, sum their word histograms, select words from top histogram bins.
Key Idea: Learns the probability of a semantic word given a visual word/cluster ( $P (w ∣ c)$ ). Allows multi-label assignment.

What: Treats a feature vector (e.g., histogram) as a function and uses Gaussian Processes to model distributions over these functions for classification.
How:
1. Assume feature vectors from a class are sample functions drawn from a GP.
2. A GP is defined by a mean function and a covariance (kernel) function $k (d_{i}, d_{j})$ describing similarity between dimensions/points of the function.
3. Learn the GP parameters from training data $X$ .
4. Predict the probability of a new feature vector $X_{*}$ belonging to the class using the conditional GP distribution $P (X_{*} ∣ X)$ .
Use: Classification method that models the underlying function generating the features, capturing dependencies between feature dimensions.