Binary Classification

Problem Definition

Given: Training data consisting of $N$ pairs $(x_{i}, y_{i})$ for $i = 1, \dots, N$ .
- $x_{i} \in R^{d}$ is the feature vector for the $i$ -th data point (input).
- $y_{i} \in {- 1, 1}$ is the class label for the $i$ -th data point (output).
Goal: Learn a classifier function $f (x)$ such that it can predict the class label for new, unseen data points.
Desired Property: For the training data, the classifier should ideally satisfy: $f (x_{i}) {\geq 0 < 0 if y_{i} = + 1 if y_{i} = - 1$
Correct Classification Condition: This can be concisely written as $y_{i} f (x_{i}) > 0$ for a correct classification. (Note: The boundary case $f (x_{i}) = 0$ is sometimes included in the positive class, as shown in the slide definition).

Review of Linear Classifiers

Linear Separability

Definition: A dataset is linearly separable if there exists a linear discriminant function (a line in 2D, a plane in 3D, a hyperplane in higher dimensions) that can perfectly separate the data points of the different classes.

Linear Classifiers: Definition and Form

General Form: A linear classifier has the discriminant function: $f (x) = w^{T} x + b$
- $w \in R^{d}$ : Weight vector.
- $b \in R$ : Bias (or intercept).
- $x \in R^{d}$ : Input feature vector.
Decision Boundary: The boundary between the classes is defined by the equation $f (x) = 0$ , which is: $w^{T} x + b = 0$
Geometric Interpretation:
- In 2D: The decision boundary is a line.
  - $w$ is the vector normal (perpendicular) to the line.
  - $b$ determines the offset of the line from the origin.
  - Points on one side of the line have $f (x) > 0$ , points on the other side have $f (x) < 0$ .
- In 3D: The decision boundary is a plane. $w$ is the normal vector to the plane.
- In nD (n > 3): The decision boundary is a hyperplane. $w$ is the normal vector to the hyperplane.
Comparison with K-NN:
- K-NN: Requires storing (carrying) the entire training dataset to make predictions.
- Linear Classifier: Uses the training data only to learn the parameters $w$ and $b$ . After learning, the training data is discarded. Only $w$ and $b$ are needed for classifying new data.

The Perceptron Classifier

Goal: Given linearly separable data $x_{i}$ with labels $y_{i} \in {- 1, 1}$ , find a weight vector $w$ and bias $b$ such that the discriminant function $f (x_{i}) = w^{T} x_{i} + b$ correctly separates all data points, i.e., $y_{i} (w^{T} x_{i} + b) > 0$ .
Finding the Separating Hyperplane: The Perceptron Algorithm provides a method.

The Perceptron Algorithm

Homogeneous Coordinates (Optional but convenient):
- Rewrite the classifier as $f (x_{i}) = \tilde{w}^{T} \tilde{x}_{i} + w_{0}$ , where $\tilde{w}$ are the weights for the original features and $w_{0}$ is the bias $b$ .
- Define an augmented weight vector $w = (\tilde{w}, w_{0})$ and an augmented feature vector $x_{i} = (\tilde{x}_{i}, 1)$ .
- The classifier becomes $f (x_{i}) = w^{T} x_{i}$ .
- Note: The slides use this notation implicitly. We will use $w$ for the potentially augmented vector and $x_{i}$ for the potentially augmented input.
Initialization: Initialize the weight vector $w = 0$ . Set a learning rate $α > 0$ (often $α = 1$ ).
Cycle through Data: Iterate through the data points $(x_{i}, y_{i})$ repeatedly.
Check for Misclassification: For the current point $x_{i}$ , calculate $f (x_{i}) = w^{T} x_{i}$ . The point is misclassified if $y_{i} f (x_{i}) \leq 0$ .
Update Rule: If $x_{i}$ is misclassified: $w \leftarrow w + α \cdot sign (f (x_{i})) \cdot x_{i}$
- Note on the update: The standard Perceptron update is often written as $w \leftarrow w + α y_{i} x_{i}$ . If $y_{i} f (x_{i}) \leq 0$ , then $y_{i}$ and $f (x_{i})$ have opposite signs (or $f (x_{i}) = 0$ ). The term $sign (f (x_{i}))$ as used in the slide seems potentially incorrect or non-standard. Using $w \leftarrow w + α y_{i} x_{i}$ directly adjusts $w$ in the direction that increases $f (x_{i})$ if $y_{i} = + 1$ and decreases $f (x_{i})$ if $y_{i} = - 1$ . Sticking precisely to the slide: $w \leftarrow w + α \cdot sign (f (x_{i})) \cdot x_{i}$ . (See Page 7 example which uses $w \leftarrow w - α x_{i}$ when $f (x_{i}) < 0$ , implying $sign (f (x_{i})) = - 1$ was used.)
Termination: Repeat steps 3-5 until no misclassifications occur during a full cycle through the data.

Perceptron Example (2D - Page 7)

Illustrates the update rule.
Before Update: Shows a data point $x_{i}$ (blue circle, assume $y_{i} = + 1$ ) misclassified by the current boundary defined by $w$ (since it’s below the line, $f (x_{i}) < 0$ ).
Update: Since $f (x_{i}) < 0$ , $sign (f (x_{i})) = - 1$ . The update is $w \leftarrow w - α x_{i}$ . This moves the weight vector $w$ away from $x_{i}$ .
After Update: Shows the new weight vector and the corresponding shifted decision boundary, which is now closer to classifying $x_{i}$ correctly.
Final Weights: After convergence, the final weight vector $w$ can be expressed as a linear combination of the training data points: $w = \sum_{i = 1}^{N} α_{i} x_{i}$ , where $α_{i}$ reflects how often point $x_{i}$ contributed to an update.

Perceptron Properties

Convergence Theorem: If the training data is linearly separable, the Perceptron algorithm is guaranteed to converge to a separating hyperplane in a finite number of steps.
Convergence Speed: Convergence can be slow.
Solution Quality: The algorithm finds any separating hyperplane, not necessarily the “best” one. The resulting separating line can be very close to the training data points.
Generalization: A line close to the data points might not generalize well to new, unseen data. We would prefer a solution with a larger margin.

Support Vector Machine (SVM) Classifier

The Concept of Margin

Problem: For linearly separable data, there can be infinitely many separating hyperplanes. Which one is the best?
Idea: Choose the hyperplane that maximizes the margin.
Margin: The margin is the distance between the separating hyperplane and the closest data point(s) from either class. The total width of the “empty” slab around the decision boundary is twice this distance.
Maximum Margin Solution: The hyperplane that maximizes this margin. This solution is considered the most stable under perturbations of the input data points and often leads to better generalization.
Support Vectors: The data points that lie exactly on the margin boundaries (closest to the separating hyperplane) are called support vectors. They “support” the hyperplane.

Geometric Intuition (Linearly Separable Case)

- Decision Boundary: The solid line $w^{T} x + b = 0$ .
- Margin Boundaries: Two dashed lines parallel to the decision boundary, defined by $w^{T} x + b = + 1$ and $w^{T} x + b = - 1$ .
- Margin Width: The perpendicular distance between the two margin boundaries is $\frac{2}{∥ w ∥}$ .
- Support Vectors: The data points (circled) that lie exactly on the margin boundaries.
Classifier Form: The final SVM classifier can be expressed as a sum involving only the support vectors: $f (x) = \sum_{i \in Support Vectors} α_{i} y_{i} (x_{i}^{T} x) + b$ (This form arises from the dual formulation, mentioned later).

Margin Calculation and Normalization (Sketch Derivation)

Scaling Freedom: The hyperplane $w^{T} x + b = 0$ is the same as $c (w^{T} x + b) = 0$ for any constant $c \neq = 0$ . We can exploit this to choose a specific normalization for $w$ and $b$ .
Canonical Hyperplanes: Choose the scaling such that for the support vectors $x_{+}$ (positive class) and $x_{-}$ (negative class) closest to the hyperplane:
- $w^{T} x_{+} + b = + 1$
- $w^{T} x_{-} + b = - 1$ These define the margin boundaries.
Margin Calculation:
- Consider the vector difference between support vectors on opposite margins: $(x_{+} - x_{-})$ .
- The distance between the margin hyperplanes is the projection of this vector onto the normal vector $\frac{w}{∥ w ∥}$ .
- Margin Width = $\frac{w ^{T}}{∥ w ∥} (x_{+} - x_{-})$
- Substitute $w^{T} x_{+} = 1 - b$ and $w^{T} x_{-} = - 1 - b$ : Margin Width = $\frac{( 1 - b ) - ( - 1 - b )}{∥ w ∥} = \frac{1 - b + 1 + b}{∥ w ∥} = \frac{2}{∥ w ∥}$ .

Optimization Problem (Hard Margin SVM)

Goal: Maximize the margin width $\frac{2}{∥ w ∥}$ .
Constraints: All data points must be correctly classified and lie on or outside their respective margin boundary. Using the canonical representation, this means:
- $w^{T} x_{i} + b \geq + 1$ for $y_{i} = + 1$
- $w^{T} x_{i} + b \leq - 1$ for $y_{i} = - 1$
- These can be combined into a single constraint: $y_{i} (w^{T} x_{i} + b) \geq 1$ for all $i = 1, \dots, N$ .
Equivalent Optimization Problem: Maximizing $\frac{2}{∥ w ∥}$ is equivalent to minimizing $∥ w ∥$ , which is equivalent to minimizing $\frac{1}{2} ∥ w ∥^{2}$ (the $\frac{1}{2}$ is often added for mathematical convenience in derivation, though the slide omits it). Primal Formulation (Hard Margin):
$w, b min ∥ w ∥^{2} subject to y_{i} (w^{T} x_{i} + b) \geq 1, for i = 1, \dots, N$
(Note: Slide 13 uses $max \frac{2}{∥ w ∥}$ and $min ∥ w ∥^{2}$ , both are valid formulations).
Properties:
- This is a Quadratic Programming (QP) problem (quadratic objective, linear constraints).
- It has a unique minimum because the objective function is convex and the constraints define a convex feasible region.

Limitations of Hard Margin and Need for Soft Margin

Problem 1: Strict Separability: The hard margin SVM requires the data to be perfectly linearly separable. It cannot handle data that is not linearly separable (as in page 3 bottom examples).
Problem 2: Sensitivity to Outliers: Even if data is separable, a single outlier might force the margin to be very narrow, potentially leading to a less robust classifier (as hinted in page 14 top example).
Trade-off: Sometimes, allowing a few points to be misclassified or lie within the margin might lead to a much larger margin and better overall generalization (as suggested in page 14 bottom example).

Soft Margin SVM: Slack Variables

Idea: Introduce slack variables, denoted by $ξ_{i}$ (Greek letter xi), for each data point $x_{i}$ .
Purpose: Allow individual points to violate the margin constraint $y_{i} (w^{T} x_{i} + b) \geq 1$ .
Modified Constraint: $y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}$
Slack Variable Constraint: We require $ξ_{i} \geq 0$ .
Interpretation of $ξ_{i}$ (Page 15):
- $ξ_{i} = 0$ : Point $x_{i}$ is correctly classified and is on or outside the margin boundary (satisfies the hard margin constraint).
- $0 < ξ_{i} \leq 1$ : Point $x_{i}$ is correctly classified ( $y_{i} f (x_{i}) > 0$ ) but lies inside the margin ( $y_{i} f (x_{i}) < 1$ ). This is a margin violation.
- $ξ_{i} > 1$ : Point $x_{i}$ is misclassified ( $y_{i} f (x_{i}) < 0$ ).
Penalty: We want to minimize the total amount of slack introduced.

Optimization Problem (Soft Margin SVM)

Objective: Minimize a combination of the classifier complexity (related to margin width) and the total slack. Primal Formulation (Soft Margin):
$w, b, ξ min ∥ w ∥^{2} + C i = 1 \sum N ξ_{i}$
$subject to$ $$\quad y_i (\mathbf{w}^T \mathbf{x}_i + b) \ge 1 - \xi_i, \quad \text{for } i = 1, \dots, N $$$$ \quad \xi_i \ge 0, \quad \text{for } i = 1, \dots, N
$- $\boldsymbol{\xi} = (\xi_1, \dots, \xi_N)$ is the vector of slack variables. - $C \ge 0$ is the **regularization parameter**.$
Role of Regularization Parameter $C$ :
- Controls the trade-off between maximizing the margin (minimizing $∥ w ∥^{2}$ ) and minimizing the classification/margin errors (minimizing $\sum ξ_{i}$ ).
- Small $C$ : Lower penalty on slack variables ( $ξ_{i}$ ). Allows more points to violate the margin (larger $ξ_{i}$ ) in favor of achieving a wider margin (smaller $∥ w ∥^{2}$ ). Can lead to simpler models, potentially underfitting if too small.
- Large $C$ : Higher penalty on slack variables. Forces the optimizer to minimize margin violations. Behaves more like the hard margin SVM. Can lead to narrower margins and potentially overfitting if too large.
- $C = \infty$ : Corresponds to the hard margin SVM (no slack allowed).
Properties:
- Still a Quadratic Programming problem.
- Has a unique minimum.
- $C$ is a hyperparameter that needs to be chosen (e.g., via cross-validation).

Examples of Varying C (Pages 17-19)

Data: 2D data that is linearly separable but requires a narrow margin for perfect separation (Page 17).
Case 1: $C = \infty$ (Hard Margin - Page 18)
- Finds the separating hyperplane with the largest margin while correctly classifying all points.
- Result: Margin = 0.0966, Training error = 0.00%, 3 Support Vectors. The margin is very narrow.
Case 2: $C = 10$ (Soft Margin - Page 19)
- Allows some margin violation to potentially achieve a wider margin.
- Result: Margin = 0.2265, Training error = 3.70% (one point is likely inside the margin or misclassified), 4 Support Vectors. The margin is significantly wider.

Application: Pedestrian Detection in Computer Vision

Objective

Detect (localize) standing humans in images.

Approach: Sliding Window Classifier

Define a window of a fixed size.
Slide this window across the image at different positions and scales.
For each window, classify whether it contains the object of interest (a pedestrian) or not.
This reduces object detection to a binary classification problem.

Quartz 4

Explorer

SVM Classifier Short Notes

Binary Classification

Problem Definition

Review of Linear Classifiers

Linear Separability

Linear Classifiers: Definition and Form

The Perceptron Classifier

The Perceptron Algorithm

Perceptron Example (2D - Page 7)

Perceptron Properties

Support Vector Machine (SVM) Classifier

The Concept of Margin

Geometric Intuition (Linearly Separable Case)

Margin Calculation and Normalization (Sketch Derivation)

Optimization Problem (Hard Margin SVM)

Limitations of Hard Margin and Need for Soft Margin

Soft Margin SVM: Slack Variables

Optimization Problem (Soft Margin SVM)

Examples of Varying C (Pages 17-19)

Application: Pedestrian Detection in Computer Vision

Objective

Approach: Sliding Window Classifier

Graph View

Table of Contents

Backlinks