Question SVM

Support Vector Machines (SVMs) - Exam Preparation Notes

I. Introduction & Overview

Q (Short): What is a Support Vector Machine (SVM), according to the definition provided? A: A system for efficiently training linear learning machines in kernel-induced feature spaces, while respecting the insights of generalization theory and exploiting optimization theory.

II. Basic Concepts

Q (Short): Write the formula for the scalar product of two vectors, $a$ and $b$ . A: $a \cdot b = ∣∣ a ∣∣ ∣∣ b ∣∣ cos θ$
Q (Short): What is the general form of a decision function, $f (x)$ , for binary classification, and what are the resulting class labels ( $y_{i}$ )? A: $f (x) \in R$ . - If $f (x_{i}) \geq 0$ , then $y_{i} = 1$ . - If $f (x_{i}) < 0$ , then $y_{i} = - 1$ .
Q (Medium): Briefly describe the core idea behind how SVMs work. A: SVMs find the best separating hyperplane between classes according to a criterion (e.g., maximum margin). The training process is an optimization problem, and the training data is effectively reduced to a set of support vectors.

III. Key Concepts: Feature Spaces and Kernels

Q (Short): Why might we map data to a higher-dimensional feature space? A: To make the data linearly separable, even if it isn’t in the original input space.
Q (Medium): What is a kernel function, and what property must it satisfy? A: A kernel function, $K (x_{1}, x_{2})$ , implicitly maps data to a new feature space. It must be equivalent to an inner product in some feature space. Formally, $K (x_{1}, x_{2}) \in R$ .
Q (Short): Give three examples of common kernel functions. A:
- Linear: $⟨ x \cdot z ⟩$
- Polynomial: $P ((x \cdot z))$
- Gaussian: $exp (- ∣∣ x - z ∣ ∣^{2} / σ^{2})$

IV. Linear Separators and Maximum Margin

Q (Medium): How can binary classification be viewed in terms of feature space? A: Binary classification is the task of separating classes in feature space using a hyperplane. The decision function is $f (x) = sign (w^{T} x + b)$ .
Q (Long): Given multiple possible linear separators, how does an SVM choose the “optimal” one? Explain the concept of the margin. A: The SVM aims to find the hyperplane that maximizes the margin. The margin is the distance between the hyperplane and the closest data points from either class (the support vectors). A larger margin generally leads to better generalization.
Q (Medium): Explain the process of finding the optimal linear separator using convex hulls. A:
1. Find the closest points in the convex hulls of the two classes.
2. The optimal hyperplane bisects the line segment connecting these two closest points. The normal vector of the hyperplane, $w$ , is given by $w = d - c$ , where $d$ and $c$ are the closest points. The hyperplane equation is $w^{T} x + b = 0$ .

[[SVMIITD.pdf#page=18&rect=191,23,529,315|]] 15. Q (Short): What are support vectors? A: The data points closest to the separating hyperplane.

Q (Short): What is the distance, $r$ , from an example data point to the separating hyperplane? A: $r = \frac{w ^{T} x + b}{∣∣ w ∣∣}$
Q (Short): What is the margin, $ρ$ , of the separator? A $ρ$ is the width of separation between classes.
Q (Medium): Why is maximizing the margin considered a good strategy, both intuitively and theoretically? A:
- Intuitively: A larger margin provides more “buffer” against misclassification of new, unseen data.
- Theoretically: Maximizing the margin minimizes the complexity of the model (related to VC dimension), which helps prevent overfitting and improves generalization.
Q (Short): In maximum margin classification, which training examples are crucial, and which can be ignored? A: Only the support vectors are important; other training examples are ignorable.

V. Linear SVM - Mathematical Formulation

Q (Long): Formulate the optimization problem for finding the maximum margin hyperplane (the primal problem). Include the constraints. A: We want to find $w$ and $b$ such that:
- $ρ = \frac{2}{∣∣ w ∣∣}$ is maximized.
- Subject to the constraints:
  - $w^{T} x_{i} + b \geq 1$ if $y_{i} = 1$
  - $w^{T} x_{i} + b \leq - 1$ if $y_{i} = - 1$ This is often reformulated as:
- Minimize: $Φ (w) = \frac{1}{2} w^{T} w$
- Subject to: $y_{i} (w^{T} x_{i} + b) \geq 1$ for all $i$ .
Q (Long): Explain how the optimization problem is solved (the dual problem). What are Lagrange multipliers, and what is the final form of the solution? A: We use Lagrange multipliers, $α_{i}$ , one for each constraint. The dual problem is:
- Maximize: $Q (α) = \sum_{i = 1}^{N} α_{i} - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j}$
- Subject to:
  - $\sum_{i = 1}^{N} α_{i} y_{i} = 0$
  - $α_{i} \geq 0$ for all $i$ The solution has the form:
- $w = \sum_{i = 1}^{N} α_{i} y_{i} x_{i}$
- $b = y_{k} - w^{T} x_{k}$ for any $x_{k}$ such that $α_{k} \neq = 0$ The classifying function is: $f (x) = \sum_{i = 1}^{N} α_{i} y_{i} x_{i}^{T} x + b$
Q (Short): How do you identify support vectors from the solution of the dual problem? A: Support vectors correspond to training points $x_{i}$ where the Lagrange multiplier $α_{i}$ is non-zero.
Q(Medium): Explain the solution to the optimization problem. A: The solution is $w = \sum α_{i} y_{i} x_{i}$ and $b = y_{k} - w^{T} x_{k}$ for any $x_{k}$ such that $α_{k} \neq = 0$ . Each non-zero $a_{i}$ indicates that the corresponding $x_{i}$ is a support vector. The classifying function is $f (x) = \sum α_{i} y_{i} x_{i}^{T} x + b$ .

VIII. Non-linear SVMs

Q (Medium): How do non-linear SVMs handle data that is not linearly separable? A: They map the data to a higher-dimensional feature space where the data becomes linearly separable (or nearly so, using a soft margin).
Q(Medium): Describe the non-linear classification with an example. A: Consider $x = [a, b]$ . Then, $x . w = w_{1} a + w_{2} b$ . Now consider $θ (x) = [a, b, ab, a^{2}, b^{2}]$ , and we have, $θ (x) . w = w_{1} a + w_{2} b + w_{3} ab + w_{4} a^{2} + w_{5} b^{2}$
Q (Long): Explain the “kernel trick.” Why is it important? A: The kernel trick avoids explicitly computing the mapping to the high-dimensional feature space, $ϕ (x)$ . Instead, we use a kernel function, $K (x_{i}, x_{j})$ , which computes the inner product in the feature space directly from the input space: $K (x_{i}, x_{j}) = ϕ (x_{i})^{T} ϕ (x_{j})$ . This is crucial because the feature space can be extremely high-dimensional (even infinite), making explicit computation infeasible.
Q (Long): How does the dual problem formulation change for non-linear SVMs? A: The only change is that the inner product $x_{i}^{T} x_{j}$ is replaced by the kernel function $K (x_{i}, x_{j})$ :
- Maximize: $Q (α) = \sum_{i = 1}^{N} α_{i} - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})$
  - Subject to:
    - $\sum_{i = 1}^{N} α_{i} y_{i} = 0$
    - $0 \leq α_{i} \leq C$ for all $i$ The classifying function becomes: $f (x) = \sum_{i = 1}^{N} α_{i} y_{i} K (x_{i}, x) + b$

IX. Kernel Functions and Mercer’s Theorem

Q (Medium): What is a positive definite matrix? A: A square matrix A is positive definite if $x^{T} Ax > 0$ for all nonzero column vectors $x$ .
Q(Short): What is a negative definite, positive semi-definite and negative semi-definite matrix? A: A square matrix A is negative definite if $x^{T} Ax < 0$ for all nonzero column vectors $x$ . It’s positive semi-definite if $x^{T} Ax \geq 0$ , and negative semi-definite if $x^{T} Ax \leq 0$ .
Q (Medium): State Mercer’s theorem. Why is it important for SVMs? A: Mercer’s theorem states that every semi-positive definite symmetric function is a kernel. This is important because it provides a way to determine if a function is a valid kernel (i.e., corresponds to an inner product in some feature space) without having to explicitly find the feature mapping $ϕ (x)$ .
Q(Medium): What is a Gram Matrix? A: A matrix, K, composed of kernel values. For example: $K = K (x_{1}, x_{1}) K (x_{2}, x_{1}) ⋮ K (x_{N}, x_{1}) K (x_{1}, x_{2}) K (x_{2}, x_{2}) ⋮ K (x_{N}, x_{2}) \dots \dots ⋱ \dots K (x_{1}, x_{N}) K (x_{2}, x_{N}) ⋮ K (x_{N}, x_{N})$
Q (Short): Give examples of kernel functions besides the linear, polynomial, and Gaussian kernels. A: Two-layer perceptron: $K (x_{i}, x_{j}) = tanh (β_{0} x_{i}^{T} x_{j} + β_{1})$

X. SVM Applications and Extensions

Q (Short): List some application areas where SVMs have been successfully used. A: Text classification, genomic data analysis, and many other classification tasks.
Q (Short): Name two popular optimization algorithms for training SVMs. A: SMO (Sequential Minimal Optimization) and SVMlight.
Q (Medium): List some extensions to the basic SVM framework. A:
- Regression
- Variable Selection
- Boosting
- Density Estimation
- Unsupervised Learning (Novelty/Outlier Detection, Feature Detection, Clustering)

Quartz 4

Explorer

Question SVM

Graph View

Backlinks