SVM Question 2

Support Vector Machines (SVMs) - Exam Preparation Notes (Continued)

XI. Introduction (From the New PDF)

Q (Medium): Why is the “kernel trick” important for SVMs? A: It makes SVM one of the most effective and efficient machine learning tools.
Q (Short): What two classifiers are combined to form the basis of an SVM? A: A linear classifier and a k-nearest neighbor (K-NN) classifier.

XII. Linear Classifier

Q (Medium): How do Bayesian methods (mentioned in the context of linear classifiers) make decisions, and what is a limitation related to data distributions? A: Bayesian methods are model-based and give good decisions if the models are accurate. However, because data distributions are usually unknown, accurate models require a large number of training samples.
Q (Long): Explain the alternative approach to Bayesian methods that uses a functional form decision boundary. What is a linear classifier an example of? A: Instead of modeling the data distribution, we assume a functional form for the decision boundary between classes. The parameters of this boundary are estimated from training data. A linear classifier is an example of this approach, assuming a linear decision boundary.
Q (Medium): Given a feature vector $x = (x_{1}, x_{2}, ..., x_{n})$ , write the general form of a linear discriminant function, $f (x)$ . A: $f (x) = w_{0} + w_{1} x_{1} + w_{2} x_{2} + ... + w_{n} x_{n}$ (or, equivalently, $f (x) = \sum_{j = 0}^{n} w_{j} x_{j}$ , where $x_{0} = 1$ ).
Q (Short): In the linear discriminant function, what are the $x_{i}$ and $w_{i}$ terms? A: $x_{i}$ are the variables (features), and $w_{i}$ are the coefficients or weights.
Q (Short): Geometrically, what does $f (x) = 0$ represent? A: A hyperplane in $n$ -dimensional space (and the decision boundary).
Q (Short): How is a data point with feature vector $x$ classified using the linear discriminant function? A:
- If $f (x) > 0$ , $x$ belongs to class 1.
- If $f (x) < 0$ , $x$ belongs to class 2.
Q (Long): Describe the “classical” way to find the weights of a linear classifier (the theoretical solution). A: Solve a set of linear equations. For $n$ -dimensional data, you need $n + 1$ linear equations (or samples) to solve for the $n + 1$ weights. If $d_{i}$ represents the class label ( $d_{i} = 1$ for class 1, $d_{i} = - 1$ for class 2), and you have samples $(x_{i}, d_{i})$ , you can set up the equations: $\sum_{j = 0}^{n} w_{j} x_{ij} = d_{i}$ for $i = 1, 2, ..., n + 1$ . This can be written in matrix form as $X w^{T} = d^{T}$ and solved as $w^{T} = X^{- 1} d^{T}$ .
Q (Long): Describe the “optimal” solution for finding the weights of a linear classifier. What is minimized? A: The optimal solution minimizes the total squared error of $f (x)$ over the entire training dataset. If you have $N$ data points $(x_{i}, d_{i})$ , you minimize: $E = \sum_{i = 0}^{N - 1} (f (x_{i}) - d_{i})^{2}$ . Taking the partial derivative of $E$ with respect to each $w_{k}$ and setting it to zero leads to a system of $n + 1$ linear equations. The solution can be expressed as $w^{T} = (X^{T} X)^{- 1} (X^{T} d^{T})$ .
Q (Medium): Explain the equations obtained after partial derivative of error. A: By taking the partial derivative of the total squared error, E, with respect to each weight, $w_{k}$ , and setting the derivative to zero ( $\frac{\partial E}{\partial w _{k}} = 0$ ), we get $n + 1$ linear equations: $\sum_{i = 0}^{N - 1} x_{ik} (\sum_{j = 0}^{n} w_{j} x_{ij} - d_{i}) = 0$ , which simplifies to $\sum_{j = 0}^{n} w_{j} (\sum_{i = 0}^{N - 1} x_{ik} x_{ij}) = \sum_{i = 0}^{N - 1} x_{ik} d_{i}$ , for k = 0, 1, 2,…n
Q (Long): Describe the iterative optimization procedure (suboptimal solution) for finding the weights of a linear classifier. A:
1. Initialize the weights $w_{0}, w_{1}, ..., w_{n}$ with small random values.
2. Take the next training sample ${x, d} = {(x_{1}, x_{2}, ..., x_{n}), d}$ , where $d = 1$ or $- 1$ .
3. Compute $f (x) = w_{0} + w_{1} x_{1} + ... + w_{n} x_{n}$ .
4. If $f (x) \neq = d$ (a misclassification), update the weights: $w_{0} \leftarrow w_{0} + c d k$ and $w_{j} \leftarrow w_{j} + c d x_{j}$ for $j = 1, 2, ..., n$ , where $c$ and $k$ are positive constants.
5. Repeat steps 2-4 for all training samples until all samples are correctly classified or the weights stop changing.
Q (Medium): How can a linear classifier be extended to build a nonlinear classifier? Give an example. A: By mapping the original feature vector to a higher-dimensional space. For instance, a 2D vector $(x, y)$ can be mapped to a 5D vector $(u_{1}, u_{2}, u_{3}, u_{4}, u_{5})$ where $u_{1} = 2 x, u_{2} = 2 y, u_{3} = x^{2}, u_{4} = 2 x y, u_{5} = y^{2}$ . This corresponds to the polynomial kernel $(1 + x + y)^{2}$ .

XIII. K-Nearest Neighbor (K-NN) Classification

Q (Medium): Describe the basic idea of the K-Nearest Neighbor (K-NN) algorithm. A: K-NN stores all available cases and classifies new cases based on a decision function (usually a distance measure). It finds the K nearest neighbors to the new data point and assigns the new data point to the majority class among those neighbors.
Q (Long): Given a training dataset and a distance measure, describe the K-NN classification algorithm. A:
1. Input the new data point $x$ .
2. Compute the distance between $x$ and all training samples $x_{i}$ : $d i s t (x - x_{i})$ .
3. Sort the distances in ascending order and rank the training samples accordingly: $x_{r 1}, x_{r 2}, ..., x_{r N}$ .
4. For a 1-NN classifier, classify $x$ to the class of the nearest neighbor ( $y_{r 1}$ ).
5. For a K-NN classifier, classify $x$ to the majority class among the top $k$ ranked data points.
Q (Short): What are two common distance measures used in K-NN? A: Euclidean distance ( $L_{2}$ ) and city block distance ( $L_{1}$ ).
Q (Medium): What is the effect of the value of ‘k’ in K-NN, and how does it affect the classifier’s resistance to outliers? A: The value of $k$ has a smoothing effect. Larger values of $k$ make the classifier more resistant to outliers but can also lead to misclassifications. The choice of $k$ is usually determined empirically.
Q(Medium): What is weighted K-NN and what is the advantage? A: It gives greater weight to nearer neighbors and lesser to the farther ones. This overcomes misclassifications
Q (Medium): What is a disadvantage of K-NN related to its dependence on training data? How does this contrast with other classifiers? A: A K-NN classifier is tied to the training data. It needs to “carry” all the training data for classification, unlike other classifiers that, once trained, can discard the training data.
Q (Short): What is a key advantage of a K-NN classifier related to its ability to classify nonlinearly separable data? A: It can classify data that is nonlinearly separable.

XIV. Support Vector Machine (SVM)

Q (Medium): What are the disadvantages of a linear classifier that SVM aims to address? A:
- The solution is either not optimal or computationally expensive.
- It cannot classify nonlinearly separable data.
Q (Medium): What are the disadvantages of a K-NN classifier that SVM aims to address? A:
- It’s difficult to choose ‘k’.
- Dependence on training data.
Q (Short): State the two primary goals of a basic (linear) SVM. A:
- Maximize the margin separating the two classes (optimal).
- Use only a few training data points (support vectors) to define the hyperplane (efficient).
Q (Short): What additional goal does a kernel-based SVM achieve? A: To be able to classify data that is nonlinearly separable.
Q (Medium): What is a perceptron, and how does its training process relate to that of a linear classifier? A: A perceptron is a binary linear classifier. Its training process is similar to the linear classifier, but it can perform online learning, processing training data one at a time.
Q (Long): Given a training dataset $D = {(x_{i}, y_{i})}$ , where $y_{i} \in {- 1, 1}$ , and the dot product $⟨ x_{i}, x_{j} ⟩ = x_{i} \cdot x_{j}$ , write the formulation of a perceptron and its decision function. A:
1. $f (x) = ⟨ w, x ⟩ + b$
2. Let $w_{0} = b$ and $x_{0} = 1$ ; then $f (x) = ⟨ w, x ⟩$
3. $h (x) = sign (f (x)) = y_{i} (f (x))$
Q(Long): Describe the perceptron training algorithm. A: 1. Take the next training data (xᵢ, yᵢ) ∈ D 2. if h(xᵢ) ≥ 0, Wk+1 ← Wk 3. if h(xᵢ) < 0, then Wk+1 ← Wk + ηyᵢxᵢ, η > 0 4. Repeat from step 1.
Q (Long): Explain the concept of the margin between two classes in the context of SVM. Why is maximizing the margin important? A: The margin is the distance between the separating hyperplane and the closest data points from either class (the support vectors). Maximizing the margin leads to a more robust classifier that generalizes better to unseen data.
Q (Medium): Given the two inequalities for the two classes: $⟨ w, x ⟩ + b \geq 1$ for $y_{i} = + 1$ and $⟨ w, x ⟩ + b \leq - 1$ for $y_{i} = - 1$ . Combine them into one inequality. A: $y_{i} (⟨ w, x ⟩ + b) - 1 \geq 0$
Q (Medium): Define the equations for the hyperplanes $H_{1}$ , $H_{2}$ , and $H_{0}$ in terms of $w$ , $x$ , and $b$ . A:
- $H_{1}$ : $⟨ w, x ⟩ + b - 1 = 0$
- $H_{2}$ : $⟨ w, x ⟩ + b + 1 = 0$
- $H_{0}$ : $⟨ w, x ⟩ + b = 0$
Q (Medium): What is the distance from the hyperplane $H_{0}$ to the origin in $n$ -dimensional space? A: $\frac{∣ b ∣}{∣∣ w ∣∣}$
Q (Long): What is the margin between the two hyperplanes $H_{1}$ and $H_{2}$ ? What are the data points that lie on $H_{1}$ and $H_{2}$ called? A: The margin is $\frac{2}{∣∣ w ∣∣}$ . The data points on $H_{1}$ and $H_{2}$ are called support vectors.
Q (Long): Formulate the optimization problem for maximizing the margin (the primal form of the SVM). A:
- Minimize: $f (w) = \frac{1}{2} ∣∣ w ∣ ∣^{2} = \frac{1}{2} ⟨ w, w ⟩$
- Subject to: $g_{i} (w, b) = y_{i} (⟨ w, x_{i} ⟩ + b) - 1 \geq 0$ for $i = 1, 2, ..., N$
Q (Long): Write the Lagrange function, $L (w, b, α_{i})$ , for the SVM optimization problem. A: $L (w, b, α_{i}) = f (w) - \sum_{i} α_{i} g_{i} (w, b) = \frac{1}{2} ∣∣ w ∣ ∣^{2} - \sum_{i} α_{i} [y_{i} (⟨ w, x_{i} ⟩ + b) - 1] = \frac{1}{2} w \cdot w - \sum_{i} α_{i} y_{i} (w \cdot x_{i} + b) + \sum_{i} α_{i}$
Q (Long): State the primal form of the SVM optimization problem. A:
- Minimize: $L (w, b, α_{i}) = \frac{1}{2} w \cdot w - \sum_{i} α_{i} y_{i} (w \cdot x_{i} + b) + \sum_{i} α_{i}$
- Subject to: $α_{i} \geq 0$ , $i = 1, 2, ..., N$

Quartz 4

Explorer

SVM Question 2

Graph View

Backlinks