Support Vector Machines (SVMs) - Exam Preparation Notes (Continued)
XI. Introduction (From the New PDF)
-
Q (Medium): Why is the “kernel trick” important for SVMs? A: It makes SVM one of the most effective and efficient machine learning tools.
-
Q (Short): What two classifiers are combined to form the basis of an SVM? A: A linear classifier and a k-nearest neighbor (K-NN) classifier.
XII. Linear Classifier
-
Q (Medium): How do Bayesian methods (mentioned in the context of linear classifiers) make decisions, and what is a limitation related to data distributions? A: Bayesian methods are model-based and give good decisions if the models are accurate. However, because data distributions are usually unknown, accurate models require a large number of training samples.
-
Q (Long): Explain the alternative approach to Bayesian methods that uses a functional form decision boundary. What is a linear classifier an example of? A: Instead of modeling the data distribution, we assume a functional form for the decision boundary between classes. The parameters of this boundary are estimated from training data. A linear classifier is an example of this approach, assuming a linear decision boundary.
-
Q (Medium): Given a feature vector , write the general form of a linear discriminant function, . A: (or, equivalently, , where ).
-
Q (Short): In the linear discriminant function, what are the and terms? A: are the variables (features), and are the coefficients or weights.
-
Q (Short): Geometrically, what does represent? A: A hyperplane in -dimensional space (and the decision boundary).
-
Q (Short): How is a data point with feature vector classified using the linear discriminant function? A:
- If , belongs to class 1.
- If , belongs to class 2.
-
Q (Long): Describe the “classical” way to find the weights of a linear classifier (the theoretical solution). A: Solve a set of linear equations. For -dimensional data, you need linear equations (or samples) to solve for the weights. If represents the class label ( for class 1, for class 2), and you have samples , you can set up the equations: for . This can be written in matrix form as and solved as .
-
Q (Long): Describe the “optimal” solution for finding the weights of a linear classifier. What is minimized? A: The optimal solution minimizes the total squared error of over the entire training dataset. If you have data points , you minimize: . Taking the partial derivative of with respect to each and setting it to zero leads to a system of linear equations. The solution can be expressed as .
-
Q (Medium): Explain the equations obtained after partial derivative of error. A: By taking the partial derivative of the total squared error, E, with respect to each weight, , and setting the derivative to zero (), we get linear equations: , which simplifies to , for k = 0, 1, 2,…n
-
Q (Long): Describe the iterative optimization procedure (suboptimal solution) for finding the weights of a linear classifier. A:
- Initialize the weights with small random values.
- Take the next training sample , where or .
- Compute .
- If (a misclassification), update the weights: and for , where and are positive constants.
- Repeat steps 2-4 for all training samples until all samples are correctly classified or the weights stop changing.
-
Q (Medium): How can a linear classifier be extended to build a nonlinear classifier? Give an example. A: By mapping the original feature vector to a higher-dimensional space. For instance, a 2D vector can be mapped to a 5D vector where . This corresponds to the polynomial kernel .
XIII. K-Nearest Neighbor (K-NN) Classification
-
Q (Medium): Describe the basic idea of the K-Nearest Neighbor (K-NN) algorithm. A: K-NN stores all available cases and classifies new cases based on a decision function (usually a distance measure). It finds the K nearest neighbors to the new data point and assigns the new data point to the majority class among those neighbors.
-
Q (Long): Given a training dataset and a distance measure, describe the K-NN classification algorithm. A:
- Input the new data point .
- Compute the distance between and all training samples : .
- Sort the distances in ascending order and rank the training samples accordingly: .
- For a 1-NN classifier, classify to the class of the nearest neighbor ().
- For a K-NN classifier, classify to the majority class among the top ranked data points.
-
Q (Short): What are two common distance measures used in K-NN? A: Euclidean distance () and city block distance ().
-
Q (Medium): What is the effect of the value of ‘k’ in K-NN, and how does it affect the classifier’s resistance to outliers? A: The value of has a smoothing effect. Larger values of make the classifier more resistant to outliers but can also lead to misclassifications. The choice of is usually determined empirically.
-
Q(Medium): What is weighted K-NN and what is the advantage? A: It gives greater weight to nearer neighbors and lesser to the farther ones. This overcomes misclassifications
-
Q (Medium): What is a disadvantage of K-NN related to its dependence on training data? How does this contrast with other classifiers? A: A K-NN classifier is tied to the training data. It needs to “carry” all the training data for classification, unlike other classifiers that, once trained, can discard the training data.
-
Q (Short): What is a key advantage of a K-NN classifier related to its ability to classify nonlinearly separable data? A: It can classify data that is nonlinearly separable.
XIV. Support Vector Machine (SVM)
-
Q (Medium): What are the disadvantages of a linear classifier that SVM aims to address? A:
- The solution is either not optimal or computationally expensive.
- It cannot classify nonlinearly separable data.
-
Q (Medium): What are the disadvantages of a K-NN classifier that SVM aims to address? A:
- It’s difficult to choose ‘k’.
- Dependence on training data.
-
Q (Short): State the two primary goals of a basic (linear) SVM. A:
- Maximize the margin separating the two classes (optimal).
- Use only a few training data points (support vectors) to define the hyperplane (efficient).
-
Q (Short): What additional goal does a kernel-based SVM achieve? A: To be able to classify data that is nonlinearly separable.
-
Q (Medium): What is a perceptron, and how does its training process relate to that of a linear classifier? A: A perceptron is a binary linear classifier. Its training process is similar to the linear classifier, but it can perform online learning, processing training data one at a time.
-
Q (Long): Given a training dataset , where , and the dot product , write the formulation of a perceptron and its decision function. A:
- Let and ; then
-
Q(Long): Describe the perceptron training algorithm. A: 1. Take the next training data (xᵢ, yᵢ) ∈ D 2. if h(xᵢ) ≥ 0, Wk+1 ← Wk 3. if h(xᵢ) < 0, then Wk+1 ← Wk + ηyᵢxᵢ, η > 0 4. Repeat from step 1.
-
Q (Long): Explain the concept of the margin between two classes in the context of SVM. Why is maximizing the margin important? A: The margin is the distance between the separating hyperplane and the closest data points from either class (the support vectors). Maximizing the margin leads to a more robust classifier that generalizes better to unseen data.
-
Q (Medium): Given the two inequalities for the two classes: for and for . Combine them into one inequality. A:
-
Q (Medium): Define the equations for the hyperplanes , , and in terms of , , and . A:
- :
- :
- :
-
Q (Medium): What is the distance from the hyperplane to the origin in -dimensional space? A:
-
Q (Long): What is the margin between the two hyperplanes and ? What are the data points that lie on and called? A: The margin is . The data points on and are called support vectors.
-
Q (Long): Formulate the optimization problem for maximizing the margin (the primal form of the SVM). A:
- Minimize:
- Subject to: for
-
Q (Long): Write the Lagrange function, , for the SVM optimization problem. A:
-
Q (Long): State the primal form of the SVM optimization problem. A:
- Minimize:
- Subject to: ,