Support Vector Machines (SVMs) - Exam Preparation Notes (Continued)

XI. Introduction (From the New PDF)

  1. Q (Medium): Why is the “kernel trick” important for SVMs? A: It makes SVM one of the most effective and efficient machine learning tools.

  2. Q (Short): What two classifiers are combined to form the basis of an SVM? A: A linear classifier and a k-nearest neighbor (K-NN) classifier.

XII. Linear Classifier

  1. Q (Medium): How do Bayesian methods (mentioned in the context of linear classifiers) make decisions, and what is a limitation related to data distributions? A: Bayesian methods are model-based and give good decisions if the models are accurate. However, because data distributions are usually unknown, accurate models require a large number of training samples.

  2. Q (Long): Explain the alternative approach to Bayesian methods that uses a functional form decision boundary. What is a linear classifier an example of? A: Instead of modeling the data distribution, we assume a functional form for the decision boundary between classes. The parameters of this boundary are estimated from training data. A linear classifier is an example of this approach, assuming a linear decision boundary.

  3. Q (Medium): Given a feature vector , write the general form of a linear discriminant function, . A: (or, equivalently, , where ).

  4. Q (Short): In the linear discriminant function, what are the and terms? A: are the variables (features), and are the coefficients or weights.

  5. Q (Short): Geometrically, what does represent? A: A hyperplane in -dimensional space (and the decision boundary).

  6. Q (Short): How is a data point with feature vector classified using the linear discriminant function? A:

    • If , belongs to class 1.
    • If , belongs to class 2.
  7. Q (Long): Describe the “classical” way to find the weights of a linear classifier (the theoretical solution). A: Solve a set of linear equations. For -dimensional data, you need linear equations (or samples) to solve for the weights. If represents the class label ( for class 1, for class 2), and you have samples , you can set up the equations: for . This can be written in matrix form as and solved as .

  8. Q (Long): Describe the “optimal” solution for finding the weights of a linear classifier. What is minimized? A: The optimal solution minimizes the total squared error of over the entire training dataset. If you have data points , you minimize: . Taking the partial derivative of with respect to each and setting it to zero leads to a system of linear equations. The solution can be expressed as .

  9. Q (Medium): Explain the equations obtained after partial derivative of error. A: By taking the partial derivative of the total squared error, E, with respect to each weight, , and setting the derivative to zero (), we get linear equations: , which simplifies to , for k = 0, 1, 2,…n

  10. Q (Long): Describe the iterative optimization procedure (suboptimal solution) for finding the weights of a linear classifier. A:

    1. Initialize the weights with small random values.
    2. Take the next training sample , where or .
    3. Compute .
    4. If (a misclassification), update the weights: and for , where and are positive constants.
    5. Repeat steps 2-4 for all training samples until all samples are correctly classified or the weights stop changing.
  11. Q (Medium): How can a linear classifier be extended to build a nonlinear classifier? Give an example. A: By mapping the original feature vector to a higher-dimensional space. For instance, a 2D vector can be mapped to a 5D vector where . This corresponds to the polynomial kernel .

XIII. K-Nearest Neighbor (K-NN) Classification

  1. Q (Medium): Describe the basic idea of the K-Nearest Neighbor (K-NN) algorithm. A: K-NN stores all available cases and classifies new cases based on a decision function (usually a distance measure). It finds the K nearest neighbors to the new data point and assigns the new data point to the majority class among those neighbors.

  2. Q (Long): Given a training dataset and a distance measure, describe the K-NN classification algorithm. A:

    1. Input the new data point .
    2. Compute the distance between and all training samples : .
    3. Sort the distances in ascending order and rank the training samples accordingly: .
    4. For a 1-NN classifier, classify to the class of the nearest neighbor ().
    5. For a K-NN classifier, classify to the majority class among the top ranked data points.
  3. Q (Short): What are two common distance measures used in K-NN? A: Euclidean distance () and city block distance ().

  4. Q (Medium): What is the effect of the value of ‘k’ in K-NN, and how does it affect the classifier’s resistance to outliers? A: The value of has a smoothing effect. Larger values of make the classifier more resistant to outliers but can also lead to misclassifications. The choice of is usually determined empirically.

  5. Q(Medium): What is weighted K-NN and what is the advantage? A: It gives greater weight to nearer neighbors and lesser to the farther ones. This overcomes misclassifications

  6. Q (Medium): What is a disadvantage of K-NN related to its dependence on training data? How does this contrast with other classifiers? A: A K-NN classifier is tied to the training data. It needs to “carry” all the training data for classification, unlike other classifiers that, once trained, can discard the training data.

  7. Q (Short): What is a key advantage of a K-NN classifier related to its ability to classify nonlinearly separable data? A: It can classify data that is nonlinearly separable.

XIV. Support Vector Machine (SVM)

  1. Q (Medium): What are the disadvantages of a linear classifier that SVM aims to address? A:

    • The solution is either not optimal or computationally expensive.
    • It cannot classify nonlinearly separable data.
  2. Q (Medium): What are the disadvantages of a K-NN classifier that SVM aims to address? A:

    • It’s difficult to choose ‘k’.
    • Dependence on training data.
  3. Q (Short): State the two primary goals of a basic (linear) SVM. A:

    • Maximize the margin separating the two classes (optimal).
    • Use only a few training data points (support vectors) to define the hyperplane (efficient).
  4. Q (Short): What additional goal does a kernel-based SVM achieve? A: To be able to classify data that is nonlinearly separable.

  5. Q (Medium): What is a perceptron, and how does its training process relate to that of a linear classifier? A: A perceptron is a binary linear classifier. Its training process is similar to the linear classifier, but it can perform online learning, processing training data one at a time.

  6. Q (Long): Given a training dataset , where , and the dot product , write the formulation of a perceptron and its decision function. A:

    1. Let and ; then
  7. Q(Long): Describe the perceptron training algorithm. A: 1. Take the next training data (xᵢ, yᵢ) ∈ D 2. if h(xᵢ) ≥ 0, Wk+1 ← Wk 3. if h(xᵢ) < 0, then Wk+1 ← Wk + ηyᵢxᵢ, η > 0 4. Repeat from step 1.

  8. Q (Long): Explain the concept of the margin between two classes in the context of SVM. Why is maximizing the margin important? A: The margin is the distance between the separating hyperplane and the closest data points from either class (the support vectors). Maximizing the margin leads to a more robust classifier that generalizes better to unseen data.

  9. Q (Medium): Given the two inequalities for the two classes: for and for . Combine them into one inequality. A:

  10. Q (Medium): Define the equations for the hyperplanes , , and in terms of , , and . A:

    • :
    • :
    • :
  11. Q (Medium): What is the distance from the hyperplane to the origin in -dimensional space? A:

  12. Q (Long): What is the margin between the two hyperplanes and ? What are the data points that lie on and called? A: The margin is . The data points on and are called support vectors.

  13. Q (Long): Formulate the optimization problem for maximizing the margin (the primal form of the SVM). A:

    • Minimize:
    • Subject to: for
  14. Q (Long): Write the Lagrange function, , for the SVM optimization problem. A:

  15. Q (Long): State the primal form of the SVM optimization problem. A:

    • Minimize:
    • Subject to: ,