4. Max-Margin Classifier Functional Margin Geometric Margin 4 We feel more confident when functional margin is larger Note that scaling on w, b won’t change the plane. Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
5. Maximize margins Optimization problem: maximize minimal geometric margin under constraints. Introduce scaling factor such that 5 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
6. Optimization problem subject to constraints Maximize f(x, y), subject to constraint g(x, y) = c 6 -> Lagrange multiplier method
7. Lagrange duality Primal optimization problem: GeneralizedLagrangian method Primal optimization problem (equivalent form) Dual optimization problem: 7 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
8. Dual Problem The necessary conditions that equality holds: f, giare convex, and hi are affine. KKT conditions. 8 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
9. Optimal margin classifiers Its Lagrangian Its dual problem 9 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
10. Support Vector Machine (cont’d) If not linearly separable, we can Find a nonlinear solution Technically, it’s a linear solution in higher-order space Kernel Trick 26
11. Kernel and feature mapping Kernel: Positive semi-definite Symmetric For example: Loose Intuition “similarity” between features 11 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
12. Soft Margin (L1 regularization) 12 C = ∞ leads to hard margin SVM, Rychetsky (2001) Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
14. Bias/variance tradeoff underfitting(high bias) overfitting(high variance) Training Error = Generalization Error = 14 In-sample error Out-of-sample error Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
15. Bias/variance tradeoff 15 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.
17. Chernoff bound (|H|=finite) Lemma: Assume Z1, Z2, …, Zmare drawn iid from Bernoulli(φ), and and let γ > 0 be fixed. Then, based on this lemma, one can find, with probability 1-δ (k = # of hypotheses) 17 Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
18. Chernoff bound (|H|=infinite) VC Dimension d : The size of largest set that H can shatter. e.g. H = linear classifiers in 2-D VC(H) = 3 With probability at least 1-δ, 18 Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
19.
20.
21. Model Selection Loop possible parameters: Pick one set of parameter, e.g. C = 2.0 Do cross validation, get a error estimation Pick the Cbest (with minimal error estimation) as the parameter 20
22. Multiclass SVM One against one There are binary SVMs. (1v2, 1v3, …) To predict, each SVM can vote between 2 classes. One against all There are k binary SVMs. (1 v rest, 2 v rest, …) To predict, evaluate , pick the largest. Multiclass SVM by solving ONE optimization problem 21 K = 1 3 5 3 2 1 1 2 3 4 5 6 K = 3 poll Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292.