2. Table of Contents
Support Vector machine
Least Squares Support Vector Machine Classifier
Conclusion
3. Support Vector Machines
SVM is a classifier derived from statistical learning theory by
Vapnik and Chervonnkis.
SVMs introduced by Boser, Guyon, Vapnik in COLT-92.
Initially popularized in NIPS community, now an important
and active field of all Machine Learning Research
What is SVM?
SVMs are learning systems that
Use a hypothesis space of linear functions
In a high dimensional feature space - kernel functions
Trained with a learning algorithm from optimization theory -
Lagrange
Implements a learning bias derived from statistical learning
theory - Generalization
4. Support Vector Machines for Classification
Given a training set of N data points {yk , xk }N , the support
k=1
vector method approach aims at constructing a classifier of the
form.
N
y (x) = sign[ αk yk ψ(x, xk ) + b] (1)
k=1
where
xk ∈ Rn is the k th input pattern
yk ∈ Rn is the k t h output.
αk are positive constants, b is real constant.
T
xk x Linear SVM.
T
(x x + 1)d
k Polynomial SVM of degree d
ψ(x, xk ) = 2 /σ 2 }
exp{− x−xk 2
RBF SVM
tanh(kx T x + θ) Two layer neural SVM
k
where σ, θ, k are constants.
5. SVMs for Classifications
The classifier is constructed as follows. One assumes that.
ω T φ(xk ) + b ≥ 1, ifyk = +1
(2)
ω T φ(xk ) + b ≤ −1, ifyk = −1
Which is equivalent to
yk [ω T φ(xk ) + b] ≥ 1, k = 1, . . . N (3)
Where φ() is a non-linear function which maps the input space into
higher dimensional space.
In order to have the possibility to violate (3), in case a separating
hyperplane in this high dimensional space does not exist, variables
ξk are introduced such that
yk [ω T φ(xk ) + b] ≥ 1 − ξk , k = 1, . . . N
(4)
ξk ≥, k = 1, . . . N
6. SVMs for Classification
According to the structural risk minimization principle, the risk
bound is minimized by formulating the optimization problem.
N
1
minJ1 (ω, ξk ) = ω T ω + c ξk (5)
ω,ξk 2
k=1
Subject to (4). Therefore, one constructs the Lagrangian.
N
L1 (ω, b, ξk , αk , vk ) = J1 (ω, ξk ) − αk {yk [ω T φ(xk ) + b] − 1 + ξk }
k=1
N
− v k ξk
k=1
(6)
by introducing Lagrange multipliers αk ≥ 0, vk ≥ 0(k = 1, . . . N)
The solution is given by the saddle point of the Lagrangian by
computing.
max min L1 (ω, b, ξk , αk , vk ). (7)
αk ,vk ω,b,ξk
7. SVMs for Classification
from (7) one can obtain.
N
δL1
=0→ω= αk yk φ(xk
δω
k=1
N
δL1 (8)
=0→ αk y k = 0
δb
k=1
δL1
= 0 → 0 ≤ αk ≤ c, k = 1, . . . N.
δξk
Which leads to the solution of the following quadratic
programming problem
N N
1
max Q1 (αk ; φ(xk )) = − yk yl φ(xk )T φ(xl )αk αl + αk (9)
αk 2
k,l=1 k=1
such that N αk yk = 0, 0 ≤ α ≤ c, k = 1, . . . N. The function
k=1
φ(xk ) in (9) is related then to ψ(x, xk ) by imposing
φ(x)T φ(xk ) = ψ(x, xk ), (10)
8. Note that for the two layer neural SVM, Mercer’s condtion only
holds for certain parameter values of k and θ. The classifier (3) is
designed by solving.
N N
1
max Q1 (αk ; ψ(xk , xl )) = − yk yl ψ(xk , xl )αk αl + αk (11)
αk 2
k,l=1 k=1
Subject to the constrainnts in (9). one does not have to calculate
ω nor φ(xk ) in order to determine the decision surface. The
solution to (11) will be global.
Further, it can be show that the hyperplane (3) satisfying
constraint ω 2 ≤ α have a VC-dimension h which is boundend by
h ≤ min([r 2 a2 ], n) + 1 (12)
Where [.] denoted the integer part and r is the radius of the
smallest ball containing the points φ(x1 ), . . . φ(xN ). Such ball is
found by defining Lagrangian.
N
2
L2 (r , q, λk ) = r − λk (r 2 − φ(xk ) − q 2
2 (13)
k=1
9. SVMs for Classification
in (13) q is the center, λk are positive lagrange multipliers. Here
q = k λk φ(xk ), where the lagrangian follows from.
N
T
max Q2 (λk ; φ(xk )) = − Nφ(xk ) φ(xl )λk λl + λk φ(xk )T φ(xk )
λk
k,l=1 k=1
(14)
N
Such that k=1 λk = 1, λk ≥ 0, k = 1, . . . N. Based on (10), Q2
can also be expressed in terms of ψ(xk , xl ). Finally one selects a
support vector machine VC dimension by solving (11) and
computing (12) and (12).
10. Least Squares Support Vector Machines
Least squares version to the SVM classifier by formulating the
classification problem as
N
1 1
min J3 (ω, b, e) = ω T ω + γ 2
ek (15)
ω,b,e 2 2
k=1
subject to the equality constraints.
yk [w T φ(xk ) + b] = 1 − ek , k = 1, . . . , N. (16)
Lagrangian defined as
N
L3 (ω, b, e, α) = J3 (ω, b, e)− αk {yk [ω T φ(xk )+b]−1+ek } (17)
k=1
where αk are lagrange multipliers. The conditions for optimality
N
δL3
=0→ω= αk yk φ(xk )
δω
k=1
(18)
N
δL3
=0→ αk yk = 0
11. Least Squares Support Vector Machines
δL3
= 0 → αk = γek , k = 1, . . . N
δek
δL3
= 0 → yk [ω T φ(xk ) + b] − 1 + ek = 0, k = 1, . . . , N
δαk
can be writte n as the solution to the following set of linear
equations.
−Z T
I 0 0 ω 0
0 0 0 −Y T b 0
=
(19)
0 0 γI −I e 0
Z Y I 0 α I
Where
Z = [φ(x1 )T y1 ; . . . φ(xN )T yN ]
Y = [y1 ; . . . ; yN ]
→
1 = [1; . . . 1]
e = [e1 , . . . eN ],
α = [α1 , . . . αN ]
12. Least Squares Support Vector Machines
The solution is given by
0 −Y T b 0
= (20)
Y ZZ T + γ − 1I a 1
Mercer’s Condition can be applied again to the matrix Ω = ZZ T ,
where
Ωkl = yk yl φ(xk )T φ(xl )
(21)
= yk yl ψ(xk ), xl )
Hence the classifier (1) is found by solving the linear equations
(20), (21) instead of quadratic programming. The parameters of
the kernele such as σ for the RBF kernel can be optimally chosen
according to (12). The support values αk are proportional to the
errors ar the data points (18), while in case of (14) most values are
equal to zero. Hence one could rather speak of a support value
spectrum in the least squares case.
13. Conclusion
Due to the equality constraints, a set of linear equations has
to be solved instead of quadratic programming,
Mercer’s condition is applied as in other SVM’s.
Least squares SVM with RBF kernel is readily found with
excellent generalization performence and low computational
cost.
References
1. Least Squres Support Vector machine Classifiers., J.A.K
Suykens, and J. Vandewalie.