SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Lecture 3: Linear SVM with slack variables
Stéphane Canu
stephane.canu@litislab.eu
Sao Paulo 2014
March 9, 2014
The non separable case
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
−1
−0.5
0
0.5
1
1.5
2
2.5
Road map
1 Linear SVM
The non separable case
The C (L1) SVM
The L2 SVM and others
The hinge loss
0
0
Slack
The non separable case: a bi criteria optimization problem
Modeling potential errors: introducing slack variables ξi
(xi , yi )
no error: yi (w xi + b) ≥ 1 ⇒ ξi = 0
error: ξi = 1 − yi (w xi + b) > 0
0
0
Slack



min
w,b,ξ
1
2
w 2
min
w,b,ξ
C
p
n
i=1
ξp
i
with yi (w xi + b) ≥ 1 − ξi
ξi ≥ 0 i = 1, n
Our hope: almost all ξi = 0
Bi criteria optimization and dominance



L(w) =
1
p
n
i=1
ξp
i
P(w) = w 2
Dominance
w1 dominates w2 if
L(w1) ≤ L(w2) and
P(w1) ≤ P(w2)
Pareto frontier
it is the set of all non
dominated solutions
Figure: dominated point (red), non
dominated point (purple) and
Pareto frontier (blue).
Pareto frontier ⇔
Regularization path
Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 5 / 29
3 equivalent formulations to reach Pareto’s front
min
w∈IRd
1
p
n
i=1
ξp
i + λ w 2
it works for CONVEX criteria!
Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 6 / 29
3 equivalent formulations to reach Pareto’s front
min
w∈IRd
1
p
n
i=1
ξp
i + λ w 2



min
w
1
p
n
i=1
ξp
i
s.t. w 2 ≤ k
it works for CONVEX criteria!
Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 6 / 29
3 equivalent formulations to reach Pareto’s front
min
w∈IRd
1
p
n
i=1
ξp
i + λ w 2



min
w
1
p
n
i=1
ξp
i
s.t. w 2 ≤ k
min
w
w 2
s.t. 1
p
n
i=1 ξp
i ≤ k
it works for CONVEX criteria!
Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 6 / 29
The non separable case
Modeling potential errors: introducing slack variables ξi
(xi , yi )
no error: yi (w xi + b) ≥ 1 ⇒ ξi = 0
error: ξi = 1 − yi (w xi + b) > 0
Minimizing also the slack (the error), for a given C > 0



min
w,b,ξ
1
2
w 2
+
C
p
n
i=1
ξp
i
with yi (w xi + b) ≥ 1 − ξi i = 1, n
ξi ≥ 0 i = 1, n
Looking for the saddle point of the lagrangian with the Lagrange
multipliers αi ≥ 0 and βi ≥ 0
L(w, b, α, β) =
1
2
w 2
+
C
p
n
i=1
ξp
i −
n
i=1
αi yi (w xi + b) − 1 + ξi −
n
i=1
βi ξi
The KKT(p = 1)
L(w, b, α, β) =
1
2
w 2
+
C
p
n
i=1
ξp
i −
n
i=1
αi yi (w xi + b) − 1 + ξi −
n
i=1
βi ξi
stationarity w −
n
i=1
αi yi xi = 0 and
n
i=1
αi yi = 0
C − αi − βi = 0 i = 1, . . . , n
primal admissibility yi (w xi + b) ≥ 1 i = 1, . . . , n
ξi ≥ 0 i = 1, . . . , n
dual admissibility αi ≥ 0 i = 1, . . . , n
βi ≥ 0 i = 1, . . . , n
complementarity αi yi (w xi + b) − 1 + ξi = 0 i = 1, . . . , n
βi ξi = 0 i = 1, . . . , n
Let’s eliminate β!
KKT (p = 1)
stationarity w −
n
i=1
αi yi xi = 0 and
n
i=1
αi yi = 0
primal admissibility yi (w xi + b) ≥ 1 i = 1, . . . , n
ξi ≥ 0 i = 1, . . . , n;
dual admissibility αi ≥ 0 i = 1, . . . , n
C − αi ≥ 0 i = 1, . . . , n;
complementarity αi yi (w xi + b) − 1 + ξi = 0 i = 1, . . . , n
(C − αi ) ξi = 0 i = 1, . . . , n
sets I0 IA IC
αi 0 0 < α < C C
βi C C − α 0
ξi 0 0 1 − yi (w xi + b)
yi (w xi + b) > 1 yi (w xi + b) = 1 yi (w xi + b) < 1
useless usefull (support vec) suspicious
The importance of being support
−2 −1 0 1 2 3 4
−2
−1
0
1
2
3
4
−2 −1 0 1 2 3 4
−2
−1
0
1
2
3
4
.
data
point
α
constraint
value
set
xi useless αi = 0 yi w xi + b > 1 I0
xi support 0 < αi < C yi w xi + b = 1 Iα
xi suspicious αi = C yi w xi + b < 1 IC
Table: When a data point is « support » it lies exactly on the margin.
here lies the efficiency of the algorithm (and its complexity)!
sparsity: αi = 0
Optimality conditions (p = 1)
L(w, b, α, β) =
1
2
w 2
+ C
n
i=1
ξi −
n
i=1
αi yi (w xi + b) − 1 + ξi −
n
i=1
βi ξi
Computing the gradients:



wL(w, b, α) = w −
n
i=1
αi yi xi
∂L(w, b, α)
∂b
=
n
i=1
αi yi
ξi
L(w, b, α) = C − αi − βi
no change for w and b
βi ≥ 0 and C − αi − βi = 0 ⇒ αi ≤ C
The dual formulation:



min
α∈IRn
1
2 α Gα − e α
with y α = 0
and 0 ≤ αi ≤ C i = 1, n
SVM primal vs. dual
Primal



min
w,b,ξ∈IRn
1
2 w 2 + C
n
i=1
ξi
with yi (w xi + b) ≥ 1 − ξi
ξi ≥ 0 i = 1, n
d + n + 1 unknown
2n constraints
classical QP
to be used when n is too
large to build G
Dual



min
α∈IRn
1
2α Gα − e α
with y α = 0
and 0 ≤ αi ≤ C i = 1, n
n unknown
G Gram matrix (pairwise
influence matrix)
2n box constraints
easy to solve
to be used when n is not too
large
The smallest C
C small ⇒ all the points are in IC : αi = C
−2 −1 0 1 2 3 4
−3
−2
−1
0
1
2
3
4
5
6
−1 ≤ fj = C
n
i=1
yi (xi xj )+b ≤ 1
fM = max(f ) fm = min(f )
Cmax =
2
fM − fm
Road map
1 Linear SVM
The non separable case
The C (L1) SVM
The L2 SVM and others
The hinge loss
0
0
Slack
Optimality conditions (p = 2)
L(w, b, α, β) =
1
2
w 2
+
C
2
n
i=1
ξ2
i −
n
i=1
αi yi (w xi + b) − 1 + ξi
Computing the gradients:



wL(w, b, α) = w −
n
i=1
αi yi xi
∂L(w, b, α)
∂b
=
n
i=1
αi yi
ξi
L(w, b, α) = Cξi − αi
no need of the positivity constraint on ξi
no change for w and b
Cξi − αi = 0 ⇒ C
2
n
i=1 ξ2
i −
n
i=1 αi ξi = − 1
2C
n
i=1 α2
i
The dual formulation:



min
α∈IRn
1
2α (G + 1
C I)α − e α
with y α = 0
and 0 ≤ αi i = 1, n
SVM primal vs. dual
Primal



min
w,b,ξ∈IRn
1
2 w 2 + C
2
n
i=1
ξ2
i
with yi (w xi + b) ≥ 1 − ξi
d + n + 1 unknown
n constraints
classical QP
to be used when n is too
large to build G
Dual



min
α∈IRn
1
2α (G + 1
C I)α − e α
with y α = 0
and 0 ≤ αi i = 1, n
n unknown
G Gram matrix is regularized
n box constraints
easy to solve
to be used when n is not too
large
One more variant: the ν SVM



max
v,a
m
with min
i=1,n
|v xi + a| ≥ m
v 2
= k



min
v,a
1
2 v 2
− ν m +
n
i=1 ξi
with yi (v xi + a) ≥ m − ξi
ξi ≥ 0
One more way to derivate SVM
−2 −1 0 1 2 3 4 5
−2
−1
0
1
2
3
4
Minimizing the distance between the convex hulls



min
α
u − v
with u(x) =
{i|yi =1}
αi (xi x), v(x) =
{i|yi =−1}
αi (xi x)
and
{i|yi =1}
αi = 1,
{i|yi =−1}
αi = 1, 0 ≤ αi i = 1, n
f (x) =
2
u − v
u(x) − v(x) and b =
u − v
u − v
SVM with non symetric costs
problem in the primal



min
w,b,ξ∈IRn
1
2 w 2
+ C+
{i|yi =1}
ξp
i + C−
{i|yi =−1}
ξp
i
with yi w xi + b ≥ 1 − ξi , ξi ≥ 0, i = 1, n
for p = 1 the dual formulation is the following:
max
α∈IRn
−1
2 α Gα + α e
with α y = 0 and 0 ≤ αi ≤ C+
or C−
i = 1, n
It generalizes to any cost (useful for unbalanced data)
Road map
1 Linear SVM
The non separable case
The C (L1) SVM
The L2 SVM and others
The hinge loss
0
0
Slack
Eliminating the slack but not the possible mistakes



min
w,b,ξ∈IRn
1
2 w 2
+ C
n
i=1
ξi
with yi (w xi + b) ≥ 1 − ξi
ξi ≥ 0 i = 1, n
Introducing the hinge loss
ξi = max 1 − yi (w xi + b), 0
min
w,b
1
2 w 2
+ C
n
i=1
max 0, 1 − yi (w xi + b)
Back to d + 1 variables, but this is no longer an explicit QP
Ooops! the notion of sub differential
Definition (Sub gradient)
a subgradient of J : IRd
−→ IR at f0 is any vector g ∈ IRd
such that
∀f ∈ V(f0), J(f ) ≥ J(f0) + g (f − f0)
Definition (Subdifferential)
∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f .
IRd
= IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1}
IRd
= IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0}
Regularization path for SVM
min
w
n
i=1
max(1 − yi w xi , 0) +
λo
2
w 2
Iα is the set of support vectors s.t. yi w xi = 1;
∂wJ(w) =
i∈Iα
αi yi xi −
i∈I1
yi xi + λo w with αi ∈ ∂H(1) =] − 1, 0[
Regularization path for SVM
min
w
n
i=1
max(1 − yi w xi , 0) +
λo
2
w 2
Iα is the set of support vectors s.t. yi w xi = 1;
∂wJ(w) =
i∈Iα
αi yi xi −
i∈I1
yi xi + λo w with αi ∈ ∂H(1) =] − 1, 0[
Let λn a value close enough to λo to keep the sets I0, Iα and IC unchanged
In particular at point xj ∈ Iα (wo xj = wn xj = yj ) : ∂wJ(w)(xj ) = 0
i∈Iα
αioyi xi xj = i∈I1
yi xi xj − λo yj
i∈Iα
αinyi xi xj = i∈I1
yi xi xj − λn yj
G(αn − αo) = (λo − λn)y with Gij = yi xi xj
αn = αo + (λo − λn)d
d = (G)−1
y
Solving SVM in the primal
min
w,b
1
2 w 2
+ C
n
i=1
max 0, 1 − yi (w xi + b)
What for: Yahoo!, Twiter, Amazon,
Google (Sibyl), Facebook. . . : Big data
Data-intensive machine learning systems
"on terascale datasets, with trillions of
features,1 billions of training examples
and millions of parameters in an hour
using a cluster of 1000 machines"
How: hybrid online+batch approach adaptive gradient updates (stochastic
gradient descent)
Code available: http://olivier.chapelle.cc/primal/
Solving SVM in the primal
J(w, b) = 1
2 w 2
2 + C
2
n
i=1
max 1 − yi (w xi + b), 0
2
= 1
2 w 2
2 + C
2 ξ ξ
with ξi = max 1 − yi (w xi + b), 0
wJ(w, b) = w − C
n
i=1
max 1 − yi (w xi + b), 0 yi xi
= w − C (diag(y)X) ξ
HwJ(w, b) = Id + C
n
i /∈I0
xi xi
Optimal step size ρ in the Newton direction:
wnew = wold − ρ H−1
w wJ(wold, bold)
The hinge and other loss
Square hinge: LP SVM and Lasso SVM (Huber /
hinge loss)
min
w,b
w 1 + C
n
i=1
max 1 − yi (w xi + b), 0
p
Penalized Logistic regression (Maxent)
min
w,b
w 2
2 − C
n
i=1
log 1 + exp−2yi (w xi +b)
The exponential loss (commonly used in boosting)
min
w,b
w 2
2 + C
n
i=1
exp−yi (w xi +b)
The sigmoid loss
min
w,b
w 2
2 − C
n
i=1
tanh yi (w xi + b)
−1 0 1
0
1
yf(x)
classificationloss
0/1 loss
hinge
hinge2
logistic
exponential
sigmoid
Choosing the data fitting term and the penalty
For a given C: controling the tradeoff between loss and penalty
min
w,b
pen(w) + C
n
i=1
Loss yi (w xi + b)
For a long list of possible penalties:
A Antoniadis, I Gijbels, M Nikolova, Penalized likelihood regression for
generalized linear models with non-quadratic penalties, 2011.
A tentative of classification:
convex/non convex
differentiable/non differentiable
What are we looking for
consistency
efficiency −→ sparcity
Conclusion: variables or data point?
seeking for a universal learning algorithm
no model for IP(x, y)
the linear case: data is separable
the non separable case
double objective: minimizing the error together with the regularity of
the solution
multi objective optimisation
dualiy : variable – example
use the primal when d < n (in the liner case) or when matrix G is hard
to compute
otherwise use the dual
universality = nonlinearity
kernels
Bibliography
T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, The entire regularization path
for the support vector machine, JMLR, 2004
P. Bartlett, M. Jordan, J. McAuliffe, Convexity, classification, and risk
bounds, JASA, 2006.
A. Antoniadis, I. Gijbels, M. Nikolova, Penalized likelihood regression for
generalized linear models with non-quadratic penalties, 2011.
A Agarwal, O Chapelle, M Dudík, J Langford, A reliable effective terascale
linear learning system, 2011.
informatik.unibas.ch/fileadmin/Lectures/FS2013/CS331/Slides/my_SVM_without_
b.pdf
http://ttic.uchicago.edu/~gregory/courses/ml2010/lectures/lect12.pdf
http://olivier.chapelle.cc/primal/
Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 29 / 29

Contenu connexe

Tendances

Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsGabriel Peyré
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursGabriel Peyré
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Gabriel Peyré
 
Lecture10 outilier l0_svdd
Lecture10 outilier l0_svddLecture10 outilier l0_svdd
Lecture10 outilier l0_svddStéphane Canu
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
Mesh Processing Course : Multiresolution
Mesh Processing Course : MultiresolutionMesh Processing Course : Multiresolution
Mesh Processing Course : MultiresolutionGabriel Peyré
 
Model Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular GaugesModel Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular GaugesGabriel Peyré
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationGabriel Peyré
 
Mesh Processing Course : Mesh Parameterization
Mesh Processing Course : Mesh ParameterizationMesh Processing Course : Mesh Parameterization
Mesh Processing Course : Mesh ParameterizationGabriel Peyré
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationGabriel Peyré
 
Computer Aided Manufacturing Design
Computer Aided Manufacturing DesignComputer Aided Manufacturing Design
Computer Aided Manufacturing DesignV Tripathi
 
Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!A Jorge Garcia
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Alexander Litvinenko
 
Calculus First Test 2011/10/20
Calculus First Test 2011/10/20Calculus First Test 2011/10/20
Calculus First Test 2011/10/20Kuan-Lun Wang
 
Calculus cheat sheet_integrals
Calculus cheat sheet_integralsCalculus cheat sheet_integrals
Calculus cheat sheet_integralsUrbanX4
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
 
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Alexander Litvinenko
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svmsesejun
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svmStéphane Canu
 

Tendances (20)

Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active Contours
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
 
Lecture10 outilier l0_svdd
Lecture10 outilier l0_svddLecture10 outilier l0_svdd
Lecture10 outilier l0_svdd
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
Mesh Processing Course : Multiresolution
Mesh Processing Course : MultiresolutionMesh Processing Course : Multiresolution
Mesh Processing Course : Multiresolution
 
Model Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular GaugesModel Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular Gauges
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
Mesh Processing Course : Mesh Parameterization
Mesh Processing Course : Mesh ParameterizationMesh Processing Course : Mesh Parameterization
Mesh Processing Course : Mesh Parameterization
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Computer Aided Manufacturing Design
Computer Aided Manufacturing DesignComputer Aided Manufacturing Design
Computer Aided Manufacturing Design
 
Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!
 
Image denoising
Image denoisingImage denoising
Image denoising
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
 
Calculus First Test 2011/10/20
Calculus First Test 2011/10/20Calculus First Test 2011/10/20
Calculus First Test 2011/10/20
 
Calculus cheat sheet_integrals
Calculus cheat sheet_integralsCalculus cheat sheet_integrals
Calculus cheat sheet_integrals
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svm
 

Similaire à Lecture3 linear svm_with_slack

A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVMHonglin Yu
 
Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svmStéphane Canu
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptNaglaaAbdelhady
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptxsghorai
 
Basic concepts of curve fittings
Basic concepts of curve fittingsBasic concepts of curve fittings
Basic concepts of curve fittingsTarun Gehlot
 
Math graphing quatratic equation
Math graphing quatratic equationMath graphing quatratic equation
Math graphing quatratic equationstudent
 
curve fitting lecture slides February 24
curve fitting lecture slides February 24curve fitting lecture slides February 24
curve fitting lecture slides February 24TaiwoD
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.pptMahimMajee
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesUjjawal
 
Howard, anton calculo i- um novo horizonte - exercicio resolvidos v1
Howard, anton   calculo i- um novo horizonte - exercicio resolvidos v1Howard, anton   calculo i- um novo horizonte - exercicio resolvidos v1
Howard, anton calculo i- um novo horizonte - exercicio resolvidos v1cideni
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svmsesejun
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
 
Epsrcws08 campbell isvm_01
Epsrcws08 campbell isvm_01Epsrcws08 campbell isvm_01
Epsrcws08 campbell isvm_01Cheng Feng
 
Bresenham derivation
Bresenham derivationBresenham derivation
Bresenham derivationKumar
 
Bresenham derivation
Bresenham derivationBresenham derivation
Bresenham derivationMuhammad Fiaz
 

Similaire à Lecture3 linear svm_with_slack (20)

A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVM
 
Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svm
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptx
 
Basic concepts of curve fittings
Basic concepts of curve fittingsBasic concepts of curve fittings
Basic concepts of curve fittings
 
Math graphing quatratic equation
Math graphing quatratic equationMath graphing quatratic equation
Math graphing quatratic equation
 
curve fitting lecture slides February 24
curve fitting lecture slides February 24curve fitting lecture slides February 24
curve fitting lecture slides February 24
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 
Conformal mapping
Conformal mappingConformal mapping
Conformal mapping
 
Numerical Methods Solving Linear Equations
Numerical Methods Solving Linear EquationsNumerical Methods Solving Linear Equations
Numerical Methods Solving Linear Equations
 
Howard, anton calculo i- um novo horizonte - exercicio resolvidos v1
Howard, anton   calculo i- um novo horizonte - exercicio resolvidos v1Howard, anton   calculo i- um novo horizonte - exercicio resolvidos v1
Howard, anton calculo i- um novo horizonte - exercicio resolvidos v1
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svm
 
Appendex
AppendexAppendex
Appendex
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Epsrcws08 campbell isvm_01
Epsrcws08 campbell isvm_01Epsrcws08 campbell isvm_01
Epsrcws08 campbell isvm_01
 
Bresenham derivation
Bresenham derivationBresenham derivation
Bresenham derivation
 
Bresenham derivation
Bresenham derivationBresenham derivation
Bresenham derivation
 

Lecture3 linear svm_with_slack

  • 1. Lecture 3: Linear SVM with slack variables Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 9, 2014
  • 2. The non separable case −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 −1 −0.5 0 0.5 1 1.5 2 2.5
  • 3. Road map 1 Linear SVM The non separable case The C (L1) SVM The L2 SVM and others The hinge loss 0 0 Slack
  • 4. The non separable case: a bi criteria optimization problem Modeling potential errors: introducing slack variables ξi (xi , yi ) no error: yi (w xi + b) ≥ 1 ⇒ ξi = 0 error: ξi = 1 − yi (w xi + b) > 0 0 0 Slack    min w,b,ξ 1 2 w 2 min w,b,ξ C p n i=1 ξp i with yi (w xi + b) ≥ 1 − ξi ξi ≥ 0 i = 1, n Our hope: almost all ξi = 0
  • 5. Bi criteria optimization and dominance    L(w) = 1 p n i=1 ξp i P(w) = w 2 Dominance w1 dominates w2 if L(w1) ≤ L(w2) and P(w1) ≤ P(w2) Pareto frontier it is the set of all non dominated solutions Figure: dominated point (red), non dominated point (purple) and Pareto frontier (blue). Pareto frontier ⇔ Regularization path Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 5 / 29
  • 6. 3 equivalent formulations to reach Pareto’s front min w∈IRd 1 p n i=1 ξp i + λ w 2 it works for CONVEX criteria! Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 6 / 29
  • 7. 3 equivalent formulations to reach Pareto’s front min w∈IRd 1 p n i=1 ξp i + λ w 2    min w 1 p n i=1 ξp i s.t. w 2 ≤ k it works for CONVEX criteria! Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 6 / 29
  • 8. 3 equivalent formulations to reach Pareto’s front min w∈IRd 1 p n i=1 ξp i + λ w 2    min w 1 p n i=1 ξp i s.t. w 2 ≤ k min w w 2 s.t. 1 p n i=1 ξp i ≤ k it works for CONVEX criteria! Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 6 / 29
  • 9. The non separable case Modeling potential errors: introducing slack variables ξi (xi , yi ) no error: yi (w xi + b) ≥ 1 ⇒ ξi = 0 error: ξi = 1 − yi (w xi + b) > 0 Minimizing also the slack (the error), for a given C > 0    min w,b,ξ 1 2 w 2 + C p n i=1 ξp i with yi (w xi + b) ≥ 1 − ξi i = 1, n ξi ≥ 0 i = 1, n Looking for the saddle point of the lagrangian with the Lagrange multipliers αi ≥ 0 and βi ≥ 0 L(w, b, α, β) = 1 2 w 2 + C p n i=1 ξp i − n i=1 αi yi (w xi + b) − 1 + ξi − n i=1 βi ξi
  • 10. The KKT(p = 1) L(w, b, α, β) = 1 2 w 2 + C p n i=1 ξp i − n i=1 αi yi (w xi + b) − 1 + ξi − n i=1 βi ξi stationarity w − n i=1 αi yi xi = 0 and n i=1 αi yi = 0 C − αi − βi = 0 i = 1, . . . , n primal admissibility yi (w xi + b) ≥ 1 i = 1, . . . , n ξi ≥ 0 i = 1, . . . , n dual admissibility αi ≥ 0 i = 1, . . . , n βi ≥ 0 i = 1, . . . , n complementarity αi yi (w xi + b) − 1 + ξi = 0 i = 1, . . . , n βi ξi = 0 i = 1, . . . , n Let’s eliminate β!
  • 11. KKT (p = 1) stationarity w − n i=1 αi yi xi = 0 and n i=1 αi yi = 0 primal admissibility yi (w xi + b) ≥ 1 i = 1, . . . , n ξi ≥ 0 i = 1, . . . , n; dual admissibility αi ≥ 0 i = 1, . . . , n C − αi ≥ 0 i = 1, . . . , n; complementarity αi yi (w xi + b) − 1 + ξi = 0 i = 1, . . . , n (C − αi ) ξi = 0 i = 1, . . . , n sets I0 IA IC αi 0 0 < α < C C βi C C − α 0 ξi 0 0 1 − yi (w xi + b) yi (w xi + b) > 1 yi (w xi + b) = 1 yi (w xi + b) < 1 useless usefull (support vec) suspicious
  • 12. The importance of being support −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 . data point α constraint value set xi useless αi = 0 yi w xi + b > 1 I0 xi support 0 < αi < C yi w xi + b = 1 Iα xi suspicious αi = C yi w xi + b < 1 IC Table: When a data point is « support » it lies exactly on the margin. here lies the efficiency of the algorithm (and its complexity)! sparsity: αi = 0
  • 13. Optimality conditions (p = 1) L(w, b, α, β) = 1 2 w 2 + C n i=1 ξi − n i=1 αi yi (w xi + b) − 1 + ξi − n i=1 βi ξi Computing the gradients:    wL(w, b, α) = w − n i=1 αi yi xi ∂L(w, b, α) ∂b = n i=1 αi yi ξi L(w, b, α) = C − αi − βi no change for w and b βi ≥ 0 and C − αi − βi = 0 ⇒ αi ≤ C The dual formulation:    min α∈IRn 1 2 α Gα − e α with y α = 0 and 0 ≤ αi ≤ C i = 1, n
  • 14. SVM primal vs. dual Primal    min w,b,ξ∈IRn 1 2 w 2 + C n i=1 ξi with yi (w xi + b) ≥ 1 − ξi ξi ≥ 0 i = 1, n d + n + 1 unknown 2n constraints classical QP to be used when n is too large to build G Dual    min α∈IRn 1 2α Gα − e α with y α = 0 and 0 ≤ αi ≤ C i = 1, n n unknown G Gram matrix (pairwise influence matrix) 2n box constraints easy to solve to be used when n is not too large
  • 15. The smallest C C small ⇒ all the points are in IC : αi = C −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 5 6 −1 ≤ fj = C n i=1 yi (xi xj )+b ≤ 1 fM = max(f ) fm = min(f ) Cmax = 2 fM − fm
  • 16. Road map 1 Linear SVM The non separable case The C (L1) SVM The L2 SVM and others The hinge loss 0 0 Slack
  • 17. Optimality conditions (p = 2) L(w, b, α, β) = 1 2 w 2 + C 2 n i=1 ξ2 i − n i=1 αi yi (w xi + b) − 1 + ξi Computing the gradients:    wL(w, b, α) = w − n i=1 αi yi xi ∂L(w, b, α) ∂b = n i=1 αi yi ξi L(w, b, α) = Cξi − αi no need of the positivity constraint on ξi no change for w and b Cξi − αi = 0 ⇒ C 2 n i=1 ξ2 i − n i=1 αi ξi = − 1 2C n i=1 α2 i The dual formulation:    min α∈IRn 1 2α (G + 1 C I)α − e α with y α = 0 and 0 ≤ αi i = 1, n
  • 18. SVM primal vs. dual Primal    min w,b,ξ∈IRn 1 2 w 2 + C 2 n i=1 ξ2 i with yi (w xi + b) ≥ 1 − ξi d + n + 1 unknown n constraints classical QP to be used when n is too large to build G Dual    min α∈IRn 1 2α (G + 1 C I)α − e α with y α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix is regularized n box constraints easy to solve to be used when n is not too large
  • 19. One more variant: the ν SVM    max v,a m with min i=1,n |v xi + a| ≥ m v 2 = k    min v,a 1 2 v 2 − ν m + n i=1 ξi with yi (v xi + a) ≥ m − ξi ξi ≥ 0
  • 20. One more way to derivate SVM −2 −1 0 1 2 3 4 5 −2 −1 0 1 2 3 4 Minimizing the distance between the convex hulls    min α u − v with u(x) = {i|yi =1} αi (xi x), v(x) = {i|yi =−1} αi (xi x) and {i|yi =1} αi = 1, {i|yi =−1} αi = 1, 0 ≤ αi i = 1, n f (x) = 2 u − v u(x) − v(x) and b = u − v u − v
  • 21. SVM with non symetric costs problem in the primal    min w,b,ξ∈IRn 1 2 w 2 + C+ {i|yi =1} ξp i + C− {i|yi =−1} ξp i with yi w xi + b ≥ 1 − ξi , ξi ≥ 0, i = 1, n for p = 1 the dual formulation is the following: max α∈IRn −1 2 α Gα + α e with α y = 0 and 0 ≤ αi ≤ C+ or C− i = 1, n It generalizes to any cost (useful for unbalanced data)
  • 22. Road map 1 Linear SVM The non separable case The C (L1) SVM The L2 SVM and others The hinge loss 0 0 Slack
  • 23. Eliminating the slack but not the possible mistakes    min w,b,ξ∈IRn 1 2 w 2 + C n i=1 ξi with yi (w xi + b) ≥ 1 − ξi ξi ≥ 0 i = 1, n Introducing the hinge loss ξi = max 1 − yi (w xi + b), 0 min w,b 1 2 w 2 + C n i=1 max 0, 1 − yi (w xi + b) Back to d + 1 variables, but this is no longer an explicit QP
  • 24. Ooops! the notion of sub differential Definition (Sub gradient) a subgradient of J : IRd −→ IR at f0 is any vector g ∈ IRd such that ∀f ∈ V(f0), J(f ) ≥ J(f0) + g (f − f0) Definition (Subdifferential) ∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f . IRd = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1} IRd = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0}
  • 25. Regularization path for SVM min w n i=1 max(1 − yi w xi , 0) + λo 2 w 2 Iα is the set of support vectors s.t. yi w xi = 1; ∂wJ(w) = i∈Iα αi yi xi − i∈I1 yi xi + λo w with αi ∈ ∂H(1) =] − 1, 0[
  • 26. Regularization path for SVM min w n i=1 max(1 − yi w xi , 0) + λo 2 w 2 Iα is the set of support vectors s.t. yi w xi = 1; ∂wJ(w) = i∈Iα αi yi xi − i∈I1 yi xi + λo w with αi ∈ ∂H(1) =] − 1, 0[ Let λn a value close enough to λo to keep the sets I0, Iα and IC unchanged In particular at point xj ∈ Iα (wo xj = wn xj = yj ) : ∂wJ(w)(xj ) = 0 i∈Iα αioyi xi xj = i∈I1 yi xi xj − λo yj i∈Iα αinyi xi xj = i∈I1 yi xi xj − λn yj G(αn − αo) = (λo − λn)y with Gij = yi xi xj αn = αo + (λo − λn)d d = (G)−1 y
  • 27. Solving SVM in the primal min w,b 1 2 w 2 + C n i=1 max 0, 1 − yi (w xi + b) What for: Yahoo!, Twiter, Amazon, Google (Sibyl), Facebook. . . : Big data Data-intensive machine learning systems "on terascale datasets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines" How: hybrid online+batch approach adaptive gradient updates (stochastic gradient descent) Code available: http://olivier.chapelle.cc/primal/
  • 28. Solving SVM in the primal J(w, b) = 1 2 w 2 2 + C 2 n i=1 max 1 − yi (w xi + b), 0 2 = 1 2 w 2 2 + C 2 ξ ξ with ξi = max 1 − yi (w xi + b), 0 wJ(w, b) = w − C n i=1 max 1 − yi (w xi + b), 0 yi xi = w − C (diag(y)X) ξ HwJ(w, b) = Id + C n i /∈I0 xi xi Optimal step size ρ in the Newton direction: wnew = wold − ρ H−1 w wJ(wold, bold)
  • 29. The hinge and other loss Square hinge: LP SVM and Lasso SVM (Huber / hinge loss) min w,b w 1 + C n i=1 max 1 − yi (w xi + b), 0 p Penalized Logistic regression (Maxent) min w,b w 2 2 − C n i=1 log 1 + exp−2yi (w xi +b) The exponential loss (commonly used in boosting) min w,b w 2 2 + C n i=1 exp−yi (w xi +b) The sigmoid loss min w,b w 2 2 − C n i=1 tanh yi (w xi + b) −1 0 1 0 1 yf(x) classificationloss 0/1 loss hinge hinge2 logistic exponential sigmoid
  • 30. Choosing the data fitting term and the penalty For a given C: controling the tradeoff between loss and penalty min w,b pen(w) + C n i=1 Loss yi (w xi + b) For a long list of possible penalties: A Antoniadis, I Gijbels, M Nikolova, Penalized likelihood regression for generalized linear models with non-quadratic penalties, 2011. A tentative of classification: convex/non convex differentiable/non differentiable What are we looking for consistency efficiency −→ sparcity
  • 31. Conclusion: variables or data point? seeking for a universal learning algorithm no model for IP(x, y) the linear case: data is separable the non separable case double objective: minimizing the error together with the regularity of the solution multi objective optimisation dualiy : variable – example use the primal when d < n (in the liner case) or when matrix G is hard to compute otherwise use the dual universality = nonlinearity kernels
  • 32. Bibliography T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, The entire regularization path for the support vector machine, JMLR, 2004 P. Bartlett, M. Jordan, J. McAuliffe, Convexity, classification, and risk bounds, JASA, 2006. A. Antoniadis, I. Gijbels, M. Nikolova, Penalized likelihood regression for generalized linear models with non-quadratic penalties, 2011. A Agarwal, O Chapelle, M Dudík, J Langford, A reliable effective terascale linear learning system, 2011. informatik.unibas.ch/fileadmin/Lectures/FS2013/CS331/Slides/my_SVM_without_ b.pdf http://ttic.uchicago.edu/~gregory/courses/ml2010/lectures/lect12.pdf http://olivier.chapelle.cc/primal/ Stéphane Canu (INSA Rouen - LITIS) March 9, 2014 29 / 29