support vector machine

Submitted by:
Garisha Chowdhary ,
MCSE 1st year,
Jadavpur University

A set of related supervised learning methods

Non-probablistic binary linear classifier

Linear learners like perceptrons but unlike them uses concept of :
maximum margin ,linearization and kernel function

Used for classification and regression analysis

Map non-lineraly separable
Select between hyper
instances to higher
planes, use maximum margin
dimensions to overcome
as a test
linearity constraints

A good
separation

Class 2 Class 2 Class 2

Class 1 Class 1 Class 1

Intuitively , a good separation is achieved by a hyperplane that
has largest distance to nearest training data point of any class

Since, larger the margin lower the generalization error(more
confident predications)

Class 2

Class 1

• {(x1,y1), (x2,y2), … , (xn,yn)
Given N samples • Where y = +1/ -1 are labels of data, x belongs to Rn

Find a hyperplane • wTxi+ b > 0 : for all i such that y=+1
wTx + b =0 • wTxi+ b < 0 : for all i such that y=-1

Functional Margin
• With respect to the training example, defined by
ˆγ(i)=y(i)(wT x(i) + b).
• Want functional margin to be large i.e. y(i)(wT x(i) + b) >> 0
• May rescale w and b, without altering the decision function
but multiplying functional margin by the scale factor
• Allows us to impose a normalization condition ||w|| = 1 and
consider the functional margin of (w/||w||,b/||w||)
• w.r.t. training set defined by ˆγ = min ˆγ(i) for all i

Geometric margin
• Defined by γ(i)=y(i)((w/||w||)Tx(i)+b/||w||).
• If ||w|| = 1, functional margin = geometric margin
• Invariant to scaling of parameters w and b. w may be scaled such
that ||w|| = 1
• Also, γ = min γ(i) for all i
Now, Objective is to
Maximize γ w.r.t. γ,w,b s.t. Maximize ˆγ/||w|| w.r.t. ˆγ,w,b s.t.

• y(i)(wTx(i) +b) >= γ for all i • y(i)(wTx(i) +b) >= ˆγ for all i
• ||w|| = 1

• Intrducing the scaling constraint that the functional margin be 1, the objective
function may further be simplified as to maximize 1/||w|| , or
Minimize (1/2)(||w||2) s.t.

• y(i)(wTx(i) +b) >= 1

Using Lagrangian to solve the inequality constrained optimization problem , we have

L = ½||w||2 - Σαi(yi(wTxi +b) - 1)

Setting gradiant to L w.r.t. w and b to 0 we have,

w = Σαiyixi for all i , Σαiyi = 0

Substituitng w in L we get the corresponding dual problem of the primal problem to

maximize W(α) = Σαi - ½ΣΣαiαjyiyjxiTxj , s.t. αi >=0 , Σαiyi = 0

Solve for α and recover

w = Σαiyixi , b∗ =( −maxi:y(i)=−1 wT x(i) + mini:y(i)=1 wT x(i))/2

For conversion of primal problem to dual problem the
following Karish-Kuhn-Tucker conditions must be satisfied
• (∂/∂wi)L(w, α) = 0, i = 1, . . . , n
• αi gi(w,b) = 0, i = 1, . . . , k
• gi(w,b) <= 0, i = 1, . . . , k
• αi >= 0
From the KKT complementary condition(2nd)

• αi > 0 => gi(w,b) = 0 (active constraint) => x(i),y(i) has functional margin 1
(support vectors)
• gi(w,b) < 0 => αi = 0 (inactive constraint, non-support vectors)

Support vectors
Class 2

Class 1

In case of non-linearly separable data , mapping data to high
dimensional feature space via non linear mapping function, φ increases
the likelihood that data is linearly separable

Use of kernel function, to simplify computations over high dimensional
mapped data, that corresponds to dot product of some non-linear
mapping of data

Having found αi , calculate a quantity that depends only on the inner
product between x (test point) and support vectors

Kernel function is the measure of similarity between the 2 vectors

A kernel function is valid if it satisfies the Mercer Theorem which states
that the corresponding kernel matrix must be symmetric positive semi-
definite (zTKz >= 0 )

Polynomial kernel with degree d
• K(x,y) = (xy + 1 )^d

Radial basis function kernel with width
• K(x,y) = exp(-||x-y||2/(2
• Feature space is infinite dimensional
Sigmoid with parameter and
• K(x,y) = tanh( xTy+
• It does not satisfy the Mercer condition on all and

High dimensionality doesn’t guarantee linear separation; hypeplane might be
susceptible to outliers

Relax the constraint introducing ‘slack variables’, ξi, that allow violations of
constraint by a small quantity

Penalize the objective function for violation

Parameter C will control the trade off between penalty and margin.

So the objective now becomes, to minw,b,γ (1/2)||w||2 + C Σξi s.t.
y(i)(wTx(i)+b)>= 1 – ξi, ξi >=0

Tries to ensure that most examples have functional margin atleast 1

Formind the corresponding Lagrangian , the dual problem now is to:
maxαΣαi - ½ΣΣαiαjyiyjxiTxj , s.t. 0<=αi <= C , Σαiyi = 0 .

Class 2

Class 1

Parameter Selection

• The effectiveness of SVM depends on selection of kernel, kernel parameters
and the parameter C
• Common is Gaussian kernel, with a single parameter γ
• Best combination of C and γ is often selected by grid search with
exponentially increasing sequences of C and γ.
• Each combination is checked using crossvalidation and the one with best
accuracy is chosen.

Drawbacks
• Cannot be directly applied to
multiclass problems, but need use
of algorithms that convert
multiclass problem to multiple
binary class problems
• Uncalibrated class membership
probabilities

support vector machine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à support vector machine

Similaire à support vector machine (20)

support vector machine