SlideShare une entreprise Scribd logo
1  sur  9
Télécharger pour lire hors ligne
Support Vector Machine
sumit singh
October 2017
0.1 Mathematics of SVM
Source: MIT 6.034 AI Prof Patrick Winston
w · x + b > 1
w · x + b = 0
w · x + b < −1
M
argen
Figure 1: SVM
w is a normal to the median of the street.
u is the unknown vector for deciding classification of an unknown data point.
w · u ≥ c
w · u + b ≥ 0 =⇒ ‘ + DecisionRule
w · x+ + b ≥ 1 (1)
w · x− + b ≤ 1 (2)
w · u + b ≥ 0 then ‘+’ Decision Rule (3)
We introduce yi for mathematical convenience. This will reduce the above
two equation into 1 as shown in the next few equations.
1
0 1 2 3 4 5
0
1
2
3
4
5
X -Axis
Y-Axis
Median
Gutter
Figure 2: SVM2
yi =



+1 for + samples
−1 for − samples
(4)
yi(w · xi + b) ≥ 1 (5)
Taking 1 to the LHS results in:
yi(w · xi + b) − 1 ≥ 0 (6)
If the point xi lies on the gutter line, then:
yi(w · xi + b) − 1 = 0 (7)
There is no point in between the two gutter lines. We are trying to get the
2
two gutter lines such that the resulting street formed which separates the
+ses from the −ses is wide as possible.
It would be a good idea to express the distance between the two gutters.
Insert Diagram here: A vector to the positive gutter line, x+, a vector
to the negative gutter line, x−. The difference of the two vectors would be :
x+ − x− Now, if we have a unit normal that is normal to the median line of
the street, then, the dot product of this difference vector and that of the unit
normal would be the width of the street. But we have w which is normal to
the median of the street. Hence, w
||w||
i a unit vector normal to the median.
WIDTH = (x+ − x−) ·
w
||w||
(8)
The above dot product is a scalar and is the width of the street. Now, using
the equation yi(w · xi + b) − 1 = 0 and considering two points one on each of
the two gutter lines results in:
x+ · w = 1 − b (9)
x− · w = 1 + b (10)
Resulting in,
WIDTH =
2
||w||
(11)
3
The whole idea is to maximize the width of the street.
MAX
2
||w||
= MAX
1
||w||
(12)
We dropped the constant. Now, maximizing 1
||w||
is same as minimizing ||w||
which would work well to minimize ||w||2
2
. Why did we square and half it?
Because it is mathematically convenient. :)
MIN
||w||2
2
(13)
Now honoring the constraints:
To find the extremum of a function with constraints we have to use Lagrange
multipliers. (See Appendix) That will give us a new expression which we
can maximize or minimize without thinking about the constraints anymore.
This takes us towards the box number 4. Min L = the minimization function
minus the sum of the constraints weighted by a multiplier αi
L =
1
2
||w||2
− αi[yi(w · xi + b) − 1] (14)
We got the original thing we trying to work with. Now, we have got Lagrange
multipliers all multiplied to the constraints. Each constraint is constrained
to be 0. Now, with a lttle bit of mathematical slight of hand. The con-
straints for which Lagrange multipliers are zero are the ones lying away from
the gutter line. There is a slack between them and the gutter line. The ones
for which the Lagrange multipliers are non zero are the ones which lie on the
gutter line.
4
We are trying to find the extremum. So, we first take the First Order Con-
dition [FOC], that is take the derivatives and set them to zero. Then, we
take the Second Order Condition [SOC] that is the second derivative and
depending on the sign confirm maxima or a minima.
∂L
∂w
= w − αiyixi = 0 (15)
=⇒ w = αiyixi (16)
This tells us that the vector w is a linear sum of the samples, either all
or some (for some α will be 0) of the samples.
Differentiating L wrt anything else that may vary:
∂L
∂b
= − αiyi = 0 (17)
=⇒ αiyi = 0 (18)
Because of the square, this is a quadratic optimization problem. At this
stage, numerical analysis ca be used to solve the problem. But, we would
continue with more math. We are interested in the fact that the decision
variable is a linear sum of the samples. Plugging the expression of w back in
5
L expression (the optimization function), we get:
L =
1
2
( αiyixi)·( αjyjxj)− αiyixi ·( αjyjxj)− αiyib+ αi
(19)
Now αiyi = 0 and writing the second part as ( αiyixi) · ( αjyjxj),
we get:
L = αi −
1
2 i j
αiαjyiyj (xi · xj) (20)
This we can solve by numerical analysis. But then we could have done it at
an earlier stage, why did we come till here? We did it to find out how the
maximization depends on the sample vectors.
The optimization depends only on the dot product of the pair of samples.
We want to take the w and stick it back not only in the Lagrangian but as
well into the decision rule.
αiyi xi · u + b ≥ 0 THEN + 1 (21)
u is the unknown vector.
The decision rule also depends on the dot products of the sample and the
unknowns. So, a total dependence of all of the math on the dot products.
The optimization is done on a convex space, hence, the local maximum is
the global maximum. So, in contrast with things like neural nets where you
have a plague of local maxima, this guy never gets stuck in a local maxima.
What happens in situations where the points are linearly inseparable, like
6
in the following figure:
INSERT FIGURE
Then, a powerful idea comes to the rescue. When stuck switch to another
perspective. Then we need a transformation which will take us from the space
that we are currently in into a higher dimension space where things are more
convenient, where the points become separable. Let’s call the transformation
φ(x). Since, we have seen that the maximization depends only on the dot
product so all we need to do to maximize is to take the transformation of
one vector dotted with the transformation of another vector, φ(xi) · φ(xj)
and maximize it. Now if we have a function K such that:
K(xi, xj) = φ(xi) · φ(xj) (22)
then we are done. This is called the KERNEL function and it provides us
with the dot product of the two vectors in another space. We don’t have to
know the transformation into the other space. And that’s the reason that
this stuff is a miracle. Some of the popular kernels are :
(u · v + 1)n
(23)
e−
||xi−xj||
σ (24)
We have got a general method which is convex and guaranteed to produce a
global solution. We have got a mechanism that easily allows us to transform
this into another space. But there are issues, for instance, if we chose σ [the
7
one in the exponential kernel above] small enough, then those sigmas are
essentially shrunk right around the sample, and we could get overfitting. So,
it doesn’t immunize us against overfitting but it does immunize us against
local maxima and does provide us with a general mechanism for doing a
transformation into another space with a better perspective.
The history of this is: Vapnik immigrated from the Soviet Union to the
United States in 1991. Nobody had ever heard of this stuff before he immi-
grated. He actually had done this work on the basic support vector idea in id
Ph.D. thesis at Moscow University. in the early 60’s. But it wasn’t possible
for him to do anything with it, because he didn’t have any computer that
he could try it out with. So he spent the next 25 years in some oncolgy
institute in the Soviet Union doing applications. Somebody form Bell labs,
discovered him, invited him over to United States where he subsequently
decided to immigrate. In around 1992, Vapnik submitted three papers to
NIPS, the Neural Information Processing Systems journal. All of them were
rejected!!! Around 1992-93 Bell labs was interested in hand-written character
recognition and in neural nets. Vapnik thinks that neural nets are not very
good. So he had a bet with a colleague for a good dinner, that the SVM
will do eventually better at hand writing recognition than neural nets. So,
the colleague decided to implement the SVM with a kernel with n = 2, just
slightly non linear. It worked like a charm! As soon as it was shown to work
in early 90s on the problem of handwriting recognition, Vapnik resuscitated
the idea of the kernel, began to develop it, and became the essential part of
the whole approach of using SVMs. It was about 30 years in between the
concept and anybody hearing about it.
8

Contenu connexe

Tendances

A non-stiff boundary integral method for internal waves
A non-stiff boundary integral method for internal wavesA non-stiff boundary integral method for internal waves
A non-stiff boundary integral method for internal waves
Alex (Oleksiy) Varfolomiyev
 
Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson Method
Jigisha Dabhi
 
Integration
IntegrationIntegration
Integration
lecturer
 
Differentiation
DifferentiationDifferentiation
Differentiation
timschmitz
 
125 4.1 through 4.5
125 4.1 through 4.5125 4.1 through 4.5
125 4.1 through 4.5
Jeneva Clark
 

Tendances (20)

Limits and derivatives
Limits and derivativesLimits and derivatives
Limits and derivatives
 
Continuity and differentiability
Continuity and differentiability Continuity and differentiability
Continuity and differentiability
 
Newton's forward difference
Newton's forward differenceNewton's forward difference
Newton's forward difference
 
ORTHOGONAL, ORTHONORMAL VECTOR, GRAM SCHMIDT PROCESS, ORTHOGONALLY DIAGONALI...
ORTHOGONAL, ORTHONORMAL  VECTOR, GRAM SCHMIDT PROCESS, ORTHOGONALLY DIAGONALI...ORTHOGONAL, ORTHONORMAL  VECTOR, GRAM SCHMIDT PROCESS, ORTHOGONALLY DIAGONALI...
ORTHOGONAL, ORTHONORMAL VECTOR, GRAM SCHMIDT PROCESS, ORTHOGONALLY DIAGONALI...
 
Vcla ppt ch=vector space
Vcla ppt ch=vector spaceVcla ppt ch=vector space
Vcla ppt ch=vector space
 
Lagrange’s interpolation formula
Lagrange’s interpolation formulaLagrange’s interpolation formula
Lagrange’s interpolation formula
 
Applied numerical methods lec9
Applied numerical methods lec9Applied numerical methods lec9
Applied numerical methods lec9
 
A non-stiff boundary integral method for internal waves
A non-stiff boundary integral method for internal wavesA non-stiff boundary integral method for internal waves
A non-stiff boundary integral method for internal waves
 
Limits and continuity[1]
Limits and continuity[1]Limits and continuity[1]
Limits and continuity[1]
 
Sol83
Sol83Sol83
Sol83
 
Lesson 11: Limits and Continuity
Lesson 11: Limits and ContinuityLesson 11: Limits and Continuity
Lesson 11: Limits and Continuity
 
Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson Method
 
Integration
IntegrationIntegration
Integration
 
Orthogonal basis and gram schmidth process
Orthogonal basis and gram schmidth processOrthogonal basis and gram schmidth process
Orthogonal basis and gram schmidth process
 
Differentiation
DifferentiationDifferentiation
Differentiation
 
GATE Engineering Maths : Limit, Continuity and Differentiability
GATE Engineering Maths : Limit, Continuity and DifferentiabilityGATE Engineering Maths : Limit, Continuity and Differentiability
GATE Engineering Maths : Limit, Continuity and Differentiability
 
Analytic function 1
Analytic function 1Analytic function 1
Analytic function 1
 
Functions
FunctionsFunctions
Functions
 
Newton's forward & backward interpolation
Newton's forward & backward interpolationNewton's forward & backward interpolation
Newton's forward & backward interpolation
 
125 4.1 through 4.5
125 4.1 through 4.5125 4.1 through 4.5
125 4.1 through 4.5
 

Similaire à Supporting Vector Machine

Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docxLecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
croysierkathey
 
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docxLecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
jeremylockett77
 
Erin catto numericalmethods
Erin catto numericalmethodsErin catto numericalmethods
Erin catto numericalmethods
oscarbg
 
_lecture_04_limits_partial_derivatives.pdf
_lecture_04_limits_partial_derivatives.pdf_lecture_04_limits_partial_derivatives.pdf
_lecture_04_limits_partial_derivatives.pdf
LeoIrsi
 

Similaire à Supporting Vector Machine (20)

Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docxLecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
 
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docxLecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
Lecture13p.pdf.pdfThedeepness of freedom are threevalues.docx
 
Pydata Katya Vasilaky
Pydata Katya VasilakyPydata Katya Vasilaky
Pydata Katya Vasilaky
 
Greedy Algorithms with examples' b-18298
Greedy Algorithms with examples'  b-18298Greedy Algorithms with examples'  b-18298
Greedy Algorithms with examples' b-18298
 
Epsrcws08 campbell isvm_01
Epsrcws08 campbell isvm_01Epsrcws08 campbell isvm_01
Epsrcws08 campbell isvm_01
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Coueete project
Coueete projectCoueete project
Coueete project
 
Erin catto numericalmethods
Erin catto numericalmethodsErin catto numericalmethods
Erin catto numericalmethods
 
project final
project finalproject final
project final
 
An Efficient And Safe Framework For Solving Optimization Problems
An Efficient And Safe Framework For Solving Optimization ProblemsAn Efficient And Safe Framework For Solving Optimization Problems
An Efficient And Safe Framework For Solving Optimization Problems
 
SVM1.pptx
SVM1.pptxSVM1.pptx
SVM1.pptx
 
On Application of Power Series Solution of Bessel Problems to the Problems of...
On Application of Power Series Solution of Bessel Problems to the Problems of...On Application of Power Series Solution of Bessel Problems to the Problems of...
On Application of Power Series Solution of Bessel Problems to the Problems of...
 
06_finite_elements_basics.ppt
06_finite_elements_basics.ppt06_finite_elements_basics.ppt
06_finite_elements_basics.ppt
 
5994944.ppt
5994944.ppt5994944.ppt
5994944.ppt
 
Optimization Methods for Machine Learning and Engineering: Optimization in Ve...
Optimization Methods for Machine Learning and Engineering: Optimization in Ve...Optimization Methods for Machine Learning and Engineering: Optimization in Ve...
Optimization Methods for Machine Learning and Engineering: Optimization in Ve...
 
_lecture_04_limits_partial_derivatives.pdf
_lecture_04_limits_partial_derivatives.pdf_lecture_04_limits_partial_derivatives.pdf
_lecture_04_limits_partial_derivatives.pdf
 
03_AJMS_209_19_RA.pdf
03_AJMS_209_19_RA.pdf03_AJMS_209_19_RA.pdf
03_AJMS_209_19_RA.pdf
 
03_AJMS_209_19_RA.pdf
03_AJMS_209_19_RA.pdf03_AJMS_209_19_RA.pdf
03_AJMS_209_19_RA.pdf
 
Algebraic techniques in combinatorics
Algebraic techniques in combinatoricsAlgebraic techniques in combinatorics
Algebraic techniques in combinatorics
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Supporting Vector Machine

  • 1. Support Vector Machine sumit singh October 2017
  • 2. 0.1 Mathematics of SVM Source: MIT 6.034 AI Prof Patrick Winston w · x + b > 1 w · x + b = 0 w · x + b < −1 M argen Figure 1: SVM w is a normal to the median of the street. u is the unknown vector for deciding classification of an unknown data point. w · u ≥ c w · u + b ≥ 0 =⇒ ‘ + DecisionRule w · x+ + b ≥ 1 (1) w · x− + b ≤ 1 (2) w · u + b ≥ 0 then ‘+’ Decision Rule (3) We introduce yi for mathematical convenience. This will reduce the above two equation into 1 as shown in the next few equations. 1
  • 3. 0 1 2 3 4 5 0 1 2 3 4 5 X -Axis Y-Axis Median Gutter Figure 2: SVM2 yi =    +1 for + samples −1 for − samples (4) yi(w · xi + b) ≥ 1 (5) Taking 1 to the LHS results in: yi(w · xi + b) − 1 ≥ 0 (6) If the point xi lies on the gutter line, then: yi(w · xi + b) − 1 = 0 (7) There is no point in between the two gutter lines. We are trying to get the 2
  • 4. two gutter lines such that the resulting street formed which separates the +ses from the −ses is wide as possible. It would be a good idea to express the distance between the two gutters. Insert Diagram here: A vector to the positive gutter line, x+, a vector to the negative gutter line, x−. The difference of the two vectors would be : x+ − x− Now, if we have a unit normal that is normal to the median line of the street, then, the dot product of this difference vector and that of the unit normal would be the width of the street. But we have w which is normal to the median of the street. Hence, w ||w|| i a unit vector normal to the median. WIDTH = (x+ − x−) · w ||w|| (8) The above dot product is a scalar and is the width of the street. Now, using the equation yi(w · xi + b) − 1 = 0 and considering two points one on each of the two gutter lines results in: x+ · w = 1 − b (9) x− · w = 1 + b (10) Resulting in, WIDTH = 2 ||w|| (11) 3
  • 5. The whole idea is to maximize the width of the street. MAX 2 ||w|| = MAX 1 ||w|| (12) We dropped the constant. Now, maximizing 1 ||w|| is same as minimizing ||w|| which would work well to minimize ||w||2 2 . Why did we square and half it? Because it is mathematically convenient. :) MIN ||w||2 2 (13) Now honoring the constraints: To find the extremum of a function with constraints we have to use Lagrange multipliers. (See Appendix) That will give us a new expression which we can maximize or minimize without thinking about the constraints anymore. This takes us towards the box number 4. Min L = the minimization function minus the sum of the constraints weighted by a multiplier αi L = 1 2 ||w||2 − αi[yi(w · xi + b) − 1] (14) We got the original thing we trying to work with. Now, we have got Lagrange multipliers all multiplied to the constraints. Each constraint is constrained to be 0. Now, with a lttle bit of mathematical slight of hand. The con- straints for which Lagrange multipliers are zero are the ones lying away from the gutter line. There is a slack between them and the gutter line. The ones for which the Lagrange multipliers are non zero are the ones which lie on the gutter line. 4
  • 6. We are trying to find the extremum. So, we first take the First Order Con- dition [FOC], that is take the derivatives and set them to zero. Then, we take the Second Order Condition [SOC] that is the second derivative and depending on the sign confirm maxima or a minima. ∂L ∂w = w − αiyixi = 0 (15) =⇒ w = αiyixi (16) This tells us that the vector w is a linear sum of the samples, either all or some (for some α will be 0) of the samples. Differentiating L wrt anything else that may vary: ∂L ∂b = − αiyi = 0 (17) =⇒ αiyi = 0 (18) Because of the square, this is a quadratic optimization problem. At this stage, numerical analysis ca be used to solve the problem. But, we would continue with more math. We are interested in the fact that the decision variable is a linear sum of the samples. Plugging the expression of w back in 5
  • 7. L expression (the optimization function), we get: L = 1 2 ( αiyixi)·( αjyjxj)− αiyixi ·( αjyjxj)− αiyib+ αi (19) Now αiyi = 0 and writing the second part as ( αiyixi) · ( αjyjxj), we get: L = αi − 1 2 i j αiαjyiyj (xi · xj) (20) This we can solve by numerical analysis. But then we could have done it at an earlier stage, why did we come till here? We did it to find out how the maximization depends on the sample vectors. The optimization depends only on the dot product of the pair of samples. We want to take the w and stick it back not only in the Lagrangian but as well into the decision rule. αiyi xi · u + b ≥ 0 THEN + 1 (21) u is the unknown vector. The decision rule also depends on the dot products of the sample and the unknowns. So, a total dependence of all of the math on the dot products. The optimization is done on a convex space, hence, the local maximum is the global maximum. So, in contrast with things like neural nets where you have a plague of local maxima, this guy never gets stuck in a local maxima. What happens in situations where the points are linearly inseparable, like 6
  • 8. in the following figure: INSERT FIGURE Then, a powerful idea comes to the rescue. When stuck switch to another perspective. Then we need a transformation which will take us from the space that we are currently in into a higher dimension space where things are more convenient, where the points become separable. Let’s call the transformation φ(x). Since, we have seen that the maximization depends only on the dot product so all we need to do to maximize is to take the transformation of one vector dotted with the transformation of another vector, φ(xi) · φ(xj) and maximize it. Now if we have a function K such that: K(xi, xj) = φ(xi) · φ(xj) (22) then we are done. This is called the KERNEL function and it provides us with the dot product of the two vectors in another space. We don’t have to know the transformation into the other space. And that’s the reason that this stuff is a miracle. Some of the popular kernels are : (u · v + 1)n (23) e− ||xi−xj|| σ (24) We have got a general method which is convex and guaranteed to produce a global solution. We have got a mechanism that easily allows us to transform this into another space. But there are issues, for instance, if we chose σ [the 7
  • 9. one in the exponential kernel above] small enough, then those sigmas are essentially shrunk right around the sample, and we could get overfitting. So, it doesn’t immunize us against overfitting but it does immunize us against local maxima and does provide us with a general mechanism for doing a transformation into another space with a better perspective. The history of this is: Vapnik immigrated from the Soviet Union to the United States in 1991. Nobody had ever heard of this stuff before he immi- grated. He actually had done this work on the basic support vector idea in id Ph.D. thesis at Moscow University. in the early 60’s. But it wasn’t possible for him to do anything with it, because he didn’t have any computer that he could try it out with. So he spent the next 25 years in some oncolgy institute in the Soviet Union doing applications. Somebody form Bell labs, discovered him, invited him over to United States where he subsequently decided to immigrate. In around 1992, Vapnik submitted three papers to NIPS, the Neural Information Processing Systems journal. All of them were rejected!!! Around 1992-93 Bell labs was interested in hand-written character recognition and in neural nets. Vapnik thinks that neural nets are not very good. So he had a bet with a colleague for a good dinner, that the SVM will do eventually better at hand writing recognition than neural nets. So, the colleague decided to implement the SVM with a kernel with n = 2, just slightly non linear. It worked like a charm! As soon as it was shown to work in early 90s on the problem of handwriting recognition, Vapnik resuscitated the idea of the kernel, began to develop it, and became the essential part of the whole approach of using SVMs. It was about 30 years in between the concept and anybody hearing about it. 8