SlideShare une entreprise Scribd logo
1  sur  27
Chapter 7
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 7. Sparse Kernel Machines
2
Kernel based regression & classification machine
Like gaussian process, many kernel based approaches require full computation of kernel functions.
In this section, we are going to cover some sparse solution machines.
‘What does sparse solution mean?’
= A method which uses only part of full dataset.
This method is called a ‘support vector machine.’
** As I mentioned, I have uploaded a report which covered the detail idea of support vector machine!
There are some interesting parts in support vector machine.
1st : Support vector machine does not yield probability of specific decision. It only gives classification result.
2nd : We can always find global optimal solution with convex optimization.
3rd : It can be extended to bayes methods by using Relevance Vector Machine (Soon)
Since I wrote basic idea in other file, thus I am going to skip the basics!
Important idea of this method is ‘using decision boundaries and maximizing the margin!’
First, let’s take a look at optimization issues.
Chapter 7.0. Lagrange Multipliers and KKT condition
3
Lagrange Multipliers
Consider we are maximizing function 𝑓(𝑋) with respect to 𝑋.
If we are doing such works under some constraint 𝑔 𝑋 = 0.
Then, constraint 𝑔(𝑋) forms a (𝐷 − 1) dimension surface in feature space.
Here, consider constraint of g x, y, z = 𝑥 + 𝑦 − 𝑧 = 0
(𝑥, 𝑦, 𝑧) constraint forms a plane of right figure’s grey surface.
However, gradient of g, ∇𝑔 𝑥, 𝑦, 𝑧 = 𝑔 ∇𝑥, ∇𝑦, ∇𝑧 = (1, 1, −1).
Check that this gradient is orthogonal to the constraint surface.
Now, let’s extend this idea to general dimension.
By using taylor series,
𝑔 𝑥 + 𝜖 = 𝑔 𝑥 + 𝜖𝑇
∇𝑔(𝑥)
As we can see, as 𝜖 → 0, this epsilon lies on constraint plane, and
𝜖𝑇
∇𝑔 𝑥 ≈ 0, which fits our result of toy result.
Chapter 7.0. Lagrange Multipliers and KKT condition
4
Lagrange Multipliers
Now, let’s get back to our original optimization issues. 𝑓(𝑋) is some value in this D – dimension space.
Here, maximum value of 𝑓(𝑋) occurs when the variables just kisses the constraint surface (sharing the tangent line).
This indicates two gradient vectors ∇𝑓 𝑋 + 𝜆∇𝑔 𝑋 = 0, and 𝜆 is a constant which changes the sign of vectors.
Thus, we can find final equation by using
Let’s consider the inequality constraint of the equation.
Most of the parts are same. Still optimal point occurs on the ‘kissing point’.
1st. However, as you can see, we are turning on and turning off the conditions according to whether it satisfies, or not.
2nd . Direction of 𝜆 is important, since we have to move away from the shaded region, which is 𝑔 𝑥 > 0.
Here, there is a condition called Karush-Kuhn-Tucker(KKT) condition, which makes our optimization optimal.
Such conditions are,
Chapter 7.0. Lagrange Multipliers and KKT condition
5
Summary
So, we have got some intuition regarding the optimization with Lagrange.
So, we are solving following formula.
1. We ant to find maximum of 𝑓(𝑋), with the constraints 𝑔 𝑋 = 0 / ℎ 𝑋 ≥ 0
2. Objective equation is 𝐿 𝑥, 𝜆𝑖, 𝜇𝑘 = 𝑓 𝑋 + 𝑗=1
𝐽
𝜆𝑗𝑔𝑗(𝑋) + 𝑘=1
𝐾
𝜇𝑘ℎ𝑘(𝑋)
3. But subject to 𝜇𝑘 ≥ 0, 𝜇𝑘ℎ𝑘 𝑋 = 0.
4. So, in short,
𝑎𝑟𝑔𝑚𝑎𝑥𝑋(𝐿 𝑥, 𝜆𝑖, 𝜇𝑘 = 𝑓 𝑋 +
𝑗=1
𝐽
𝜆𝑗𝑔𝑗 𝑋 +
𝑘=1
𝐾
𝜇𝑘ℎ𝑘 𝑋 )
𝑆. 𝑡. 𝜇𝑘 ≥ 0
𝑆. 𝑡. 𝜇𝑘ℎ𝑘 𝑋 = 0
Check how these equations are being used in
optimization of Support Vector Machine!
- Dual Representation
- Lagrange
- KKT condition
Chapter 7.1. Maximum Margin Classifiers
6
General formula
Output takes only two forms, {-1 , 1}.
𝑦 𝑋 = 𝑊𝑇
𝜙 𝑋 + 𝑏 t𝑛
1 𝑖𝑓 𝑦 𝑋 > 0
−1 𝑖𝑓 𝑦 𝑋 < 0
Thus, optimal values of 𝑦(𝑋) can be expressed by 𝑡𝑛𝑦 𝑋 > 0
Here, we assume data is perfectly separable!
We discussed perpendicular distance and other related issues in chapter 4.
Distance from an arbitrary data point can be expressed as
As our goal is to maximize this margin, distance should also be maximized.
By using this, we can set our optimization function as
Here, we are free to set inner term of equation as 1. (Since we can achieve
this by simple re-scaling, and this point corresponds to the decision surface.)
Then, following condition satisfies.
Chapter 7.1. Maximum Margin Classifiers
7
General formula
Thus, our final objective function becomes…
Since there is a constraint of 𝑡𝑛 𝑊𝑇
𝜙 𝑋𝑛 + 𝑏 ≥ 1, we can use Lagrange multipliers! (Posing constraint on objective function!)
Here, we can re-write optimization function by
By computing
𝜕𝐿 𝑤,𝑏,𝑎
𝜕𝑤
, we can get following equations.
Which is a dual representation of optimization!
Chapter 7.1. Maximum Margin Classifiers
8
General formula
Here, kernel function means the inner product of two kernel values.
1. By changing optimization model complexity increases.
2. However, this allows us to use kernel function in the optimization.
For the new input, we can classify it by using this equation!
KKT condition should be satisfied!
Now, let’s talk about support vectors.
Most of the well-classified data are 𝑡𝑛𝑦 𝑋𝑛 > 1.
Here, there are support vectors, data that lie on the boundary of classifier.
They can be defined by
Circled data are
support vectors
Dual representation means this.
𝑎𝑟𝑔𝑚𝑖𝑛 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ≥ (𝑑𝑢𝑎𝑙 − 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛)
Here, if we maximize dual representation, we can get the greatest lower bound of original equation!
Two values become equal under KKT condition, and we can express equation in terms of kernel!
Chapter 7.1. Maximum Margin Classifiers
9
Finding support vectors
By optimizing aforementioned equation, we can get the value of 𝑎, (which is usually expressed as 𝜆)
Note the constraints of
Here, data point which 𝑡𝑛𝑦 𝑋𝑛 = 1 satisfies are the support vectors.
Conversely, this means 𝒂𝒏 ≠ 𝟎 data are the support vectors!
This is easy since we have already found all 𝑎𝑛 values! Now, think of how we can get bias term.
Here, 𝑆 is the set of support vectors,
And 𝑡𝑛 indicates the label of any support vector Figure of SVM with
Gaussian Kernel
Stable solution of bias
Chapter 7.1. Maximum Margin Classifiers
10
Overlapping class distributions
We have assumed data are all separable, which is actually an impossible situation.
Which means, (right figure)
In order to take this into our model, we think of new constraint, a slack variable.
Slack variable is a variable which gives different value for each data.
𝜉𝑛 = 𝑡𝑛 − 𝑦 𝑋𝑛 ≥ 0
Correctly classified : 𝜉 = 0
On the boundary : 𝜉 = 1
Mis-Classified : 𝜉 > 1.
This slack variable should be as small as it can!
Thus, this can be added to original objective function with hyper-parameter 𝐶
Now, we are trying to minimize 𝑳 under constraint of 𝝃𝒏
𝑳 =
Chapter 7.1. Maximum Margin Classifiers
11
Optimization with slack variable
By using partial derivative for each parameters, we can get following equations.
Most of the parts are same with the previous separable
case example(without slack variables)
Here, lagrange multiplier 𝑎𝑛 has upper limit 𝑪
Dual representation (Maximization)
Chapter 7.1. Maximum Margin Classifiers
12
Slack variable optimization + Nu SVM
Here again, support vectors are the data which satisfies 𝑎𝑛 > 0, which means 𝑡𝑛𝑦 𝑋𝑛 = 1 − 𝜉𝑛
1. If 𝑎𝑛 < 𝐶, this implies 𝜇𝑛 ≥ 0, → 𝜉𝑛 = 0 / Well classified!
2. If 𝑎𝑛 = 𝐶, this implies 𝜇𝑛 = 0, → 𝜉𝑛 ≠ 0 / Again two possible cases.
2.1. 𝜉 ≤ 1 : correctly classified! / But over boundary
2.2. 𝜉 > 1 : Misclassified!
To compute bias, we again find values of 𝟎 < 𝒂𝒏 < 𝑪, and corresponding data.
Note that scalar 𝐶 is a trade-off parameter of violation of data
In order to get a more intuitive hyper-param, there is a
SVM called 𝜈 − 𝑆𝑉𝑀 (nu-SVM).
Here, optimization equation becomes…
Here, 𝝂 indicates,
- Upper-bound of margin errors (𝜉 >
0) (Can or cannot be wrong)
- Lower-bound of # of support
vectors
Chapter 7.1. Maximum Margin Classifiers
13
Characteristic & SMO
SMO from https://www.youtube.com/watch?v=vqoVIchkM7I
As mentioned above, we are updating lagrange multipliers two at a time!
Selecting 𝑎𝑛 also has various methods.
Above equation can be solved in closed form!
Consider label predicting equation of SVM
Do we have to save all data, and performing weighted sum with respect to
all data all the time??
Actually not, since data within the boundary has value of 𝑎𝑛.
Which means, we only need data of 𝒂𝒏 > 𝟎, which are the
Support Vectors!
Chapter 7.1. Maximum Margin Classifiers
14
Relation to the logistic regression
https://www.slideshare.net/ssuser36cf8e/prml-chapter-7-svm-supplementary-files
Check 15 & 16 page of this file!!
Chapter 7.1. Maximum Margin Classifiers
15
SVM for regression
We can extend simple idea of ‘error acceptance’ to the linear regression.
This is called ′𝝐 − 𝒊𝒏𝒕𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆 𝒆𝒓𝒓𝒐𝒓 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏′.
Red : 𝝐 − 𝒍𝒐𝒔𝒔 Green : Squared loss
Which means, error smaller than 𝜖 is okay. Otherwise, not really good!
The boundary(red region) is called a ‘tube’.
However, not every data can exist between ϵ interval.
Thus, we introduce slack variable again.
Thus, error can be computed as
This can be viewed as
Regularized error!
There still exist constraint of
𝝃 ≥ 𝟎 & 𝝃 ≥ 𝟎
Chapter 7.1. Maximum Margin Classifiers
16
SVM for regression
Here, lagrangian objective function can be
By plugging them in…
Thus, new prediction can be written as…
Chapter 7.1. Maximum Margin Classifiers
17
SVM for regression
Dual representation should satisfy KKT condition to be great lower bound. Which should be…
Interpretation of lagrange multipliers.
𝑎𝑛 ≠ 0 : Support vectors or above boundaries
𝑎𝑛 ≠ 0 : Support vectors or below boundaries
Here, bias term can be computed as…
Just like classification case, here also we can implement
𝝂 − 𝑺𝑽𝑴 − 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒐𝒓
Interpretation of hyper-parameter
1. At most 𝜈𝑁 data fall outside of the tube.
2. At least 𝜈N data are the support vectors.
Chapter 7.2. Relevance Vector Machines
18
Limitation of SVM, derivation of RVM
One fundamental limitation of SVM is that ‘it cannot yields a probability’.
It can only decide whether specific data belongs to certain class or not.
In order to overcome this issue, (to generate probability) we can think of a new model called ‘relevance vector machine’.
It uses the idea of kernel, but still has a structure of probability model.
Let’s begin with a regression example.
RVM also has a structure of pdf
Here, predicted mean 𝑦 𝑋 is equal to
Here, RVM substitutes basis
function 𝝓(𝑿) to a kernel function.
Thus, it includes total 𝑴 = 𝑵 + 𝟏 parameters
Basic idea is clear. Now, let’s move onto ‘how to define distributions?’
First, we have to define likelihood function.
Chapter 7.2. Relevance Vector Machines
19
RVM for regression
Now, we have to define prior distribution of 𝑊, which is a parameter of a model.
Here, please note that we are fitting individual 𝜶 values for
each dimension of 𝒘.
By computing product of likelihood and prior, (𝑝 𝑤 𝑥 ∝ 𝑝 𝑥 𝑤 𝑝(𝑤)) we can get posterior.
We can also use general result which we derived in chapter 3.
Here, we haven’t computed nuisance
parameters 𝛼, 𝛽 for the model.
We are using evidence approximation, which we
did in chapter 3.
** Evidence Approx.
We are getting rid of the influence of 𝑤 by
integrating it out, then compute most likely
value of each parameters.(MLE)
Chapter 7.2. Relevance Vector Machines
20
RVM for regression
Thus, in order to estimate 𝛼 𝑎𝑛𝑑 𝛽, we have to compute
This can be transformed into following terms with log function.
We have to maximize above ln 𝑝(𝑡|𝑋, 𝛼, 𝛽) with respect to 𝛼 and 𝛽.
This optimization cannot be expressed in a closed form. We have to use iterative methods. That is,
Here, Σ𝑖𝑖 is a diagonal term of posterior’s
covariance matrix
Take a look at 𝛼.
Huge 𝜶 indicates zero variance with mean zero (precision)
Of weight parameter. Thus, that basis does not have any power.
Chapter 7.2. Relevance Vector Machines
21
RVM for regression
After we find optimal values for 𝛼 and 𝛽, we generate predictive distribution for target value 𝒕.
Now let’s compare SVM’s regression and RVM’s regression.
SVM RVM
1. RVM requires much less number of
relevance(support) vectors, which
means we can save prediction time.
2. However, RVM takes more time to
train model, due to inversion of 𝑪
matrix.
Chapter 7.2. Relevance Vector Machines
22
Analysis of Sparsity
Let’s focus on parameter 𝛼. How does it contribute to the model’s sparsity?(Selecting reasonable basis)
Consider there exists only one basis function and two data 𝑥1, 𝑡1 , (𝑥2, 𝑡2).
Then, aforementioned value 𝑪 can be computed as
𝜑 is a N-dimensional vector of 𝜙 𝑋1 , 𝜙 𝑋2
𝑇
.
And similarly 𝒕 = 𝑡1, 𝑡2
𝑻
When 𝛼 has an
infinite value
Finite value of 𝛼.
Direction of 𝝋 is significant!
Chapter 7.2. Relevance Vector Machines
23
Mathematical perspective
We now move onto 𝑁 − 𝑑𝑖𝑚 variables. We are still thinking of optimizing 𝐶 with respect to 𝛼 𝑎𝑛𝑑 𝛽.
We can re-write 𝐶 by
Here, 𝝋𝒊 indicates i-th column of design matrix 𝚽.
Here, we have to compute
However, we don’t know |𝐶| and 𝐶−1
.
We have to think how we can express
them with 𝑪−𝒊, 𝜶𝒊, 𝒂𝒏𝒅 𝝋
By using the equation of
Chapter 7.2. Relevance Vector Machines
24
Mathematical perspective
We can sort all values with new variables 𝑠𝑖 and 𝑞𝑖
Here, 𝒔𝒊 indicates sparsity and 𝒒𝒊 indicates quality of 𝝋
1. Sparsity(𝑠𝑖) measures the extent to which basis function 𝜑𝑖
overlaps with other basis vectors in the model.
2. Quality(𝑞𝑖) measures the alignment of the basis vector 𝝋𝒊 and
other training vectors t.
Now, in order to decide optimal value of 𝛼𝑖 we do not need to
consider values of other 𝛼𝑗. So, we have to only calculate derivative
of 𝜆(𝛼𝑖), which will be introduced in the following page.
Chapter 7.2. Relevance Vector Machines
25
Mathematical perspective
Equation can be
Recall that 𝛼𝑖 ≥ 0, (It’s a precision!) we should think of two conditions.
1. If 𝑞𝑖
2
< 𝑠𝑖, then 𝛼𝑖 → ∞ / Second term goes positive, so first term should be close to zero.
2. If 𝑞𝑖
2
> 𝑠𝑖 solution can be
According to these equations, we can get iterative optimization methods of RVM.
Chapter 7.2. Relevance Vector Machines
26
RVM for classification
Relevance vector machine can be extended to classification model by simply using logistic regression model
with ARD prior.
Just as we covered in chapter 4, we are not integrating with respect to 𝑤. Instead, we use Laplace approximation.
It’s been a while, thus let’s revise Laplace approximation for short.
That is, weight parameters are having different prior, and are independent!
What we need are…
1. Mode of posterior
2. Hessian of posterior.
Here, modes are…
Note that 𝐵 = 𝑁𝑥𝑁 𝑜𝑓 𝑦𝑛(1 − 𝑦𝑛)
Chapter 7.2. Relevance Vector Machines
27
RVM for classification
Here, we don’t know exact value of 𝛼, we have to estimate it by evidence value.
After substituting each function of parameter, we can get estimation of 𝛼 if we set derivative of the marginal likelihood.
Note that result is equivalent
to that of regression
At the same time, by defining 𝒕 as following, we can get much simple path.
Note that this result fits the result of regression example.
Thus, we can put same analysis with 𝜶 as we did before!
For the multi-class case, we can simply train 𝑘 − different
models of 𝑘 − 𝑐𝑙𝑎𝑠𝑠 labels. Then use softmax function.

Contenu connexe

Tendances

Spread Spectrum Multiple Access
 Spread Spectrum Multiple Access Spread Spectrum Multiple Access
Spread Spectrum Multiple Accessguest734441
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12Sunwoo Kim
 
PRML Chapter 10
PRML Chapter 10PRML Chapter 10
PRML Chapter 10Sunwoo Kim
 
2[1].1 data transmission
2[1].1 data transmission2[1].1 data transmission
2[1].1 data transmissionHattori Sidek
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1Sunwoo Kim
 
Nelder Mead Search Algorithm
Nelder Mead Search AlgorithmNelder Mead Search Algorithm
Nelder Mead Search AlgorithmAshish Khetan
 
Graph based Clustering
Graph based ClusteringGraph based Clustering
Graph based Clustering怡秀 林
 
Chapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy LogicChapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy LogicAshique Rasool
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Bandwidth enhancement patch antenna
Bandwidth enhancement patch antennaBandwidth enhancement patch antenna
Bandwidth enhancement patch antennaAnurag Anupam
 
Comparison of Defuzzification Methods from a Real World Problem
Comparison of Defuzzification Methods from a Real World ProblemComparison of Defuzzification Methods from a Real World Problem
Comparison of Defuzzification Methods from a Real World Problemijtsrd
 
Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Sivagowry Shathesh
 
Fuzzy Logic Ppt
Fuzzy Logic PptFuzzy Logic Ppt
Fuzzy Logic Pptrafi
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingBob John
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_BSrimatre K
 

Tendances (20)

Spread Spectrum Multiple Access
 Spread Spectrum Multiple Access Spread Spectrum Multiple Access
Spread Spectrum Multiple Access
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12
 
Neural network
Neural networkNeural network
Neural network
 
PRML Chapter 10
PRML Chapter 10PRML Chapter 10
PRML Chapter 10
 
Curve fitting
Curve fittingCurve fitting
Curve fitting
 
2[1].1 data transmission
2[1].1 data transmission2[1].1 data transmission
2[1].1 data transmission
 
Fuzzy inference systems
Fuzzy inference systemsFuzzy inference systems
Fuzzy inference systems
 
Basics of Patch antenna
Basics of Patch antennaBasics of Patch antenna
Basics of Patch antenna
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
 
Nelder Mead Search Algorithm
Nelder Mead Search AlgorithmNelder Mead Search Algorithm
Nelder Mead Search Algorithm
 
Graph based Clustering
Graph based ClusteringGraph based Clustering
Graph based Clustering
 
Chapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy LogicChapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy Logic
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Bandwidth enhancement patch antenna
Bandwidth enhancement patch antennaBandwidth enhancement patch antenna
Bandwidth enhancement patch antenna
 
Comparison of Defuzzification Methods from a Real World Problem
Comparison of Defuzzification Methods from a Real World ProblemComparison of Defuzzification Methods from a Real World Problem
Comparison of Defuzzification Methods from a Real World Problem
 
Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing
 
Fuzzy Logic Ppt
Fuzzy Logic PptFuzzy Logic Ppt
Fuzzy Logic Ppt
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision making
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_B
 
Fuzzy sets
Fuzzy setsFuzzy sets
Fuzzy sets
 

Similaire à PRML Chapter 7

PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9Sunwoo Kim
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregressionkongara
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 
PRML Chapter 8
PRML Chapter 8PRML Chapter 8
PRML Chapter 8Sunwoo Kim
 
PRML Chapter 6
PRML Chapter 6PRML Chapter 6
PRML Chapter 6Sunwoo Kim
 
UE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptxUE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptxpremkumar901866
 
Support Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxSupport Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxCodingChamp1
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptxHarishNayak44
 
classification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxclassification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxjasontseng19
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notesUmeshJagga1
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksYasutoTamura1
 
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION ijscai
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 

Similaire à PRML Chapter 7 (20)

PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
PRML Chapter 8
PRML Chapter 8PRML Chapter 8
PRML Chapter 8
 
PRML Chapter 6
PRML Chapter 6PRML Chapter 6
PRML Chapter 6
 
UE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptxUE19EC353 ML Unit4_slides.pptx
UE19EC353 ML Unit4_slides.pptx
 
Support Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxSupport Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptx
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptx
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
classification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxclassification algorithms in machine learning.pptx
classification algorithms in machine learning.pptx
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
 
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
svm.pptx
svm.pptxsvm.pptx
svm.pptx
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
 
Ann a Algorithms notes
Ann a Algorithms notesAnn a Algorithms notes
Ann a Algorithms notes
 

Dernier

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Dernier (20)

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

PRML Chapter 7

  • 1. Chapter 7 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 7. Sparse Kernel Machines 2 Kernel based regression & classification machine Like gaussian process, many kernel based approaches require full computation of kernel functions. In this section, we are going to cover some sparse solution machines. ‘What does sparse solution mean?’ = A method which uses only part of full dataset. This method is called a ‘support vector machine.’ ** As I mentioned, I have uploaded a report which covered the detail idea of support vector machine! There are some interesting parts in support vector machine. 1st : Support vector machine does not yield probability of specific decision. It only gives classification result. 2nd : We can always find global optimal solution with convex optimization. 3rd : It can be extended to bayes methods by using Relevance Vector Machine (Soon) Since I wrote basic idea in other file, thus I am going to skip the basics! Important idea of this method is ‘using decision boundaries and maximizing the margin!’ First, let’s take a look at optimization issues.
  • 3. Chapter 7.0. Lagrange Multipliers and KKT condition 3 Lagrange Multipliers Consider we are maximizing function 𝑓(𝑋) with respect to 𝑋. If we are doing such works under some constraint 𝑔 𝑋 = 0. Then, constraint 𝑔(𝑋) forms a (𝐷 − 1) dimension surface in feature space. Here, consider constraint of g x, y, z = 𝑥 + 𝑦 − 𝑧 = 0 (𝑥, 𝑦, 𝑧) constraint forms a plane of right figure’s grey surface. However, gradient of g, ∇𝑔 𝑥, 𝑦, 𝑧 = 𝑔 ∇𝑥, ∇𝑦, ∇𝑧 = (1, 1, −1). Check that this gradient is orthogonal to the constraint surface. Now, let’s extend this idea to general dimension. By using taylor series, 𝑔 𝑥 + 𝜖 = 𝑔 𝑥 + 𝜖𝑇 ∇𝑔(𝑥) As we can see, as 𝜖 → 0, this epsilon lies on constraint plane, and 𝜖𝑇 ∇𝑔 𝑥 ≈ 0, which fits our result of toy result.
  • 4. Chapter 7.0. Lagrange Multipliers and KKT condition 4 Lagrange Multipliers Now, let’s get back to our original optimization issues. 𝑓(𝑋) is some value in this D – dimension space. Here, maximum value of 𝑓(𝑋) occurs when the variables just kisses the constraint surface (sharing the tangent line). This indicates two gradient vectors ∇𝑓 𝑋 + 𝜆∇𝑔 𝑋 = 0, and 𝜆 is a constant which changes the sign of vectors. Thus, we can find final equation by using Let’s consider the inequality constraint of the equation. Most of the parts are same. Still optimal point occurs on the ‘kissing point’. 1st. However, as you can see, we are turning on and turning off the conditions according to whether it satisfies, or not. 2nd . Direction of 𝜆 is important, since we have to move away from the shaded region, which is 𝑔 𝑥 > 0. Here, there is a condition called Karush-Kuhn-Tucker(KKT) condition, which makes our optimization optimal. Such conditions are,
  • 5. Chapter 7.0. Lagrange Multipliers and KKT condition 5 Summary So, we have got some intuition regarding the optimization with Lagrange. So, we are solving following formula. 1. We ant to find maximum of 𝑓(𝑋), with the constraints 𝑔 𝑋 = 0 / ℎ 𝑋 ≥ 0 2. Objective equation is 𝐿 𝑥, 𝜆𝑖, 𝜇𝑘 = 𝑓 𝑋 + 𝑗=1 𝐽 𝜆𝑗𝑔𝑗(𝑋) + 𝑘=1 𝐾 𝜇𝑘ℎ𝑘(𝑋) 3. But subject to 𝜇𝑘 ≥ 0, 𝜇𝑘ℎ𝑘 𝑋 = 0. 4. So, in short, 𝑎𝑟𝑔𝑚𝑎𝑥𝑋(𝐿 𝑥, 𝜆𝑖, 𝜇𝑘 = 𝑓 𝑋 + 𝑗=1 𝐽 𝜆𝑗𝑔𝑗 𝑋 + 𝑘=1 𝐾 𝜇𝑘ℎ𝑘 𝑋 ) 𝑆. 𝑡. 𝜇𝑘 ≥ 0 𝑆. 𝑡. 𝜇𝑘ℎ𝑘 𝑋 = 0 Check how these equations are being used in optimization of Support Vector Machine! - Dual Representation - Lagrange - KKT condition
  • 6. Chapter 7.1. Maximum Margin Classifiers 6 General formula Output takes only two forms, {-1 , 1}. 𝑦 𝑋 = 𝑊𝑇 𝜙 𝑋 + 𝑏 t𝑛 1 𝑖𝑓 𝑦 𝑋 > 0 −1 𝑖𝑓 𝑦 𝑋 < 0 Thus, optimal values of 𝑦(𝑋) can be expressed by 𝑡𝑛𝑦 𝑋 > 0 Here, we assume data is perfectly separable! We discussed perpendicular distance and other related issues in chapter 4. Distance from an arbitrary data point can be expressed as As our goal is to maximize this margin, distance should also be maximized. By using this, we can set our optimization function as Here, we are free to set inner term of equation as 1. (Since we can achieve this by simple re-scaling, and this point corresponds to the decision surface.) Then, following condition satisfies.
  • 7. Chapter 7.1. Maximum Margin Classifiers 7 General formula Thus, our final objective function becomes… Since there is a constraint of 𝑡𝑛 𝑊𝑇 𝜙 𝑋𝑛 + 𝑏 ≥ 1, we can use Lagrange multipliers! (Posing constraint on objective function!) Here, we can re-write optimization function by By computing 𝜕𝐿 𝑤,𝑏,𝑎 𝜕𝑤 , we can get following equations. Which is a dual representation of optimization!
  • 8. Chapter 7.1. Maximum Margin Classifiers 8 General formula Here, kernel function means the inner product of two kernel values. 1. By changing optimization model complexity increases. 2. However, this allows us to use kernel function in the optimization. For the new input, we can classify it by using this equation! KKT condition should be satisfied! Now, let’s talk about support vectors. Most of the well-classified data are 𝑡𝑛𝑦 𝑋𝑛 > 1. Here, there are support vectors, data that lie on the boundary of classifier. They can be defined by Circled data are support vectors Dual representation means this. 𝑎𝑟𝑔𝑚𝑖𝑛 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ≥ (𝑑𝑢𝑎𝑙 − 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛) Here, if we maximize dual representation, we can get the greatest lower bound of original equation! Two values become equal under KKT condition, and we can express equation in terms of kernel!
  • 9. Chapter 7.1. Maximum Margin Classifiers 9 Finding support vectors By optimizing aforementioned equation, we can get the value of 𝑎, (which is usually expressed as 𝜆) Note the constraints of Here, data point which 𝑡𝑛𝑦 𝑋𝑛 = 1 satisfies are the support vectors. Conversely, this means 𝒂𝒏 ≠ 𝟎 data are the support vectors! This is easy since we have already found all 𝑎𝑛 values! Now, think of how we can get bias term. Here, 𝑆 is the set of support vectors, And 𝑡𝑛 indicates the label of any support vector Figure of SVM with Gaussian Kernel Stable solution of bias
  • 10. Chapter 7.1. Maximum Margin Classifiers 10 Overlapping class distributions We have assumed data are all separable, which is actually an impossible situation. Which means, (right figure) In order to take this into our model, we think of new constraint, a slack variable. Slack variable is a variable which gives different value for each data. 𝜉𝑛 = 𝑡𝑛 − 𝑦 𝑋𝑛 ≥ 0 Correctly classified : 𝜉 = 0 On the boundary : 𝜉 = 1 Mis-Classified : 𝜉 > 1. This slack variable should be as small as it can! Thus, this can be added to original objective function with hyper-parameter 𝐶 Now, we are trying to minimize 𝑳 under constraint of 𝝃𝒏 𝑳 =
  • 11. Chapter 7.1. Maximum Margin Classifiers 11 Optimization with slack variable By using partial derivative for each parameters, we can get following equations. Most of the parts are same with the previous separable case example(without slack variables) Here, lagrange multiplier 𝑎𝑛 has upper limit 𝑪 Dual representation (Maximization)
  • 12. Chapter 7.1. Maximum Margin Classifiers 12 Slack variable optimization + Nu SVM Here again, support vectors are the data which satisfies 𝑎𝑛 > 0, which means 𝑡𝑛𝑦 𝑋𝑛 = 1 − 𝜉𝑛 1. If 𝑎𝑛 < 𝐶, this implies 𝜇𝑛 ≥ 0, → 𝜉𝑛 = 0 / Well classified! 2. If 𝑎𝑛 = 𝐶, this implies 𝜇𝑛 = 0, → 𝜉𝑛 ≠ 0 / Again two possible cases. 2.1. 𝜉 ≤ 1 : correctly classified! / But over boundary 2.2. 𝜉 > 1 : Misclassified! To compute bias, we again find values of 𝟎 < 𝒂𝒏 < 𝑪, and corresponding data. Note that scalar 𝐶 is a trade-off parameter of violation of data In order to get a more intuitive hyper-param, there is a SVM called 𝜈 − 𝑆𝑉𝑀 (nu-SVM). Here, optimization equation becomes… Here, 𝝂 indicates, - Upper-bound of margin errors (𝜉 > 0) (Can or cannot be wrong) - Lower-bound of # of support vectors
  • 13. Chapter 7.1. Maximum Margin Classifiers 13 Characteristic & SMO SMO from https://www.youtube.com/watch?v=vqoVIchkM7I As mentioned above, we are updating lagrange multipliers two at a time! Selecting 𝑎𝑛 also has various methods. Above equation can be solved in closed form! Consider label predicting equation of SVM Do we have to save all data, and performing weighted sum with respect to all data all the time?? Actually not, since data within the boundary has value of 𝑎𝑛. Which means, we only need data of 𝒂𝒏 > 𝟎, which are the Support Vectors!
  • 14. Chapter 7.1. Maximum Margin Classifiers 14 Relation to the logistic regression https://www.slideshare.net/ssuser36cf8e/prml-chapter-7-svm-supplementary-files Check 15 & 16 page of this file!!
  • 15. Chapter 7.1. Maximum Margin Classifiers 15 SVM for regression We can extend simple idea of ‘error acceptance’ to the linear regression. This is called ′𝝐 − 𝒊𝒏𝒕𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆 𝒆𝒓𝒓𝒐𝒓 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏′. Red : 𝝐 − 𝒍𝒐𝒔𝒔 Green : Squared loss Which means, error smaller than 𝜖 is okay. Otherwise, not really good! The boundary(red region) is called a ‘tube’. However, not every data can exist between ϵ interval. Thus, we introduce slack variable again. Thus, error can be computed as This can be viewed as Regularized error! There still exist constraint of 𝝃 ≥ 𝟎 & 𝝃 ≥ 𝟎
  • 16. Chapter 7.1. Maximum Margin Classifiers 16 SVM for regression Here, lagrangian objective function can be By plugging them in… Thus, new prediction can be written as…
  • 17. Chapter 7.1. Maximum Margin Classifiers 17 SVM for regression Dual representation should satisfy KKT condition to be great lower bound. Which should be… Interpretation of lagrange multipliers. 𝑎𝑛 ≠ 0 : Support vectors or above boundaries 𝑎𝑛 ≠ 0 : Support vectors or below boundaries Here, bias term can be computed as… Just like classification case, here also we can implement 𝝂 − 𝑺𝑽𝑴 − 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒐𝒓 Interpretation of hyper-parameter 1. At most 𝜈𝑁 data fall outside of the tube. 2. At least 𝜈N data are the support vectors.
  • 18. Chapter 7.2. Relevance Vector Machines 18 Limitation of SVM, derivation of RVM One fundamental limitation of SVM is that ‘it cannot yields a probability’. It can only decide whether specific data belongs to certain class or not. In order to overcome this issue, (to generate probability) we can think of a new model called ‘relevance vector machine’. It uses the idea of kernel, but still has a structure of probability model. Let’s begin with a regression example. RVM also has a structure of pdf Here, predicted mean 𝑦 𝑋 is equal to Here, RVM substitutes basis function 𝝓(𝑿) to a kernel function. Thus, it includes total 𝑴 = 𝑵 + 𝟏 parameters Basic idea is clear. Now, let’s move onto ‘how to define distributions?’ First, we have to define likelihood function.
  • 19. Chapter 7.2. Relevance Vector Machines 19 RVM for regression Now, we have to define prior distribution of 𝑊, which is a parameter of a model. Here, please note that we are fitting individual 𝜶 values for each dimension of 𝒘. By computing product of likelihood and prior, (𝑝 𝑤 𝑥 ∝ 𝑝 𝑥 𝑤 𝑝(𝑤)) we can get posterior. We can also use general result which we derived in chapter 3. Here, we haven’t computed nuisance parameters 𝛼, 𝛽 for the model. We are using evidence approximation, which we did in chapter 3. ** Evidence Approx. We are getting rid of the influence of 𝑤 by integrating it out, then compute most likely value of each parameters.(MLE)
  • 20. Chapter 7.2. Relevance Vector Machines 20 RVM for regression Thus, in order to estimate 𝛼 𝑎𝑛𝑑 𝛽, we have to compute This can be transformed into following terms with log function. We have to maximize above ln 𝑝(𝑡|𝑋, 𝛼, 𝛽) with respect to 𝛼 and 𝛽. This optimization cannot be expressed in a closed form. We have to use iterative methods. That is, Here, Σ𝑖𝑖 is a diagonal term of posterior’s covariance matrix Take a look at 𝛼. Huge 𝜶 indicates zero variance with mean zero (precision) Of weight parameter. Thus, that basis does not have any power.
  • 21. Chapter 7.2. Relevance Vector Machines 21 RVM for regression After we find optimal values for 𝛼 and 𝛽, we generate predictive distribution for target value 𝒕. Now let’s compare SVM’s regression and RVM’s regression. SVM RVM 1. RVM requires much less number of relevance(support) vectors, which means we can save prediction time. 2. However, RVM takes more time to train model, due to inversion of 𝑪 matrix.
  • 22. Chapter 7.2. Relevance Vector Machines 22 Analysis of Sparsity Let’s focus on parameter 𝛼. How does it contribute to the model’s sparsity?(Selecting reasonable basis) Consider there exists only one basis function and two data 𝑥1, 𝑡1 , (𝑥2, 𝑡2). Then, aforementioned value 𝑪 can be computed as 𝜑 is a N-dimensional vector of 𝜙 𝑋1 , 𝜙 𝑋2 𝑇 . And similarly 𝒕 = 𝑡1, 𝑡2 𝑻 When 𝛼 has an infinite value Finite value of 𝛼. Direction of 𝝋 is significant!
  • 23. Chapter 7.2. Relevance Vector Machines 23 Mathematical perspective We now move onto 𝑁 − 𝑑𝑖𝑚 variables. We are still thinking of optimizing 𝐶 with respect to 𝛼 𝑎𝑛𝑑 𝛽. We can re-write 𝐶 by Here, 𝝋𝒊 indicates i-th column of design matrix 𝚽. Here, we have to compute However, we don’t know |𝐶| and 𝐶−1 . We have to think how we can express them with 𝑪−𝒊, 𝜶𝒊, 𝒂𝒏𝒅 𝝋 By using the equation of
  • 24. Chapter 7.2. Relevance Vector Machines 24 Mathematical perspective We can sort all values with new variables 𝑠𝑖 and 𝑞𝑖 Here, 𝒔𝒊 indicates sparsity and 𝒒𝒊 indicates quality of 𝝋 1. Sparsity(𝑠𝑖) measures the extent to which basis function 𝜑𝑖 overlaps with other basis vectors in the model. 2. Quality(𝑞𝑖) measures the alignment of the basis vector 𝝋𝒊 and other training vectors t. Now, in order to decide optimal value of 𝛼𝑖 we do not need to consider values of other 𝛼𝑗. So, we have to only calculate derivative of 𝜆(𝛼𝑖), which will be introduced in the following page.
  • 25. Chapter 7.2. Relevance Vector Machines 25 Mathematical perspective Equation can be Recall that 𝛼𝑖 ≥ 0, (It’s a precision!) we should think of two conditions. 1. If 𝑞𝑖 2 < 𝑠𝑖, then 𝛼𝑖 → ∞ / Second term goes positive, so first term should be close to zero. 2. If 𝑞𝑖 2 > 𝑠𝑖 solution can be According to these equations, we can get iterative optimization methods of RVM.
  • 26. Chapter 7.2. Relevance Vector Machines 26 RVM for classification Relevance vector machine can be extended to classification model by simply using logistic regression model with ARD prior. Just as we covered in chapter 4, we are not integrating with respect to 𝑤. Instead, we use Laplace approximation. It’s been a while, thus let’s revise Laplace approximation for short. That is, weight parameters are having different prior, and are independent! What we need are… 1. Mode of posterior 2. Hessian of posterior. Here, modes are… Note that 𝐵 = 𝑁𝑥𝑁 𝑜𝑓 𝑦𝑛(1 − 𝑦𝑛)
  • 27. Chapter 7.2. Relevance Vector Machines 27 RVM for classification Here, we don’t know exact value of 𝛼, we have to estimate it by evidence value. After substituting each function of parameter, we can get estimation of 𝛼 if we set derivative of the marginal likelihood. Note that result is equivalent to that of regression At the same time, by defining 𝒕 as following, we can get much simple path. Note that this result fits the result of regression example. Thus, we can put same analysis with 𝜶 as we did before! For the multi-class case, we can simply train 𝑘 − different models of 𝑘 − 𝑐𝑙𝑎𝑠𝑠 labels. Then use softmax function.