1. Introduction + Chapter 1
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
1
2. Study Introduction
Study Objective
- Assuming that we are familiar of mathematical statistics 1&2 + regression analysis + basic bayes,
- Getting the intuition of the various algorithms.
- Understanding mathematical concepts of the algorithms.
- Reviewing the algorithms on the statistical perspective.
Time
- Fixed time TBD
Method
- A week before, we choose the scope of the sessions.
- I will prepare the summary of the scope.
- Every participants should study the scope and prepare some related questions!
2
3. Notation
𝑦 𝑥, 𝑤 : Estimated value with parameter w (which is y)
𝑡𝑛 : True value (which is 𝑦)
𝒕 ∶ Set of input and output vectors
𝐸 𝑤 = 𝐿(𝑦 𝑥, 𝑤 , 𝑡) : Error function, which measures the misfit between estimated value and the true value.
𝒘 = 𝒘𝑻𝒘 = 𝑤1
2
+ 𝑤2
2
+ ⋯ + 𝑤𝑛
2
= Euclidean norm (also called l2-norm)
𝝁, 𝚺, |𝚺| : Mean, covariance and determinant of variables.
𝛽 = Σ−1
(𝑓𝑜𝑟 𝑢𝑛𝑖𝑣𝑎𝑟𝑖𝑎𝑡𝑒 =
1
𝜎2) : Precision parameter (inverse of covariance)
𝜇𝑀𝐿 : Estimated mean by maximum likelihood estimation.
3
4. Chapter 1.1. Polynomial Curve Fitting
We have already covered most of the sections in chapter 1 in our undergraduate classes.
Thus, I would like to cover only the concepts which are unfamiliar to us.
4
Most of our regression model focuses on simple linear regression, that is
𝛽 = 𝑋𝑇
𝑋 −1
𝑋𝑇
𝑌
Above estimation could be achieved via normal equation.
However, how can we set example like this?
Here we construct model by using polynomial variables!
We can still apply squared error!
5. Chapter 1.2.6 Bayesian Curve Fitting
As we all know, we need to assume the distribution of parameters.
Furthermore, we have to marginalize it out in order to make prediction!
This process can be expressed by
5
This entire process will be covered in detail in chapter 3!
6. Chapter 1.5. Decision Theory
Our Goal : Getting the 𝑝(𝑥, 𝑡), but in most case it is extremely hard.
In fact, we estimate posterior, 𝑝 𝑡 𝑥 = 𝑝 𝐶𝑘 𝑥 =
𝑝 𝑥 𝐶𝑘 𝒑 𝑪𝒌
𝑝(𝑥)
6
For cancer diagnosis, we have some belief,
A prior knowledge before taking X-ray.
Consider we are trying to build a decision rule.
For binary classification, we are dividing input space to 𝓡𝟏 & 𝓡𝟐.
What we do in ML is “minimizing the misclassification rate”.
Here, let’s consider the decision boundary to be 𝑥.
Optimal boundary will be 𝑥 = 𝑥0
7. Chapter 1.5. Decision Theory
We need a generalization of the concept “loss”.
Here we define the “loss function”.
𝐿𝑘𝑗 : Element of a loss matrix.
We are minimizing the average loss of the function.
7
In practical, estimating the mere probability is not enough.
We need to assign a specific label! / That is, we need to decide a cut-off
This threshold matter is called “reject option”.
=
8. Chapter 1.5. Decision Theory
Way of classification
8
(A) Generative Model (B) Discriminative Model (C) Direct classification
- Estimating above probabilities.
- We are modeling the distribution of the
input & output.
- It is possible to generate synthetic data.
- Estimating the posterior only.
- We calculate the probability of our
interest.
𝒑 𝑪𝒌 𝒙) 𝑪𝒌 = 𝒇(𝒙)
- We do not calculate the probability.
- Directly yields the class label.
9. Chapter 1.6. Information Theory
We are interested in ‘how much information is received when we observe specific event?’
This is something connected to the idea of uncertainty!
Let ℎ(. ) be a function of information gain by observing specific event.
If two events 𝑥 𝑎𝑛𝑑 𝑦 are independent, ℎ 𝑥, 𝑦 = ℎ 𝑥 + ℎ(𝑦) satisfies.
However, unrelated events’ probabilities satisfy… 𝑝 𝑥, 𝑦 = 𝑝 𝑥 𝑝(𝑦)
It is intuitive to use ℎ 𝑥 = − log2 𝑝(𝑥)
What is an average achievement of information? It can be written as
9
Check how the entropy values change as the probability changes
10. Chapter 1.6. Information Theory
Ideation.
Consider the random variable which may have 8 possible states.
1st Case : All same probabilities
2nd Case : Probability of (
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
)
See how the information gain is changing.
At the same time, we can define entropy as ‘average amount of information needed to specify the state of a random variable.’
Now, consider multinomial distribution.
10
This can be interpreted as multi-version of 𝑛
𝑘
.
Similarly, we are assigning each value to the different boxes.
11. Chapter 1.6. Information Theory
Let’s take a deeper look at this equation. We are interested in how much information we need to achieve certain state.
Thus, again we apply it in entropy with scale value N.
By applying Stirling’s approximation…
Then, when does this entropy is being maximized?
We can optimize this by solving
Here, maximized value is 𝒑 𝒙𝒊 =
𝟏
𝑴
11
12. Chapter 1.6. Information Theory
Let’s extend this idea to the continuous variables. By using mean value theorem.
** From wiki, mean value theorem means for the closed interval [a, b], and function 𝑓(𝑥) is continuous in that interval.
Then, for the value c that exists between a and b, following equation satisfies. (Something like 구분구적법)
𝑎
𝑏
𝑓 𝑥 𝑑𝑥 ≅ 𝑓 𝑐 ∗ (𝑏 − 𝑎)
12
We may simply extend it to
Obviously, the interval Δ should be as small as possible to increase the accuracy of approximation, we consider Δ → 0
We can express continuous variable’s entropy in the discrete form.
13. Chapter 1.6. Information Theory
Let’s again maximize this equation by using lagrangian multiplier.
13
Constraint of basic probability distribution.
We are maximizing…
We can set above equation by zero, and get
And again, by doing some math (solving lagrangian issues, then we get)
Oh… This is amazing…
A probability distribution which gives maximum entropy is a gaussian distribution!!
14. Chapter 1.6. Information Theory
Kullback-Leibler divergence(KL Divergence)
14
We all have heard of KL divergence for many times. But, what does it exactly indicates??
Let’s think of variable 𝑥 of probability distribution 𝑝(𝑥). We are trying to model this by using 𝑞(𝑥).
It is the average additional amount of information required to specify the value of x as a result of using 𝑞(𝑥) instead of true
distribution 𝑝(𝑥).
In short, it indicates ‘How much information do we need more?’
Original entropy
New estimated
entropy
We have covered the KL-divergence’s inequality in Mathematical Statistics I, by using Jensen inequality.
Note that,
15. Chapter 1.6. Information Theory
KL divergence in ML
15
= 𝐸𝑥[ln
𝑞 𝑥
𝑝(𝑥)
]
By using the fact of
Now, let’s think of data x which has an unknown distribution of 𝑝(𝑥).
We are trying to model the distribution of 𝑝(𝑥) by using 𝑞(𝑥𝑛|𝜃).
If distribution of 𝑞(𝑥𝑛|𝜃) is similar to 𝑝(𝑥), then its KL divergence is relatively small.
Here, ln 𝑝(𝑥𝑛) does not depend on 𝜃, which is already fixed. Thus, we don’t need second term. Thus, we only need
𝑛=1
𝑁
{− ln 𝑞(𝑋𝑛|𝜃)}, which is related to the negative loge likelihood!