[AI x Robotics : The First] 행사 - 김홍배 박사님 강연
Bayesian Inference : Kalman filter 에서 Optimization 까지
AI Robotics KR
(https://www.facebook.com/groups/airoboticskr/)
2. Bayes Rule
𝑃 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑑𝑎𝑡𝑎 =
𝑃 𝑑𝑎𝑡𝑎 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑃(ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)
𝑃(𝑑𝑎𝑡𝑎)
• Bayes rule tells us how to do inference about hypotheses from data.
• Learning and prediction can be seen as forms of inference.
Given information
Estimate hypothesis
Rev'd Thomas Bayes (1702-1761)
Data : Observation, Hypothesis : Model
Countbayesie.com/blog/2016/5/1/a-guide-to-Bayesian-statistics
3. 3
Contents :
- Learning : Maximum a Posterior Estimator(MAP)
- Prediction : Kalman Filter and it’s implementation
- Optimization : Bayesian Optimization and it’s application
5. 5
Cost to minimize : Cross-entropy Error
Function J(ϴ) = −
1
𝑚
𝑙𝑜g(P 𝑦 𝑥; ϴ )
Logistic regression
Likelihood
Maximum likelihood estimator (MLE)
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
• This approach is very ill-conditioned in nature
sensitive to noise and model error
Learning :
6. 6
Regularized Logistic Regression
Now assume that prior distribution over parameters exists :
Then we can apply Bayes Rule:
Posterior distribution
over model parameters
Learning :
7. 7
Regularized Logistic Regression
Now assume that prior distribution over parameters exists :
Then we can apply Bayes Rule:
Data likelihood for specific parameters
(could be modeled with Deep Network!)
Learning :
8. 8
Regularized Logistic Regression
Now assume that prior distribution over parameters exists :
Then we can apply Bayes Rule:
Prior distribution over parameters
(describes our prior knowledge or /
and our desires for the model)
Learning :
9. 9
Regularized Logistic Regression
Now assume that prior distribution over parameters exists :
Then we can apply Bayes Rule:
Bayesian evidence
A powerful method for model selection!
Learning :
10. 10
Regularized Logistic Regression
Now assume that prior distribution over parameters exists :
Then we can apply Bayes Rule:
Learning :
As a rule this integral is intractable :(
(You can never integrate this)
11. 11
The core idea of Maximum a Posteriori Estimator:
Maximum a posteriori estimator(MAP)
𝐽 𝑀𝐴𝑃 ϴ = − log 𝑃 ϴ 𝑥, 𝑦 = -log 𝑃 𝑦 𝑥, ϴ - log 𝑃 ϴ + log 𝑃 𝑦
=𝐽 𝑀𝐿𝐸 ϴ +
1
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
+ 𝑐𝑜𝑛𝑠𝑡
𝜃 𝑀𝐴𝑃
∗
= argmax 𝜃(log(𝑃 𝑦 𝑥, ϴ + log 𝑃 ϴ )) = arg𝑚𝑖𝑛 𝜃{ 𝐽 𝑀𝐴𝑃 ϴ }
Loss function of Posterior distribution over model parameters
assuming a Gaussian prior for the weights
Regularized Logistic RegressionLearning :
20. 20
Prediction : Kalman Filter
Autonomous Mobile Robot Design
Dr. Kostas Alexis (CSE)
Kalman Filter –A Primer
Consider a time-discrete stochastic process(Markov chain)
21. 21
Estimates the state xt of a discrete-time controlled process that is
governed by the linear stochastic difference equation
And (linear)measurements of the state
with and
Prediction : Kalman Filter
25. 25
Implementation of Kalman FilterPrediction :
GPS aided IMU :
- Gyro has drift, bias, and alignment error
- GPS, vision or kinematics can cope with these inherent problems
“ASSESSMENT OF INTEGRATED GPS/INS FOR THE EX-171 EXTENDED
RANGE GUIDED MUNITION”, AIAA-98-4416
26. 26
• Eq. of error dynamics
Implementation of Kalman FilterPrediction :
• Measurement Model
state
output
34. Why GPs ? :
- Provide Closed-Form Predictions !
- Effective for small data problems
- And Explainable !
35. How Do We Deal With Many Parameters, Little Data ?
1. Regularization
e.g., smoothing, L1 penalty, drop out in neural nets, large K
for K-nearest neighbor
2. Standard Bayesian approach
specify probability of data given weights, P(D|W)
specify weight priors given hyper-parameter α, P(W|α)
find posterior over weights given data, P(W|D, α)
With little data, strong weight prior constrains inference
3. Gaussian processes
place a prior over functions, p(f) directly rather than
over model parameters, p(w)
36. Functions : Relationship between Input and Output
Distribution of functions that satisfy
within the range of Input, X and Output, f
Prior over functions, No Constraints
X
f
prior
37. • GP specifies a prior over functions, f(x)
• Suppose we have a set of observations:
• D = {(x1,y1), (x2, y2), (x3, y3), …, (xn,
yn)}
Standard Bayesian approach
• p(f|D) ~ p(D|f) p(f)
One view of Bayesian inference
• generating samples (the prior)
• discard all samples inconsistent with
our data, leaving the samples of
interest (the posterior)
• The Gaussian process allows us to
do this analytically.
Gaussian Process Approach
prior
posterior
38. Bayesian data modeling technique that account for uncertainty
Bayesian kernel regression machines
Gaussian Process Approach
39. Procedure to sample
2. Compute Covariance Matrix for a given 𝑋 = 𝑥1 … . 𝑥 𝑛
1. Let’s assume input, X and function, f distributed as follows
X
f
40. Procedure to sample
3. Compute SVD or Cholesky decomp. of K to get orthogonal basis
functions
K = 𝐴𝑆𝐵 𝑇 = 𝐿𝐿𝑇
4. Compute Basis Function
𝑓𝑖 = 𝐴𝑆1/2 𝑢𝑖
or 𝑓𝑖 = 𝐿𝑢𝑖
𝑢𝑖 ∶ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ
𝑧𝑒𝑟𝑜 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 𝑢𝑛𝑖𝑡 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
L : Lower part of Cholesky
decomp. of K
X
f
posterior
X
f
prior
41. J
= 𝜃1 𝑟 𝑡 − 𝑦 𝑡 + 𝜃2 𝑦(𝑡)
A simple PD control example
Global optimal gains, θ to get a minimum cost J ?
42. A simple PD control example
Procedure of Bayesian Optimization
1. GP prior before observing any data
2. GP posterior, after five noisy evaluations
3. The next parameters θnext are chosen
at the maximum of Acquisition function
Repeat until you can find
a globally optimal θ
44. = 𝐻 𝑥∗ 𝐷𝑡 − 𝐻 𝑥∗ 𝐷𝑡U{𝑥, 𝑦}
Information gain, I : Mutual information for an observed data
Reduction of uncertainty in the location 𝑥∗ by selecting points (𝑥, 𝑦)
that are expected to cause the largest reduction in entropy of
distribution 𝐻 𝑥∗ 𝐷𝑡
Acquisition function and Entropy Search