Lecture 5

C OMPUTER V ISION : L EAST S QUARES M INIMIZATION

IIT Kharagpur

Computer Science and Engineering,
Indian Institute of Technology
Kharagpur.

(IIT Kharagpur) Minimization Jan ’10 1 / 35

Solution of Linear equations
Consider a system of equations of the form Ax = b. Let A be an m × n
matrix.
If m < n there are more unknowns than equations. In this case
there will not be a unique solution, but rather a vector space of
solutions.
If m = n there will be a unique solution as long as A is invertible.
If m > n there will be more equations than unknowns. In general
the system will not have a solution.


Least-squares solution Full rank case
Consider the case m > n and assume that A is of rank n. We seek a
vector x that is closest to providing a solution to the system Ax = b.
We seek x such that ||Ax − b|| is minimized. Such an x is known as
the least squares solution to the over-determined system.
We seek x that minimizes ||Ax − b|| = ||UDV T x − b||
Because of the norm preserving property of orthogonal
transforms,
||UDV T x − b|| = ||DV T x − U T b||
Writing y = V T x and b = UT b the problem becomes one of
minimizing ||Dy − b || where D is a diagonal matrix.


b1
 
 
 d1
   
b2
 
 

 
.
 
d2
   
.
    
 

 y1  .
   
..

 
 
 

.

 
    
   y2 
   
bn

 
   
 

 . 
   
 =  
dn   . 
  
 








  . 
 






     
bn+1
     
yn

 
   
 

.
 
0
   
.

 
  
.

 
 
 


 


 

bm

The nearest Dy can approach to b is the vector
(b1 , b2 , . . . , bn , 0, . . . , . . . , 0)T
This is achieved by setting yi = bi /di for i = 1, . . . , n
The assumption rank A = n ensures that di 0
Finally x is retrieved from x = Vy.


Algorithm Least Squares

Objective:

Find the least-squares solution to the m × n set of equations Ax = b,
where m > n and rank A= n.

Algorithm:

(i) Find the SVD A = UDV T
(ii) Set b = U T b
(iii) Find the vector y deﬁned yi = bi /di , where di is the i th diagonal
entry of D
(iv) The solution is x = Vy


Pseudo Inverse
Given a square diagonal matrix D, we deﬁne its pseudo-inverse to
be the diagonal matrix D+ such that

+ 0 if Dii = 0
Dn = −1 otherwise
Dii

For an m × n matrix A with m ≥ n, let the SVD of A = UDV T . The
pseudo-inverse of matrix A is

A
+
= VD+ U T

The least-squares solution to an m × n system of equations Ax = b of
rank n is given by x = A+ b. In the case of a deﬁcient-rank system,
x = A+ b is the solution that minimizes ||x||.


Linear least-squares using normal equations
Consider a system of equations of the form Ax = b. Let A be an m × n
matrix. m > n.
In general, no solution x will exist for this set of equations.
Consequently, the task is to ﬁnd the vector x that minimizes the
norm ||Ax − b||.
As the vector x varies over all values, the product Ax varies over
the complete column space of A, i.e. the subspace of Rm spanned
by the columns of A.
The task is to ﬁnd the closest vector to b that lies in the column
space of A.


Linear least-squares using normal equations
Let x be the solution to this problem.
Thus Ax is the closest point to b. In this case, the difference
Ax − b must be orthogonal to the column space of A.
This means that Ax − b is perpendicular to each of the columns of
A, hence
T
A (Ax − b) = 0 (A T A)x = A T b
The solution is given as:

x = (A T A)−1 A T b

x = A+ b A
+
= (A T A)−1 A T

The pseudo-inverse of matrix A using SVD is given as

A
+
= VD+ U T


Least-squares solution of
homogeneous equations
Solving a set of equations of the form Ax = 0.
x has the homogeneous representation. Hence if x is a solution,
then k x is also a solution.
A reasonable constraint would be to seek a solution for which
||x|| = 1
In general, such a set of equations will not have an exact solution.
The problem is to ﬁnd x that minimizes ||Ax|| subject to ||x|| = 1


Least-squares solution of
homogeneous equations
Let A = UDV T
We need to minimize ||UDV T x||.
Note that ||UDV T x|| = ||DV T x|| so we need to minimize ||DV T x||
Note that ||x|| = ||V T x|| so we have the condition that ||V T x|| = 1
Let y = V T x, so we minimize ||Dy|| subject to ||y || = 1
 0 
 
Since D is a diagonal matrix with its diagonal  
 0 
 
 
entries in descending order.
y= . 
 
 . 
 
 
It follows that the solution to this problem is  . 
 
 
T x, x = Vy is simply the last
 0 
 
Since y = V
 
 
1
 
column of V


Iterative estimation techniques
X = f(P)

X is a measurement vector in RN
P is a parameter vector in RM .
We wish to seek the vector P satisfying

X = f(P) −

for which || || is minimized.

The linear least squares problem is exactly of this type with the
function f being deﬁned as a linear function f(P) = AP


Iterative estimation methods
If the function f is not a
linear function we use
iterative estimation
techniques.


We start with an initial estimated value P0 , and proceed to reﬁne
the estimate under the assumption that the function f is locally
linear.

Let 0 = f(P0 ) − X
We assume that the function is approximated at P0 by

f(P0 + ∆) = f(P0 ) + J∆

J is the linear mapping represented by the Jacobian matrix

J = ∂f/∂P


We seek a point f(P1 ), with P1 = P0 + ∆, which minimizes

f(P1 ) − X = f(P0 ) + J∆ − X
= 0 + J∆

Thus it is required to minimize || 0 + J∆|| over ∆, which is a linear
minimization problem.
The vector ∆ is obtained by solving the normal equations
T
J J∆ = −J T 0 ∆ = −J+ 0


The solution vector P is obtained by starting with an estimate P0
and computing successive approximations according to the
formula
Pi+1 = Pi + ∆i
where ∆i is the solution to the linear least-squares problem

J∆ =− i

Matrix J is the Jacobian ∂f/∂P evaluated at Pi and i = f(Pi ) − X.

The algorithm converges to a least squares solution P.
Convergence can take place to a local minimum, or there may be
no convergence at all.


Newton’s method
We consider ﬁnding minima of functions of many variables.
Consider an arbitrary scalar-valued function g(P) where P is a
vector.
The optimization problem is simply to minimize g(P) over all
values of P.
Expand g(P) about P0 in a Taylor series to get

g(P0 + ∆) = g + gP ∆ + ∆T gPP ∆/2 + . . .

where gP denotes the differentiation of g(P) with respect to P,
where gPP denotes the differentiation of gP with respect to P.


Newton’s method
Expand g(P) about P0 in a Taylor series to get

g(P0 + ∆) = g + gP ∆ + ∆T gPP ∆/2 + . . .

Differentiating the Taylor series with respect to ∆ we get

gP + gPP ∆ = 0 ∆ = −gP /gPP

Hessian matrix: gPP is the matrix of second derivatives, the
Hessian of g. The (i, j)th entry is ∂2 g/∂pi ∂pj , and pi and pj are the
i th and j th parameters. Vector gP is the gradient of g.

The method of Newton iteration consists in starting with an initial
value of the parameters, P0 and iteratively computing parameter
increments ∆ until convergence occurs.


Gauss Newton Method
Consider a special case when gP is a squared norm of an error
function.
1 (P)T (P)
g(P) = || (P)||2 =
2 2
(P) = f(P) − X
(P) is the error function (P) = f(P) − X
(P) is a vector valued function of the parameter P

∂g(P) T
The gradient gP = P
∂P
∂ (P) ∂f(P)
where P = = fP
∂P ∂P
We know that fP = J, ∴ P = J hence we have gP = J T


Gauss Newton Method
Consider the second derivative gPP
T T T
gP = P therefore gPP = P P + PP

Since P = fP , and assuming that f(P) is linear, PP vanishes.
T
gPP = P P = JT J

We have got an approximation of the 2nd derivative gPP .
Now using the Newton’s equation

gPP ∆ = −gP we get J T J∆ = −J T

This is the Gauss-Newton method, in which we use an
approximation of the Hessian gPP = J T J of the function g(P).


Gradient Descent
T
The gradient of g(P) is given as gP = P
T
The negative gradient vector −gP = − P deﬁnes the direction of
most rapid decrease of the cost function.
Gradient descent is a strategy of minimization of g where we
move iteratively in the gradient direction.
We take small steps in the direction of descent.
−gP
∆= where λ controls the length of the step
λ
Recall that in Newton’s method, the step size is given by
−g∆
∆= Hessian approximated by scalar matrix λI
gPP


Gradient Descent
Gradient descent by itself is not a very good minimization strategy,
typically characterized by slow convergence due to zig-zagging.
However Gradient descent can be quite useful in conjunction with
Gauss-Newton iteration as a way of getting out of tight corners.
Levenberg-Marquardt method is essentially a Gauss-Newton
method that transitions smoothly to gradient descent when the
Gauss-Newton updates fail.


Summary
g(P) is an arbitrary scalar valued function. g(P) = (P)T (P)/2

Newton’s Method Gauss Newton Gradient Descent
T T
gPP ∆ = −gP P P∆ =− P λ∆ = − T = −gP

where The Hessian is The Hessian is
T
gPP = P P + PP T approximated as replaced by λI
T T
and gP = P The P P
cost function is
approximated as
quadratic near the
minimum.


Levenberg-Marquardt iteration LM
This is a slight variation of the Gauss-Newton iteration method.
We have the augmented normal equations:
T
J J∆ = −J T −→ (J T J + λI)∆ = −J T

The value of λ varies from iteration to iteration.
A typical initial value of λ is 10−3 times the average of the diagonal
elements of J T J


Levenberg-Marquardt iteration LM
If the value of ∆ If the value of ∆ leads to an
obtained by solving the increased error, then λ is multiplied by
augmented normal the same factor and the augmented
equations leads to normal equations are solved again.
reduction of error, then This process continues until a value
the increment is of ∆ is found that gives rise to a
accepted and λ is decreased error.
divided by a factor
(typically 10) before the
next iteration.
The process of repeatedly solving the augmented
normal equations for different values of λ until an
acceptable ∆ is found constitutes one iteration of
the LM algorithm.


Robust cost functions

Squared Error (convex) PDF Attenuation function



Blake Zisserman (non-convex) PDF Attenuation function

Corrupted Gaussian (non-convex) PDF Attenuation function



Cauchy (non-convex) PDF Attenuation function

L1 cost (convex) PDF Attenuation function



Huber (convex) PDF Attenuation function

Pseudo Huber (convex) PDF Attenuation function


Square Error cost function
C(δ) = δ2 PDF = exp(−C(δ))

Its main drawback is that it is not robust to outliers in the
measurements.
Because of the rapid growth of the quadratic curve, distant outliers
exert an excessive inﬂuence, and can draw the cost minimum well
away from the desired value.

The squared-error cost function is generally very susceptible to
outliers, and may be regarded as unusable as long as outliers are
present.
If outliers have been thoroughly eradicated, using for instance
RANSAC, then it may be used.


Non-convex cost functions
The Blake-Zisserman, corrupted Gaussian and Cauchy cost
functions seek to mitigate the deleterious effect of outliers by
giving them diminished weight.
As is seen in the plot of the ﬁrst two of these, once the error
exceeds a certain threshold, it is classiﬁed as an outlier, and the
cost remains substantially constant.
The Cauchy cost function also seeks to deemphasize the cost of
outliers, but this is done more gradually.


Asymptotically Linear cost functions
The L1 cost function measures the absolute value of the error.
The main effect of this is to give outliers less weight compared
with the squared error.
This cost function acts to ﬁnd the median of a set of data.
Consider a set of real valued data {ai } and a cost function deﬁned
by C(x) = i |x − ai | The minimum of this function is at the median
of the set {ai }.
For higher dimensional data ai ∈ Rn , the minimum of the cost
function C(x) = i ||x − ai || similar stability properties with regard
to outliers.


Huber Cost function
The Huber cost function takes the form of a quadratic for small
values of the error, δ, and becomes linear for values of δ beyond a
given threshold.
It retains the outlier stability of the L1 cost function, while for inliers
it reﬂects the property that the squared-error cost function gives
the Maximum Likelihood estimate.


Non-convex Cost functions
The non-convex cost functions, though generally having a stable
minimum, not much effected by outliers have the signiﬁcant
disadvantage of having local minima, which can make
convergence to a global minimum chancy.
The estimate is not strongly attracted to the minimum from outside
of its immediate neighbourhood.
Thus, they are not useful, unless (or until) the estimate is close to
the ﬁnal correct value.


Maximum Likelihood method
Maximum likelihood is the procedure of ﬁnding the value of one or
more parameters for a given statistic which makes the known
likelihood distribution a maximum.
The maximum likelihood estimate for a parameter µ is denoted µ.
ˆ
n
1
√ e(xi −µ) /2σ
2 2
f (x1 , x2 , . . . , xn |µσ) =
i=1 σ 2π
(2π)−n/2 (xi − µ)2
= exp −
σn 2σ2

Taking the logarithm

1 (xi − µ)2
log f = − n log(2π) − n log σ −
2 2σ2


To maximize the log likelihood

∂(log f ) (xi − µ) xi
= = 0 giving µ =
ˆ
∂µ σ2 n

Similarly

∂(log f ) n (xi − µ)2 µ
(xi −ˆ)2
=− + = 0 giving σ =
ˆ
∂σ σ σ3 n

Minimizing the least squares cost function gives a result which is
equivalent to the maximum likelihood estimate assuming
Gaussian distribution.
In general, the maximum likelihood estimate of the parameter vector θ
is given as
ˆ
θML = arg max p(x|θ)
θ


Lecture 5

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à Lecture 5

Similaire à Lecture 5 (20)

Plus de Krishna Karri

Plus de Krishna Karri (15)

Lecture 5