1. C OMPUTER V ISION : L EAST S QUARES M INIMIZATION
IIT Kharagpur
Computer Science and Engineering,
Indian Institute of Technology
Kharagpur.
(IIT Kharagpur) Minimization Jan ’10 1 / 35
2. Solution of Linear equations
Consider a system of equations of the form Ax = b. Let A be an m × n
matrix.
If m < n there are more unknowns than equations. In this case
there will not be a unique solution, but rather a vector space of
solutions.
If m = n there will be a unique solution as long as A is invertible.
If m > n there will be more equations than unknowns. In general
the system will not have a solution.
(IIT Kharagpur) Minimization Jan ’10 2 / 35
3. Least-squares solution Full rank case
Consider the case m > n and assume that A is of rank n. We seek a
vector x that is closest to providing a solution to the system Ax = b.
We seek x such that ||Ax − b|| is minimized. Such an x is known as
the least squares solution to the over-determined system.
We seek x that minimizes ||Ax − b|| = ||UDV T x − b||
Because of the norm preserving property of orthogonal
transforms,
||UDV T x − b|| = ||DV T x − U T b||
Writing y = V T x and b = UT b the problem becomes one of
minimizing ||Dy − b || where D is a diagonal matrix.
(IIT Kharagpur) Minimization Jan ’10 3 / 35
4. b1
d1
b2
.
d2
.
y1 .
..
.
y2
bn
.
=
dn .
.
bn+1
yn
.
0
.
.
bm
The nearest Dy can approach to b is the vector
(b1 , b2 , . . . , bn , 0, . . . , . . . , 0)T
This is achieved by setting yi = bi /di for i = 1, . . . , n
The assumption rank A = n ensures that di 0
Finally x is retrieved from x = Vy.
(IIT Kharagpur) Minimization Jan ’10 4 / 35
5. Algorithm Least Squares
Objective:
Find the least-squares solution to the m × n set of equations Ax = b,
where m > n and rank A= n.
Algorithm:
(i) Find the SVD A = UDV T
(ii) Set b = U T b
(iii) Find the vector y defined yi = bi /di , where di is the i th diagonal
entry of D
(iv) The solution is x = Vy
(IIT Kharagpur) Minimization Jan ’10 5 / 35
6. Pseudo Inverse
Given a square diagonal matrix D, we define its pseudo-inverse to
be the diagonal matrix D+ such that
+ 0 if Dii = 0
Dn = −1 otherwise
Dii
For an m × n matrix A with m ≥ n, let the SVD of A = UDV T . The
pseudo-inverse of matrix A is
A
+
= VD+ U T
The least-squares solution to an m × n system of equations Ax = b of
rank n is given by x = A+ b. In the case of a deficient-rank system,
x = A+ b is the solution that minimizes ||x||.
(IIT Kharagpur) Minimization Jan ’10 6 / 35
7. Linear least-squares using normal equations
Consider a system of equations of the form Ax = b. Let A be an m × n
matrix. m > n.
In general, no solution x will exist for this set of equations.
Consequently, the task is to find the vector x that minimizes the
norm ||Ax − b||.
As the vector x varies over all values, the product Ax varies over
the complete column space of A, i.e. the subspace of Rm spanned
by the columns of A.
The task is to find the closest vector to b that lies in the column
space of A.
(IIT Kharagpur) Minimization Jan ’10 7 / 35
8. Linear least-squares using normal equations
Let x be the solution to this problem.
Thus Ax is the closest point to b. In this case, the difference
Ax − b must be orthogonal to the column space of A.
This means that Ax − b is perpendicular to each of the columns of
A, hence
T
A (Ax − b) = 0 (A T A)x = A T b
The solution is given as:
x = (A T A)−1 A T b
x = A+ b A
+
= (A T A)−1 A T
The pseudo-inverse of matrix A using SVD is given as
A
+
= VD+ U T
(IIT Kharagpur) Minimization Jan ’10 8 / 35
9. Least-squares solution of
homogeneous equations
Solving a set of equations of the form Ax = 0.
x has the homogeneous representation. Hence if x is a solution,
then k x is also a solution.
A reasonable constraint would be to seek a solution for which
||x|| = 1
In general, such a set of equations will not have an exact solution.
The problem is to find x that minimizes ||Ax|| subject to ||x|| = 1
(IIT Kharagpur) Minimization Jan ’10 9 / 35
10. Least-squares solution of
homogeneous equations
Let A = UDV T
We need to minimize ||UDV T x||.
Note that ||UDV T x|| = ||DV T x|| so we need to minimize ||DV T x||
Note that ||x|| = ||V T x|| so we have the condition that ||V T x|| = 1
Let y = V T x, so we minimize ||Dy|| subject to ||y || = 1
0
Since D is a diagonal matrix with its diagonal
0
entries in descending order.
y= .
.
It follows that the solution to this problem is .
T x, x = Vy is simply the last
0
Since y = V
1
column of V
(IIT Kharagpur) Minimization Jan ’10 10 / 35
11. Iterative estimation techniques
X = f(P)
X is a measurement vector in RN
P is a parameter vector in RM .
We wish to seek the vector P satisfying
X = f(P) −
for which || || is minimized.
The linear least squares problem is exactly of this type with the
function f being defined as a linear function f(P) = AP
(IIT Kharagpur) Minimization Jan ’10 11 / 35
12. Iterative estimation methods
If the function f is not a
linear function we use
iterative estimation
techniques.
(IIT Kharagpur) Minimization Jan ’10 12 / 35
13. Iterative estimation methods
We start with an initial estimated value P0 , and proceed to refine
the estimate under the assumption that the function f is locally
linear.
Let 0 = f(P0 ) − X
We assume that the function is approximated at P0 by
f(P0 + ∆) = f(P0 ) + J∆
J is the linear mapping represented by the Jacobian matrix
J = ∂f/∂P
(IIT Kharagpur) Minimization Jan ’10 13 / 35
14. Iterative estimation methods
We seek a point f(P1 ), with P1 = P0 + ∆, which minimizes
f(P1 ) − X = f(P0 ) + J∆ − X
= 0 + J∆
Thus it is required to minimize || 0 + J∆|| over ∆, which is a linear
minimization problem.
The vector ∆ is obtained by solving the normal equations
T
J J∆ = −J T 0 ∆ = −J+ 0
(IIT Kharagpur) Minimization Jan ’10 14 / 35
15. Iterative estimation methods
The solution vector P is obtained by starting with an estimate P0
and computing successive approximations according to the
formula
Pi+1 = Pi + ∆i
where ∆i is the solution to the linear least-squares problem
J∆ =− i
Matrix J is the Jacobian ∂f/∂P evaluated at Pi and i = f(Pi ) − X.
The algorithm converges to a least squares solution P.
Convergence can take place to a local minimum, or there may be
no convergence at all.
(IIT Kharagpur) Minimization Jan ’10 15 / 35
16. Newton’s method
We consider finding minima of functions of many variables.
Consider an arbitrary scalar-valued function g(P) where P is a
vector.
The optimization problem is simply to minimize g(P) over all
values of P.
Expand g(P) about P0 in a Taylor series to get
g(P0 + ∆) = g + gP ∆ + ∆T gPP ∆/2 + . . .
where gP denotes the differentiation of g(P) with respect to P,
where gPP denotes the differentiation of gP with respect to P.
(IIT Kharagpur) Minimization Jan ’10 16 / 35
17. Newton’s method
Expand g(P) about P0 in a Taylor series to get
g(P0 + ∆) = g + gP ∆ + ∆T gPP ∆/2 + . . .
Differentiating the Taylor series with respect to ∆ we get
gP + gPP ∆ = 0 ∆ = −gP /gPP
Hessian matrix: gPP is the matrix of second derivatives, the
Hessian of g. The (i, j)th entry is ∂2 g/∂pi ∂pj , and pi and pj are the
i th and j th parameters. Vector gP is the gradient of g.
The method of Newton iteration consists in starting with an initial
value of the parameters, P0 and iteratively computing parameter
increments ∆ until convergence occurs.
(IIT Kharagpur) Minimization Jan ’10 17 / 35
18. Gauss Newton Method
Consider a special case when gP is a squared norm of an error
function.
1 (P)T (P)
g(P) = || (P)||2 =
2 2
(P) = f(P) − X
(P) is the error function (P) = f(P) − X
(P) is a vector valued function of the parameter P
∂g(P) T
The gradient gP = P
∂P
∂ (P) ∂f(P)
where P = = fP
∂P ∂P
We know that fP = J, ∴ P = J hence we have gP = J T
(IIT Kharagpur) Minimization Jan ’10 18 / 35
19. Gauss Newton Method
Consider the second derivative gPP
T T T
gP = P therefore gPP = P P + PP
Since P = fP , and assuming that f(P) is linear, PP vanishes.
T
gPP = P P = JT J
We have got an approximation of the 2nd derivative gPP .
Now using the Newton’s equation
gPP ∆ = −gP we get J T J∆ = −J T
This is the Gauss-Newton method, in which we use an
approximation of the Hessian gPP = J T J of the function g(P).
(IIT Kharagpur) Minimization Jan ’10 19 / 35
20. Gradient Descent
T
The gradient of g(P) is given as gP = P
T
The negative gradient vector −gP = − P defines the direction of
most rapid decrease of the cost function.
Gradient descent is a strategy of minimization of g where we
move iteratively in the gradient direction.
We take small steps in the direction of descent.
−gP
∆= where λ controls the length of the step
λ
Recall that in Newton’s method, the step size is given by
−g∆
∆= Hessian approximated by scalar matrix λI
gPP
(IIT Kharagpur) Minimization Jan ’10 20 / 35
21. Gradient Descent
Gradient descent by itself is not a very good minimization strategy,
typically characterized by slow convergence due to zig-zagging.
However Gradient descent can be quite useful in conjunction with
Gauss-Newton iteration as a way of getting out of tight corners.
Levenberg-Marquardt method is essentially a Gauss-Newton
method that transitions smoothly to gradient descent when the
Gauss-Newton updates fail.
(IIT Kharagpur) Minimization Jan ’10 21 / 35
22. Summary
g(P) is an arbitrary scalar valued function. g(P) = (P)T (P)/2
Newton’s Method Gauss Newton Gradient Descent
T T
gPP ∆ = −gP P P∆ =− P λ∆ = − T = −gP
where The Hessian is The Hessian is
T
gPP = P P + PP T approximated as replaced by λI
T T
and gP = P The P P
cost function is
approximated as
quadratic near the
minimum.
(IIT Kharagpur) Minimization Jan ’10 22 / 35
23. Levenberg-Marquardt iteration LM
This is a slight variation of the Gauss-Newton iteration method.
We have the augmented normal equations:
T
J J∆ = −J T −→ (J T J + λI)∆ = −J T
The value of λ varies from iteration to iteration.
A typical initial value of λ is 10−3 times the average of the diagonal
elements of J T J
(IIT Kharagpur) Minimization Jan ’10 23 / 35
24. Levenberg-Marquardt iteration LM
If the value of ∆ If the value of ∆ leads to an
obtained by solving the increased error, then λ is multiplied by
augmented normal the same factor and the augmented
equations leads to normal equations are solved again.
reduction of error, then This process continues until a value
the increment is of ∆ is found that gives rise to a
accepted and λ is decreased error.
divided by a factor
(typically 10) before the
next iteration.
The process of repeatedly solving the augmented
normal equations for different values of λ until an
acceptable ∆ is found constitutes one iteration of
the LM algorithm.
(IIT Kharagpur) Minimization Jan ’10 24 / 35
25. Robust cost functions
Squared Error (convex) PDF Attenuation function
(IIT Kharagpur) Minimization Jan ’10 25 / 35
26. Robust cost functions
Blake Zisserman (non-convex) PDF Attenuation function
Corrupted Gaussian (non-convex) PDF Attenuation function
(IIT Kharagpur) Minimization Jan ’10 26 / 35
27. Robust cost functions
Cauchy (non-convex) PDF Attenuation function
L1 cost (convex) PDF Attenuation function
(IIT Kharagpur) Minimization Jan ’10 27 / 35
28. Robust cost functions
Huber (convex) PDF Attenuation function
Pseudo Huber (convex) PDF Attenuation function
(IIT Kharagpur) Minimization Jan ’10 28 / 35
29. Square Error cost function
C(δ) = δ2 PDF = exp(−C(δ))
Its main drawback is that it is not robust to outliers in the
measurements.
Because of the rapid growth of the quadratic curve, distant outliers
exert an excessive influence, and can draw the cost minimum well
away from the desired value.
The squared-error cost function is generally very susceptible to
outliers, and may be regarded as unusable as long as outliers are
present.
If outliers have been thoroughly eradicated, using for instance
RANSAC, then it may be used.
(IIT Kharagpur) Minimization Jan ’10 29 / 35
30. Non-convex cost functions
The Blake-Zisserman, corrupted Gaussian and Cauchy cost
functions seek to mitigate the deleterious effect of outliers by
giving them diminished weight.
As is seen in the plot of the first two of these, once the error
exceeds a certain threshold, it is classified as an outlier, and the
cost remains substantially constant.
The Cauchy cost function also seeks to deemphasize the cost of
outliers, but this is done more gradually.
(IIT Kharagpur) Minimization Jan ’10 30 / 35
31. Asymptotically Linear cost functions
The L1 cost function measures the absolute value of the error.
The main effect of this is to give outliers less weight compared
with the squared error.
This cost function acts to find the median of a set of data.
Consider a set of real valued data {ai } and a cost function defined
by C(x) = i |x − ai | The minimum of this function is at the median
of the set {ai }.
For higher dimensional data ai ∈ Rn , the minimum of the cost
function C(x) = i ||x − ai || similar stability properties with regard
to outliers.
(IIT Kharagpur) Minimization Jan ’10 31 / 35
32. Huber Cost function
The Huber cost function takes the form of a quadratic for small
values of the error, δ, and becomes linear for values of δ beyond a
given threshold.
It retains the outlier stability of the L1 cost function, while for inliers
it reflects the property that the squared-error cost function gives
the Maximum Likelihood estimate.
(IIT Kharagpur) Minimization Jan ’10 32 / 35
33. Non-convex Cost functions
The non-convex cost functions, though generally having a stable
minimum, not much effected by outliers have the significant
disadvantage of having local minima, which can make
convergence to a global minimum chancy.
The estimate is not strongly attracted to the minimum from outside
of its immediate neighbourhood.
Thus, they are not useful, unless (or until) the estimate is close to
the final correct value.
(IIT Kharagpur) Minimization Jan ’10 33 / 35
34. Maximum Likelihood method
Maximum likelihood is the procedure of finding the value of one or
more parameters for a given statistic which makes the known
likelihood distribution a maximum.
The maximum likelihood estimate for a parameter µ is denoted µ.
ˆ
n
1
√ e(xi −µ) /2σ
2 2
f (x1 , x2 , . . . , xn |µσ) =
i=1 σ 2π
(2π)−n/2 (xi − µ)2
= exp −
σn 2σ2
Taking the logarithm
1 (xi − µ)2
log f = − n log(2π) − n log σ −
2 2σ2
(IIT Kharagpur) Minimization Jan ’10 34 / 35
35. To maximize the log likelihood
∂(log f ) (xi − µ) xi
= = 0 giving µ =
ˆ
∂µ σ2 n
Similarly
∂(log f ) n (xi − µ)2 µ
(xi −ˆ)2
=− + = 0 giving σ =
ˆ
∂σ σ σ3 n
Minimizing the least squares cost function gives a result which is
equivalent to the maximum likelihood estimate assuming
Gaussian distribution.
In general, the maximum likelihood estimate of the parameter vector θ
is given as
ˆ
θML = arg max p(x|θ)
θ
(IIT Kharagpur) Minimization Jan ’10 35 / 35