03 conditional random field

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields

Part 4: Conditional Random Fields

Sebastian Nowozin and Christoph H. Lampert

Colorado Springs, 25th June 2011

1 / 39


Problem (Probabilistic Learning)
Let d(y|x) be the (unknown) true conditional distribution.
Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y).
Find a distribution p(y|x) that we can use as a proxy for d(y|x).
or
Given a parametrized family of distributions, p(y|x, w), ﬁnd the
parameter w∗ making p(y|x, w) closest to d(y|x).

Open questions:
What do we mean by closest?
What’s a good candidate for p(y|x, w)?
How to actually ﬁnd w∗ ?
conceptually, and
numerically
2 / 39


Principle of Parsimony (Parsimoney, aka Occam’s razor)
“Pluralitas non est ponenda sine neccesitate.”
William of Ockham

“We are to admit no more causes of natural things than such as are both
true and suﬃcient to explain their appearances.”
Isaac Newton

“Make everything as simple as possible, but not simpler.”
Albert Einstein

“Use the simplest explanation that covers all the facts.”
what we’ll use

3 / 39


1) Define what aspects we consider relevant facts about the data.
2) Pick the simplest distribution reflecting that.

Definition (Simplicity ≡ Entropy)
The simplicity of a distribution p is given by its entropy:

H(p) = − p(z) log p(z)
z∈Z

Definition (Relevant Facts ≡ Feature Functions)
By φi : Z → R for i = 1, . . . , D we denote a set of feature functions that
express everything we want to be able to model about our data.

For example: the grayvalue of a pixel,
a bag-of-words histogram of an image,
the time of day an image was taken,
a flag if a pixel is darker than half of its neighbors.
4 / 39


Principle (Maximum Entropy Principle)
Let z 1 , . . . , z N be samples from a distribution d(z). Let φ1 , . . . , φD be
1
feature functions, and denote by µi := N n φi (z n ) their means over the
sample set.
The maximum entropy distribution, p, is the solution to

max H(p) subject to Ez∼p(z) {φi (z)} = µi .
p is a prob.distr.
be faithful to what we know
be as simple as possible

Theorem (Exponential Family Distribution)
Under some very reasonable conditions, the maximum entropy distribution
has the form
1
p(z) = exp wi φi (z)
Z i

for some parameter vector w = (w1 , . . . , wD ) and constant Z.

5 / 39


Example:
Let Z = R, φ1 (z) = z, φ2 (z) = z 2 .
The exponential family distribution is
1
p(z) = exp( w1 z + w2 z 2 )
Z(w)
b2 a 2 w1
= exp( a z − b ) for a = w2 , b = − .
Z(a, b) w2

It’s a Gaussian!

Given examples z 1 , . . . , z N , we can compute a and b, and derive w.

6 / 39


Example:
Let Z = {1, . . . , K}, φk (z) = z = k , for k = 1, . . . , K.
1
p(z) = exp( wk φk (z) )
Z(w)
k

exp(w1 )/Z for z = 1,



exp(w )/Z for z = 2,
2
=
. . .



exp(w )/Z for z = K.
K

with Z = exp(w1 ) + · · · + exp(wK ).

It’s a Multinomial!

7 / 39


Example:
Let Z = {0, 1}N ×M image grid,
φi (y) := yi for each pixel i,
φN M (y) = i∼j yi = yj (summing over all 4-neighbor pairs)
1
p(z) = exp( w, φ(y) + w
˜ yi = yj )
Z(w)
i,j

It’s a (binary) Markov Random Field!

8 / 39


Conditional Random Field Learning

Assume:
a set of i.i.d. samples D = {(xn , y n )}n=1,...,N , (xn , y n ) ∼ d(x, y)
feature functions ( φ1 (x, y), . . . , φD (x, y) ) ≡: φ(x, y)
1
parametrized family p(y|x, w) = Z(x,w) exp( w, φ(x, y) )

Task:
adjust w of p(y|x, w) based on D.

Many possible technique to do so:
Expectation Matching
Maximum Likelihood
Best Approximation
MAP estimation of w

Punchline: they all turn out to be (almost) the same!
9 / 39


Maximum Likelihood Parameter Estimation Idea: maximize conditional
likelihood of observing outputs y 1 , . . . , y N for inputs x1 , . . . , xN

w∗ = argmax p(y 1 , . . . , y N |x1 , . . . , xN , w)
w∈RD
N
i.i.d.
= argmax p(y n |xn , w)
w∈RD n=1
N
− log(·)
= argmin − log p(y n |xn , w)
w∈RD n=1

negative conditional log-likelihood (of D)

10 / 39


Best Approximation
Idea: find p(y|x, w) that is closest to d(y|x)

Definition (Similarity between conditional distributions)
For fixed x ∈ X : KL-divergence measure similarity

d(y|x)
KLcond (p||d)(x) := d(y|x) log
p(y|x, w)
y∈Y

For x ∼ d(x), compute expectation:

KLtot (p||d) : = Ex∼d(x) KLcond (p||d)(x)
d(y|x)
= d(x, y) log
p(y|x, w)
x∈X y∈Y

11 / 39


Best Approximation
Idea: ﬁnd p(y|x, w) of minimal KLtot -distance to d(y|x)

d(y|x)
w∗ = argmin d(x, y) log
w∈RD x∈X y∈Y p(y|x, w)
drop const.
= argmin − d(x, y) log p(y|x, w)
w∈RD (x,y)∈X ×Y
N
(xn ,y n )∼d(x,y)
≈ argmin − log p(y n |xn , w)
w∈RD n=1

negative conditional log-likelihood (of D)

12 / 39


N
w∗ = argmin − log p(w) − log p(y n |xn , w)
w∈RD n=1

Choices for p(w):
p(w) :≡ const. (uniform; in RD not really a distribution)
N
w∗ = argmin − log p(y n |xn , w) + const.
w∈RD n=1
negative conditional log-likelihood

1 2
p(w) := const. · e− 2σ2 w
(Gaussian)
N
1
w∗ = argmin − w 2
+ log p(y n |xn , w) + const.
w∈RD 2σ 2
n=1
regularized negative conditional log-likelihood

14 / 39


Probabilistic Models for Structured Prediction - Summary
Negative (Regularized) Conditional Log-Likelihood (of D)

N
1 2 w,φ(xn ,y)
L(w) = 2 w − w, φ(xn , y n ) − log e
2σ
n=1 y∈Y

(σ 2 → ∞ makes it unregularized)

Probabilistic parameter estimation or training means solving

w∗ = argmin L(w).
w∈RD

Same optimization problem as for multi-class logistic regression.

15 / 39


Negative Conditional Log-Likelihood (Toy Example)
negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10
3 3

128.0
2 2

00
.000

512.0
512

00

0
0

.00
.00

128
1 1
256

00
64.0
128.000

32.000
.0 00
16 0 0
0 0 2.0

32.000

8.000
64.000

1024.000

4.0
00
1 1

00
16.0
2 2
3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5
negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞
3 2.5
0
.00

2.0
128

2
1.5
00

00
.0

00
.0

1.0
32

16

64.0
1
0 0
4.00
0

0.5
8.0

00

31 2 25 50
32.0

0.0.00.1 0.2
0000 00

0.0 0000 00
0001 02
00 00

04 8 16
0.5 .0 2.0

0.0.0 0.0
00

0.0.0 0.0
4.0 8.0

0.0.0 0.0
0.0

16.0

0
6

00
00

1
0

0
0
0
0
64.0

00

0.5
1.000
0.5

1 1.0

00
1.5
2.0
2 2.0
3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5

16 / 39


Steepest Descent Minimization – minimize L(w)
input tolerance > 0
1: wcur ← 0
2: repeat
3: v ← w L(wcur )
4: η ← argminη∈R L(wcur − ηv)
5: wcur ← wcur − ηv
6: until v <
output wcur

Alternatives:
L-BFGS (second-order descent without explicit Hessian)
Conjugate Gradient
We always need (at least) the gradient of L.

17 / 39


N
1 2 w,φ(xn ,y)
L(w) = w − w, φ(xn , y n ) + log e
2σ 2
n=1 y∈Y

N
e w,φ(xn ,y) φ(xn , y)
1 n n y∈Y
w L(w) = 2 w − φ(x , y ) − w,φ(xn ,¯)
y
σ y ∈Y
¯ e
n=1
N
1
= w− φ(xn , y n ) − p(y|xn , w)φ(xn , y)
σ2
n=1 y∈Y
N
1
= w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
σ2
n=1

N
1
∆L(w) = IdD×D + [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)]
σ2
n=1
18 / 39


N
1 2 w,φ(xn ,y)
L(w) = w − w, φ(xn , y n ) + log e
2σ 2
n=1 y∈Y

C ∞ -diﬀerentiable on all RD .

19 / 39


N
1
w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
σ2
n=1

For σ → ∞:

Ey∼p(y|xn ,w) φ(xn , y) = φ(xn , y n ) ⇒ w L(w) = 0,

criticial point of L (local minimum/maximum/saddle point).

Interpretation:
We aim for expectation matching: Ey∼p φ(x, y) = φ(x, y obs )
but discriminatively: only for x ∈ {x1 , . . . , xn }.

20 / 39


N
1
∆L(w) = IdD×D + [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)]
σ2
n=1

positive deﬁnite Hessian matrix → L(w) is convex
→ w L(w) = 0 implies global minimum.

21 / 39


Milestone I: Probabilistic Training (Conditional Random Fields)
p(y|x, w) log-linear in w ∈ RD .
Training: many probabilistic derivations lead to same optimization
problem → minimize negative conditional log-likelihood, L(w)
L(w) is differentiable and convex,
→ gradient descent will find global optimum with w L(w) =0
Same structure as multi-class logistic regression.

For logistic regression: this is where the textbook ends. we’re done.

For conditional random fields: we’re not in safe waters, yet!

22 / 39


Task: Compute v = w L(wcur ), evaluate L(wcur + ηv):

N
1 2 w,φ(xn ,y)
L(w) = w − w, φ(xn , y n ) + log e
2σ 2
n=1 y∈Y
N
1
w L(w) = w− φ(xn , y n ) − p(y|xn , w)φ(xn , y)
σ2
n=1 y∈Y

Problem: Y typically is very (exponentially) large:
binary image segmentation: |Y| = 2640×480 ≈ 1092475
ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568 .

We must use the structure in Y, or we’re lost.

23 / 39


Solving the Training Optimization Problem Numerically
N
1
σ2
n=1

Computing the Gradient (naive): O(K M N D)

N
1 2
L(w) = 2 w − w, φ(xn , y n ) + log Z(xn , w)
2σ
n=1

Line Search (naive): O(K M N D) per evaluation of L

N : number of samples
D: dimension of feature space
M : number of output nodes ≈ 100s to 1,000,000s
K: number of possible labels of each output nodes ≈ 2 to 100s
24 / 39


Probabilistic Inference to the Rescue Remember: in a graphical model with
factors F, the features decompose:

φ(x, y) = φF (x, yF )
F ∈F

Ey∼p(y|x,w) φ(x, y) = Ey∼p(y|x,w) φF (x, yF )
F ∈F
= EyF ∼p(yF |x,w) φF (x, yF ) F ∈F

EyF ∼p(yF |x,w) φF (x, yF ) = p(yF |x, w) φF (x, yF )
yF ∈YF
factor marginals
K |F | terms

Factor marginals µF = p(yF |x, w)
are much smaller than complete joint distribution p(y|x, w),
can be computed/approximated, e.g., with (loopy) belief propagation.
25 / 39



N
1
σ2
n=1

Computing the Gradient: X O(M K |Fmax | N D):
O(K MX X
XXX
nd),

N
1 2 w,φ(xn ,y)
L(w) = w − w, φ(xn , y n ) + log e
2σ 2
n=1 y∈Y

Line Search: XX O(M K |Fmax | N D) per evaluation of L
O(K MX X
X X
nd),

N : number of samples ≈ 10s to 1,000,000s
D: dimension of feature space
M : number of output nodes
K: number of possible labels of each output nodes
26 / 39


What, if the training set D is too large (e.g. millions of examples)?
Simplify Model

Train simpler model (smaller factors)
Less expressive power ⇒ results might get worse rather than better

Subsampling

Create random subset D ⊂ D. Train model using D
Ignores all information in D D

Parallelize
Train multiple models in parallel. Merge the models.
Follows the multi-core/GPU trend
How to actually merge? (or ?)
Doesn’t really save computation.
27 / 39


What, if the training set D is too large (e.g. millions of examples)?

Stochastic Gradient Descent (SGD)

Keep maximizing p(w|D).
In each gradient descent step:
Create random subset D ⊂ D, ← often just 1–3 elements!
Follow approximate gradient

˜w L(w) = 1 w − φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
σ2
(xn ,y n )∈D

Line search no longer possible. Extra parameter: stepsize η
SGD converges to argminw L(w)! (if η chosen right)
SGD needs more iterations, but each one is much faster

more: see L. Bottou, O. Bousquet: ”The Tradeoﬀs of Large Scale Learning”, NIPS 2008.
also: http://leon.bottou.org/research/largescale
28 / 39



N
1
σ2
n=1

Computing the Gradient: X O(M K 2 N D) (if BP is possible):
O(K MX X
XXX
nd),

N
1 2 w,φ(xn ,y)
L(w) = w − w, φ(xn , y n ) + log e
2σ 2
n=1 y∈Y

Line Search: XX O(M K 2 N D) per evaluation of L
O(K MX X
X X
nd),

N : number of samples
D: dimension of feature space: ≈ φi,j 1–10s, φi : 100s to 10000s
M : number of output nodes
K: number of possible labels of each output nodes
29 / 39


Typical feature functions in image segmentation:

φi (yi , x) ∈ R≈1000 : local image features, e.g. bag-of-words
→ wi , φi (yi , x) : local classiﬁer (like logistic-regression)

φi,j (yi , yj ) = yi = yj ∈ R1 : test for same label
→ wij , φij (yi , yj ) : penalizer for label changes (if wij 0)

combined: argmaxy p(y|x) is smoothed version of local cues

original local classiﬁcation local + smoothness

30 / 39


Typical feature functions in handwriting recognition:

φi (yi , x) ∈ R≈1000 : image representation (pixels, gradients)
→ wi , φi (yi , x) : local classiﬁer if xi is letter yi

φi,j (yi , yj ) = eyi ⊗ eyj ∈ R26·26 : letter/letter indicator
→ wij , φij (yi , yj ) : encourage/suppress letter combinations

combined: argmaxy p(y|x) is ”corrected” version of local cues

local classiﬁcation local + ”correction”

31 / 39


Typical feature functions in pose estimation:

φi (yi , x) ∈ R≈1000 : local image representation, e.g. HoG
→ wi , φi (yi , x) : local confidence map

φi,j (yi , yj ) = good fit(yi , yj ) ∈ R1 : test for geometric fit
→ wij , φij (yi , yj ) : penalizer for unrealistic poses

together: argmaxy p(y|x) is sanitized version of local cues

original local classification local + geometry
[V. Ferrari, M. Marin-Jimenez, A. Zisserman: ”Progressive Search Space Reduction for Human Pose Estimation”, CVPR 2008.]

32 / 39


Typical feature functions for CRFs in Computer Vision:

φi (yi , x): local representation, high-dimensional
→ wi , φi (yi , x) : local classiﬁer

φi,j (yi , yj ): prior knowledge, low-dimensional
→ wij , φij (yi , yj ) : penalize outliers

learning adjusts parameters:
unary wi : learn local classiﬁers and their importance
binary wij : learn importance of smoothing/penalization

argmaxy p(y|x) is cleaned up version of local prediction

33 / 39


Solving the Training Optimization Problem Numerically Idea: split
learning of unary potentials into two parts:
local classifiers,
their importance.

Two-Stage Training

pre-train fiy (x) = log p(yi |x)
ˆ
use φ˜i (yi , x) := f y (x) ∈ RK (low-dimensional)
i
keep φij (yi , yj ) are before
˜
perform CRF learning with φi and φij

Advantage:
lower dimensional feature space during inference → faster
fiy (x) can be stronger classifiers, e.g. non-linear SVMs
Disadvantage:
if local classifiers are bad, CRF training cannot fix that.
34 / 39


Solving the Training Optimization Problem Numerically CRF training
methods is based on gradient-descent optimization.
The faster we can do it, the better (more realistic) models we can use:

N
˜w L(w) = w − φ(xn , y n ) − p(y|xn , w) φ(xn , y) ∈ RD
σ2 n=1 y∈Y

A lot of research on accelerating CRF training:

problem ”solution” method(s)
|Y| too large exploit structure (loopy) belief propagation
smart sampling contrastive divergence
use approximate L e.g. pseudo-likelihood
N too large mini-batches stochastic gradient descent
D too large trained φunary two-stage training

35 / 39


Summary – CRF Learning Given:
i.i.d.
training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, (xn , y n ) ∼ d(x, y)
feature function φ : X × RD .
1
Task: ﬁnd parameter vector w such that Z exp( w, φ(x, y) ) ≈ d(y|x).

CRF solution derived by minimizing negative conditional log-likelihood:
N
∗ 1 2 w,φ(xn ,y)
w = argmin w − w, φ(xn , y n ) − log e
w 2σ 2
n=1 y∈Y

convex optimization problem → gradient descent works
training needs repeated runs of probabilistic inference

36 / 39


Extra I: Beyond Fully Supervised Learning So far, training was fully
supervised, all variables were observed.
In real life, some variables can be unobserved even during training.

missing labels in training data latent variables, e.g. part location

latent variables, e.g. part occlusion latent variables, e.g. viewpoint

37 / 39


Graphical model: three types of variables
x ∈ X always observed,
y ∈ Y observed only in training,
z ∈ Z never observed (latent).

Marginalization over Latent Variables
Construct conditional likelihood as usual:
1
p(y, z|x, w) = exp( w, φ(x, y, z) )
Z(x, w)

Derive p(y|x, w) by marginalizing over z:

1
p(y|x, w) = p(y, z|x, w)
Z(x, w)
z∈Z

38 / 39


Negative regularized conditional log-likelihood:
N
2
L(w) = λ w − log p(y n |xn , w)
n=1
N
2
=λ w − log p(y n , z|xn , w)
n=1 z∈Z
N
2
=λ w − log exp( w, φ(xn , y n , z) )
n=1 z∈Z
N
+ log exp( w, φ(xn , y, z) )
n=1 z∈Z
y∈Y

L is not convex in w → can have local minima
no agreed on best way for treating latent variables
39 / 39

03 conditional random field

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à 03 conditional random field

Similaire à 03 conditional random field (20)

Plus de zukun

Plus de zukun (20)

Dernier

Dernier (20)

03 conditional random field