This document discusses structured prediction problems in machine learning for natural language processing tasks. It covers using linear classifiers like perceptrons and SVMs for structured outputs by factorizing feature representations. Sequence labeling tasks are used as a running example, with explanations of how to apply the Viterbi algorithm for inference and conditional random fields for learning. Dependency parsing is presented as a case study for structured prediction.
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Machine Learning for Structured Prediction
1. Machine Learning for Language Technology
Uppsala University
Department of Linguistics and Philology
Structured Prediction
October 2013
Slides borrowed from previous courses.
Thanks to Ryan McDonald (Google Research)
and Prof. Joakim Nivre
Machine Learning for Language Technology 1(36)
2. Structured Prediction
Outline
Last time:
Preliminaries: input/output, features, etc.
Linear classifiers
Perceptron
Large-margin classifiers (SVMs, MIRA)
Logistic regression
Today:
Structured prediction with linear classifiers
Structured perceptron
Structured large-margin classifiers (SVMs, MIRA)
Conditional random fields
Case study: Dependency parsing
Machine Learning for Language Technology 2(36)
3. Structured Prediction
Structured Prediction (i)
Sometimes our output space Y does not consist of simple
atomic classes
Examples:
Parsing: for a sentence x, Y is the set of possible parse trees
Sequence tagging: for a sentence x, Y is the set of possible
tag sequences, e.g., part-of-speech tags, named-entity tags
Machine translation: for a source sentence x, Y is the set of
possible target language sentences
Machine Learning for Language Technology 3(36)
4. Structured Prediction
Hidden Markov Models
Generative model – maximizes likelihood of P(x, y)
We are looking at discriminative versions of this
Not just sequences, though that will be the running example
Machine Learning for Language Technology 4(36)
5. Structured Prediction
Structured Prediction (ii)
Can’t we just use our multiclass learning algorithms?
In all the cases, the size of the set Y is exponential in the
length of the input x
It is non-trivial to apply our learning algorithms in such cases
Machine Learning for Language Technology 5(36)
6. Structured Prediction
Perceptron
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. Let y = arg maxy w(i) · f(xt, y) (**)
5. if y = yt
6. w(i+1) = w(i) + f(xt, yt) − f(xt, y )
7. i = i + 1
8. return wi
(**) Solving the argmax requires a search over an exponential
space of outputs!
Machine Learning for Language Technology 6(36)
7. Structured Prediction
Large-Margin Classifiers
Batch (SVMs):
min
1
2
||w||2
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ 1
∀(xt, yt) ∈ T and y ∈ ¯Yt (**)
Online (MIRA):
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0)
= 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. w(i+1)
= arg minw* w* − w(i)
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ 1
∀y ∈ ¯Yt (**)
5. i = i + 1
6. return wi
(**) There are exponential constraints in the size of each input!!
Machine Learning for Language Technology 7(36)
8. Structured Prediction
Factor the Feature Representations
We can make an assumption that our feature representations
factor relative to the output
Example:
Context-free parsing:
f(x, y) =
A→BC∈y
f(x, A → BC)
Sequence analysis – Markov assumptions:
f(x, y) =
|y|
i=1
f(x, yi−1, yi )
These kinds of factorizations allow us to run algorithms like
CKY and Viterbi to compute the argmax function
Machine Learning for Language Technology 8(36)
9. Structured Prediction
Example – Sequence Labeling
Many NLP problems can be cast in this light
Part-of-speech tagging
Named-entity extraction
Semantic role labeling
...
Input: x = x0x1 . . . xn
Output: y = y0y1 . . . yn
Each yi ∈ Yatom – which is small
Each y ∈ Y = Yn
atom – which is large
Example: part-of-speech tagging – Yatom is set of tags
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
Machine Learning for Language Technology 9(36)
10. Structured Prediction
Sequence Labeling – Output Interaction
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
Why not just break up sequence into a set of multi-class
predictions?
Because there are interactions between neighbouring tags
What tag does“saw”have?
What if I told you the previous tag was article?
What if it was noun?
Machine Learning for Language Technology 10(36)
11. Structured Prediction
Sequence Labeling – Markov Factorization
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
Markov factorization – factor by adjacent labels
First-order (like HMMs)
f(x, y) =
|y|
i=1
f(x, yi−1, yi )
kth-order
f(x, y) =
|y|
i=k
f(x, yi−k, . . . , yi−1, yi )
Machine Learning for Language Technology 11(36)
12. Structured Prediction
Sequence Labeling – Features
x = John saw Mary with the telescope
y = noun verb noun preposition article noun
First-order
f(x, y) =
|y|
i=1
f(x, yi−1, yi )
f(x, yi−1, yi ) is any feature of the input & two adjacent labels
fj (x, yi−1, yi ) =
8
<
:
1 if xi = “saw”
and yi−1 = noun and yi = verb
0 otherwise
fj (x, yi−1, yi ) =
8
<
:
1 if xi = “saw”
and yi−1 = article and yi = verb
0 otherwise
wj should get high weight and wj should get low weight
Machine Learning for Language Technology 12(36)
13. Structured Prediction
Sequence Labeling - Inference
How does factorization effect inference?
y = arg max
y
w · f(x, y)
= arg max
y
w ·
|y|
i=1
f(x, yi−1, yi )
= arg max
y
|y|
i=1
w · f(x, yi−1, yi )
= arg max
y
|y|
i=1
m
j=1
wj · fj (x, yi−1, yi )
Can use the Viterbi algorithm
Machine Learning for Language Technology 13(36)
14. Structured Prediction
Sequence Labeling – Viterbi Algorithm
Let αy,i be the score of the best labeling
Of the sequence x0x1 . . . xi
Where yi = y
Let’s say we know α, then
maxy αy,n is the score of the best labeling of the sequence
αy,i can be calculated with the following recursion
αy,0 = 0.0 ∀y ∈ Yatom
αy,i = max
y∗
αy∗,i−1 + w · f(x, y∗, y)
Machine Learning for Language Technology 14(36)
15. Structured Prediction
Sequence Labeling - Back-Pointers
But that only tells us what the best score is
Let βy,i be the ith label in the best labeling
Of the sequence x0x1 . . . xi
Where yi = y
βy,i can be calculated with the following recursion
βy,0 = nil ∀y ∈ Yatom
βy,i = arg max
y∗
αy∗,i−1 + w · f(x, y∗, y)
Thus:
The last label in the best sequence is yn = arg maxy αy,n
The second-to-last label is yn−1 = βyn,n
...
The first label is y0 = βy1,1
Machine Learning for Language Technology 15(36)
16. Structured Prediction
Structured Learning
We know we can solve the inference problem
At least for sequence labeling
But also for many other problems where one can factor
features appropriately (context-free parsing, dependency
parsing, semantic role labeling, . . . )
How does this change learning?
for the perceptron algorithm?
for SVMs?
for logistic regression?
Machine Learning for Language Technology 16(36)
17. Structured Prediction
Structured Perceptron
Exactly like original perceptron
Except now the argmax function uses factored features
Which we can solve with algorithms like the Viterbi algorithm
All of the original analysis carries over!!
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. Let y = arg maxy w(i) · f(xt , y) (**)
5. if y = yt
6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y )
7. i = i + 1
8. return wi
(**) Solve the argmax with Viterbi for sequence problems!
Machine Learning for Language Technology 17(36)
18. Structured Prediction
Online Structured SVMs (or Online MIRA)
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. w(i+1) = arg minw*
‚
‚w* − w(i)
‚
‚
such that:
w · f(xt , yt ) − w · f(xt , y ) ≥ L(yt , y )
∀y ∈ ¯Yt and y ∈ k-best(xt , w(i))
5. i = i + 1
6. return wi
k-best(xt) is set of k outputs with highest scores using w(i)
Simple solution – only consider highest-scoring output y ∈ ¯Yt
Note: Old fixed margin of 1 is now a fixed loss L(yt, y )
between two structured outputs
Machine Learning for Language Technology 18(36)
19. Structured Prediction
Structured SVMs
min
1
2
||w||2
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ L(yt, y )
∀(xt, yt) ∈ T and y ∈ ¯Yt
Still have an exponential number of constraints
Feature factorizations permit solutions (max-margin Markov
networks, structured SVMs)
Note: Old fixed margin of 1 is now a fixed loss L(yt, y )
between two structured outputs
Machine Learning for Language Technology 19(36)
20. Structured Prediction
Conditional Random Fields (i)
What about structured logistic regression?
Such a thing exists – Conditional Random Fields (CRFs)
Consider again the sequential case with 1st order factorization
Inference is identical to the structured perceptron – use Viterbi
arg max
y
P(y|x) = arg max
y
ew·f(x,y)
Zx
= arg max
y
ew·f(x,y)
= arg max
y
w · f(x, y)
= arg max
y
X
i=1
w · f(x, yi−1, yi )
Machine Learning for Language Technology 20(36)
21. Structured Prediction
Conditional Random Fields (ii)
However, learning does change
Reminder: pick w to maximize log-likelihood of training data:
w = arg max
w t
log P(yt|xt)
Take gradient and use gradient ascent
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
And the gradient is:
F(w) = (
∂
∂w0
F(w),
∂
∂w1
F(w), . . . ,
∂
∂wm
F(w))
Machine Learning for Language Technology 21(36)
22. Structured Prediction
Conditional Random Fields (iii)
Problem: sum over output space Y
∂
∂wi
F(w) =
X
t
fi (xt , yt ) −
X
t
X
y ∈Y
P(y |xt )fi (xt , y )
=
X
t
X
j=1
fi (xt , yt,j−1, yt,j ) −
X
t
X
y ∈Y
X
j=1
P(y |xt )fi (xt , yj−1, yj )
Can easily calculate first term – just empirical counts
What about the second term?
Machine Learning for Language Technology 22(36)
23. Structured Prediction
Conditional Random Fields (iv)
Problem: sum over output space Y
t y ∈Y j=1
P(y |xt)fi (xt, yj−1, yj )
We need to show we can compute it for arbitrary xt
y ∈Y j=1
P(y |xt)fi (xt, yj−1, yj )
Solution: the forward-backward algorithm
Machine Learning for Language Technology 23(36)
24. Structured Prediction
Forward Algorithm (i)
Let αm
u be the forward scores
Let |xt| = n
αm
u is the sum over all labelings of x0 . . . xm such that ym = u
αm
u =
|y |=m, ym=u
ew·f(xt ,y )
=
|y |=m ym=u
e
P
j=1 w·f(xt ,yj−1,yj )
i.e., the sum of all labelings of length m, ending at position m
with label u
Note then that
Zxt =
y
ew·f(xt ,y )
=
u
αn
u
Machine Learning for Language Technology 24(36)
25. Structured Prediction
Forward Algorithm (ii)
We can fill in α as follows:
α0
u = 1.0 ∀u
αm
u =
v
αm−1
v × ew·f(xt ,v,u)
Machine Learning for Language Technology 25(36)
26. Structured Prediction
Backward Algorithm
Let βm
u be the symmetric backward scores
i.e., the sum over all labelings of xm . . . xn such that ym = u
We can fill in β as follows:
βn
u = 1.0 ∀u
βm
u =
v
βm+1
v × ew·f(xt ,u,v)
Note: β is overloaded – different from back-pointers
Machine Learning for Language Technology 26(36)
27. Structured Prediction
Conditional Random Fields - Final
Let’s show we can compute it for arbitrary xt
y ∈Y j=1
P(y |xt)fi (xt, yj−1, yj )
So we can re-write it as:
j=1
αj−1
yj−1
× ew·f(xt ,yj−1,yj )
× βj
yj
Zxt
fi (xt, yj−1, yj )
Forward-backward can calculate partial derivatives efficiently
Machine Learning for Language Technology 27(36)
28. Structured Prediction
Conditional Random Fields Summary
Inference: Viterbi
Learning: Use the forward-backward algorithm
What about not sequential problems
Context-free parsing – can use inside-outside algorithm
General problems – message passing & belief propagation
Machine Learning for Language Technology 28(36)
29. Structured Prediction
Case Study: Dependency Parsing
Given an input sentence x, predict syntactic dependencies y
Machine Learning for Language Technology 29(36)
30. Structured Prediction
Model 1: Arc-Factored Graph-Based Parsing
y = arg max
y
w · f(x, y)
= arg max
y
(i,j)∈y
w · f(i, j)
(i, j) ∈ y means xi → xj , i.e., a dependency from xi to xj
Solving the argmax
w · f(i, j) is weight of arc
A dependency tree is a spanning tree of a dense graph over x
Use max spanning tree algorithms for inference
Machine Learning for Language Technology 30(36)
31. Structured Prediction
Defining f(i, j)
Can contain any feature over arc or the input sentence
Some example features
Identities of xi and xj
Their part-of-speech tags
The part-of-speech of surrounding words
The distance between xi and xj
...
Machine Learning for Language Technology 31(36)
32. Structured Prediction
Empirical Results
Spanning tree dependency parsing results (McDonald 2006)
Trained using MIRA (online SVMs)
English Czech Chinese
Accuracy Complete Accuracy Complete Accuracy Complete
90.7 36.7 84.1 32.2 79.7 27.2
Simple structured linear classifier
Near state-of-the-art performance for many languages
Higher-order models give higher accuracy
Machine Learning for Language Technology 32(36)
33. Structured Prediction
Model 2: Transition-Based Parsing
y = arg max
y
w · f(x, y)
= arg max
y
t(s)∈T(y)
w · f(s, t)
t(s) ∈ T(y) means that the derivation of y includes the
application of transition t to state s
Solving the argmax
w · f(s, t) is score of transition t in state s
Use beam search to find best derivation from start state s0
Machine Learning for Language Technology 33(36)
34. Structured Prediction
Defining f(s, t)
Can contain any feature over parser states
Some example features
Identities of words in s (e.g., top of stack, head of queue)
Their part-of-speech tags
Their head and dependents (and their part-of-speech tags)
The number of dependents of words in s
...
Machine Learning for Language Technology 34(36)
35. Structured Prediction
Empirical Results
Transition-based dependency parsing with beam search(**)
(Zhang and Nivre 2011)
Trained using perceptron
English Chinese
Accuracy Complete Accuracy Complete
92.9 48.0 86.0 36.9
Simple structured linear classifier
State-of-the-art performance with rich non-local features
(**) Beam search is a heuristic search algorithm that explores a
graph by expanding the most promising node in a limited set.
Beam search is an optimization of best-first search that reduces its
memory requirements. It only finds an approxiamate solution
Machine Learning for Language Technology 35(36)
36. Structured Prediction
Structured Prediction Summary
Can’t use multiclass algorithms – search space too large
Solution: factor representations
Can allow for efficient inference and learning
Showed for sequence learning: Viterbi + forward-backward
But also true for other structures
CFG parsing: CKY + inside-outside
Dependency parsing: spanning tree algorithms or beam search
General graphs: message passing and belief propagation
Machine Learning for Language Technology 36(36)