Slides from my 2011 Association for Computational Linguistics paper & talk (joint work with Jason Baldridge and Katrin Erk). Presents Unsupervised Partial Parsing, a simple but very effective method for discovering grammatical phrases (like noun phrases and what not)
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk
1. Simple Unsupervised Grammar Induction from
Raw Text with Cascaded Finite State Models
Elias Ponvert, Jason Baldridge, Katrin Erk
Department of Linguistics
The University of Texas at Austin
Association for Computational Linguistics
19–24 June, 2011
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
1 / 34
2. Why unsupervised parsing?
1 Less reliance on annotated training
Hello!
2 Apply to new languages and domains
Særær man
annær man
mæþæn
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
2 / 34
3. Assumptions made in parser learning
Getting these labels right AS WELL AS the structure
of the tree is hard
S
PP
,
P
NP
on
N
,
NP
Det
the
A
VP
N
brown bear
V
sleeps
Sunday
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
3 / 34
4. Assumptions made in parser learning
So the task is to identify the structure alone
,
P
N
on Sunday
Ponvert, Baldridge, Erk (UT Austin)
,
V
Det
the
A
N
sleeps
brown bear
Simple Unsupervised Grammar Induction
ACL 2011
3 / 34
5. Assumptions made in parser learning
Learning operates from gold-standard parts-of-speech
(POS) rather than raw text
P N , Det A N V
on Sunday , the brown bear sleeps
,
P
N
V
Det
A
N
Klein & Manning 2003 CCM
Bod 2006a, 2006b
Klein & Manning 2005 DMV
Successors to DMV:
- Smith 2006, Smith & Cohen
2009, Headden et al 2009,
Spitkovsky et al 2010ab, &c
Ponvert, Baldridge, Erk (UT Austin)
,
on Sunday
sleeps
the brown bear
J. Gao et al 2003, 2004
Seginer 2007
this work
Simple Unsupervised Grammar Induction
ACL 2011
3 / 34
6. Unsupervised parsing: desiderata
Raw text
Standard NLP / extensible
Scalable and fast
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
4 / 34
7. A new approach: start from the bottom
Unsupervised Partial Parsing =
segmentation of (non-overlapping) multiword constituents
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
5 / 34
8. Unsupervised segmentation of constituents
leaves some room for interpretation
Possible segmentations
( the cat ) in ( the hat ) knows ( a lot ) about that
( the cat ) ( in the hat ) knows ( a lot ) ( about that )
( the cat in the hat ) knows ( a lot about that )
( the cat in the hat ) ( knows a lot about that )
( the cat in the hat ) ( knows a lot ) ( about that )
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
6 / 34
9. Defining UPP by evaluation
1. Constituent chunks:
non-hierarchical multiword constituents
S
NP
D
The
VP
N
PP
Cat P
NP
V
knows
NP
in D
N
the
D
N
a
lot about
hat
Ponvert, Baldridge, Erk (UT Austin)
PP
Simple Unsupervised Grammar Induction
P
NP
N
that
ACL 2011
7 / 34
10. Defining UPP by evaluation
2. Base NPs:
non-recursive noun phrases
S
NP
D
The
VP
N
PP
Cat P
NP
V
knows
NP
in D
N
the
D
N
a
lot about
hat
Ponvert, Baldridge, Erk (UT Austin)
PP
Simple Unsupervised Grammar Induction
P
NP
N
that
ACL 2011
7 / 34
11. Multilingual data for direct evaluation
English WSJ
German Negra
Chinese CTB
WSJ Penn Treebank
Negra Negra German Corpus
CTB Penn Chinese Treebank
Ponvert, Baldridge, Erk (UT Austin)
Sentences Types Tokens
49K
44K
1M
21K
49K 300K
19K
37K 430K
Simple Unsupervised Grammar Induction
ACL 2011
8 / 34
12. Constituent chunks and NPs in the data
WSJ
Chunks
203K
NPs
172K
Chunks ∩ NPs 161K
Negra
Chunks
59K
NPs
33K
Chunks ∩ NPs 23K
CTB
Chunks
92K
NPs
56K
Chunks ∩ NPs 43K
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
9 / 34
13. The benchmark: CCL parser
the
cat
saw
run
the
red
dog
Constituency tree
0
the
0
1
cat
saw
0
0
0
the
0
0
red
dog
0
0
run
Common Cover Links representation
Seginer (2007 ACL; 2007 PhD UvA)
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
10 / 34
14. Hypothesis
Segmentation can be learned by
generalizing on phrasal boundaries
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
11 / 34
15. UPP as a tagging problem
the
cat
in
the
hat
B
I
O
B
I
the
cat
in
the
hat
B Beginning of a constituent
I Inside a constituent
O Not inside a constituent
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
12 / 34
18. UPP: Models
Hidden Markov Model
B
I
O
B
I
the
cat
in
the
hat
P(
B
I
the
) ≈ P(
B
I
) P(
B
)
the
Probabilistic right linear grammar
B
I
the
O
cat
P(
B
in
the
I
B
the
I
) = P(
B
I
) P( the | B
I
)
hat
Learning: expectation maximization (EM) via
forward-backward (run to convergence)
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
15 / 34
19. UPP: Models
Hidden Markov Model
B
I
O
B
I
the
cat
in
the
hat
P(
B
I
the
) ≈ P(
B
I
) P(
B
)
the
Probabilistic right linear grammar
B
I
the
O
cat
P(
B
in
the
I
B
the
I
) = P(
B
I
) P( the | B
I
)
hat
Decoding: Viterbi
Smoothing: additive smoothing on emissions
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
15 / 34
20. UPP: Constraints on sequences
the
cat
in
the
hat
STOP
B
I
O
B
I
STOP
#
the
cat
in
the
hat
#
STOP
O
Ponvert, Baldridge, Erk (UT Austin)
B
I
Simple Unsupervised Grammar Induction
ACL 2011
16 / 34
21. UPP evaluation: Setup
Evaluation by comparison to treebank data
Standard train / development / test splits
Precision and recall on matched constituents
Benchmark: CCL
Both get tokenization, punctuation,
sentence boundaries
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
17 / 34
24. UPP: Review
Sequence models can generalize on indicators
for phrasal boundaries
Leads to improved unsupervised segmentation
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
20 / 34
25. Question
Are we limited to segmentation?
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
21 / 34
26. Hypothesis
Identification of higher level constituents
can also be learned by generalizing on
phrasal boundaries
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
22 / 34
27. Cascaded UPP: 1 Segment raw text
there
is
no
asbestos
in
our
products
now
there
is
no
asbestos
in
our
products
now
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
23 / 34
28. Cascaded UPP: 2 Choose stand-ins for phrases
there
is
is
no
asbestos
in
no asbestos
there
Ponvert, Baldridge, Erk (UT Austin)
our
products
our
is
in
our
Simple Unsupervised Grammar Induction
now
products
now
ACL 2011
23 / 34
29. Cascaded UPP: 3 Segment text + phrasal stand-ins
there
is
in
our
now
there
is
in
our
now
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
23 / 34
30. Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4
there
is
in
our
there
is
in
our
no asbestos
is
Ponvert, Baldridge, Erk (UT Austin)
now
in
Simple Unsupervised Grammar Induction
products
now
ACL 2011
23 / 34
31. Cascaded UPP: 5 Unwind to output tree
there
is
in
our
no asbestos
is
there
in
products
now
now
is
Ponvert, Baldridge, Erk (UT Austin)
no asbestos
in
our products
Simple Unsupervised Grammar Induction
ACL 2011
23 / 34
32. Cascaded UPP: Review
Separate models learned at each cascade level
Models share hyper-parameters (smoothing etc)
Choice of pseudowords as phrasal stand-ins
Pseudoword-identification: corpus frequency
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
24 / 34
34. More example parses
Gold standard
tut
die
csu
das
in
doch
bayern
tut
die
csu
the
das
doch
does
this
nevertheless also
CSU
in
bayern
in
auch
sehr erfolgreich
auch sehr erfolgreich
very
successfully
Bavaria
Nevertheless, the CSU does this in Bavaria very successfully as well
Cascaded PRLG – Negra
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
correct
incorrect
ACL 2011
26 / 34
35. More example parses
Gold standard
bei
bei
bleibt alles
den windsors in
bleibt alles
in
stays
with
in
der familie
den
windsors
the
everything
der familie
Windsors
the
family
With the Windsors everything stays in the family.
Cascaded PRLG – Negra
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
correct
incorrect
ACL 2011
26 / 34
37. What we’ve learned
Unsupervised identification of base NPs and
local constituents is possible
A cascade of chunking models for raw text
parsing has state-of-the-art results
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
27 / 34
38. Future directions
Improvements to the sequence models
Better phrasal stand-in (pseudoword)
construction
Learning joint models rather than a cascade
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
28 / 34
39. What’s in the paper
Comparison to Klein Manning’s CCM
Discussion of phrasal punctuation
the chunkers still do well w/out punctuation
Analysis of chunking and parsing Chinese
Error analysis
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
29 / 34
40. Thanks!
Contact: eponvert@utexas.edu
Code: elias.ponvert.net/upparse
This work is supported in part by the U. S. Army Research Laboratory and
the U.S. Army Research Office under grant number W911NF-10-1-0533. Support for Elias was also provided by Mike Hogg Endowment Fellowship, the
Office of Graduate Studies at The University of Texas at Austin.
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
30 / 34
42. More example parses
two
Gold standard
share
a house
almost devoid
offurniture
two share
a house almost devoid of furniture
Cascaded PRLG – WSJ
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
correct
incorrect
ACL 2011
32 / 34
43. More example parses
what
Gold standard
is
one
to
think
of
what
is
all
one
to
think of
Cascaded PRLG – WSJ
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
all
this
this
correct
incorrect
ACL 2011
32 / 34
44. Learning curves: Base NPs
80
60
40
20
F -score
80
10 20 30 40K
sentences
80
60
60
40
40
20
20
100
60
EM iter
20
20
30 40K
10 sentences
0 20 40 60 80 100
EM iter
1
PRLG chunking model: WSJ
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
33 / 34
45. 50
40
30
20
10
F -score
Learning curves: Base NPs
5 10 15K
sentences
50
40
30
20
10
40
20
140
80
EM iter
20
5
10
15K
0
50 100 150
EM iter
sentences
1
PRLG chunking model: Negra
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
33 / 34
46. Learning curves: Base NPs
30
30
F -score
20
10
0
5
10 15K
sentences
30
20
20
10
10
0
100
60
EM iter
20
5
10
15K
0 20 40 60 80 100
EM iter
sentences
PRLG chunking model: CTB
1
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
33 / 34
47. What are the models learning?
B
the
a
to
’s
in
mr.
its
of
an
and
P(w|B)
21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4
I
%
million
be
company
year
market
billion
share
new
than
P(w|I)
1.8
1.6
1.3
0.9
0.8
0.7
0.6
0.5
0.5
0.5
O
of
and
in
that
to
for
is
it
said
on
P(w|O)
5.8
4.0
3.7
2.2
2.1
2.0
2.0
1.7
1.7
1.5
HMM Emissions: WSJ
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
34 / 34
48. What are the models learning?
P(w|B)
B
der
die
den
und
im
das
des
dem
eine
ein
the
the
the
and
in
the
the
the
a
a
13.0
12.2
4.4
3.3
3.2
2.9
2.7
2.4
2.1
2.0
P(w|I)
I
uhr
juni
jahren
prozent
mark
stadt
000
o’clock
June
years
percent
currency
city
millionen
millions
jahre
year
frankfurter
Frankfurt
0.8
0.6
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.3
P(w|O)
O
in
und
mit
¨
fur
auf
zu
von
sich
ist
nicht
in
and
with
for
on
to
of
such
is
not
3.4
2.7
1.7
1.6
1.5
1.4
1.3
1.3
1.3
1.2
HMM Emissions: Negra
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
34 / 34
49. What are the models learning?
P(w|B)
B
的
一
和
两
这
有
经济
各
全
不
de, of
one
and
two
this
have
economy
each
all
no
14.3
3.1
1.1
0.9
0.8
0.8
0.7
0.7
0.7
0.6
P(w|I)
I
的
了
个
年
说
中
上
人
大
国
de
(perf. asp.)
ge (measure)
year
say
middle
on, above
person
big
country
3.9
2.2
1.5
1.3
1.0
0.9
0.9
0.7
0.7
0.6
P(w|O)
O
在
是
中国
也
不
对
和
的
将
有
at, in
is
China
also
no
pair
and
de
fut. tns.
have
3.4
2.4
1.4
1.2
1.2
1.1
1.0
1.0
1.0
1.0
HMM Emissions: CTB
Ponvert, Baldridge, Erk (UT Austin)
Simple Unsupervised Grammar Induction
ACL 2011
34 / 34