Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Simple Unsupervised Grammar Induction from
Raw Text with Cascaded Finite State Models
Elias Ponvert, Jason Baldridge, Katrin Erk
Department of Linguistics
The University of Texas at Austin
Association for Computational Linguistics
19–24 June, 2011

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

1 / 34

Why unsupervised parsing?
1 Less reliance on annotated training
Hello!

2 Apply to new languages and domains
Særær man
annær man
mæþæn



ACL 2011

2 / 34

Assumptions made in parser learning
Getting these labels right AS WELL AS the structure
of the tree is hard
S
PP

,

P

NP

on

N

,

NP
Det
the

A

VP
N

brown bear

V
sleeps

Sunday



ACL 2011

3 / 34


So the task is to identify the structure alone

,
P

N

on Sunday


,

V
Det
the

A

N

sleeps

brown bear


ACL 2011

3 / 34

Learning operates from gold-standard parts-of-speech
(POS) rather than raw text
P N , Det A N V

on Sunday , the brown bear sleeps

,
P

N

V
Det

A

N

Klein & Manning 2003 CCM
Bod 2006a, 2006b
Klein & Manning 2005 DMV
Successors to DMV:
- Smith 2006, Smith & Cohen
2009, Headden et al 2009,
Spitkovsky et al 2010ab, &c

,
on Sunday

sleeps
the brown bear

J. Gao et al 2003, 2004
Seginer 2007
this work


ACL 2011

3 / 34

Unsupervised parsing: desiderata

Raw text
Standard NLP / extensible
Scalable and fast



ACL 2011

4 / 34

A new approach: start from the bottom

Unsupervised Partial Parsing =
segmentation of (non-overlapping) multiword constituents



ACL 2011

5 / 34

Unsupervised segmentation of constituents
leaves some room for interpretation
Possible segmentations
( the cat ) in ( the hat ) knows ( a lot ) about that
( the cat ) ( in the hat ) knows ( a lot ) ( about that )
( the cat in the hat ) knows ( a lot about that )
( the cat in the hat ) ( knows a lot about that )
( the cat in the hat ) ( knows a lot ) ( about that )



ACL 2011

6 / 34

Deﬁning UPP by evaluation
1. Constituent chunks:
non-hierarchical multiword constituents
S
NP
D
The

VP

N

PP

Cat P

NP

V
knows

NP

in D

N

the

D

N

a

lot about

hat


PP


P

NP
N
that
ACL 2011

7 / 34

Deﬁning UPP by evaluation
2. Base NPs:
non-recursive noun phrases
S
NP
D
The

VP

N

PP

Cat P

NP

V
knows

NP

in D

N

the

D

N

a

lot about

hat


PP


P

NP
N
that
ACL 2011

7 / 34

Multilingual data for direct evaluation

English WSJ
German Negra
Chinese CTB
WSJ Penn Treebank
Negra Negra German Corpus
CTB Penn Chinese Treebank


Sentences Types Tokens
49K
44K
1M
21K
49K 300K
19K
37K 430K


ACL 2011

8 / 34

Constituent chunks and NPs in the data

WSJ

Chunks
203K
NPs
172K
Chunks ∩ NPs 161K

Negra

Chunks
59K
NPs
33K
Chunks ∩ NPs 23K

CTB

Chunks
92K
NPs
56K
Chunks ∩ NPs 43K



ACL 2011

9 / 34

The benchmark: CCL parser
the

cat
saw
run
the

red

dog

Constituency tree
0

the

0

1

cat

saw

0
0

0

the

0

0

red

dog

0

0

run

Common Cover Links representation
Seginer (2007 ACL; 2007 PhD UvA)


ACL 2011

10 / 34

Hypothesis

Segmentation can be learned by
generalizing on phrasal boundaries



ACL 2011

11 / 34

UPP as a tagging problem
the

cat

in

the

hat

B

I

O

B

I

the

cat

in

the

hat

B Beginning of a constituent
I Inside a constituent
O Not inside a constituent


ACL 2011

12 / 34

Learning from boundaries

the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#



ACL 2011

13 / 34

Learning from punctuation

on

sunday

,

the

brown

bear

sleeps

STOP

B

I

STOP

B

I

I

O

STOP

#

on

sunday

,

the

brown

bear

sleeps

#



ACL 2011

14 / 34

UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P(

B

)

the

Probabilistic right linear grammar

B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

) P( the | B

I

)

hat

Learning: expectation maximization (EM) via
forward-backward (run to convergence)


ACL 2011

15 / 34

UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P(

B

)

the

Probabilistic right linear grammar

B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

) P( the | B

I

)

hat

Decoding: Viterbi
Smoothing: additive smoothing on emissions


ACL 2011

15 / 34

UPP: Constraints on sequences
the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#

STOP
O

B
I


ACL 2011

16 / 34

UPP evaluation: Setup

Evaluation by comparison to treebank data
Standard train / development / test splits
Precision and recall on matched constituents
Benchmark: CCL
Both get tokenization, punctuation,
sentence boundaries



ACL 2011

17 / 34

UPP evaluation: Chunking (F-score)
WSJ
Negra
CTB
0

10

CCL∗

20

30

40

HMM Chunker

50

60

70

80

PRLG Chunker

CCL non-hierarchical constituents
First-level parsing output


ACL 2011

18 / 34

UPP evaluation: Base NPs (F-score)
WSJ
Negra
CTB
0

10

CCL∗

20

30

40

HMM Chunker

50

60

70

80

PRLG Chunker

CCL non-hierarchical constituents
First-level parsing output


ACL 2011

19 / 34

UPP: Review

Sequence models can generalize on indicators
for phrasal boundaries
Leads to improved unsupervised segmentation



ACL 2011

20 / 34

Question

Are we limited to segmentation?



ACL 2011

21 / 34

Hypothesis

Identiﬁcation of higher level constituents
can also be learned by generalizing on
phrasal boundaries



ACL 2011

22 / 34

Cascaded UPP: 1 Segment raw text

there

is

no

asbestos

in

our

products

now

there

is

no

asbestos

in

our

products

now



ACL 2011

23 / 34

Cascaded UPP: 2 Choose stand-ins for phrases

there

is

is

no

asbestos

in

no asbestos

there


our

products

our

is

in

our


now

products

now

ACL 2011

23 / 34

Cascaded UPP: 3 Segment text + phrasal stand-ins

there

is

in

our

now

there

is

in

our

now



ACL 2011

23 / 34

Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

there

is

in

our

there
is

in
our

no asbestos

is


now

in


products

now

ACL 2011

23 / 34

Cascaded UPP: 5 Unwind to output tree

there
is

in
our

no asbestos

is

there

in

products

now

now
is


no asbestos

in

our products


ACL 2011

23 / 34

Cascaded UPP: Review

Separate models learned at each cascade level
Models share hyper-parameters (smoothing etc)
Choice of pseudowords as phrasal stand-ins
Pseudoword-identiﬁcation: corpus frequency



ACL 2011

24 / 34

Cascaded UPP: Evaluation
WSJ
Negra
CTB
0

CCL

10

20

30

Cascaded HMM

40

50

60

Cascaded PRLG

All constituent F-score
Cascade run to convergence


ACL 2011

25 / 34

More example parses
Gold standard
tut
die

csu

das
in

doch
bayern

tut
die

csu

the

das

doch

does

this

nevertheless also

CSU

in

bayern

in

auch
sehr erfolgreich

auch sehr erfolgreich
very

successfully

Bavaria

Nevertheless, the CSU does this in Bavaria very successfully as well

Cascaded PRLG – Negra


correct
incorrect
ACL 2011

26 / 34

More example parses
Gold standard
bei

bei

bleibt alles
den windsors in

bleibt alles

in

stays

with

in

der familie

den

windsors

the

everything

der familie

Windsors

the

family

With the Windsors everything stays in the family.



correct
incorrect
ACL 2011

26 / 34

More example parses

¨
uberaltern
over-age

anlagenteile
immer

mehr

ever

machine parts

more

(with) more and more machine parts over-age




correct
incorrect

ACL 2011

26 / 34

What we’ve learned

Unsupervised identiﬁcation of base NPs and
local constituents is possible
A cascade of chunking models for raw text
parsing has state-of-the-art results



ACL 2011

27 / 34

Future directions

Improvements to the sequence models
Better phrasal stand-in (pseudoword)
construction
Learning joint models rather than a cascade



ACL 2011

28 / 34

What’s in the paper

Comparison to Klein Manning’s CCM
Discussion of phrasal punctuation
the chunkers still do well w/out punctuation

Analysis of chunking and parsing Chinese
Error analysis



ACL 2011

29 / 34

Thanks!

Contact: eponvert@utexas.edu
Code: elias.ponvert.net/upparse
This work is supported in part by the U. S. Army Research Laboratory and
the U.S. Army Research Ofﬁce under grant number W911NF-10-1-0533. Support for Elias was also provided by Mike Hogg Endowment Fellowship, the
Ofﬁce of Graduate Studies at The University of Texas at Austin.



ACL 2011

30 / 34

Appendices



ACL 2011

31 / 34

More example parses
two

Gold standard

share

a house
almost devoid
offurniture

two share
a house almost devoid of furniture

Cascaded PRLG – WSJ


correct
incorrect
ACL 2011

32 / 34

More example parses
what

Gold standard

is
one
to

think
of

what

is

all

one

to

think of

Cascaded PRLG – WSJ


all

this

this

correct
incorrect

ACL 2011

32 / 34

Learning curves: Base NPs
80

60
40
20

F -score

80

10 20 30 40K
sentences

80

60

60

40

40

20

20

100

60
EM iter

20

20

30 40K

10 sentences

0 20 40 60 80 100
EM iter

1
PRLG chunking model: WSJ



ACL 2011

33 / 34

50
40
30
20
10

F -score


5 10 15K
sentences

50
40
30
20
10

40
20
140

80

EM iter

20

5

10

15K

0

50 100 150
EM iter

sentences

1
PRLG chunking model: Negra



ACL 2011

33 / 34

30

30
F -score

20
10
0

5

10 15K

sentences

30

20

20

10

10
0

100

60
EM iter

20

5

10

15K

0 20 40 60 80 100
EM iter

sentences

PRLG chunking model: CTB
1



ACL 2011

33 / 34

What are the models learning?
B
the
a
to
’s
in
mr.
its
of
an
and

P(w|B)
21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4

I
%
million
be
company
year
market
billion
share
new
than

P(w|I)
1.8
1.6
1.3
0.9
0.8
0.7
0.6
0.5
0.5
0.5

O
of
and
in
that
to
for
is
it
said
on

P(w|O)
5.8
4.0
3.7
2.2
2.1
2.0
2.0
1.7
1.7
1.5

HMM Emissions: WSJ



ACL 2011

34 / 34

P(w|B)

B
der
die
den
und
im
das
des
dem
eine
ein

the
the
the
and
in
the
the
the
a
a

13.0
12.2
4.4
3.3
3.2
2.9
2.7
2.4
2.1
2.0

P(w|I)

I
uhr
juni
jahren
prozent
mark
stadt
000

o’clock
June
years
percent
currency
city

millionen

millions

jahre

year

frankfurter

Frankfurt

0.8
0.6
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.3

P(w|O)

O
in
und
mit
¨
fur
auf
zu
von
sich
ist
nicht

in
and
with
for
on
to
of
such
is
not

3.4
2.7
1.7
1.6
1.5
1.4
1.3
1.3
1.3
1.2

HMM Emissions: Negra



ACL 2011

34 / 34

P(w|B)

B
的
一
和
两
这
有
经济
各
全
不

de, of
one
and
two
this
have
economy
each
all
no

14.3
3.1
1.1
0.9
0.8
0.8
0.7
0.7
0.7
0.6

P(w|I)

I
的
了
个
年
说
中
上
人
大
国

de
(perf. asp.)
ge (measure)
year
say
middle
on, above
person
big
country

3.9
2.2
1.5
1.3
1.0
0.9
0.9
0.7
0.7
0.6

P(w|O)

O
在
是
中国
也
不
对
和
的
将
有

at, in
is
China
also
no
pair
and
de
fut. tns.
have

3.4
2.4
1.4
1.2
1.2
1.1
1.0
1.0
1.0
1.0

HMM Emissions: CTB



ACL 2011

34 / 34

Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk