SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Simple Unsupervised Grammar Induction from
Raw Text with Cascaded Finite State Models
Elias Ponvert, Jason Baldridge, Katrin Erk
Department of Linguistics
The University of Texas at Austin
Association for Computational Linguistics
19–24 June, 2011

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

1 / 34
Why unsupervised parsing?
1 Less reliance on annotated training
Hello!

2 Apply to new languages and domains
Særær man
annær man
mæþæn

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

2 / 34
Assumptions made in parser learning
Getting these labels right AS WELL AS the structure
of the tree is hard
S
PP

,

P

NP

on

N

,

NP
Det
the

A

VP
N

brown bear

V
sleeps

Sunday

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

3 / 34
Assumptions made in parser learning

So the task is to identify the structure alone

,
P

N

on Sunday

Ponvert, Baldridge, Erk (UT Austin)

,

V
Det
the

A

N

sleeps

brown bear

Simple Unsupervised Grammar Induction

ACL 2011

3 / 34
Assumptions made in parser learning
Learning operates from gold-standard parts-of-speech
(POS) rather than raw text
P N , Det A N V

on Sunday , the brown bear sleeps

,
P

N

V
Det

A

N

Klein & Manning 2003 CCM
Bod 2006a, 2006b
Klein & Manning 2005 DMV
Successors to DMV:
- Smith 2006, Smith & Cohen
2009, Headden et al 2009,
Spitkovsky et al 2010ab, &c
Ponvert, Baldridge, Erk (UT Austin)

,
on Sunday

sleeps
the brown bear

J. Gao et al 2003, 2004
Seginer 2007
this work

Simple Unsupervised Grammar Induction

ACL 2011

3 / 34
Unsupervised parsing: desiderata

Raw text
Standard NLP / extensible
Scalable and fast

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

4 / 34
A new approach: start from the bottom

Unsupervised Partial Parsing =
segmentation of (non-overlapping) multiword constituents

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

5 / 34
Unsupervised segmentation of constituents
leaves some room for interpretation
Possible segmentations
( the cat ) in ( the hat ) knows ( a lot ) about that
( the cat ) ( in the hat ) knows ( a lot ) ( about that )
( the cat in the hat ) knows ( a lot about that )
( the cat in the hat ) ( knows a lot about that )
( the cat in the hat ) ( knows a lot ) ( about that )

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

6 / 34
Defining UPP by evaluation
1. Constituent chunks:
non-hierarchical multiword constituents
S
NP
D
The

VP

N

PP

Cat P

NP

V
knows

NP

in D

N

the

D

N

a

lot about

hat

Ponvert, Baldridge, Erk (UT Austin)

PP

Simple Unsupervised Grammar Induction

P

NP
N
that
ACL 2011

7 / 34
Defining UPP by evaluation
2. Base NPs:
non-recursive noun phrases
S
NP
D
The

VP

N

PP

Cat P

NP

V
knows

NP

in D

N

the

D

N

a

lot about

hat

Ponvert, Baldridge, Erk (UT Austin)

PP

Simple Unsupervised Grammar Induction

P

NP
N
that
ACL 2011

7 / 34
Multilingual data for direct evaluation

English WSJ
German Negra
Chinese CTB
WSJ Penn Treebank
Negra Negra German Corpus
CTB Penn Chinese Treebank

Ponvert, Baldridge, Erk (UT Austin)

Sentences Types Tokens
49K
44K
1M
21K
49K 300K
19K
37K 430K

Simple Unsupervised Grammar Induction

ACL 2011

8 / 34
Constituent chunks and NPs in the data

WSJ

Chunks
203K
NPs
172K
Chunks ∩ NPs 161K

Negra

Chunks
59K
NPs
33K
Chunks ∩ NPs 23K

CTB

Chunks
92K
NPs
56K
Chunks ∩ NPs 43K

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

9 / 34
The benchmark: CCL parser
the

cat
saw
run
the

red

dog

Constituency tree
0

the 



0

1

cat





saw

0
0



0

the 

0



0

red 



dog

0



0




run

Common Cover Links representation
Seginer (2007 ACL; 2007 PhD UvA)
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

10 / 34
Hypothesis

Segmentation can be learned by
generalizing on phrasal boundaries

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

11 / 34
UPP as a tagging problem
the

cat

in

the

hat

B

I

O

B

I

the

cat

in

the

hat

B Beginning of a constituent
I Inside a constituent
O Not inside a constituent
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

12 / 34
Learning from boundaries

the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

13 / 34
Learning from punctuation

on

sunday

,

the

brown

bear

sleeps

STOP

B

I

STOP

B

I

I

O

STOP

#

on

sunday

,

the

brown

bear

sleeps

#

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

14 / 34
UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P(

B

)

the

Probabilistic right linear grammar

B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

) P( the | B

I

)

hat

Learning: expectation maximization (EM) via
forward-backward (run to convergence)
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

15 / 34
UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P(

B

)

the

Probabilistic right linear grammar

B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

) P( the | B

I

)

hat

Decoding: Viterbi
Smoothing: additive smoothing on emissions
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

15 / 34
UPP: Constraints on sequences
the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#

STOP
O
Ponvert, Baldridge, Erk (UT Austin)

B
I

Simple Unsupervised Grammar Induction

ACL 2011

16 / 34
UPP evaluation: Setup

Evaluation by comparison to treebank data
Standard train / development / test splits
Precision and recall on matched constituents
Benchmark: CCL
Both get tokenization, punctuation,
sentence boundaries

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

17 / 34
UPP evaluation: Chunking (F-score)
WSJ
Negra
CTB
0

10

CCL∗

20

30

40

HMM Chunker

50

60

70

80

PRLG Chunker

CCL non-hierarchical constituents
First-level parsing output
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

18 / 34
UPP evaluation: Base NPs (F-score)
WSJ
Negra
CTB
0

10

CCL∗

20

30

40

HMM Chunker

50

60

70

80

PRLG Chunker

CCL non-hierarchical constituents
First-level parsing output
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

19 / 34
UPP: Review

Sequence models can generalize on indicators
for phrasal boundaries
Leads to improved unsupervised segmentation

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

20 / 34
Question

Are we limited to segmentation?

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

21 / 34
Hypothesis

Identification of higher level constituents
can also be learned by generalizing on
phrasal boundaries

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

22 / 34
Cascaded UPP: 1 Segment raw text

there

is

no

asbestos

in

our

products

now

there

is

no

asbestos

in

our

products

now

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

23 / 34
Cascaded UPP: 2 Choose stand-ins for phrases

there

is

is

no

asbestos

in

no asbestos

there

Ponvert, Baldridge, Erk (UT Austin)

our

products

our

is

in

our

Simple Unsupervised Grammar Induction

now

products

now

ACL 2011

23 / 34
Cascaded UPP: 3 Segment text + phrasal stand-ins

there

is

in

our

now

there

is

in

our

now

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

23 / 34
Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

there

is

in

our

there
is

in
our

no asbestos

is

Ponvert, Baldridge, Erk (UT Austin)

now

in

Simple Unsupervised Grammar Induction

products

now

ACL 2011

23 / 34
Cascaded UPP: 5 Unwind to output tree

there
is

in
our

no asbestos

is

there

in

products

now

now
is

Ponvert, Baldridge, Erk (UT Austin)

no asbestos

in

our products

Simple Unsupervised Grammar Induction

ACL 2011

23 / 34
Cascaded UPP: Review

Separate models learned at each cascade level
Models share hyper-parameters (smoothing etc)
Choice of pseudowords as phrasal stand-ins
Pseudoword-identification: corpus frequency

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

24 / 34
Cascaded UPP: Evaluation
WSJ
Negra
CTB
0

CCL

10

20

30

Cascaded HMM

40

50

60

Cascaded PRLG

All constituent F-score
Cascade run to convergence
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

25 / 34
More example parses
Gold standard
tut
die

csu

das
in

doch
bayern

tut
die

csu

the

das

doch

does

this

nevertheless also

CSU

in

bayern

in

auch
sehr erfolgreich

auch sehr erfolgreich
very

successfully

Bavaria

Nevertheless, the CSU does this in Bavaria very successfully as well

Cascaded PRLG – Negra
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

correct
incorrect
ACL 2011

26 / 34
More example parses
Gold standard
bei

bei

bleibt alles
den windsors in

bleibt alles

in

stays

with

in

der familie

den

windsors

the

everything

der familie

Windsors

the

family

With the Windsors everything stays in the family.

Cascaded PRLG – Negra
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

correct
incorrect
ACL 2011

26 / 34
More example parses

¨
uberaltern
over-age

anlagenteile
immer

mehr

ever

machine parts

more

(with) more and more machine parts over-age

Cascaded PRLG – Negra

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

correct
incorrect

ACL 2011

26 / 34
What we’ve learned

Unsupervised identification of base NPs and
local constituents is possible
A cascade of chunking models for raw text
parsing has state-of-the-art results

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

27 / 34
Future directions

Improvements to the sequence models
Better phrasal stand-in (pseudoword)
construction
Learning joint models rather than a cascade

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

28 / 34
What’s in the paper

Comparison to Klein  Manning’s CCM
Discussion of phrasal punctuation
the chunkers still do well w/out punctuation

Analysis of chunking and parsing Chinese
Error analysis

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

29 / 34
Thanks!

Contact: eponvert@utexas.edu
Code: elias.ponvert.net/upparse
This work is supported in part by the U. S. Army Research Laboratory and
the U.S. Army Research Office under grant number W911NF-10-1-0533. Support for Elias was also provided by Mike Hogg Endowment Fellowship, the
Office of Graduate Studies at The University of Texas at Austin.

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

30 / 34
Appendices

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

31 / 34
More example parses
two

Gold standard

share

a house
almost devoid
offurniture

two share
a house almost devoid of furniture

Cascaded PRLG – WSJ
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

correct
incorrect
ACL 2011

32 / 34
More example parses
what

Gold standard

is
one
to

think
of

what

is

all

one

to

think of

Cascaded PRLG – WSJ
Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

all

this

this

correct
incorrect

ACL 2011

32 / 34
Learning curves: Base NPs
80

60
40
20

F -score

80

10 20 30 40K
sentences

80

60

60

40

40

20

20

100

60
EM iter

20

20

30 40K

10 sentences

0 20 40 60 80 100
EM iter

1
PRLG chunking model: WSJ

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

33 / 34
50
40
30
20
10

F -score

Learning curves: Base NPs

5 10 15K
sentences

50
40
30
20
10

40
20
140

80

EM iter

20

5

10

15K

0

50 100 150
EM iter

sentences

1
PRLG chunking model: Negra

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

33 / 34
Learning curves: Base NPs
30

30
F -score

20
10
0

5

10 15K

sentences

30

20

20

10

10
0

100

60
EM iter

20

5

10

15K

0 20 40 60 80 100
EM iter

sentences

PRLG chunking model: CTB
1

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

33 / 34
What are the models learning?
B
the
a
to
’s
in
mr.
its
of
an
and

P(w|B)
21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4

I
%
million
be
company
year
market
billion
share
new
than

P(w|I)
1.8
1.6
1.3
0.9
0.8
0.7
0.6
0.5
0.5
0.5

O
of
and
in
that
to
for
is
it
said
on

P(w|O)
5.8
4.0
3.7
2.2
2.1
2.0
2.0
1.7
1.7
1.5

HMM Emissions: WSJ

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

34 / 34
What are the models learning?
P(w|B)

B
der
die
den
und
im
das
des
dem
eine
ein

the
the
the
and
in
the
the
the
a
a

13.0
12.2
4.4
3.3
3.2
2.9
2.7
2.4
2.1
2.0

P(w|I)

I
uhr
juni
jahren
prozent
mark
stadt
000

o’clock
June
years
percent
currency
city

millionen

millions

jahre

year

frankfurter

Frankfurt

0.8
0.6
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.3

P(w|O)

O
in
und
mit
¨
fur
auf
zu
von
sich
ist
nicht

in
and
with
for
on
to
of
such
is
not

3.4
2.7
1.7
1.6
1.5
1.4
1.3
1.3
1.3
1.2

HMM Emissions: Negra

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

34 / 34
What are the models learning?
P(w|B)

B
的
一
和
两
这
有
经济
各
全
不

de, of
one
and
two
this
have
economy
each
all
no

14.3
3.1
1.1
0.9
0.8
0.8
0.7
0.7
0.7
0.6

P(w|I)

I
的
了
个
年
说
中
上
人
大
国

de
(perf. asp.)
ge (measure)
year
say
middle
on, above
person
big
country

3.9
2.2
1.5
1.3
1.0
0.9
0.9
0.7
0.7
0.6

P(w|O)

O
在
是
中国
也
不
对
和
的
将
有

at, in
is
China
also
no
pair
and
de
fut. tns.
have

3.4
2.4
1.4
1.2
1.2
1.1
1.0
1.0
1.0
1.0

HMM Emissions: CTB

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

34 / 34

Contenu connexe

Dernier

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Dernier (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

En vedette (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

  • 1. Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge, Katrin Erk Department of Linguistics The University of Texas at Austin Association for Computational Linguistics 19–24 June, 2011 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34
  • 2. Why unsupervised parsing? 1 Less reliance on annotated training Hello! 2 Apply to new languages and domains Særær man annær man mæþæn Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34
  • 3. Assumptions made in parser learning Getting these labels right AS WELL AS the structure of the tree is hard S PP , P NP on N , NP Det the A VP N brown bear V sleeps Sunday Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
  • 4. Assumptions made in parser learning So the task is to identify the structure alone , P N on Sunday Ponvert, Baldridge, Erk (UT Austin) , V Det the A N sleeps brown bear Simple Unsupervised Grammar Induction ACL 2011 3 / 34
  • 5. Assumptions made in parser learning Learning operates from gold-standard parts-of-speech (POS) rather than raw text P N , Det A N V on Sunday , the brown bear sleeps , P N V Det A N Klein & Manning 2003 CCM Bod 2006a, 2006b Klein & Manning 2005 DMV Successors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c Ponvert, Baldridge, Erk (UT Austin) , on Sunday sleeps the brown bear J. Gao et al 2003, 2004 Seginer 2007 this work Simple Unsupervised Grammar Induction ACL 2011 3 / 34
  • 6. Unsupervised parsing: desiderata Raw text Standard NLP / extensible Scalable and fast Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 4 / 34
  • 7. A new approach: start from the bottom Unsupervised Partial Parsing = segmentation of (non-overlapping) multiword constituents Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 5 / 34
  • 8. Unsupervised segmentation of constituents leaves some room for interpretation Possible segmentations ( the cat ) in ( the hat ) knows ( a lot ) about that ( the cat ) ( in the hat ) knows ( a lot ) ( about that ) ( the cat in the hat ) knows ( a lot about that ) ( the cat in the hat ) ( knows a lot about that ) ( the cat in the hat ) ( knows a lot ) ( about that ) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34
  • 9. Defining UPP by evaluation 1. Constituent chunks: non-hierarchical multiword constituents S NP D The VP N PP Cat P NP V knows NP in D N the D N a lot about hat Ponvert, Baldridge, Erk (UT Austin) PP Simple Unsupervised Grammar Induction P NP N that ACL 2011 7 / 34
  • 10. Defining UPP by evaluation 2. Base NPs: non-recursive noun phrases S NP D The VP N PP Cat P NP V knows NP in D N the D N a lot about hat Ponvert, Baldridge, Erk (UT Austin) PP Simple Unsupervised Grammar Induction P NP N that ACL 2011 7 / 34
  • 11. Multilingual data for direct evaluation English WSJ German Negra Chinese CTB WSJ Penn Treebank Negra Negra German Corpus CTB Penn Chinese Treebank Ponvert, Baldridge, Erk (UT Austin) Sentences Types Tokens 49K 44K 1M 21K 49K 300K 19K 37K 430K Simple Unsupervised Grammar Induction ACL 2011 8 / 34
  • 12. Constituent chunks and NPs in the data WSJ Chunks 203K NPs 172K Chunks ∩ NPs 161K Negra Chunks 59K NPs 33K Chunks ∩ NPs 23K CTB Chunks 92K NPs 56K Chunks ∩ NPs 43K Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 9 / 34
  • 13. The benchmark: CCL parser the cat saw run the red dog Constituency tree 0 the 0 1 cat saw 0 0 0 the 0 0 red dog 0 0 run Common Cover Links representation Seginer (2007 ACL; 2007 PhD UvA) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34
  • 14. Hypothesis Segmentation can be learned by generalizing on phrasal boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 11 / 34
  • 15. UPP as a tagging problem the cat in the hat B I O B I the cat in the hat B Beginning of a constituent I Inside a constituent O Not inside a constituent Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34
  • 16. Learning from boundaries the cat in the hat STOP B I O B I STOP # the cat in the hat # Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 13 / 34
  • 18. UPP: Models Hidden Markov Model B I O B I the cat in the hat P( B I the ) ≈ P( B I ) P( B ) the Probabilistic right linear grammar B I the O cat P( B in the I B the I ) = P( B I ) P( the | B I ) hat Learning: expectation maximization (EM) via forward-backward (run to convergence) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
  • 19. UPP: Models Hidden Markov Model B I O B I the cat in the hat P( B I the ) ≈ P( B I ) P( B ) the Probabilistic right linear grammar B I the O cat P( B in the I B the I ) = P( B I ) P( the | B I ) hat Decoding: Viterbi Smoothing: additive smoothing on emissions Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
  • 20. UPP: Constraints on sequences the cat in the hat STOP B I O B I STOP # the cat in the hat # STOP O Ponvert, Baldridge, Erk (UT Austin) B I Simple Unsupervised Grammar Induction ACL 2011 16 / 34
  • 21. UPP evaluation: Setup Evaluation by comparison to treebank data Standard train / development / test splits Precision and recall on matched constituents Benchmark: CCL Both get tokenization, punctuation, sentence boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34
  • 22. UPP evaluation: Chunking (F-score) WSJ Negra CTB 0 10 CCL∗ 20 30 40 HMM Chunker 50 60 70 80 PRLG Chunker CCL non-hierarchical constituents First-level parsing output Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 18 / 34
  • 23. UPP evaluation: Base NPs (F-score) WSJ Negra CTB 0 10 CCL∗ 20 30 40 HMM Chunker 50 60 70 80 PRLG Chunker CCL non-hierarchical constituents First-level parsing output Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 19 / 34
  • 24. UPP: Review Sequence models can generalize on indicators for phrasal boundaries Leads to improved unsupervised segmentation Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 20 / 34
  • 25. Question Are we limited to segmentation? Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 21 / 34
  • 26. Hypothesis Identification of higher level constituents can also be learned by generalizing on phrasal boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 22 / 34
  • 27. Cascaded UPP: 1 Segment raw text there is no asbestos in our products now there is no asbestos in our products now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
  • 28. Cascaded UPP: 2 Choose stand-ins for phrases there is is no asbestos in no asbestos there Ponvert, Baldridge, Erk (UT Austin) our products our is in our Simple Unsupervised Grammar Induction now products now ACL 2011 23 / 34
  • 29. Cascaded UPP: 3 Segment text + phrasal stand-ins there is in our now there is in our now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
  • 30. Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4 there is in our there is in our no asbestos is Ponvert, Baldridge, Erk (UT Austin) now in Simple Unsupervised Grammar Induction products now ACL 2011 23 / 34
  • 31. Cascaded UPP: 5 Unwind to output tree there is in our no asbestos is there in products now now is Ponvert, Baldridge, Erk (UT Austin) no asbestos in our products Simple Unsupervised Grammar Induction ACL 2011 23 / 34
  • 32. Cascaded UPP: Review Separate models learned at each cascade level Models share hyper-parameters (smoothing etc) Choice of pseudowords as phrasal stand-ins Pseudoword-identification: corpus frequency Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 24 / 34
  • 33. Cascaded UPP: Evaluation WSJ Negra CTB 0 CCL 10 20 30 Cascaded HMM 40 50 60 Cascaded PRLG All constituent F-score Cascade run to convergence Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 25 / 34
  • 34. More example parses Gold standard tut die csu das in doch bayern tut die csu the das doch does this nevertheless also CSU in bayern in auch sehr erfolgreich auch sehr erfolgreich very successfully Bavaria Nevertheless, the CSU does this in Bavaria very successfully as well Cascaded PRLG – Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 26 / 34
  • 35. More example parses Gold standard bei bei bleibt alles den windsors in bleibt alles in stays with in der familie den windsors the everything der familie Windsors the family With the Windsors everything stays in the family. Cascaded PRLG – Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 26 / 34
  • 36. More example parses ¨ uberaltern over-age anlagenteile immer mehr ever machine parts more (with) more and more machine parts over-age Cascaded PRLG – Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 26 / 34
  • 37. What we’ve learned Unsupervised identification of base NPs and local constituents is possible A cascade of chunking models for raw text parsing has state-of-the-art results Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34
  • 38. Future directions Improvements to the sequence models Better phrasal stand-in (pseudoword) construction Learning joint models rather than a cascade Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34
  • 39. What’s in the paper Comparison to Klein Manning’s CCM Discussion of phrasal punctuation the chunkers still do well w/out punctuation Analysis of chunking and parsing Chinese Error analysis Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34
  • 40. Thanks! Contact: eponvert@utexas.edu Code: elias.ponvert.net/upparse This work is supported in part by the U. S. Army Research Laboratory and the U.S. Army Research Office under grant number W911NF-10-1-0533. Support for Elias was also provided by Mike Hogg Endowment Fellowship, the Office of Graduate Studies at The University of Texas at Austin. Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34
  • 41. Appendices Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 31 / 34
  • 42. More example parses two Gold standard share a house almost devoid offurniture two share a house almost devoid of furniture Cascaded PRLG – WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 32 / 34
  • 43. More example parses what Gold standard is one to think of what is all one to think of Cascaded PRLG – WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction all this this correct incorrect ACL 2011 32 / 34
  • 44. Learning curves: Base NPs 80 60 40 20 F -score 80 10 20 30 40K sentences 80 60 60 40 40 20 20 100 60 EM iter 20 20 30 40K 10 sentences 0 20 40 60 80 100 EM iter 1 PRLG chunking model: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
  • 45. 50 40 30 20 10 F -score Learning curves: Base NPs 5 10 15K sentences 50 40 30 20 10 40 20 140 80 EM iter 20 5 10 15K 0 50 100 150 EM iter sentences 1 PRLG chunking model: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
  • 46. Learning curves: Base NPs 30 30 F -score 20 10 0 5 10 15K sentences 30 20 20 10 10 0 100 60 EM iter 20 5 10 15K 0 20 40 60 80 100 EM iter sentences PRLG chunking model: CTB 1 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
  • 47. What are the models learning? B the a to ’s in mr. its of an and P(w|B) 21.0 8.7 6.5 2.8 1.9 1.8 1.6 1.4 1.4 1.4 I % million be company year market billion share new than P(w|I) 1.8 1.6 1.3 0.9 0.8 0.7 0.6 0.5 0.5 0.5 O of and in that to for is it said on P(w|O) 5.8 4.0 3.7 2.2 2.1 2.0 2.0 1.7 1.7 1.5 HMM Emissions: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
  • 48. What are the models learning? P(w|B) B der die den und im das des dem eine ein the the the and in the the the a a 13.0 12.2 4.4 3.3 3.2 2.9 2.7 2.4 2.1 2.0 P(w|I) I uhr juni jahren prozent mark stadt 000 o’clock June years percent currency city millionen millions jahre year frankfurter Frankfurt 0.8 0.6 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 P(w|O) O in und mit ¨ fur auf zu von sich ist nicht in and with for on to of such is not 3.4 2.7 1.7 1.6 1.5 1.4 1.3 1.3 1.3 1.2 HMM Emissions: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
  • 49. What are the models learning? P(w|B) B 的 一 和 两 这 有 经济 各 全 不 de, of one and two this have economy each all no 14.3 3.1 1.1 0.9 0.8 0.8 0.7 0.7 0.7 0.6 P(w|I) I 的 了 个 年 说 中 上 人 大 国 de (perf. asp.) ge (measure) year say middle on, above person big country 3.9 2.2 1.5 1.3 1.0 0.9 0.9 0.7 0.7 0.6 P(w|O) O 在 是 中国 也 不 对 和 的 将 有 at, in is China also no pair and de fut. tns. have 3.4 2.4 1.4 1.2 1.2 1.1 1.0 1.0 1.0 1.0 HMM Emissions: CTB Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34