RANLP 2013: DutchSemcor in quest of the ideal corpus

DutchSemCor:
in Quest of the Ideal
Sense Tagged Corpus
Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl

Outline
 Main goal of our project
 WSD and annotated corpus
 Our approach
 Balanced-sense corpus and evaluation
 Balanced-context corpus and evaluation
 Sense distributions, all words corpus and evaluation
 Numbers…
1

Main goal of DSC
 Deliver a Dutch corpus enriched with semantic
information:
 Senses of the most frequent and most polysemous words
 Domains
 Named Entities linked with Wikipedia
 1 million sense tagged tokens:
 250K tagged manually by 2 annotators
 750K tagged by 1 annotator / automatically through Active
Learning
2

Current WSD
 Insights on Word Sense Disambiguation
1. Evaluation tasks depend on the corpus / lexicon
 It seems that the results depend more on the evaluation data than on WSD systems
 Are the evaluation corpora diverse enough?
2. Most frequent sense from SemCor difficult to beat
 Are evaluation tasks neglecting low frequent senses?
3. Predominant senses in specific domains give the best results
4. Supervised systems beat unsupervised systems
 Which are the best corpora for WSD
 How should be the ideal corpus for WSD? (we)
 Define criteria for the ideal sense-tagged corpus
 Describe a novel approach for building a large scale sense tagged corpus for meet
criteria (with as little manual effort as possible)
3

Criteria for a corpus
 A good corpus for WSD should:
 Be balanced for different senses
 Equal number of examples for each meaning
 Be balanced for different contexts
 Different usages of the words
 Provide information on sense frequencies (across
domains and genres)
 Frequency of the words in a representative meaning
4

Annotating a corpus
Sequential
Tagging
All Words
corpus
Targeted
tagging
Lexical Sample
Corpus
Balanced
sense
Balanced
context
 Whole text
 Reconsider meanings
 KWIC
 Repeated contexts
 Small numbers of
texts, genres, domains and
senses
 Sense distributions
 SemCor
 Usually large number of contexts and
senses
 Line-hard-serve
 DSO 5

Annotating a corpus
Sense
distribution
Sense
coverage
Context
diversity
All words ✔ ✖ ✖
Balanced-sense ✖ ✔ ✖
Balanced-context ✖ ✖ ✔
6

Our main approach
1. Annotated corpus that represents ALL the meanings of an existing
lexicon
 Balanced sense
 Manual
2. Train WSD systems using the annotated corpus
 Will be trained for all the senses
3. Extend this annotated corpus to acquire a wider representation of
contexts
 Balanced-context
 Manual + WSD
4. Annotate the full raw corpus
 Sense distributions
 WSD
5. Evaluation of the annotations for the 3 criteria
7

Resources
 Cornetto database
 Lexical semantic database for Dutch
 Structure and content of WN + FrameNet-like
data
 SoNaR (500M tokens)
 Dutch wide range of genres and topics
 34 categories: discussion
lists, books, chats, autocues…)
 CGN (9M tokens)
 Transcribed spontaneous Dutch adult speech
 Internet
8

WSD systems
 DSC-timbl
 Memory learning classifier
 Supervised K-nearest neighbor
 DSC-SVM
 Linear classifier / Support Vector Machines
 Binary classifiers 1 vs all
 DSC-UKB
 Knowledge based system
 Personalized page rank algorithm
 Synsets  nodes Relations  hedges
 Context words inject mass into word senses
9

Balanced-sense corpus
 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
 Student assistants 2 years
 SAT tool and Web-snippets tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
10

 SAT tool
11

 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
 SAT tool
12

WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
13

senses
14

senses
15

Balanced context
 Try to annotate the whole corpus  as many contexts as the whole
corpus  have a good WSD  improve problematic cases
 Select all words perform under 80%
 Annotate all corpus with Timbl-wsd system (optimized)
 50 new tokens for senses of words under 80% being different context
 High confidence
 Low distance / High distance to the nearest neighbor
 Manually annotate these 50
 Completely different to first phase where annotators could chose
 Lemmatization errors, PoS errors, figurative, idiomatic unknown senses
16

Evaluating the Balanced-sense
and new annotations
Type Accuracy # examples
Balanced Sense (BS) 81.62 8641
BS + LowD 78.81 13266
BS+ LowD_agreed 85.02 11405
BS+ High 76.24 19055
BS+ HighD_agreed 83.77 13359
BS + LowD_agreed +
HighD_agreed
85.33 16123
• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas
• Better results when using agreed data
• High/Low distance does not make big difference
17

Evaluation balanced-context
 5-FCV using agreed new instances
 Best is majority voting
System Nouns Verbs Adjs
DSC-timbl 83.97 83.44 78.64
DSC-svm 82.69 84.93 79.03
DSC-ukb 73.04 55.84 56.36
Voting 88.65 87.60 83.06
18

Evaluating representativeness
 Our manual annotated corpus probably skewed towards
balanced-sense
 Required to test the performance of our WSD on the rest
of SONAR
 Random evaluation
 Ranges of accuracy (90-100 80-90 70-80 60-70)
 5 nouns 5 verbs and 3 adjs  52 lemmas
 100 tokens for each lemma automatic tagged and
manual validated
19

Evaluating representativeness
 Results lower than previous evaluations
 Difference between approach representing the lexicon (sense) and
representing the corpus
 Results comparable to state-of-the-art English Sens/Sem-eval
DSC-timbl 54.25 48.25 46.50
DSC-svm 64.10 52.20 52.00
DSC-ukb 49.37 44.15 38.13
Voting 60.70 53.95 50.83
20

Obtaining sense distributions
 Approach
 Annotate the remainder SoNaR with WSD systems an obtain
sense frequencies
 Assume that automatic annotation still reflects real distribution
 Evaluate this frequency distribution (Most Frequent Sense)
 How can be evaluated this MFS approach?
 Manual annotations
 25 examples per sense, no sense distribution
 Random evaluation corpus
 Only a small selection of words (52 lemmas)
21

 All-words corpus was created
 Completely independent texts from Lassy
 Medical journals, manuals, newspapers, magazines, reports,
websites, wikipedia
 23,907 tokens and covers 1,527 of our set of lemmas (53%)
 Evaluation of
 3 WSD systems
 First sense baseline according to cornetto
 Random sense baseline
 Most frequent sense
 Sense distributions obtained from automatic annotation
22

 MFS in Dutch similar to English MFS
 MFS better than 1st and random sense baselines
 MFS automatically derived is a good predictor
1st sense 53.17 32.84 52.17
Random sense 29.52 24.99 32.16
MFS 61.20 50.76 54.62
DSC-timbl 55.76 37.96 49.00
DSC-svm 64.58 45.81 55.70
DSC-ukb 56.81 31.37 35.93
Voting 66.09 45.68 52.24
23

Numbers of DSC
 Balanced-sense annotated corpus
 274,344 tokens
 2,874 lemmas
 Annotated by 2 annotators, 90% IAA
 Balanced-context annotated corpus
 132,666 tokens
 1,133 lemmas
 Manually annotated by 1 agreeing with
WSD in 44%
 Random evaluation corpus
 5,200 tokens
 52 lemmas
 All words corpus
 23,907 tokens
 1,527 lemmas
 3 WSD systems for Dutch
 DSC-timbl
 DSC-svm
 DSC-ukb
 Automatic annotations by the 3 WSD
 Sense distributions
 48 million of tokens with confidence
 … and more…
 800,000 semantic relations between senses
extracted from manual annotations
 28.080 sense groups
 Improved version of Cornetto
 SAT annotation tool
 Web search tool
 Statistics on figurative, idiomatic and
collocational usage of words
 …
24

Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl
Thanks for your attention

RANLP 2013: DutchSemcor in quest of the ideal corpus

Recommandé

Recommandé

Contenu connexe

Similaire à RANLP 2013: DutchSemcor in quest of the ideal corpus

Similaire à RANLP 2013: DutchSemcor in quest of the ideal corpus (20)

Plus de Rubén Izquierdo Beviá

Plus de Rubén Izquierdo Beviá (12)

Dernier

Dernier (20)

RANLP 2013: DutchSemcor in quest of the ideal corpus