The document describes the DutchSemCor (DSC) project, which aims to create a large-scale sense-tagged Dutch corpus meeting criteria for word sense disambiguation evaluation. The project uses a multi-step approach, beginning with a balanced-sense corpus manually annotated for specific words. This is then expanded using word sense disambiguation systems and active learning to add contexts. Automatic tagging is then used to annotate an all-words corpus and obtain sense distributions across 48 million tokens. The final DSC corpus contains over 1 million sense-tagged tokens from various sources, evaluating WSD systems and providing frequency information.
RANLP 2013: DutchSemcor in quest of the ideal corpus
1. DutchSemCor:
in Quest of the Ideal
Sense Tagged Corpus
Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl
2. Outline
Main goal of our project
WSD and annotated corpus
Our approach
Balanced-sense corpus and evaluation
Balanced-context corpus and evaluation
Sense distributions, all words corpus and evaluation
Numbers…
1
3. Main goal of DSC
Deliver a Dutch corpus enriched with semantic
information:
Senses of the most frequent and most polysemous words
Domains
Named Entities linked with Wikipedia
1 million sense tagged tokens:
250K tagged manually by 2 annotators
750K tagged by 1 annotator / automatically through Active
Learning
2
4. Current WSD
Insights on Word Sense Disambiguation
1. Evaluation tasks depend on the corpus / lexicon
It seems that the results depend more on the evaluation data than on WSD systems
Are the evaluation corpora diverse enough?
2. Most frequent sense from SemCor difficult to beat
Are evaluation tasks neglecting low frequent senses?
3. Predominant senses in specific domains give the best results
4. Supervised systems beat unsupervised systems
Which are the best corpora for WSD
How should be the ideal corpus for WSD? (we)
Define criteria for the ideal sense-tagged corpus
Describe a novel approach for building a large scale sense tagged corpus for meet
criteria (with as little manual effort as possible)
3
5. Criteria for a corpus
A good corpus for WSD should:
Be balanced for different senses
Equal number of examples for each meaning
Be balanced for different contexts
Different usages of the words
Provide information on sense frequencies (across
domains and genres)
Frequency of the words in a representative meaning
4
6. Annotating a corpus
Sequential
Tagging
All Words
corpus
Targeted
tagging
Lexical Sample
Corpus
Balanced
sense
Balanced
context
Whole text
Reconsider meanings
KWIC
Repeated contexts
Small numbers of
texts, genres, domains and
senses
Sense distributions
SemCor
Usually large number of contexts and
senses
Line-hard-serve
DSO 5
8. Our main approach
1. Annotated corpus that represents ALL the meanings of an existing
lexicon
Balanced sense
Manual
2. Train WSD systems using the annotated corpus
Will be trained for all the senses
3. Extend this annotated corpus to acquire a wider representation of
contexts
Balanced-context
Manual + WSD
4. Annotate the full raw corpus
Sense distributions
WSD
5. Evaluation of the annotations for the 3 criteria
7
9. Resources
Cornetto database
Lexical semantic database for Dutch
Structure and content of WN + FrameNet-like
data
SoNaR (500M tokens)
Dutch wide range of genres and topics
34 categories: discussion
lists, books, chats, autocues…)
CGN (9M tokens)
Transcribed spontaneous Dutch adult speech
Internet
8
10. WSD systems
DSC-timbl
Memory learning classifier
Supervised K-nearest neighbor
DSC-SVM
Linear classifier / Support Vector Machines
Binary classifiers 1 vs all
DSC-UKB
Knowledge based system
Personalized page rank algorithm
Synsets nodes Relations hedges
Context words inject mass into word senses
9
11. Balanced-sense corpus
2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
Student assistants 2 years
SAT tool and Web-snippets tool
80% agreement 25 examples per sense
282,503 tokens double annotated
80% senses with more than 25 examples
90% lemmas with 25 examples for each sense
Distribution-> 67% sonar, 5% CGN, 28% web
10
12. Balanced-sense corpus
Student assistants 2 years
SAT tool
80% agreement 25 examples per sense
282,503 tokens double annotated
80% senses with more than 25 examples
90% lemmas with 25 examples for each sense
Distribution-> 67% sonar, 5% CGN, 28% web
11
13. Balanced-sense corpus
2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
Student assistants 2 years
SAT tool
80% agreement 25 examples per sense
282,503 tokens double annotated
80% senses with more than 25 examples
90% lemmas with 25 examples for each sense
Distribution-> 67% sonar, 5% CGN, 28% web
12
14. WSD from balanced sense
5-FCV at sense level and focus on nouns
Optimized for annotate SONAR
Specific features (word_id)
Overall result for nouns 82.76
Results used for further annotate weakly performing
senses
Active Learning approach
Select 82 lemmas performing under 80%
3 rounds of annotation till reach 81.62%
13
15. WSD from balanced sense
5-FCV at sense level and focus on nouns
Optimized for annotate SONAR
Specific features (word_id)
Overall result for nouns 82.76
Results used for further annotate weakly performing
senses
Active Learning approach
Select 82 lemmas performing under 80%
3 rounds of annotation till reach 81.62%
14
16. WSD from balanced sense
5-FCV at sense level and focus on nouns
Optimized for annotate SONAR
Specific features (word_id)
Overall result for nouns 82.76
Results used for further annotate weakly performing
senses
Active Learning approach
Select 82 lemmas performing under 80%
3 rounds of annotation till reach 81.62%
15
17. Balanced context
Try to annotate the whole corpus as many contexts as the whole
corpus have a good WSD improve problematic cases
Select all words perform under 80%
Annotate all corpus with Timbl-wsd system (optimized)
50 new tokens for senses of words under 80% being different context
High confidence
Low distance / High distance to the nearest neighbor
Manually annotate these 50
Completely different to first phase where annotators could chose
Lemmatization errors, PoS errors, figurative, idiomatic unknown senses
16
18. Evaluating the Balanced-sense
and new annotations
Type Accuracy # examples
Balanced Sense (BS) 81.62 8641
BS + LowD 78.81 13266
BS+ LowD_agreed 85.02 11405
BS+ High 76.24 19055
BS+ HighD_agreed 83.77 13359
BS + LowD_agreed +
HighD_agreed
85.33 16123
• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas
• Better results when using agreed data
• High/Low distance does not make big difference
17
19. Evaluation balanced-context
5-FCV using agreed new instances
Best is majority voting
System Nouns Verbs Adjs
DSC-timbl 83.97 83.44 78.64
DSC-svm 82.69 84.93 79.03
DSC-ukb 73.04 55.84 56.36
Voting 88.65 87.60 83.06
18
20. Evaluating representativeness
Our manual annotated corpus probably skewed towards
balanced-sense
Required to test the performance of our WSD on the rest
of SONAR
Random evaluation
Ranges of accuracy (90-100 80-90 70-80 60-70)
5 nouns 5 verbs and 3 adjs 52 lemmas
100 tokens for each lemma automatic tagged and
manual validated
19
21. Evaluating representativeness
Results lower than previous evaluations
Difference between approach representing the lexicon (sense) and
representing the corpus
Results comparable to state-of-the-art English Sens/Sem-eval
System Nouns Verbs Adjs
DSC-timbl 54.25 48.25 46.50
DSC-svm 64.10 52.20 52.00
DSC-ukb 49.37 44.15 38.13
Voting 60.70 53.95 50.83
20
22. Obtaining sense distributions
Approach
Annotate the remainder SoNaR with WSD systems an obtain
sense frequencies
Assume that automatic annotation still reflects real distribution
Evaluate this frequency distribution (Most Frequent Sense)
How can be evaluated this MFS approach?
Manual annotations
25 examples per sense, no sense distribution
Random evaluation corpus
Only a small selection of words (52 lemmas)
21
23. Obtaining sense distributions
All-words corpus was created
Completely independent texts from Lassy
Medical journals, manuals, newspapers, magazines, reports,
websites, wikipedia
23,907 tokens and covers 1,527 of our set of lemmas (53%)
Evaluation of
3 WSD systems
First sense baseline according to cornetto
Random sense baseline
Most frequent sense
Sense distributions obtained from automatic annotation
22
24. Obtaining sense distributions
MFS in Dutch similar to English MFS
MFS better than 1st and random sense baselines
MFS automatically derived is a good predictor
System Nouns Verbs Adjs
1st sense 53.17 32.84 52.17
Random sense 29.52 24.99 32.16
MFS 61.20 50.76 54.62
DSC-timbl 55.76 37.96 49.00
DSC-svm 64.58 45.81 55.70
DSC-ukb 56.81 31.37 35.93
Voting 66.09 45.68 52.24
23
25. Numbers of DSC
Balanced-sense annotated corpus
274,344 tokens
2,874 lemmas
Annotated by 2 annotators, 90% IAA
Balanced-context annotated corpus
132,666 tokens
1,133 lemmas
Manually annotated by 1 agreeing with
WSD in 44%
Random evaluation corpus
5,200 tokens
52 lemmas
All words corpus
23,907 tokens
1,527 lemmas
3 WSD systems for Dutch
DSC-timbl
DSC-svm
DSC-ukb
Automatic annotations by the 3 WSD
Sense distributions
48 million of tokens with confidence
… and more…
800,000 semantic relations between senses
extracted from manual annotations
28.080 sense groups
Improved version of Cornetto
SAT annotation tool
Web search tool
Statistics on figurative, idiomatic and
collocational usage of words
…
24