SlideShare une entreprise Scribd logo
1  sur  26
DutchSemCor:
in Quest of the Ideal
Sense Tagged Corpus
Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl
Outline
 Main goal of our project
 WSD and annotated corpus
 Our approach
 Balanced-sense corpus and evaluation
 Balanced-context corpus and evaluation
 Sense distributions, all words corpus and evaluation
 Numbers…
1
Main goal of DSC
 Deliver a Dutch corpus enriched with semantic
information:
 Senses of the most frequent and most polysemous words
 Domains
 Named Entities linked with Wikipedia
 1 million sense tagged tokens:
 250K tagged manually by 2 annotators
 750K tagged by 1 annotator / automatically through Active
Learning
2
Current WSD
 Insights on Word Sense Disambiguation
1. Evaluation tasks depend on the corpus / lexicon
 It seems that the results depend more on the evaluation data than on WSD systems
 Are the evaluation corpora diverse enough?
2. Most frequent sense from SemCor difficult to beat
 Are evaluation tasks neglecting low frequent senses?
3. Predominant senses in specific domains give the best results
4. Supervised systems beat unsupervised systems
 Which are the best corpora for WSD
 How should be the ideal corpus for WSD? (we)
 Define criteria for the ideal sense-tagged corpus
 Describe a novel approach for building a large scale sense tagged corpus for meet
criteria (with as little manual effort as possible)
3
Criteria for a corpus
 A good corpus for WSD should:
 Be balanced for different senses
 Equal number of examples for each meaning
 Be balanced for different contexts
 Different usages of the words
 Provide information on sense frequencies (across
domains and genres)
 Frequency of the words in a representative meaning
4
Annotating a corpus
Sequential
Tagging
All Words
corpus
Targeted
tagging
Lexical Sample
Corpus
Balanced
sense
Balanced
context
 Whole text
 Reconsider meanings
 KWIC
 Repeated contexts
 Small numbers of
texts, genres, domains and
senses
 Sense distributions
 SemCor
 Usually large number of contexts and
senses
 Line-hard-serve
 DSO 5
Annotating a corpus
Sense
distribution
Sense
coverage
Context
diversity
All words ✔ ✖ ✖
Balanced-sense ✖ ✔ ✖
Balanced-context ✖ ✖ ✔
6
Our main approach
1. Annotated corpus that represents ALL the meanings of an existing
lexicon
 Balanced sense
 Manual
2. Train WSD systems using the annotated corpus
 Will be trained for all the senses
3. Extend this annotated corpus to acquire a wider representation of
contexts
 Balanced-context
 Manual + WSD
4. Annotate the full raw corpus
 Sense distributions
 WSD
5. Evaluation of the annotations for the 3 criteria
7
Resources
 Cornetto database
 Lexical semantic database for Dutch
 Structure and content of WN + FrameNet-like
data
 SoNaR (500M tokens)
 Dutch wide range of genres and topics
 34 categories: discussion
lists, books, chats, autocues…)
 CGN (9M tokens)
 Transcribed spontaneous Dutch adult speech
 Internet
8
WSD systems
 DSC-timbl
 Memory learning classifier
 Supervised K-nearest neighbor
 DSC-SVM
 Linear classifier / Support Vector Machines
 Binary classifiers 1 vs all
 DSC-UKB
 Knowledge based system
 Personalized page rank algorithm
 Synsets  nodes Relations  hedges
 Context words inject mass into word senses
9
Balanced-sense corpus
 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
 Student assistants 2 years
 SAT tool and Web-snippets tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
10
Balanced-sense corpus
 Student assistants 2 years
 SAT tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
11
Balanced-sense corpus
 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
 Student assistants 2 years
 SAT tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
12
WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
13
WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
14
WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
15
Balanced context
 Try to annotate the whole corpus  as many contexts as the whole
corpus  have a good WSD  improve problematic cases
 Select all words perform under 80%
 Annotate all corpus with Timbl-wsd system (optimized)
 50 new tokens for senses of words under 80% being different context
 High confidence
 Low distance / High distance to the nearest neighbor
 Manually annotate these 50
 Completely different to first phase where annotators could chose
 Lemmatization errors, PoS errors, figurative, idiomatic unknown senses
16
Evaluating the Balanced-sense
and new annotations
Type Accuracy # examples
Balanced Sense (BS) 81.62 8641
BS + LowD 78.81 13266
BS+ LowD_agreed 85.02 11405
BS+ High 76.24 19055
BS+ HighD_agreed 83.77 13359
BS + LowD_agreed +
HighD_agreed
85.33 16123
• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas
• Better results when using agreed data
• High/Low distance does not make big difference
17
Evaluation balanced-context
 5-FCV using agreed new instances
 Best is majority voting
System Nouns Verbs Adjs
DSC-timbl 83.97 83.44 78.64
DSC-svm 82.69 84.93 79.03
DSC-ukb 73.04 55.84 56.36
Voting 88.65 87.60 83.06
18
Evaluating representativeness
 Our manual annotated corpus probably skewed towards
balanced-sense
 Required to test the performance of our WSD on the rest
of SONAR
 Random evaluation
 Ranges of accuracy (90-100 80-90 70-80 60-70)
 5 nouns 5 verbs and 3 adjs  52 lemmas
 100 tokens for each lemma automatic tagged and
manual validated
19
Evaluating representativeness
 Results lower than previous evaluations
 Difference between approach representing the lexicon (sense) and
representing the corpus
 Results comparable to state-of-the-art English Sens/Sem-eval
System Nouns Verbs Adjs
DSC-timbl 54.25 48.25 46.50
DSC-svm 64.10 52.20 52.00
DSC-ukb 49.37 44.15 38.13
Voting 60.70 53.95 50.83
20
Obtaining sense distributions
 Approach
 Annotate the remainder SoNaR with WSD systems an obtain
sense frequencies
 Assume that automatic annotation still reflects real distribution
 Evaluate this frequency distribution (Most Frequent Sense)
 How can be evaluated this MFS approach?
 Manual annotations
 25 examples per sense, no sense distribution
 Random evaluation corpus
 Only a small selection of words (52 lemmas)
21
Obtaining sense distributions
 All-words corpus was created
 Completely independent texts from Lassy
 Medical journals, manuals, newspapers, magazines, reports,
websites, wikipedia
 23,907 tokens and covers 1,527 of our set of lemmas (53%)
 Evaluation of
 3 WSD systems
 First sense baseline according to cornetto
 Random sense baseline
 Most frequent sense
 Sense distributions obtained from automatic annotation
22
Obtaining sense distributions
 MFS in Dutch similar to English MFS
 MFS better than 1st and random sense baselines
 MFS automatically derived is a good predictor
System Nouns Verbs Adjs
1st sense 53.17 32.84 52.17
Random sense 29.52 24.99 32.16
MFS 61.20 50.76 54.62
DSC-timbl 55.76 37.96 49.00
DSC-svm 64.58 45.81 55.70
DSC-ukb 56.81 31.37 35.93
Voting 66.09 45.68 52.24
23
Numbers of DSC
 Balanced-sense annotated corpus
 274,344 tokens
 2,874 lemmas
 Annotated by 2 annotators, 90% IAA
 Balanced-context annotated corpus
 132,666 tokens
 1,133 lemmas
 Manually annotated by 1 agreeing with
WSD in 44%
 Random evaluation corpus
 5,200 tokens
 52 lemmas
 All words corpus
 23,907 tokens
 1,527 lemmas
 3 WSD systems for Dutch
 DSC-timbl
 DSC-svm
 DSC-ukb
 Automatic annotations by the 3 WSD
 Sense distributions
 48 million of tokens with confidence
 … and more…
 800,000 semantic relations between senses
extracted from manual annotations
 28.080 sense groups
 Improved version of Cornetto
 SAT annotation tool
 Web search tool
 Statistics on figurative, idiomatic and
collocational usage of words
 …
24
Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl
Thanks for your attention

Contenu connexe

Similaire à RANLP 2013: DutchSemcor in quest of the ideal corpus

Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems
 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisAditya Joshi
 
A Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationA Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationMaryOsborne11
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Voice recognitionr.ppt
Voice recognitionr.pptVoice recognitionr.ppt
Voice recognitionr.pptSahidKhan61
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Investigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text AnnotationInvestigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text Annotationnlpg
 
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...Ana Marasović
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingGuy De Pauw
 
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term DetectionARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term DetectionMediaEval2012
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsMatīss ‎‎‎‎‎‎‎  
 

Similaire à RANLP 2013: DutchSemcor in quest of the ideal corpus (20)

Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
Class14
Class14Class14
Class14
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment Analysis
 
Acm ihi-2010-pedersen-final
Acm ihi-2010-pedersen-finalAcm ihi-2010-pedersen-final
Acm ihi-2010-pedersen-final
 
Semeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic SimilaritySemeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic Similarity
 
A Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationA Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense Disambiguation
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
sr.ppt
sr.pptsr.ppt
sr.ppt
 
Voice recognitionr.ppt
Voice recognitionr.pptVoice recognitionr.ppt
Voice recognitionr.ppt
 
sr.ppt
sr.pptsr.ppt
sr.ppt
 
Conll
ConllConll
Conll
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
 
Investigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text AnnotationInvestigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text Annotation
 
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech Tagging
 
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term DetectionARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 

Plus de Rubén Izquierdo Beviá

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityRubén Izquierdo Beviá
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsRubén Izquierdo Beviá
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesRubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)Rubén Izquierdo Beviá
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Rubén Izquierdo Beviá
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013Rubén Izquierdo Beviá
 

Plus de Rubén Izquierdo Beviá (12)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 

Dernier

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Dernier (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

RANLP 2013: DutchSemcor in quest of the ideal corpus

  • 1. DutchSemCor: in Quest of the Ideal Sense Tagged Corpus Piek Vossen piek.vossen@vu.nl Rubén Izquierdo ruben.izquierdobevia@vu.nl Attila Görög a.gorog@vu.nl
  • 2. Outline  Main goal of our project  WSD and annotated corpus  Our approach  Balanced-sense corpus and evaluation  Balanced-context corpus and evaluation  Sense distributions, all words corpus and evaluation  Numbers… 1
  • 3. Main goal of DSC  Deliver a Dutch corpus enriched with semantic information:  Senses of the most frequent and most polysemous words  Domains  Named Entities linked with Wikipedia  1 million sense tagged tokens:  250K tagged manually by 2 annotators  750K tagged by 1 annotator / automatically through Active Learning 2
  • 4. Current WSD  Insights on Word Sense Disambiguation 1. Evaluation tasks depend on the corpus / lexicon  It seems that the results depend more on the evaluation data than on WSD systems  Are the evaluation corpora diverse enough? 2. Most frequent sense from SemCor difficult to beat  Are evaluation tasks neglecting low frequent senses? 3. Predominant senses in specific domains give the best results 4. Supervised systems beat unsupervised systems  Which are the best corpora for WSD  How should be the ideal corpus for WSD? (we)  Define criteria for the ideal sense-tagged corpus  Describe a novel approach for building a large scale sense tagged corpus for meet criteria (with as little manual effort as possible) 3
  • 5. Criteria for a corpus  A good corpus for WSD should:  Be balanced for different senses  Equal number of examples for each meaning  Be balanced for different contexts  Different usages of the words  Provide information on sense frequencies (across domains and genres)  Frequency of the words in a representative meaning 4
  • 6. Annotating a corpus Sequential Tagging All Words corpus Targeted tagging Lexical Sample Corpus Balanced sense Balanced context  Whole text  Reconsider meanings  KWIC  Repeated contexts  Small numbers of texts, genres, domains and senses  Sense distributions  SemCor  Usually large number of contexts and senses  Line-hard-serve  DSO 5
  • 7. Annotating a corpus Sense distribution Sense coverage Context diversity All words ✔ ✖ ✖ Balanced-sense ✖ ✔ ✖ Balanced-context ✖ ✖ ✔ 6
  • 8. Our main approach 1. Annotated corpus that represents ALL the meanings of an existing lexicon  Balanced sense  Manual 2. Train WSD systems using the annotated corpus  Will be trained for all the senses 3. Extend this annotated corpus to acquire a wider representation of contexts  Balanced-context  Manual + WSD 4. Annotate the full raw corpus  Sense distributions  WSD 5. Evaluation of the annotations for the 3 criteria 7
  • 9. Resources  Cornetto database  Lexical semantic database for Dutch  Structure and content of WN + FrameNet-like data  SoNaR (500M tokens)  Dutch wide range of genres and topics  34 categories: discussion lists, books, chats, autocues…)  CGN (9M tokens)  Transcribed spontaneous Dutch adult speech  Internet 8
  • 10. WSD systems  DSC-timbl  Memory learning classifier  Supervised K-nearest neighbor  DSC-SVM  Linear classifier / Support Vector Machines  Binary classifiers 1 vs all  DSC-UKB  Knowledge based system  Personalized page rank algorithm  Synsets  nodes Relations  hedges  Context words inject mass into word senses 9
  • 11. Balanced-sense corpus  2870 most polysemous and frequent words (11982 meanings avg polysemy 3)  Student assistants 2 years  SAT tool and Web-snippets tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 10
  • 12. Balanced-sense corpus  Student assistants 2 years  SAT tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 11
  • 13. Balanced-sense corpus  2870 most polysemous and frequent words (11982 meanings avg polysemy 3)  Student assistants 2 years  SAT tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 12
  • 14. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 13
  • 15. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 14
  • 16. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 15
  • 17. Balanced context  Try to annotate the whole corpus  as many contexts as the whole corpus  have a good WSD  improve problematic cases  Select all words perform under 80%  Annotate all corpus with Timbl-wsd system (optimized)  50 new tokens for senses of words under 80% being different context  High confidence  Low distance / High distance to the nearest neighbor  Manually annotate these 50  Completely different to first phase where annotators could chose  Lemmatization errors, PoS errors, figurative, idiomatic unknown senses 16
  • 18. Evaluating the Balanced-sense and new annotations Type Accuracy # examples Balanced Sense (BS) 81.62 8641 BS + LowD 78.81 13266 BS+ LowD_agreed 85.02 11405 BS+ High 76.24 19055 BS+ HighD_agreed 83.77 13359 BS + LowD_agreed + HighD_agreed 85.33 16123 • Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas • Better results when using agreed data • High/Low distance does not make big difference 17
  • 19. Evaluation balanced-context  5-FCV using agreed new instances  Best is majority voting System Nouns Verbs Adjs DSC-timbl 83.97 83.44 78.64 DSC-svm 82.69 84.93 79.03 DSC-ukb 73.04 55.84 56.36 Voting 88.65 87.60 83.06 18
  • 20. Evaluating representativeness  Our manual annotated corpus probably skewed towards balanced-sense  Required to test the performance of our WSD on the rest of SONAR  Random evaluation  Ranges of accuracy (90-100 80-90 70-80 60-70)  5 nouns 5 verbs and 3 adjs  52 lemmas  100 tokens for each lemma automatic tagged and manual validated 19
  • 21. Evaluating representativeness  Results lower than previous evaluations  Difference between approach representing the lexicon (sense) and representing the corpus  Results comparable to state-of-the-art English Sens/Sem-eval System Nouns Verbs Adjs DSC-timbl 54.25 48.25 46.50 DSC-svm 64.10 52.20 52.00 DSC-ukb 49.37 44.15 38.13 Voting 60.70 53.95 50.83 20
  • 22. Obtaining sense distributions  Approach  Annotate the remainder SoNaR with WSD systems an obtain sense frequencies  Assume that automatic annotation still reflects real distribution  Evaluate this frequency distribution (Most Frequent Sense)  How can be evaluated this MFS approach?  Manual annotations  25 examples per sense, no sense distribution  Random evaluation corpus  Only a small selection of words (52 lemmas) 21
  • 23. Obtaining sense distributions  All-words corpus was created  Completely independent texts from Lassy  Medical journals, manuals, newspapers, magazines, reports, websites, wikipedia  23,907 tokens and covers 1,527 of our set of lemmas (53%)  Evaluation of  3 WSD systems  First sense baseline according to cornetto  Random sense baseline  Most frequent sense  Sense distributions obtained from automatic annotation 22
  • 24. Obtaining sense distributions  MFS in Dutch similar to English MFS  MFS better than 1st and random sense baselines  MFS automatically derived is a good predictor System Nouns Verbs Adjs 1st sense 53.17 32.84 52.17 Random sense 29.52 24.99 32.16 MFS 61.20 50.76 54.62 DSC-timbl 55.76 37.96 49.00 DSC-svm 64.58 45.81 55.70 DSC-ukb 56.81 31.37 35.93 Voting 66.09 45.68 52.24 23
  • 25. Numbers of DSC  Balanced-sense annotated corpus  274,344 tokens  2,874 lemmas  Annotated by 2 annotators, 90% IAA  Balanced-context annotated corpus  132,666 tokens  1,133 lemmas  Manually annotated by 1 agreeing with WSD in 44%  Random evaluation corpus  5,200 tokens  52 lemmas  All words corpus  23,907 tokens  1,527 lemmas  3 WSD systems for Dutch  DSC-timbl  DSC-svm  DSC-ukb  Automatic annotations by the 3 WSD  Sense distributions  48 million of tokens with confidence  … and more…  800,000 semantic relations between senses extracted from manual annotations  28.080 sense groups  Improved version of Cornetto  SAT annotation tool  Web search tool  Statistics on figurative, idiomatic and collocational usage of words  … 24
  • 26. Piek Vossen piek.vossen@vu.nl Rubén Izquierdo ruben.izquierdobevia@vu.nl Attila Görög a.gorog@vu.nl Thanks for your attention