LDK2019: Validation methodology for expert-annotated datasets: Events Annotation Case Study

Oana Inel and Lora Aroyo
LDK 2019
Validation Methodology for Expert-Annotated
Datasets: Event Annotation Case Study
@oana_inel
https://oana-inel.github.io
crowdtruth.org
!1

• Traditionally performed by experts
• Difficult to define who are the experts
when it comes to events
• Typical evaluation: inter-annotator
agreement
However ...
Event
Annotation
!2

Insurers could see claims totaling
nearly $1 billion from the San
Francisco earthquake, far less than
the $4 billion from Hurricane Hugo.
Which are the events in this sentence?
!3

Which are the right
events here?
!4

Annotation
Guidelines Difficult to
Define
How can we ensure consistent
annotation of events?
How do we know when we have the
complete set of events?
!5

… let’s check some expert-annotated datasets
How well do they do in terms of consistency and
completeness?
!6

TempEval-3
SemEval 2013: Temporal Annotation Task
TempEval-3 Gold:
● 256 documents, 3953 sentences
● 11.129 events
● 1.822 time expressions
TempEval-3 Platinum:
● 20 documents, 273 sentences
● 746 events
● 138 time expressions
Experiments
https://www.cs.york.ac.uk/semeval-2013/task1/
UzZaman, N., Llorens, H., Derczynski, L., Allen, J., Verhagen, M. and Pustejovsky, J., 2013.
Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal
relations. SemEval 2013, Vol. 2, pp. 1-9.
!7

Tokens have different types across datasets
His appointment to that post, which has senior administrative, staff and policy
responsibilities, followed a several-year tenure as Reuters's editor in chief .
During his tenure, he has increasingly unleashed biting comedic barbs against
his critics and political adversaries .
EVENT in
TempEval-3 Gold
TIME EXPRESSION in
TempEval-3 Platinum
Consistency
!8

Annotation guidelines for events are not
used consistently
Consistency
Hungary's accession to NATO has brought about new changes to the political
balance in central and eastern Europe, shattering the world configuration under
the Yalta accord since World War II.
Only single-token EVENTS
in TempEval-3 Gold
The report estimated that the number of passenger cars in China was on track to
hit 400 million by 2030, up from 90 million now.
Also multi-token EVENTS in
TempEval-3 Platinum
EVENTS composed of
NUMERALS were removed !9

Occurrences of the same previously
annotated event/time token are not
annotated by experts
Travelers Corp.'s third-quarter net income rose 11%, even though claims stemming from
Hurricane Hugo reduced results $40 million.
Insurers could see claims totaling nearly $1 billion from the San Francisco earthquake,
far less than the $4 billion from Hurricane Hugo.
EVENT in
TempEval-3 Gold
NOT an EVENT in
TempEval-3 Gold
Completeness
!10

Occurrences of the same previously
annotated event/time lemma are not
annotated by experts
The Windows-ready Kinect sensor, is currently selling for $250 through Microsoft, more
than twice what Microsoft charges for the gaming-only Xbox version.
Starting next year, the law will block insurers from refusing to sell coverage or setting
premiums based on people’s health histories.
EVENT in
TempEval-3 Platinum
NOT an EVENT in
TempEval-3 Platinum
Completeness
!11

Expert Datasets
Are Not always consistently annotated
May not always contain all events
Can the crowd help us to improve
these datasets?
… and by computing this in a
systematic way we observe that
expert-annotated datasets ...
!12

motivating the choice increases the accuracy of the results
providing explicit definitions increases the accuracy of the
results
large amounts of crowd workers perform as well as experts
Annotation
Guidelines
Annotation
Value
Number of
Annotators
Hypotheses
crowd provides consistent annotations when asked to validate
and add missing events
Input Entity
Values
!13

Pilot Experiments
to determine
Optimal
Crowdsourcing
Setting
Evaluate Pilot
Experiments
Main Experiment
with
Optimal
Crowdsourcing
Setting
Annotate
Events
in Sentences
Crowdsourcing Event Annotations
Annotation
Guidelines
Annotation
Value
Input Entity
Values
Number of
Annotators
!14

with 50 sentences from the TempEval-3 Platinum (P) dataset
Input Data
Entity
Type
Time
Expression
Event
Dataset
Platinum
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Entity
Values
Controlled Variables
Annotators
20 workers
Figure Eight
English
Speaking
Which is the optimal setting?
16 PILOT EXPERIMENTS
Annotation
Value
Crowdsourcing Template
Annotation
Guidelines
Implicit Definition
Explicit Definition
Entities +
Motivation (NONE)
Entities
Entities +
Motivation (ALL) +
Highlight
Entities +
Motivation (ALL)
!15

OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS
Entity
Type
Dataset
Entity
Values
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Annotation
Guidelines
Implicit Definition
Explicit Definition
Implicit Definition
Implicit Definition
Explicit Definition
Explicit Definition
Explicit Definition
Explicit Definition
Annotation
Value
Event
Entities +
Motivation (NONE)
Entities
Entities + Motivation
(ALL) + Highlight
Entities +
Motivation (ALL)
Entities +
Motivation (NONE)
Entities
Entities +
Motivation (ALL)
Entities +
Motivation (ALL)
Event
Event
Event
Event
Event
Event
Event
Annotators
Figure Eight
20 workers
English Speaking
!16

F1- score / TP
Entity
Type
Dataset
Entity
Values
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Annotation
Guidelines
Implicit Definition
Explicit Definition
Implicit Definition
Implicit Definition
Explicit Definition
Explicit Definition
Explicit Definition
Explicit Definition
Annotation
Value
Event
Entities +
Motivation (NONE)
Entities
(ALL) + Highlight
Entities +
Motivation (ALL)
Entities +
Motivation (NONE)
Entities
Entities +
Motivation (ALL)
Entities +
Motivation (ALL)
Event
Event
Event
Event
Event
Event
Event
0.89 / 154
0.88 / 159
0.89 / 157
0.89 / 152
0.88 / 164
0.90 / 161
0.84 / 156
0.83 / 155
crowd performs better when they are provided with explicit definitions
!17

F1- score / TP
Entity
Type
Dataset
Entity
Values
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Platinum
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Expert (G+P) &
Tools & Missing
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Expert (P) &
Tools
Annotation
Guidelines
Implicit Definition
Explicit Definition
Implicit Definition
Implicit Definition
Explicit Definition
Explicit Definition
Explicit Definition
Explicit Definition
Annotation
Value
Event
Entities +
Motivation (NONE)
Entities
(ALL) + Highlight
Entities +
Motivation (ALL)
Entities +
Motivation (NONE)
Entities
Entities +
Motivation (ALL)
Entities +
Motivation (ALL)
Event
Event
Event
Event
Event
Event
Event
0.89 / 154
0.88 / 159
0.89 / 157
0.89 / 152
0.88 / 164
0.90 / 161
0.84 / 156
0.83 / 155
crowd performs better when answers are motivated
!18

What amount of workers gives accurate results?
event annotation: 12 workers perform well when compared to the experts
time expression annotation: 15 workers perform well when compared to the experts
PlatinumTimeEvent
Expert (G+P) &
Tools & Missing
Explicit Definition
(ALL) + Highlight
!19

with 4.202 sentences from the TempEval-3 Gold (G) and TempEval-3 Platinum (P) datasets
MAIN EXPERIMENTS FOR EVENT ANNOTATION
Figure Eight
15 workers
English Speaking
4 cents per
Annotation
Gold & Platinum
Explicit Definition
Event
(ALL) + Highlight
Expert (G+P) &
Tools & Missing
!20

Train and Evaluate ClearTK with Expert & Crowd Events
Train on Experts and Test on Experts: F1-score of the ClearTK tool is 0.788
Train on Experts and Test on Crowd: F1-score of the ClearTK tool is significantly better, around 0.83
Train on Crowd and Test on Experts: F1-score of the ClearTK tool is only almost as good (0.77)!21

Training and Evaluating ClearTK with Crowd Events
Train on Crowd and Test on Crowd: F1-score of the ClearTK tool reaches a maximum of 0.83
ClearTK performs well when trained and evaluated at similar crowd event-sentence score thresholds
!22

Contributions and Future Work
Contributions
● data-agnostic validation methodology of expert-annotated datasets in terms of
consistency and completeness
● 4,202 crowd-annotated English sentences from the TempEval-3 Gold and
TempEval-3 Platinum datasets with events
● 121 crowd-annotated sentences from the TempEval-3 Platinum dataset with time
expressions
● training and evaluating ClearTK with crowd-driven event annotations
Future Work
● replicate crowdsourcing experiment on time expressions
● investigate the role of sentence and event ambiguity in the training and
evaluation of event extraction systems !23

Data & Code
https://github.com/CrowdTruth/Event-Extraction
CrowdTruth Metrics
https://github.com/CrowdTruth/CrowdTruth-core
CrowdTruth Tutorial
http://crowdtruth.org/tutorial/
Resources
!24

LDK2019: Validation methodology for expert-annotated datasets: Events Annotation Case Study

Recommandé

Recommandé

Contenu connexe

Similaire à LDK2019: Validation methodology for expert-annotated datasets: Events Annotation Case Study

Similaire à LDK2019: Validation methodology for expert-annotated datasets: Events Annotation Case Study (20)

Plus de oanainel

Plus de oanainel (7)

Dernier

Dernier (20)

LDK2019: Validation methodology for expert-annotated datasets: Events Annotation Case Study