Ambiguity in interpreting signs is not a new idea, yet the vast majority of research in machine interpretation of signals such as speech, language, images, video, audio, etc., tend to ignore ambiguity. This is evidenced by the fact that metrics for quality of machine understanding rely on a ground truth, in which each instance (a sentence, a photo, a sound clip, etc) is assigned a discrete label, or set of labels, and the machine’s prediction for that instance is compared to the label to determine if it is correct. This determination yields the familiar precision, recall, accuracy, and f-measure metrics, but clearly presupposes that this determination can be made. CrowdTruth is a form of collective intelligence based on a vector representation that accommodates diverse interpretation perspectives and encourages human annotators to disagree with each other, in order to expose latent elements such as ambiguity and worker quality. In other words, CrowdTruth assumes that when annotators disagree on how to label an example, it is because the example is ambiguous, the worker isn’t doing the right thing, or the task itself is not clear. In previous work on CrowdTruth, the focus was on how the disagreement signals from low quality workers and from unclear tasks can be isolated. Recently, we observed that disagreement can also signal ambiguity. The basic hypothesis is that, if workers disagree on the correct label for an example, then it will be more difficult for a machine to classify that example. The elaborate data analysis to determine if the source of the disagreement is ambiguity supports our intuition that low clarity signals ambiguity, while high clarity sentences quite obviously express one or more of the target relations. In this talk I will share the experiences and lessons learned on the path to understanding diversity in human interpretation and the ways to capture it as ground truth to enable machines to deal with such diversity.
2. Web & Media Group
http://lora-aroyo.org @laroyo
Bulgaria
The Netherlands
Sofia
NYC
Personal
Semantics
3. Web & Media Group
http://lora-aroyo.org @laroyo
Riva del Garda, Italy, 2014
Semantic
Social Life
4. Web & Media Group
http://lora-aroyo.org @laroyo
4
To understand the value of
Semantic Web for e-learning
you have to understand people,
e.g. how they learn, interact &
consume information
5. Web & Media Group
http://lora-aroyo.org @laroyo
5
To understand the value of
Semantic Web for e-learning
you have to understand people,
e.g. how they interact &
consume information
6. Web & Media Group
http://lora-aroyo.org @laroyo
6
To understand the value of Semantic Web
for cultural heritage
you have to understand people, e.g.
how they interact & consume information
7. Web & Media Group
http://lora-aroyo.org @laroyo
7
To understand the value of Semantic Web
for cultural heritage
you have to understand people, e.g.
how they interact & consume information
8. Web & Media Group
http://lora-aroyo.org @laroyo
To understand the value of Semantic Web
for digital humanities, you have to
understand people, e.g.
how they interact & consume information
9. Web & Media Group
http://lora-aroyo.org @laroyo
people are in the center of everything
people & their semantics, i.e. their real-world behavior,
online interactions, information needs, information
consumption habits, personal preferences ...
10. Web & Media Group
http://lora-aroyo.org @laroyo
CrowdTruth team
12. http://lora-aroyo.org @laroyo
50’AI more or less begins
......
80’expert systems
90’knowledge acquisition from experts
00’standards & interoperability
10’big data & large crowds
A long time ago
in a galaxy far, far away …
14. http://lora-aroyo.org @laroyo
Advances in hardware and SDEs
PCs, workstations, Symbolics, Sun
New architectures like the Hypercube
LISP, Prolog, OPS
AI can now BUILD SYSTEMS
Primary focus on experts and rules
What is the knowledge of experts
What is the form of this knowledge?
Graphs, logic, rules, frames
How do experts reason?
Deduction, induction
80’s - empire of the experts
Work on form & process remained
academic
what happened inside the system, to
make the reasoning inside the system
proper and as good as possible
industry forged ahead with ad-hoc
& proprietary systems and actually
tried to build expert systems
Originals of uncertain KR
Fuzzy, probabilistic
15. http://lora-aroyo.org @laroyo
Piero Bonissone and the
DELTA/CATS expert system for
locomotive repair with David Smith, a
locomotive repair expert
Buchanan and Shortliff’s MYCIN project at
Stanford built an huge rule base for medicat
diagnosis working with an extensive team of
medical experts.
18. http://lora-aroyo.org @laroyo
90’s - knowledge acquisition from experts
The 90’s brought [attention for] knowledge acquisition.
Knowing that expert systems by then can functionally work, the focus [in
practice as well as scientific research and technology development] shifted
to the then-bigger challenge of how to acquire knowledge in real-world
scenarios.
It seems natural that after the look inside the systems, then one needed
to pay attention to how actually get the knowledge from the world outside
and frame it into the proper structured knowledge for inside the system.
Dream of the 90’s
26. Web & Media Group
http://lora-aroyo.org @laroyo
27. Web & Media Group
http://lora-aroyo.org @laroyo
the semantic
comfort
zone
28. Web & Media Group
http://lora-aroyo.org @laroyo
One truth: knowledge acquisition for the semantic web
assumes one correct interpretation for every example
All examples are created equal: triples are triples, one is not
more important than another, they are all either true or false
Disagreement bad: when people disagree, they don’t
understand the problem
Experts rule: knowledge is captured from domain experts
One is enough: knowledge by a single expert is sufficient
Detailed explanations help: if examples cause disagreement
- add instructions
Once done, forever valid: knowledge is not updated; new
data not aligned with old
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
29. Web & Media Group
http://lora-aroyo.org @laroyo
Use Case:
video archive
enrichment
Search Behavior of Media Professionals at an Audiovisual Archive:
A Transaction Log Analysis (2009).
B. Huurnink, L. Hollink, W. van den Heuvel, M. de Rijke.
30. Web & Media Group
http://lora-aroyo.org @laroyo
Use Case:
video archive
enrichment
Goal:
make the
multimedia content of
Dutch National Video Archive
accessible to large audiences
Comfort Zone Solution:
media professionals watch & annotate videos. Of course!
31. Web & Media Group
http://lora-aroyo.org @laroyo
but ...
Expensive
Doesn’t scale
time-consuming
5 times the video duration
professional vocabulary
experts use a specific vocabulary
that is unknown to general audiences
32. Web & Media Group
http://lora-aroyo.org @laroyo
… and
people search for fragments
experts annotate full videos
not finding
35% of search queries result in not found
33. Web & Media Group
http://lora-aroyo.org @laroyo
Use Case:
real world QA
for Watson
Crowdsourcing ground truth for Question Answering using CrowdTruth (2015).
B Timmermans, L Aroyo, C Welty
34. Web & Media Group
http://lora-aroyo.org @laroyo
Goal:
gather questions
that real people ask
for training & evaluating Watson
Data:
30K Questions + Candidate Answers.
from Yahoo! Answers
Comfort Zone Solution:
ask people if the passage answers the question (Y/N). Simple!
Use Case:
real world QA
for Watson
35. Web & Media Group
http://lora-aroyo.org @laroyo
Contradicting evidence
Is Coral a plant?
• “Coral almost could be considered half-plant [..]”
• “[..] organism, such as a coral, resembling a stony plant.”
Unanswerable questions
• Can I take a pill if you don't have a child yet?
• Is the spelling for being drunk right?
• Is napster black?
Unclear answer type
Is paper animal plant or man made?
Multiple right answers to a question
What is the best university in NY? (subjective)
YES or NO?
36. Web & Media Group
http://lora-aroyo.org @laroyo
Use Case:
medical relation
extraction
for Watson
Crowdsourcing Ground Truth for Medical Relation Extraction (2017).
A Dumitrache, L Aroyo, C Welty
37. Web & Media Group
http://lora-aroyo.org @laroyo
Goal:
gather data to train
Watson to read
medical text & automatically
extract a medical relations KB
Comfort Zone Solution:
having medical experts read & annotate examples
Use Case:
medical relation
extraction
for Watson
38. Web & Media Group
http://lora-aroyo.org @laroyo
ANTIBIOTICS are the first line treatment for
indications of TYPHUS.
treats(ANTIBIOTICS, TYPHUS)? Expert: yes
Patients with TYPHUS who were given ANTIBIOTICS
exhibited side-effects.
treats(ANTIBIOTICS, TYPHUS)? Expert: yes
With ANTIBIOTICS in short supply, DDT was used
during WWII to control the insect vectors of
TYPHUS.
treats(ANTIBIOTICS, TYPHUS)? Expert: yes.
Are these three really all the same???
39. Web & Media Group
http://lora-aroyo.org @laroyo
Use Case:
map music to moods
40. Web & Media Group
http://lora-aroyo.org @laroyo
Use Case:
map music to moods
Goal:
annotate songs with emotional tags
Comfort Zone Solution:
people assign the prevalent mood of a song
41. Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Other
passionate, rollicking, literate, humorous, silly, aggressive, fiery, does not fit into
rousing, cheerful, fun, poignant, wistful, campy, quirky, tense, anxious, any of the 5
confident, sweet, amiable, bittersweet, whimsical, witty, intense, volatile, clusters
boisterous, good-natured autumnal, wry visceral
rowdy brooding
Choose one:
Which is the mood most appropriate
for each song?
Goal:
(Lee and Hu 2012)
1 song - 1 mood???
42. Web & Media Group
http://lora-aroyo.org @laroyo
One truth: knowledge acquisition for the semantic web
assumes one correct interpretation for every example
All examples are created equal: triples are triples, one is not
more important than another, they are all either true or false
Disagreement bad: when people disagree, they don’t
understand the problem
Experts rule: knowledge is captured from domain experts
One is enough: knowledge by a single expert is sufficient
Detailed explanations help: if examples cause disagreement
- add instructions
Once done, forever valid: knowledge is not updated; new
data not aligned with old
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
43. Web & Media Group
http://lora-aroyo.org @laroyo
One truth: knowledge acquisition for the semantic web
assumes one correct interpretation for every example
All examples are created equal: triples are triples, one is not
more important than another, they are all either true or false
Disagreement bad: when people disagree, they don’t
understand the problem
Experts rule: knowledge is captured from domain experts
One is enough: knowledge by a single expert is sufficient
Detailed explanations help: if examples cause disagreement
- add instructions
Once done, forever valid: knowledge is not updated; new
data not aligned with old
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
Semantic
Comfort Zone
44. Web & Media Group
http://lora-aroyo.org @laroyo
One truth: knowledge acquisition for the semantic web
assumes one correct interpretation for every example
All examples are created equal: triples are triples, one is not
more important than another, they are all either true or false
Disagreement bad: when people disagree, they don’t
understand the problem
Experts rule: knowledge is captured from domain experts
One is enough: knowledge by a single expert is sufficient
Detailed explanations help: if examples cause disagreement
- add instructions
Once done, forever valid: knowledge is not updated; new
data not aligned with old
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
Semantic
Comfort Zone
disrupted
45. Web & Media Group
http://lora-aroyo.org @laroyo
46. Web & Media Group
http://lora-aroyo.org @laroyo
interestingly …
47. Web & Media Group
http://lora-aroyo.org @laroyo
• collective decisions of large groups
of people
• a group of error-prone
decision-makers can be surprisingly
good at picking the best choice
• when thumbs up or thumbs down - the
chance of picking the right answer
needs to be > 50%
• the odds that a most of them will pick
the right answer is greater than any of
them will pick it on their own
• performance gets better as size grows
1785
Marquis de Condorcet
“wisdom of crowds”
48. Web & Media Group
http://lora-aroyo.org @laroyo
•asked 787 people to
guess the weight of
an ox
•none got the right
answer
•their collective guess
was almost perfect
1906
Sir Francis Galton
“wisdom of crowds”
49. Web & Media Group
http://lora-aroyo.org @laroyo
WWII Math Rosies
1942: Ballistics calculations and flight trajectories
50. Web & Media Group
http://lora-aroyo.org @laroyo
NASA’s Computer Room
transcribe raw flight data from celluloid film & oscillograph paper
51. Web & Media Group
http://lora-aroyo.org @laroyo
can we harness it?
53. http://lora-aroyo.org @laroyo
Web & Media Group
CrowdTruth
Three basic causes of disagreement: workers,
examples, target semantics
Disagreement is signal, not noise.
It is indicative of the variation in human semantic
interpretation
It can indicate ambiguity, vagueness, similarity,
over-generality, etc, as well as quality
Crowdtruth: Machine-human computation framework for harnessing disagreement
in gathering annotated data (2014)
O Inel, A Dumitrache, l.Aroyo, C. Welty
54. Web & Media Group
http://lora-aroyo.org @laroyo
one truth: multiple truths
all examples are created equal:
each example is unique
disagreement bad: disagreement is good
experts rule: crowd rules
one is enough: the more the better
detailed explanations help:
keep it simple stupid
once done, forever valid:
maintenance is necessary
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
55. Web & Media Group
http://lora-aroyo.org @laroyo
changes needed
video archive
enrichment
improve support
for fragment search
time-based annotations
bridging vocabulary gap between
searcher & cataloguer
56. Web & Media Group
http://lora-aroyo.org @laroyo
crowdsourcing
video tagging
two
video tagging pilots
57. Web & Media Group
http://lora-aroyo.org @laroyo
@waisda
http://waisda.nl
engage
crowds
through
continuous
gaming
59. http://lora-aroyo.org @laroyo
Web & Media Group
time-based
bernhard
just “tags”
“On the Role of User-Generated Metadata in A/V Collections”, Riste Gligorov et al. KCAP2011
60. http://lora-aroyo.org @laroyo
Web & Media Group
objects (57%)
westminster abbey
abbey
priester
geestelijken
hek
paarden
tocht
aankomst
koets
kroning
mensenmassa
parade
kroon
regen
“On the Role of User-Generated Metadata in A/V Collections”, Riste Gligorov et al. KCAP2011
61. http://lora-aroyo.org @laroyo
Web & Media Group
persons (31%)
bernhard
juliana
objects (57%)
“On the Role of User-Generated Metadata in A/V Collections”, Riste Gligorov et al. KCAP2011
62. http://lora-aroyo.org @laroyo
Web & Media Group
user vocabulary
8% in professional vocabulary
23% in Dutch lexicon
89% found on Google
locations (7%)
engeland
locations (7%)
persons (31%)
objects (57%)
“On the Role of User-Generated Metadata in A/V Collections”, Riste Gligorov et al. KCAP2011
63. http://lora-aroyo.org @laroyo
Web & Media Group
user vocabulary
8% in professional vocabulary
23% in Dutch lexicon
89% found on Google
locations (7%)
describe mainly short segments
often not very specific
don’t describe programmes as a whole
“On the Role of User-Generated Metadata in A/V Collections”, Riste Gligorov et al. KCAP2011
user vocabulary
8% in professional vocabulary
23% in Dutch lexicon
89% found on Google
64. Web & Media Group
http://lora-aroyo.org @laroyo
crowdsourcing
medical relation
extraction
diversity of opinions
independent perspectives
multitude of contexts
we exposed a richer set of possibilities
that help in identifying, processing
& understanding context
65. Web & Media Group
http://lora-aroyo.org @laroyo
Does this sentence express
TREATS(Antibiotics, Typhus)?
Patients with TYPHUS who were given
ANTIBIOTICS exhibited several side-effects.
With ANTIBIOTICS in short supply, DDT was
used during World War II to control the insect
vectors of TYPHUS.
ANTIBIOTICS are the first line treatment for
indications of TYPHUS. 95%
75%
50%
The crowd results captures the natural ambiguity
66. http://lora-aroyo.org @laroyo
Web & Media Group
What is the relation between the highlighted terms?
He was the first physician to identify the relationship
between HEMOPHILIA and HEMOPHILIC ARTHROPATHY.
Experts Hallucinate
Crowd reads text literally - provide better examples to machine
experts: cause
crowd: no relation
67. http://lora-aroyo.org @laroyo
Web & Media Group
Unclear relationship between the two arguments reflected
in the disagreement
Medical Relation Extraction
68. http://lora-aroyo.org @laroyo
Web & Media Group
Clearly expressed relation between the two arguments reflected in
the agreement
Medical Relation Extraction
69. http://lora-aroyo.org @laroyo
Web & Media Group
Unclear relationship between the two arguments reflected
in the disagreement
Medical Relation Extraction
71. http://lora-aroyo.org @laroyo
Web & Media Group
Learning Curves
(crowd with pos./neg. threshold at 0.5)
above 400 sent.: crowd consistently over baseline & single
above 600 sent.: crowd out-performs experts
72. http://lora-aroyo.org @laroyo
Web & Media Group
Learning Curves Extended
(crowd with pos./neg. threshold at 0.5)
crowd consistently performs better than baseline
74. Web & Media Group
http://lora-aroyo.org @laroyo
Training a Relation Extraction Classifier
F1
Cost per
sentence
CrowdTruth 0.642 $0.66
Expert Annotator 0.638 $2.00
Single Annotator 0.492 $0.08
“wisdom of the crowd”
provides training data that is at least as good
if not better than experts
only with proper analytic framework for
harnessing disagreement from the crowd
75. http://lora-aroyo.org @laroyo
Web & Media Group
map music to moods
Goal:
tag songs with emotional clusters
Comfort Zone Solution:
people assign the prevalent mood of a song
76. Web & Media Group
http://lora-aroyo.org @laroyo
77. Web & Media Group
http://lora-aroyo.org @laroyo
Is this song ….
?Passionate
Rousing
Confident
Boisterous
Rowdy
Literate
Poignant
Wistful
Bittersweet
Autumnal
Brooding
Rollicking
Cheerful
Fun
Sweet
Amiable
Good-natured
Humorous
Silly
Campy
Whimsical
Witty
Wry
Aggressive
Fiery
Tense
Anxious
Intense
Volatile
80. Web & Media Group
http://lora-aroyo.org @laroyo
can indicate
alternative interpretations
Worker Mood-C1 Mood-C2 Mood-C3 Mood-C4 Mood-C5 Other
W10 1 1 1 1 1
Totals 3 5 6 5 2 8
Disagreement as Signal
can indicate
ambiguity in the
categorisation
can indicate
low quality workers
83. http://lora-aroyo.org @laroyo
Take Home Message
People first, experts second
True and False is not enough,
There is diversity in human interpretation
CrowdTruth introduces a spatial representation
of meaning that harnesses disagreement
With CrowdTruth untrained workers can be just as
reliable as highly trained experts