How to create a taxonomy by a paid workforce provided by Amazon Mechanical Turk. Evaluative comparison to an existing community of motivated students and domain experts.
Presentation held at JCDL 2010, Brisbane, Australia (http://www.jcdl2010.org).
Automating Google Workspace (GWS) & more with Apps Script
Crowdsourcing the Assembly of Concept Hierarchies
1. Crowdsourcing
the Assembly of Concept
Hierarchies
Kai Eckert¹ Cameron Buckner²
Mathias Niepert¹ Colin Allen²
Christof Niemann¹ Heiner Stuckenschmidt¹
¹ University of Mannheim, Germany
² Indiana University, USA
Presentation: Kai Eckert
Wednesday, June 23, 2010
Joint Conference on Digital Libraries (JCDL), Brisbane, Australia, 2010
2. Motivation
● Various types of Concept Hierarchies:
● Thesauri
● Taxonomies
● Classifications
● Ontologies
● ...
● Manual creation is expensive.
● Automatic creation lacks quality.
3. Could the users do the work?
● Divide the work between a lot of users.
● Motivate them to be part of a community.
● Achieve quality control by means of redundancy.
● Can a concept hierarchy be
created like e.g. Wikipedia?
4. ● The Indiana Philosophy Ontology Project.
● A browsable taxonomy of philosophical ideas.
● Ideas are extracted from the Stanford Encyclopedia of
Philosophy (SEP).
● Intuitive access to the SEP via the InPhO taxonomy.
● Entry point for other philosophical ressources on the web.
5. From the SEP to InPhO
Start with a hand-built
formal ontology
describing major Extraction of new
topics and sub-topics. ideas and relationships
Process feedback and Gathering community
infer positions in the feedback about ideas
classification tree and relationships
10. Great stuff, but...
● what, if you do not have a motivated community of expert
users?
● Well,...
● Like almost everything,
you can buy it
at Amazon...
● Amazon Mechanical Turk
11. Amazon Mechanical Turk (AMT)
● Platform for the placing and taking of
Human Intelligence Tasks (HIT).
● 100,000 – 400,000 HITs available.
● Number of workers: ??? (100,000 in 100 countries,
2007, New York Times).
12. HIT Definition
Time allotted per assignment: Maximum time
a worker can work on a single task.
Worker restrictions: Approval Rate, Location
Reward per assignment: How much do you pay for
each HIT?
Number of assignments per HIT: How many unique
workers do you want to work on each HIT?
13. HIT Result
Answer of each worker for each HIT
Accept Time, Submit Time, Work Time In
Seconds
Worker ID
14. Our questions
Can we replace the InPhO community by means
of Amazon Mechanical Turk?
How much does it cost and what is the resulting
quality?
15. Experimental Setup
● We wanted some overlap within the experts:
Minimum overlap i=1 2 3 4 5
Number of pairs 3,237 1,154 370 187 92
We decided for the 1,154 pairs.
● Each pair was evaluated by 5 different workers.
● Each worker evaluated at least 12 pairs (1 HIT).
● 87 distinct workers.
● The HITs were completed in 20 hours.
16. Measuring Agreement
● Calculation of the distance between two answers:
● Relatedness: Absolute value of the difference
● Relative Generality: Match: 0, otherwise: 1
● The evaluation deviation is the mean distance of a user
to the users in a reference group.
17. Comparison with Experts
(Relative Generality)
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
18. Comparison with Experts
(Relative Generality)
Random Clicker
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
19. Comparison with Experts
(Relative Generality)
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
20. Comparison with Experts
(Relative Generality)
InPhO Users are quite consistent.
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
21. Comparison with Experts
(Relative Generality)
InPhO Users are quite consistent.
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
AMT Users are not consistent.
→ Are there good ones?
22. Comparison with Experts
(Relative Generality)
InPhO Users are quite consistent.
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
AMT Users are not consistent. Yes, there are!
→ Are there good ones? → But which ones?
23. Comparison with Experts
(Relative Generality)
InPhO Users are quite consistent.
30
InPhO Users
AMT Users
Fraction of users in %
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion
AMT Users are not consistent. Yes, there are!
→ Are there good ones? → But which ones?
25. Telling the good from the bad
● First approach: Filtering by working time
● Hypothesis 1: Workers who think some time before they
answer, give better answers.
● Hypothesis 2: Probably there are workers who give quick
random responses.
26. Filtering by working time
100
84 80
75
Number of Users
68
60
57
44 40
36
29
22 20
17
# Users
13
9 9 8 7
5 4 4 3 0
0s
s
s
s
s
s
s
s
s
s
s
s
s
00
40
00
40
40
60
20
80
60
20
80
00
>8
>2
>4
>5
>7
>8
>1
>2
>3
>3
>5
>6
>6
Average working time for one HIT (12 pairs)
27. Filtering by working time
48
47
1,
1,5 100
1,
41
39
1,
38
37
36
1,
35
1,
1,
1,
1,42
1,
31
1,
27
1,
1,2 84 1,21 80
10
1,
Deviation from Experts
75
1,06
Number of Users
68
0,9 60
57
0,64
0,6 44 40
36
29
0,3 22 20
17
# Users
13
Deviation
9 9 8 7
5 4 4
0 3 0
0s
s
s
s
s
s
s
s
s
s
s
s
s
00
40
00
40
40
60
20
80
60
20
80
00
>8
>2
>4
>5
>7
>8
>1
>2
>3
>3
>5
>6
>6
Average working time for one HIT (12 pairs)
28. Telling the good from the bad
● Second approach: Filtering by comparison with a hidden
gold standard.
● Test pairs:
● Social Epistemology – Epistemology (P1)
● Computer Ethics – Ethics (P2)
● Chinese Room Argument – Chinese Philosophy (P3)
● Dualism - Philosophy of Mind (P4)
29. Applying filters
● Test pairs:
● Social Epistemology – Epistemology (P1)
● Computer Ethics – Ethics (P2)
● Chinese Room Argument – Chinese Philosophy (P3)
● Dualism - Philosophy of Mind (P4)
● Filters:
1) P1 and P2 are correct (Common Sense)
2) Like 1), additionally P4 is correct (+Background)
3) Like 1), additionally P3 is correct (+Lexical)
4) All have to be correct (All)
30. Filter results for relatedness
Filter Users Deviation Max. Dev.
All (4) 7 0.60 1.00
+Lexical (3) 10 0.87 1.78
+Background (2) 23 0.84 1.41
Common Sense (1) 40 1.11 1.96
All AMT 87 1.39 2.96
All InPhO 25 0.77 1.75
Random --- 1.8 ---
31. Filter results for relative generality
Filter Users Deviation Max. Dev.
All (4) 7(5) 0.12 0.22
+Lexical (3) 10(8) 0.14 0.27
+Background (2) 23(20) 0.15 0.45
Common Sense (1) 40(35) 0.21 0.59
All AMT 87(78) 0.45 1.00
All InPhO 25 0.23 0.47
Random --- 0.75 ---
32. Financial considerations
Filter Pairs Evaluations Cost per Pair Cost per Evaluation
--- 1,138 5,690 US$ 0.111 US$ 0.022
Common Sense (1) 1,074 1,909 US$ 0.117 US$ 0.066
+Background (2) 1,018 1,558 US$ 0.124 US$ 0.081
+Lexical (3) 215 215 US$ 0.586 US$ 0.586
All (4) 183 183 US$ 0.689 US$ 0.689
● Overall payments: 126 US$
● Estimation for all pairs with filter „All (4)“: 784 US$
● Estimation for all pairs with redundancy (5x): 3,920 US$.
33. Conclusion
AMT answers are of varying quality. But this is true
for many communities, too.
With moderate filtering („Background“), we achieved
a quality comparable to the InPhO community.
With 5 evaluations per pair, we still covered 89% of
all pairs with this filter.
The resulting InPhO taxonomy is online:
http://inpho.cogs.indiana.edu/amt_taxonomy
No need for existing data, gold standards or training
data (Beside the filter pairs).
No need for a community?
34. Thank you
Questions?
Kai Eckert
kai@informatik.uni-mannheim.de
http://www.slideshare.net/kaiec
„Computer ethics doesn't exist. Blue is
black and red is blood on the internet.
Nobody cares, because they are lonely.“
Anonymous Mechanical Turk Worker