Crowdsourcing the Assembly of Concept Hierarchies

Crowdsourcing
the Assembly of Concept
Hierarchies
Kai Eckert¹ Cameron Buckner²
Mathias Niepert¹ Colin Allen²
Christof Niemann¹ Heiner Stuckenschmidt¹

¹ University of Mannheim, Germany
² Indiana University, USA

Presentation: Kai Eckert
Wednesday, June 23, 2010

Joint Conference on Digital Libraries (JCDL), Brisbane, Australia, 2010

Motivation
● Various types of Concept Hierarchies:

● Thesauri
● Taxonomies
● Classifications
● Ontologies
● ...
● Manual creation is expensive.

● Automatic creation lacks quality.

Could the users do the work?
● Divide the work between a lot of users.

● Motivate them to be part of a community.

● Achieve quality control by means of redundancy.

● Can a concept hierarchy be
created like e.g. Wikipedia?

● The Indiana Philosophy Ontology Project.

● A browsable taxonomy of philosophical ideas.

● Ideas are extracted from the Stanford Encyclopedia of
Philosophy (SEP).

● Intuitive access to the SEP via the InPhO taxonomy.

● Entry point for other philosophical ressources on the web.

From the SEP to InPhO
Start with a hand-built
formal ontology
describing major Extraction of new
topics and sub-topics. ideas and relationships

Process feedback and Gathering community
infer positions in the feedback about ideas
classification tree and relationships

Gathering community feedback

Relatedness

Gathering community feedback

Relatedness

is more specific than

Relative Generality

Great stuff, but...
● what, if you do not have a motivated community of expert
users?

● Well,...

● Like almost everything,
you can buy it
at Amazon...

● Amazon Mechanical Turk

Amazon Mechanical Turk (AMT)

● Platform for the placing and taking of
Human Intelligence Tasks (HIT).
● 100,000 – 400,000 HITs available.
● Number of workers: ??? (100,000 in 100 countries,
2007, New York Times).

HIT Definition
Time allotted per assignment: Maximum time
a worker can work on a single task.

Worker restrictions: Approval Rate, Location

Reward per assignment: How much do you pay for
each HIT?

Number of assignments per HIT: How many unique
workers do you want to work on each HIT?

HIT Result

Answer of each worker for each HIT

Accept Time, Submit Time, Work Time In
Seconds

Worker ID

Our questions
Can we replace the InPhO community by means
of Amazon Mechanical Turk?

How much does it cost and what is the resulting
quality?

Experimental Setup
● We wanted some overlap within the experts:
Minimum overlap i=1 2 3 4 5
Number of pairs 3,237 1,154 370 187 92

We decided for the 1,154 pairs.

● Each pair was evaluated by 5 different workers.

● Each worker evaluated at least 12 pairs (1 HIT).

● 87 distinct workers.

● The HITs were completed in 20 hours.

Measuring Agreement
● Calculation of the distance between two answers:

● Relatedness: Absolute value of the difference
● Relative Generality: Match: 0, otherwise: 1
● The evaluation deviation is the mean distance of a user
to the users in a reference group.

Comparison with Experts
(Relative Generality)

30
InPhO Users
AMT Users
Fraction of users in %

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Follow Experts Own Opinion


Random Clicker
30
InPhO Users
AMT Users

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

InPhO Users are quite consistent.
30
InPhO Users
AMT Users

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

30
InPhO Users
AMT Users

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

AMT Users are not consistent.
→ Are there good ones?

30
InPhO Users
AMT Users

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

AMT Users are not consistent. Yes, there are!
→ Are there good ones? → But which ones?

Mixed Results...

Can we just use the good ones?

Telling the good from the bad

● First approach: Filtering by working time

● Hypothesis 1: Workers who think some time before they
answer, give better answers.

● Hypothesis 2: Probably there are workers who give quick
random responses.

Filtering by working time
100

84 80

75

Number of Users
68
60
57

44 40

36

29

22 20
17
# Users
13
9 9 8 7
5 4 4 3 0
0s

s
s

s

s

s

s

s

s

s

s

s

s
00

40

00

40
40

60

20

80

60

20

80

00
>8

>2

>4

>5

>7

>8
>1

>2

>3

>3

>5

>6

>6
Average working time for one HIT (12 pairs)

Filtering by working time

48

47
1,
1,5 100

1,
41

39
1,
38

37

36

1,

35
1,

1,

1,
1,42

1,

31
1,
27
1,
1,2 84 1,21 80

10
1,
Deviation from Experts

75
1,06

Number of Users
68
0,9 60
57

0,64
0,6 44 40

36

29

0,3 22 20
17
# Users
13
Deviation
9 9 8 7
5 4 4
0 3 0
0s

s
s

s

s

s

s

s

s

s

s

s

s
00

40

00

40
40

60

20

80

60

20

80

00
>8

>2

>4

>5

>7

>8
>1

>2

>3

>3

>5

>6

>6
Average working time for one HIT (12 pairs)

Telling the good from the bad

● Second approach: Filtering by comparison with a hidden
gold standard.

● Test pairs:

● Social Epistemology – Epistemology (P1)
● Computer Ethics – Ethics (P2)
● Chinese Room Argument – Chinese Philosophy (P3)
● Dualism - Philosophy of Mind (P4)

Applying filters
● Test pairs:
● Social Epistemology – Epistemology (P1)
● Computer Ethics – Ethics (P2)
● Chinese Room Argument – Chinese Philosophy (P3)
● Dualism - Philosophy of Mind (P4)
● Filters:
1) P1 and P2 are correct (Common Sense)
2) Like 1), additionally P4 is correct (+Background)
3) Like 1), additionally P3 is correct (+Lexical)
4) All have to be correct (All)

Filter results for relatedness

Filter Users Deviation Max. Dev.
All (4) 7 0.60 1.00
+Lexical (3) 10 0.87 1.78
+Background (2) 23 0.84 1.41
Common Sense (1) 40 1.11 1.96
All AMT 87 1.39 2.96
All InPhO 25 0.77 1.75
Random --- 1.8 ---

Filter results for relative generality

Filter Users Deviation Max. Dev.
All (4) 7(5) 0.12 0.22
+Lexical (3) 10(8) 0.14 0.27
+Background (2) 23(20) 0.15 0.45
Common Sense (1) 40(35) 0.21 0.59
All AMT 87(78) 0.45 1.00
All InPhO 25 0.23 0.47
Random --- 0.75 ---

Financial considerations
Filter Pairs Evaluations Cost per Pair Cost per Evaluation
--- 1,138 5,690 US$ 0.111 US$ 0.022
Common Sense (1) 1,074 1,909 US$ 0.117 US$ 0.066
+Background (2) 1,018 1,558 US$ 0.124 US$ 0.081
+Lexical (3) 215 215 US$ 0.586 US$ 0.586
All (4) 183 183 US$ 0.689 US$ 0.689

● Overall payments: 126 US$

● Estimation for all pairs with filter „All (4)“: 784 US$

● Estimation for all pairs with redundancy (5x): 3,920 US$.

Conclusion
AMT answers are of varying quality. But this is true
for many communities, too.
With moderate filtering („Background“), we achieved
a quality comparable to the InPhO community.
With 5 evaluations per pair, we still covered 89% of
all pairs with this filter.
The resulting InPhO taxonomy is online:
http://inpho.cogs.indiana.edu/amt_taxonomy
No need for existing data, gold standards or training
data (Beside the filter pairs).
No need for a community?

Thank you

Questions?

Kai Eckert
kai@informatik.uni-mannheim.de
http://www.slideshare.net/kaiec

„Computer ethics doesn't exist. Blue is
black and red is blood on the internet.
Nobody cares, because they are lonely.“

Anonymous Mechanical Turk Worker

Photo Credits
● Michal Zacharzewski (Title Crowd), http://www.sxc.hu/profile/mzacha
● Peter Suneson (Crowd sillhouette), http://www.sxc.hu/profile/CMSeter
● Alaa Hamed (Egyptian Coins), http://www.sxc.hu/profile/alaasafei
● Piotr Lewandowski (Money), http://www.sxc.hu/profile/LeWy2005
● Asif Akbar (Clock), http://www.sxc.hu/profile/asifthebes
● Zern Liew (Traffic Cone), http://www.sxc.hu/profile/eidesign
● Peter Gustafson (Counting Fingers), http://www.sxc.hu/profile/liaj
● Kostya Kisleyko (Yes No), http://www.sxc.hu/profile/dlnny
● Sergio Roberto Bichara (Barcode), http://www.sxc.hu/profile/srbichara
● Maggie Molloy (Icons), http://www.sxc.hu/profile/agthabrown
● Sanja Gjenero (World with Crowd), http://www.sxc.hu/profile/lusi
● Wikimedia Commons (The Turk), http://en.wikipedia.org/wiki/File:Kempelen_chess1.jpg

Crowdsourcing the Assembly of Concept Hierarchies

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Plus de Kai Eckert

Plus de Kai Eckert (11)

Dernier

Dernier (20)

Crowdsourcing the Assembly of Concept Hierarchies