Comparison GWAP Mechanical Turk

Comparing human computation
services
Elena Simperl (University of Southampton)

Human computation
• Outsourcing tasks that machines find difficult
to solve to humans (accuracy, efficiency,
costs)

Dimensions of human computation
See also [Quinn & Bederson, 2012]

• What is outsourced
– Tasks that require human skills that cannot be easily replicated by
machines (visual recognition, language understanding, knowledge
acquisition, basic human communication etc)
– Sometimes only certain steps of a task are outsourced to humans, the
rest is executed automatically
• How is the task being outsourced
– Tasks broken down into smaller units undertaken in parallel by
different people
– Coordination required to handle cases with more complex workflows
– Partial or independent answers consolidated and aggregated into
complete solution

Dimensions of human computation
(2) See also [Quinn & Bederson, 2012]

• How are the results validated
– Solutions space closed (choice of correct answer) vs open
(collection of potential solutions)
– Performance objectively measured or through ratings/votes
– Statistical techniques employed to predict accurate solutions
• May take into account confidence values of algorithmically generated solutions

• How can the overall process be optimized
– Incentives and motivators (altruism, entertainment, intellectual challenge,
social status, competition, financial compensation)
– Assigning tasks to people based on their skills and performance (as
opposed to random assignments)
– Symbiotic combinations of human- and machine-driven computation,
including combinations of different forms of crowdsourcing

Games with a purpose (GWAP)
See also [van Ahn & Dabbish, 2008]

• Human computation disguised as casual games
• Tasks are divided into parallelizable atomic units
(challenges) solved (consensually) by players
• Game models
– Single vs multi-player
– Selection agreement vs input agreement vs inversion-
problem games

Dimensions of GWAP design
• What tasks are amenable to ‚GWAP-ification‘
– Work is decomposable into simpler (nested) tasks
– Performance is measurable according to an obvious rewarding scheme
– Skills can be arranged in a smooth learning curve
– Player’s retention vs repetitive tasks
• Note: Not all domains are equally appealing
– Application domain needs to attract a large user base
– Knowledge corpus has to be large-enough to avoid repetitions
– Quality of automatically computed input may hamper game
experience
• Attracting and retaining players
– You need a critical mass of players to validate the results
– Advertisement, building upon an existing user base
– Continuous development

Microtask crowdsourcing
• Similar types of tasks, but different incentives
model (monetary reward)
• Successfully applied to transcription,
classification, and content generation, data
collection, image tagging, website feedback,
usability tests…

Our experiment
• Goals
– Compare the two approaches for a given task
(ontology engineering)
– More general: description framework to compare
different human computation models and use
them in combination
• Set-up
– Re-build OntoPronto within Amazon’s Mechanical
Turk, based on existing OntoPronto data

OntoPronto
• Goal: extend Proton upper-
level ontology
• Multi-player (single player
using pre-recorded rounds)
– Step 1: topic of Wikipedia
article classified as class or
instance
– Step 2: browsing the Proton
hierarchy from the root to
identify most specific class
which matches the topic of
the article
• Consensual answers,
additional points for more
specific classes

Validation of players‘ inputs
• A topic is played at least six times
• Number of consensual answers to each
question at least four
• The number of consensual answers modulo
reliability more than half of the number of
total answers received
– Reliability measures relation consensual and
correct answers given by a player

Evaluation and collected data
• 270 distinct players, 365 Wikipedia articles,
2905 game rounds

• Approach is effective
– 77% of challenges solved consensually
– If agreement, most answers correct (97%)
• …and efficient
– 122 classes and entities extending Proton (after
validation)

Implementation through MTurk
• Server-side component
– Generates new HITs
– Evaluate assignments of
existing HITs
• Two types of HITs
– Class or instance (1 cent)
– Proton class (5 cent)
• HITs generated using title,
first paragraph and first
image (if available)
• Qualification test with
five questions, turkers
with at least 90%
accepted tasks

Implementation through MTurk (2)
• Multiple assignments per HIT, four consensual
answers needed
– (number of answers needed for consensus - 1) x
(number of available answer options) + 1
• HITs with (four) consensual answers are
considered completed
• Assignments matching consensus accepted
• HIT costs maximally (number of answers needed
for consensus) x (reward per correct assignment)

Development time and costs per
contribution
• OntoPronto: five development months
• MTurk: one month
– Additional effort required because of the setting
of the experiment
– Less effort as HIT design and validation
mechanisms adopted from OntoPronto
• Average cost for a correct answer on MTurk
0.74 $

Quality of contributions
• Both approaches resulted in high-quality data
• Diversity and biases (270 players vs 16 turkers)
– Additional functionality of MTurk
• Game-based approach economic in the long
run if player retention strategy available
• Microtask-based approach uses ‚predictable‘
motivation framework
• MTurk less diverse (270 players vs 16 turkers)

Challenges and open questions
• Synchronous vs asynchronous modes of
interaction
– Consensual answers, ratings by other turkers?
• Executing inter-dependent tasks in MTurk
– Mapping game steps into HITs
– Grouping HITs
• Using game-like interfaces within microtask
crowdsourcing platforms
– Impact on incentives and turkers‘ behavior?
• Using MTurk to test GWAP design decisions

Challenges and open questions (2)
• Descriptive framework for classification of human
computation systems
– Types of tasks and their mode of execution
– Participants and their roles
– Interaction with system and among participants
– Validation of results
– Consolidation and aggregation of inputs into complete
solution
• Reusable collection of algorithms for quality assurance,
task assignment, workflow management, results
consolidation etc
• Schemas recording provenance of crowdsourced data

S. Thaler, E. Simperl, S. Wölger. An experiment in
comparing human computation techniques. IEEE
Internet Computing, 16(5): 52-58, 2012

For more information
email: e.simperl@soton.ac.uk
twitter: @esimperl

Theory and practice of social machines

http://sociam.org/www2013/

Deadline: 25.02.2013

Comparison GWAP Mechanical Turk

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à Comparison GWAP Mechanical Turk

Similaire à Comparison GWAP Mechanical Turk (20)

Plus de Elena Simperl

Plus de Elena Simperl (20)

Dernier

Dernier (20)

Comparison GWAP Mechanical Turk