Conducting Behavioral Research with Crowdsourcing Platforms Like Mechanical Turk

Conducting Behavioral Research
with Crowdsourcing
(especially Amazon’s Mechanical Turk)

Winter Mason Siddharth Suri
Stevens Institute of Yahoo! Research
Technology

Outline

 Peer Production vs. Human Computation vs.
Crowdsourcing
 Peer Production & Citizen Science
 Crowdsourcing
 Mechanical Turk Basics
 Internal HITs
 Preference Elicitation
 Surveys
 External HITs
 Random Assignment
 Synchronous Experiments
 Conclusion

Definitions
 Peer Production
 Creation through distributed contributions
 Human Computation
 Computation with “humans in the loop” (Law & von Ahn, ‘11)
 Crowdsourcing
 Jobs outsourced to a group through an open call (Howe ‘06)

Examples of Modern Peer Production
 Open source software  Crowdsourcing
 Linux, Apache, Fire Fox  ESP Game
 Mash-ups  Fold-it!
 Prediction Markets  galaxyZoo
 Iowa electronics markets,  threadless
Hollywood stock exchange  Tagasauris
 Collaborative Knowledge  Innocentive
 Wikipedia, Intellipedia  TopCoder
 Yahoo! Answers  oDesk
 Amazon, Yelp, Epinions  Mechanical Turk
 Social Tagging
Communities
 Flickr, Del.icio.us

ESP Game
 Two player online game
 Players do not know who they are playing with
 Players cannot communicate
 Object of the game:
 Type the same word given an image

Games With a Purpose
 The outcome of the ESP game is labeled images.
 Google Images bought the ESP game, and has used
it to improve image search.
 The contributions of the crowd are completely free
for Google.

Fold.It!
 Fold.it is an online game in
which players fold proteins
into different configurations
 Certain configurations earn
more points than others
 The configurations
correspond to physical
structures:
 some amino acids must be
near the center, and others
outside
 some pairs of amino acids
must be close together and
others far apart
 Players of the game recently
unlocked the structure of an
AIDS-related enzyme that
the scientific community had
been unable to unlock for a
decade

galaxyZoo
 “Citizen Science”
 The number of images of galaxies
taken by Hubble is immense.
 Computers can identify whether
something is a galaxy, but not what
type of galaxy it is (reliably).
 By employing the crowd, galaxyZoo
has classified over 50M galaxies.
 Astronomers used to assume that if
a galaxy appears red in color, it is
also probably an elliptical galaxy.
Galaxy Zoo has shown that up to a
third of red galaxies are actually
spirals.

Tagasauris
 Magnum Photos has a very large
collection of mis- or unclassified
photos

 To get a handle on it, they asked
crowd-workers to tag their photos

 Through this process, in
combination with a knowledge
base, they discovered lost photos
from the movie, “American Graffiti”

 The actors were tagged individually
in the photos (like the one on the
right), and the system linked them
together and discovered they were
all related to the film.

Innocentive
 A “Seeker” creates a
“challenge”, typically requiring
serious skill and technical ability

 Multiple “Solvers” submit
detailed solutions to the
challenge. If the solution is
selected, they win the (typically
sizable) reward.

 For instance, by creating a
durable & inexpensive solar
flashlight that could double as a
lamp, a retired engineer won
$20,000 and brought lighting to
many rural Africans.

topCoder

 Programming jobs are
offered as contests

 Coders submit their work,
and the winner earns the
reward

 Aside from the direct
payoff, there are anecdotal
reports of people being
hired for permanent
positions as a result of
their contributions on
TopCoder

oDesk
 Skilled crowdsourcing:
 for any job that requires some
skills, but can be done entirely
on a computer.

 Jobs are paid either as a flat,
one-time reward, or on an
hourly basis for longer
contracts.

 Workers have extensive profiles
& reputations, and wages are
negotiated between Employer
and Worker.

 Jobs cover a vary large
spectrum, and pay varies with
skill

Amazon’s Mechanical Turk

 The original
crowdsourcing platform

 “The human inside the
machine”; built to
programmatically
incorporate human input

 Jobs are meant to be
doable by any human, and
every worker is meant to
be completely
interchangeable.

Generally-Shared Features of Existing
Systems
 Contributions highly modular
 Minimal contribution is small
 Single edit, single line of code, single tag
 Low interdependence between separate contributions
 Same document or function

 Distribution of contributions highly skewed
 Small number of heavy contributors
 Wikipedia, AMT, Digg
 Large number of “free riders”
 Very common feature of public goods

What is Mechanical Turk?

 Crowdsourcing
 Jobs outsourced to a group through an open
call (Howe ‘06)
 Online Labor Market
 Place for requesters to post jobs and workers
to do them for pay
 Participant recruitment and reimbursement
 How can we use MTurk for behavioral
research?
 What kinds of behavioral research can we use
MTurk for?

Why Mechanical Turk?

 Subject pool size
 Central place for > 100,000 workers (Pontin „07)
 Always-available subject pool
 Subject pool diversity
 Open to anyone globally with a computer, internet
connection
 Low cost
 Reservation Wage: $1.38/hour (Chilton et al „10)
 Effective Wage: $4.80/hour (Ipeirotis, ‟10)
 Faster theory/experiment cycle
 Hypothesis formulation
 Testing & evaluation of hypothesis
 New hypothesis tests

Validity of Worker Behavior
 (Quality-controlled) worker output can be as good as
experts, sometimes better
 Labeling text with emotion (Snow, et al, 2008)
 Audio transcriptions (Marge, et al, 2010)
 Similarity judgments for music (Urbano, et al, 2010)
 Search relevance judgments (Alonso & Mizzaro, 2009)

 Experiments with workers replicate studies conducted in
laboratory or other online settings
 Standard psychometric tests (Buhrmester, et al, 2011)
 Response in judgment and decision-making tests (Paolacci, et
al, 2010)
 Responses in public good games (Suri & Watts, 2011)

Worker Demographics

 Self reported demographic
information from 2,896 workers
over 3 years (MW „09, MW „11, SW ‟10)
 55% Female, 45% Male
 Similar to other internet panels (e.g.
Goldstein)
 Age:
 Mean: 30 yrs,
 Median: 32 yrs
 Mean Income: $30,000 / yr
 Similar to Ipeirotis „10, Ross et al
‟10

Internal Consistency of Demographics

 207 out of 2,896 workers did 2 of our studies
 Only 1 inconsistency on gender, age, income
(0.4%)
 31 workers did ≥ 3 of our studies
 3 changed gender
 1 changed age (by 6 years)
 7 changed income bracket
 Strong internal consistency

Why Do Work on Mechanical Turk?

 “Mturk money is always necessary to make ends meet.”
 5% U.S. 13% India
 “Mturk money is irrelevant.”
 12% U.S. 10% India
 “Mturk is a fruitful way to spend free time and get some
cash.”
 69% U.S. 59% India

(Ross et al ‟10, Ipeirotis ‟10)

Requesters

 Companies crowdsourcing part of their business
 Search companies: relevance
 Online stores: similar products from different
stores (identifying competition)
 Online directories: accuracy, freshness of
listings
 Researchers
 Intermediaries
 CrowdFlower (formerly Delores Labs)
 Smartsheet.com

Common Tasks
 Image labeling
 Audio transcription
 Object / Website / Image classification
 Product evaluation

Uncommon tasks
 Workflow optimization
 Copy editing
 Product description
 Technical writing

Soylent
 Word processing with an embedded crowd (Bernstein
et al, UIST 2010)
 Crowd proofreads each paragraph
 “Find-Fix-Verify” prevents “lazy worker” from ruining
output

Find–Fix–Verify
 Find
 Identify one area that can be shortened without changing
the meaning of the paragraph
 Fix
 Edit the highlighted section to shorten its length without
changing the meaning of the paragraph
 Verify
 Choose one rewrite that fixes style errors and one that
changes the meaning

Iterative processes
 By building on each
other‟s work, the crowd
can achieve remarkable
outcomes

 Some tasks benefit from
iterative processes,
others from parallel
(Little, et al, 2010)

TurkoMatic
 Crowd creates workflows
1. Ask workers to decompose task into steps
2. Ask if a step can be completed in 10 minutes
 If so, solve it
 If not, decompose the sub-task
3. Combine outputs of sub-tasks into final output
(Kalkani et al, CHI 2011)

Turker Community
 Asymmetry in reputation mechanism

 Reputation of Workers is given by approval rating
 Requesters can reject work
 Requesters can refuse workers with low approval rates

 Reputation of Requesters is not built in to Mturk
 Turkopticon: Workers rate requesters on
communicativity, generosity, fairness and promptness
 Turker Nation: Online forum for workers
 Requesters should introduce themselves here
 Reputation matters, so abusive studies will fail quickly

Anatomy of a HIT

 HITs with the
same title,
description,
pay rate, etc.
are the same
HIT type

 HITs are
broken up into
Assignments

 A worker
cannot do
more than 1
assignment of
a HIT

Anatomy of a HIT

 HITs with the
same title,
description,
pay rate, etc.
are the same
HIT type

 HITs are
broken up into
Assignments
Requesters can set qualifications that determine who
 A worker can work on the HIT
cannot do e.g., Only US workers, workers with approval rating >
more than 1 90%
assignment of
a HIT

Anatomy of a HIT

 HITs with the
same title,
description, pay
rate, etc. are
the same HIT
type

 HITs are broken
up into
Assignments

 A worker
cannot do more
than 1
assignment of a
HIT

HIT GROUP
Assignment 1
“Black”

Alice

Assignment 2
Which is the better translation for Táy ?
“Night”
HIT 1 o Black
o Night
Bob
Which is the better translation for Nedj
HIT 2 ?
o Clean
o White Assignment 3
• “Black”
• Charlie
•

HIT GROUP

Assignment 1
“White”

Which is the better translation for Táy ?
HIT 1 o Black Alice
o Night

Assignment 2
Which is the better translation for Nedj “White”
HIT 2 ?
o Clean
o White
• Bob

•
•
Assignment 3
“White”
David

Requester Worker

Build HIT

Search for
Test HIT
HITs

Post HIT Accept HIT

Do work

Reject or
Submit HIT
Approve HIT

Lifecycle of a HIT
 Requester builds a HIT
 Internal HITs are hosted by Amazon
 External HITs are hosted by the requester
 HITs can be tested on {requester,
worker}sandbox.mturk.com
 Requester posts HIT on mturk.com
 Can post as many HITs as account can cover
 Workers do HIT and submit work
 Requester approves/rejects work
 Payment is rendered
 Amazon charges requesters 10%
 HIT completes when it expires or all assignments are
completed

How Much to Pay?
 Pay rate can affect quantity of work
 Pay rate does not have a big impact on quality
 (MW ‟09)
Number of Tasks Completed

Accuracy

Pay per Task Pay per Task

Completion Time

 3, 6-question multiple
choice surveys
 Launched same time of
day, day of week
 $0.01, $0.03, $0.05
 Past a threshold, pay
rate does not increase
speed
 Start with low pay rate
work up

Internal HITs on AMT

 Template tool
 Variables
 Preference Elicitation
 Honesty study

AMT Templates

• Hosted by Amazon

• Set parameters for HIT
• Title
• Description
• Keywords
• Reward
• Assignments per HIT
• Qualifications
• Time per assignment
• HIT expiration
• Auto-approve time

• Design an HTML form

Variables in Templates

Example: Preference Elicitation
${movie1 ${movie2
} }
HIT 1 img1.jpg img2.jpg Which would you prefer to
HIT 2 img1.jpg img3.jpg watch?
HIT 3 img1.jpg img4.jpg <img src=www.sid.com/${movie1}>
HIT 4 img2.jpg img3.jpg <img src=www.sid.com/${movie2}>

HIT 5 img2.jpg img4.jpg
HIT 6 img3.jpg img4.jpg

Variables in Templates
Example: Preference Elicitation
HIT 1
Which would you prefer to watch?

HIT 6
Which would you prefer to watch?

Cross Cultural Studies: 2 Methods

 Self-reported:
 Ask workers demographic questions, do experiment
 Qualifications:
 Restrict HITs to worker‟s country of origin using MTurk
qualifications

 Honesty experiment:
 Ask workers to roll a die (or go to a website that
simulates one), pay $0.25 times the self-reported roll.

One die, $0.25 + $0.25 / pip
 Average reported roll
significantly higher than
expected
 M = 3.91, p < 0.0005
 Players under-reported
ones and twos and
over-reported fives
 Replicates F & H

Dishonesty by Gender
 Men are more likely to
over-report sixes

 Women are more likely
to over-report fives

Dishonesty by Country
 Indians are more likely
to over-report sixes

 Americans are more
likely to over-report
fives

 Might be conflated with
gender

Dishonesty by Gender & Country

External HITs on AMT

 Flexible survey
 Random Assignment
 Synchronous Experiments
 Security

Random Assignment
 One HIT, multiple Assignments
 Only post once, or delete repeat submissions
 Preview page neutral for all conditions

 Once HIT accepted:
 If new, record WorkerID, Assignment ID assign to condition
 If old, get condition, “push” worker to last seen state of study
 Wage conditions = pay through bonus

 Intent to treat:
 Keep track of attrition by condition
 Example: Noisy sites decrease reading comprehension
 BUT find no difference between conditions
 Why? Most people in noisy condition dropped out, only people left
were deaf!

Javascript on Internal HIT
<script type=“javascript”>
var condition = Math.floor(Math.random()*2)
switch (condition)
{
case 0:
pagetext = “Condition 1”;
break;
case 1:
pagetext = “Condition 2”;
break;
}
document.getElementById(“page”).html() = pagetext;
</script>

<html><div id=“page”></div></html>

Privacy survey
 External HIT
 Random order of
answers
 Random order of
questions
 Pop-out questions based
on answers

 Changed wording on
question from
Annenberg study:
Do you want the websites you
visit to show you ads that are
{tailored, relevant} to your

Results

 Replicated original
study
 Found effect of
differences in wording
Annenberg MTurk “Relevant”

Yes
No
Maybe

Results
BUT
 Replicated original  Not representative
study sample
 Found effect of  Results not replicated in
differences in wording subsequent phone
Annenberg MTurk “Relevant”
survey

Yes
No
Maybe

Financial Incentives
& the performance of crowds

Manipulated Measured
 Task Value  Quantity
 Amount earned per image  Number of image sets
set submitted
 $0.01, $0.05, $0.10  Quality
 No additional pay for image
 Proportion of image sets
sets
correctly sorted
 Difficulty  Rank correlation of image
 Number of images per set sets with correct order
 2, 3, 4

Results
 Pay rate can affect quantity of work
 Pay rate does not have a big impact on quality
 (MW ‟09)
Number of Tasks Completed

Accuracy

Pay per Task Pay per Task

Quality Assurance
 Majority vote – Snow, O‟Connor, Jurafsky, & Ng (2008)
 Machine learning with responses – Sheng, Provost, & Ipeirotis
(2008)
 Iterative vs. Parallel tasks – Little, Chilton, Goldman, & Miller (2010)
 Mutual Information – Ipeirotis, Provost, & Wang (2010)

 Verifiable answers – Kittur, Chi, Suh (2008)
 Time to completion
 Honeypot tasks

 Monitor discussion on forums. MW ’11: Players followed guidelines
about what not to talk about.

Synchronous Experiments
 Example research questions
 Market behavior under new mechanism
 Network dynamics (e.g., contagion)
 Multi-player games

 Typical tasks on MTurk don‟t depend on each other
 can be split up, done in parallel

 How does one get many workers to do an
experiment at the same time?
 Panel
 Waiting Room

Social Dilemmas in Networks
 A social dilemma occurs
when the interest of the
individual is at odds with the
interest of the collective.
 In social networking sites
one‟s contributions are only
seen by friends.
 E.g. photos in Flickr, status
updates in Facebook
 More contributions, more
engaged group, better for
everyone
 Why contribute when one can
free ride?

64

Cycle
Cliques
Paired
Cliques

Small Random
World Regular

Effect of Seed Nodes

• 10-seeds: 13 trials 65

0-seeds: 17 trials
• Only human
contributions are
included in averages
• People are
conditional
cooperators
• Fischbacher et al.
„01

Building the Panel
 Do experiments requiring 4-8
fresh players
 Waiting time is not too high
 Less consequences if there
is a bug

 Ask if they would like to be
notified of future studies
 85% opt in rate for SW „10
 78% opt in rate for MW „11

NotifyWorkers

 MTurk API call that sends an e-mail to workers

 Notify them a day early

 Experiments work well 11am-5pm EST

 If n subjects are needed, notify 3n
 Done experiments with 45 players
simultaneously

Waiting Room
…
 Workers need to start a synchronous
experiment at the same time
 Workers show up at slightly different times
 Have workers wait at a page until enough
arrive False
 Show how many they are waiting for
True
 After enough arrive tell the rest
experiment is full
 Funnel extra players into another
instance of the experiment

Attrition
 In lab experiments subjects rarely walk out
 On the web:
 Browsers/computers crash
 Internet connections go down
 Bosses walk in
 Need a timeout and a default action
 Discard experiments with < 90% human actions
 SW „10 discarded 21 of 94 experiments with 20-24
people
 Discard experiment where one player acted <
50% of the time
 MW „11 discarded 43 of 232 experiments with 16
people

Security of External HITs
 Code security
 Code is exposed to entire internet, susceptible to
attacks
 SQL injection attacks: malicious user inputs database code
to damage or get access to database
 Scrub input for dB commands
 Cross-site scripting attacks (XSS): malicious user injects
code into HTTP request or HTML form
 Scrub input and _GET and _POST variables

Security of External HITs
 Code security
 Code is exposed to entire internet, susceptible to
attacks
 SQL injection attacks: malicious user inputs database code
to damage or get access to database
 Scrub input for dB commands
 Cross-site scripting attacks (XSS): malicious user injects
code into HTTP request or HTML form
 Scrub input and _GET and _POST variables
 Protocol Security
 HITs vs Assignments
 If you want fresh players in different runs (HITs) of a
synchronous experiment, need to check workerIds
 Made a synchronous experiment with many HITs, one
assignment each
 One worker accepted most of the HITs, did the quiz, got
paid

Use Cases

Internal HITs External HITs

 Pilot survey  Testing market
 Preference elicitation mechanisms
 Training data for  Behavioral game theory
machine learning experiments
algorithms  User-generated content
 “Polling” for wisdom of  Effects of incentives
crowds / general
knowledge
ANY online study can be done on Turk
Can be used as recruitment tool

Thank you!

Conducting Behavioral Research on Amazon's Mechanical Turk
(2011) Behavior Research Methods
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1691163

Main API Functions
 CreateHIT (Requirements, Pay rate, Description) – returns HIT Id and
HIT Type Id

 SubmitAssignment (AssignmentId) – notifies Amazon that this
assignment has been completed

 ApproveAssignment (AssignmentID) – Requester accepts assignment,
money is transferred, also RejectAssignment

 GrantBonus (WorkerID, Amount, Message) – Give the worker the
specified bonus and sends message, should have a failsafe

 NotifyWorkers (list of WorkerIds, Message) – e-mails message to the
workers.

Command-line Tools
 Configuration files
 mturk.properties – for interacting with MTurk API
 [task name].input – variable name & values by row
 [task name].properties – HIT parameters
 [task name].question – XML file
 Shell scripts
 run.sh – post HIT to Mechanical Turk (creates .success file)
 getResults.sh – download results (using .success file)
 reviewResults.sh – approve or reject assignments
 approveAndDeleteResults.sh – approve & delete all
unreviewed HITs
 Output files
 [task name].success – created HIT ID & Assignment IDs
 [task name].results – tab-delimited output from workers

mturk.properties
access_key=ABCDEF0123455676789
secret_key=Fa234asOIU/as92345kasSDfq3rDSF

#service_url=http://mechanicalturk.sandbox.amazonaws.com/?Service=AWSMechanical
TurkRequester
service_url=http://mechanicalturk.amazonaws.com/?Service=AWSMechanicalTurkReque
ster

# You should not need to adjust these values.
retriable_errors=Server.ServiceUnavailable,503
retry_attempts=6
retry_delay_millis=500

[task name].properties
title: Categorize Web Sites

description: Look at URLs, rate, and classify them. These websites have not
been screened for adult content!

keywords: URL, categorize, web sites
reward: 0.01
assignments: 10
annotation:

# this Assignment Duration value is 30 * 60 = 0.5 hours
assignmentduration:1800

# this HIT Lifetime value is 60*60*24*3 = 3 days
hitlifetime:259200

# this Auto Approval period is 60*60*24*15 = 15 days
autoapprovaldelay:1296000

[task name].question
<?xml version="1.0"?>
<ExternalQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2006-07-
14/ExternalQuestion.xsd">
<ExternalURL>http://mywebsite.com/experiment/index.htm</ExternalURL>
<FrameHeight>600</FrameHeight>
</ExternalQuestion>

[task name].results
feed Answer.
hitid Assignment id Worker id accepted submitted back reject bonus
14SBGD
GM5ZHZ Sat Oct 02 Sat Oct 02
1BPE1URVWQKM6DSG40
FE3OU2 A2IB92P5729K3Q 16:03:49 EDT 16:43:55 EDT 1.39
MWDVKIAJ93B4
6DJESC2 2010 2010
0DXKY
14SBGD
1GMFLPGSL0NMWZJSTF
FE3OU2 A2LKKOAIMEF1PT 16:10:23 EDT 16:44:33 EDT 1.54
XNJ1FS74J6KW
6DJESC2 2010 2010
0DXKY
14SBGD
1VQ5ID82X6TJXBU4EKX
FE3OU2 A15T1WFW5B2OPR 16:13:22 EDT 16:44:56 EDT 1.49
YISVF8C4BWJ
6DJESC2 2010 2010
0DXKY
14SBGD
16XXR2KPFCB31UOCMB
FE3OU2 A16ME0W2U4THE0 16:00:21 EDT 16:45:08 EDT 1.67
G78KLMAD4HND
6DJESC2 2010 2010
0DXKY

Conducting Behavioral Research with Crowdsourcing Platforms Like Mechanical Turk

Recommandé

Recommandé

Contenu connexe

Similaire à Conducting Behavioral Research with Crowdsourcing Platforms Like Mechanical Turk

Similaire à Conducting Behavioral Research with Crowdsourcing Platforms Like Mechanical Turk (20)

Dernier

Dernier (20)

Conducting Behavioral Research with Crowdsourcing Platforms Like Mechanical Turk

Notes de l'éditeur