Slides on how to use crowdsourcing and Amazon's Mechanical Turk for collecting online data, particularly for psychologists. Presented at the Online Data Collection Workshop at ICSTE in Lisbon, Portgual on Jan 9, 2012.
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Conducting Behavioral Research with Crowdsourcing Platforms Like Mechanical Turk
1. Conducting Behavioral Research
with Crowdsourcing
(especially Amazon’s Mechanical Turk)
Winter Mason Siddharth Suri
Stevens Institute of Yahoo! Research
Technology
2. Outline
Peer Production vs. Human Computation vs.
Crowdsourcing
Peer Production & Citizen Science
Crowdsourcing
Mechanical Turk Basics
Internal HITs
Preference Elicitation
Surveys
External HITs
Random Assignment
Synchronous Experiments
Conclusion
3. Definitions
Peer Production
Creation through distributed contributions
Human Computation
Computation with “humans in the loop” (Law & von Ahn, ‘11)
Crowdsourcing
Jobs outsourced to a group through an open call (Howe ‘06)
4. Examples of Modern Peer Production
Open source software Crowdsourcing
Linux, Apache, Fire Fox ESP Game
Mash-ups Fold-it!
Prediction Markets galaxyZoo
Iowa electronics markets, threadless
Hollywood stock exchange Tagasauris
Collaborative Knowledge Innocentive
Wikipedia, Intellipedia TopCoder
Yahoo! Answers oDesk
Amazon, Yelp, Epinions Mechanical Turk
Social Tagging
Communities
Flickr, Del.icio.us
5. ESP Game
Two player online game
Players do not know who they are playing with
Players cannot communicate
Object of the game:
Type the same word given an image
6.
7. Games With a Purpose
The outcome of the ESP game is labeled images.
Google Images bought the ESP game, and has used
it to improve image search.
The contributions of the crowd are completely free
for Google.
8. Fold.It!
Fold.it is an online game in
which players fold proteins
into different configurations
Certain configurations earn
more points than others
The configurations
correspond to physical
structures:
some amino acids must be
near the center, and others
outside
some pairs of amino acids
must be close together and
others far apart
Players of the game recently
unlocked the structure of an
AIDS-related enzyme that
the scientific community had
been unable to unlock for a
decade
9. galaxyZoo
“Citizen Science”
The number of images of galaxies
taken by Hubble is immense.
Computers can identify whether
something is a galaxy, but not what
type of galaxy it is (reliably).
By employing the crowd, galaxyZoo
has classified over 50M galaxies.
Astronomers used to assume that if
a galaxy appears red in color, it is
also probably an elliptical galaxy.
Galaxy Zoo has shown that up to a
third of red galaxies are actually
spirals.
10. Tagasauris
Magnum Photos has a very large
collection of mis- or unclassified
photos
To get a handle on it, they asked
crowd-workers to tag their photos
Through this process, in
combination with a knowledge
base, they discovered lost photos
from the movie, “American Graffiti”
The actors were tagged individually
in the photos (like the one on the
right), and the system linked them
together and discovered they were
all related to the film.
11. Innocentive
A “Seeker” creates a
“challenge”, typically requiring
serious skill and technical ability
Multiple “Solvers” submit
detailed solutions to the
challenge. If the solution is
selected, they win the (typically
sizable) reward.
For instance, by creating a
durable & inexpensive solar
flashlight that could double as a
lamp, a retired engineer won
$20,000 and brought lighting to
many rural Africans.
12. topCoder
Programming jobs are
offered as contests
Coders submit their work,
and the winner earns the
reward
Aside from the direct
payoff, there are anecdotal
reports of people being
hired for permanent
positions as a result of
their contributions on
TopCoder
13. oDesk
Skilled crowdsourcing:
for any job that requires some
skills, but can be done entirely
on a computer.
Jobs are paid either as a flat,
one-time reward, or on an
hourly basis for longer
contracts.
Workers have extensive profiles
& reputations, and wages are
negotiated between Employer
and Worker.
Jobs cover a vary large
spectrum, and pay varies with
skill
14. Amazon’s Mechanical Turk
The original
crowdsourcing platform
“The human inside the
machine”; built to
programmatically
incorporate human input
Jobs are meant to be
doable by any human, and
every worker is meant to
be completely
interchangeable.
15. Generally-Shared Features of Existing
Systems
Contributions highly modular
Minimal contribution is small
Single edit, single line of code, single tag
Low interdependence between separate contributions
Same document or function
Distribution of contributions highly skewed
Small number of heavy contributors
Wikipedia, AMT, Digg
Large number of “free riders”
Very common feature of public goods
16. What is Mechanical Turk?
Crowdsourcing
Jobs outsourced to a group through an open
call (Howe ‘06)
Online Labor Market
Place for requesters to post jobs and workers
to do them for pay
Participant recruitment and reimbursement
How can we use MTurk for behavioral
research?
What kinds of behavioral research can we use
MTurk for?
17. Why Mechanical Turk?
Subject pool size
Central place for > 100,000 workers (Pontin „07)
Always-available subject pool
Subject pool diversity
Open to anyone globally with a computer, internet
connection
Low cost
Reservation Wage: $1.38/hour (Chilton et al „10)
Effective Wage: $4.80/hour (Ipeirotis, ‟10)
Faster theory/experiment cycle
Hypothesis formulation
Testing & evaluation of hypothesis
New hypothesis tests
18. Validity of Worker Behavior
(Quality-controlled) worker output can be as good as
experts, sometimes better
Labeling text with emotion (Snow, et al, 2008)
Audio transcriptions (Marge, et al, 2010)
Similarity judgments for music (Urbano, et al, 2010)
Search relevance judgments (Alonso & Mizzaro, 2009)
Experiments with workers replicate studies conducted in
laboratory or other online settings
Standard psychometric tests (Buhrmester, et al, 2011)
Response in judgment and decision-making tests (Paolacci, et
al, 2010)
Responses in public good games (Suri & Watts, 2011)
19. Worker Demographics
Self reported demographic
information from 2,896 workers
over 3 years (MW „09, MW „11, SW ‟10)
55% Female, 45% Male
Similar to other internet panels (e.g.
Goldstein)
Age:
Mean: 30 yrs,
Median: 32 yrs
Mean Income: $30,000 / yr
Similar to Ipeirotis „10, Ross et al
‟10
20. Internal Consistency of Demographics
207 out of 2,896 workers did 2 of our studies
Only 1 inconsistency on gender, age, income
(0.4%)
31 workers did ≥ 3 of our studies
3 changed gender
1 changed age (by 6 years)
7 changed income bracket
Strong internal consistency
21. Why Do Work on Mechanical Turk?
“Mturk money is always necessary to make ends meet.”
5% U.S. 13% India
“Mturk money is irrelevant.”
12% U.S. 10% India
“Mturk is a fruitful way to spend free time and get some
cash.”
69% U.S. 59% India
(Ross et al ‟10, Ipeirotis ‟10)
22. Requesters
Companies crowdsourcing part of their business
Search companies: relevance
Online stores: similar products from different
stores (identifying competition)
Online directories: accuracy, freshness of
listings
Researchers
Intermediaries
CrowdFlower (formerly Delores Labs)
Smartsheet.com
25. Soylent
Word processing with an embedded crowd (Bernstein
et al, UIST 2010)
Crowd proofreads each paragraph
“Find-Fix-Verify” prevents “lazy worker” from ruining
output
26. Find–Fix–Verify
Find
Identify one area that can be shortened without changing
the meaning of the paragraph
Fix
Edit the highlighted section to shorten its length without
changing the meaning of the paragraph
Verify
Choose one rewrite that fixes style errors and one that
changes the meaning
27. Iterative processes
By building on each
other‟s work, the crowd
can achieve remarkable
outcomes
Some tasks benefit from
iterative processes,
others from parallel
(Little, et al, 2010)
28. TurkoMatic
Crowd creates workflows
1. Ask workers to decompose task into steps
2. Ask if a step can be completed in 10 minutes
If so, solve it
If not, decompose the sub-task
3. Combine outputs of sub-tasks into final output
(Kalkani et al, CHI 2011)
29. Turker Community
Asymmetry in reputation mechanism
Reputation of Workers is given by approval rating
Requesters can reject work
Requesters can refuse workers with low approval rates
Reputation of Requesters is not built in to Mturk
Turkopticon: Workers rate requesters on
communicativity, generosity, fairness and promptness
Turker Nation: Online forum for workers
Requesters should introduce themselves here
Reputation matters, so abusive studies will fail quickly
30. Anatomy of a HIT
HITs with the
same title,
description,
pay rate, etc.
are the same
HIT type
HITs are
broken up into
Assignments
A worker
cannot do
more than 1
assignment of
a HIT
31. Anatomy of a HIT
HITs with the
same title,
description,
pay rate, etc.
are the same
HIT type
HITs are
broken up into
Assignments
Requesters can set qualifications that determine who
A worker can work on the HIT
cannot do e.g., Only US workers, workers with approval rating >
more than 1 90%
assignment of
a HIT
32. Anatomy of a HIT
HITs with the
same title,
description, pay
rate, etc. are
the same HIT
type
HITs are broken
up into
Assignments
A worker
cannot do more
than 1
assignment of a
HIT
33. HIT GROUP
Assignment 1
“Black”
Alice
Assignment 2
Which is the better translation for Táy ?
“Night”
HIT 1 o Black
o Night
Bob
Which is the better translation for Nedj
HIT 2 ?
o Clean
o White Assignment 3
• “Black”
• Charlie
•
34. HIT GROUP
Assignment 1
“White”
Which is the better translation for Táy ?
HIT 1 o Black Alice
o Night
Assignment 2
Which is the better translation for Nedj “White”
HIT 2 ?
o Clean
o White
• Bob
•
•
Assignment 3
“White”
David
35. Requester Worker
Build HIT
Search for
Test HIT
HITs
Post HIT Accept HIT
Do work
Reject or
Submit HIT
Approve HIT
36. Lifecycle of a HIT
Requester builds a HIT
Internal HITs are hosted by Amazon
External HITs are hosted by the requester
HITs can be tested on {requester,
worker}sandbox.mturk.com
Requester posts HIT on mturk.com
Can post as many HITs as account can cover
Workers do HIT and submit work
Requester approves/rejects work
Payment is rendered
Amazon charges requesters 10%
HIT completes when it expires or all assignments are
completed
37. How Much to Pay?
Pay rate can affect quantity of work
Pay rate does not have a big impact on quality
(MW ‟09)
Number of Tasks Completed
Accuracy
Pay per Task Pay per Task
38. Completion Time
3, 6-question multiple
choice surveys
Launched same time of
day, day of week
$0.01, $0.03, $0.05
Past a threshold, pay
rate does not increase
speed
Start with low pay rate
work up
40. Internal HITs on AMT
Template tool
Variables
Preference Elicitation
Honesty study
41. AMT Templates
• Hosted by Amazon
• Set parameters for HIT
• Title
• Description
• Keywords
• Reward
• Assignments per HIT
• Qualifications
• Time per assignment
• HIT expiration
• Auto-approve time
• Design an HTML form
42. Variables in Templates
Example: Preference Elicitation
${movie1 ${movie2
} }
HIT 1 img1.jpg img2.jpg Which would you prefer to
HIT 2 img1.jpg img3.jpg watch?
HIT 3 img1.jpg img4.jpg <img src=www.sid.com/${movie1}>
HIT 4 img2.jpg img3.jpg <img src=www.sid.com/${movie2}>
HIT 5 img2.jpg img4.jpg
HIT 6 img3.jpg img4.jpg
43. Variables in Templates
Example: Preference Elicitation
HIT 1
Which would you prefer to watch?
HIT 6
Which would you prefer to watch?
45. Cross Cultural Studies: 2 Methods
Self-reported:
Ask workers demographic questions, do experiment
Qualifications:
Restrict HITs to worker‟s country of origin using MTurk
qualifications
Honesty experiment:
Ask workers to roll a die (or go to a website that
simulates one), pay $0.25 times the self-reported roll.
46. One die, $0.25 + $0.25 / pip
Average reported roll
significantly higher than
expected
M = 3.91, p < 0.0005
Players under-reported
ones and twos and
over-reported fives
Replicates F & H
47. Dishonesty by Gender
Men are more likely to
over-report sixes
Women are more likely
to over-report fives
48. Dishonesty by Country
Indians are more likely
to over-report sixes
Americans are more
likely to over-report
fives
Might be conflated with
gender
51. External HITs on AMT
Flexible survey
Random Assignment
Synchronous Experiments
Security
52. Random Assignment
One HIT, multiple Assignments
Only post once, or delete repeat submissions
Preview page neutral for all conditions
Once HIT accepted:
If new, record WorkerID, Assignment ID assign to condition
If old, get condition, “push” worker to last seen state of study
Wage conditions = pay through bonus
Intent to treat:
Keep track of attrition by condition
Example: Noisy sites decrease reading comprehension
BUT find no difference between conditions
Why? Most people in noisy condition dropped out, only people left
were deaf!
53. Javascript on Internal HIT
<script type=“javascript”>
var condition = Math.floor(Math.random()*2)
switch (condition)
{
case 0:
pagetext = “Condition 1”;
break;
case 1:
pagetext = “Condition 2”;
break;
}
document.getElementById(“page”).html() = pagetext;
</script>
<html><div id=“page”></div></html>
54. Privacy survey
External HIT
Random order of
answers
Random order of
questions
Pop-out questions based
on answers
Changed wording on
question from
Annenberg study:
Do you want the websites you
visit to show you ads that are
{tailored, relevant} to your
55. Results
Replicated original
study
Found effect of
differences in wording
Annenberg MTurk “Relevant”
Yes
No
Maybe
56. Results
BUT
Replicated original Not representative
study sample
Found effect of Results not replicated in
differences in wording subsequent phone
Annenberg MTurk “Relevant”
survey
Yes
No
Maybe
57. Financial Incentives
& the performance of crowds
Manipulated Measured
Task Value Quantity
Amount earned per image Number of image sets
set submitted
$0.01, $0.05, $0.10 Quality
No additional pay for image
Proportion of image sets
sets
correctly sorted
Difficulty Rank correlation of image
Number of images per set sets with correct order
2, 3, 4
58.
59. Results
Pay rate can affect quantity of work
Pay rate does not have a big impact on quality
(MW ‟09)
Number of Tasks Completed
Accuracy
Pay per Task Pay per Task
60. Quality Assurance
Majority vote – Snow, O‟Connor, Jurafsky, & Ng (2008)
Machine learning with responses – Sheng, Provost, & Ipeirotis
(2008)
Iterative vs. Parallel tasks – Little, Chilton, Goldman, & Miller (2010)
Mutual Information – Ipeirotis, Provost, & Wang (2010)
Verifiable answers – Kittur, Chi, Suh (2008)
Time to completion
Honeypot tasks
Monitor discussion on forums. MW ’11: Players followed guidelines
about what not to talk about.
62. Synchronous Experiments
Example research questions
Market behavior under new mechanism
Network dynamics (e.g., contagion)
Multi-player games
Typical tasks on MTurk don‟t depend on each other
can be split up, done in parallel
How does one get many workers to do an
experiment at the same time?
Panel
Waiting Room
63. Social Dilemmas in Networks
A social dilemma occurs
when the interest of the
individual is at odds with the
interest of the collective.
In social networking sites
one‟s contributions are only
seen by friends.
E.g. photos in Flickr, status
updates in Facebook
More contributions, more
engaged group, better for
everyone
Why contribute when one can
free ride?
64. 64
Cycle
Cliques
Paired
Cliques
Small Random
World Regular
65. Effect of Seed Nodes
• 10-seeds: 13 trials 65
0-seeds: 17 trials
• Only human
contributions are
included in averages
• People are
conditional
cooperators
• Fischbacher et al.
„01
66. Building the Panel
Do experiments requiring 4-8
fresh players
Waiting time is not too high
Less consequences if there
is a bug
Ask if they would like to be
notified of future studies
85% opt in rate for SW „10
78% opt in rate for MW „11
67. NotifyWorkers
MTurk API call that sends an e-mail to workers
Notify them a day early
Experiments work well 11am-5pm EST
If n subjects are needed, notify 3n
Done experiments with 45 players
simultaneously
68. Waiting Room
…
Workers need to start a synchronous
experiment at the same time
Workers show up at slightly different times
Have workers wait at a page until enough
arrive False
Show how many they are waiting for
True
After enough arrive tell the rest
experiment is full
Funnel extra players into another
instance of the experiment
69. Attrition
In lab experiments subjects rarely walk out
On the web:
Browsers/computers crash
Internet connections go down
Bosses walk in
Need a timeout and a default action
Discard experiments with < 90% human actions
SW „10 discarded 21 of 94 experiments with 20-24
people
Discard experiment where one player acted <
50% of the time
MW „11 discarded 43 of 232 experiments with 16
people
70. Security of External HITs
Code security
Code is exposed to entire internet, susceptible to
attacks
SQL injection attacks: malicious user inputs database code
to damage or get access to database
Scrub input for dB commands
Cross-site scripting attacks (XSS): malicious user injects
code into HTTP request or HTML form
Scrub input and _GET and _POST variables
72. Security of External HITs
Code security
Code is exposed to entire internet, susceptible to
attacks
SQL injection attacks: malicious user inputs database code
to damage or get access to database
Scrub input for dB commands
Cross-site scripting attacks (XSS): malicious user injects
code into HTTP request or HTML form
Scrub input and _GET and _POST variables
Protocol Security
HITs vs Assignments
If you want fresh players in different runs (HITs) of a
synchronous experiment, need to check workerIds
Made a synchronous experiment with many HITs, one
assignment each
One worker accepted most of the HITs, did the quiz, got
paid
73. Use Cases
Internal HITs External HITs
Pilot survey Testing market
Preference elicitation mechanisms
Training data for Behavioral game theory
machine learning experiments
algorithms User-generated content
“Polling” for wisdom of Effects of incentives
crowds / general
knowledge
ANY online study can be done on Turk
Can be used as recruitment tool
74. Thank you!
Conducting Behavioral Research on Amazon's Mechanical Turk
(2011) Behavior Research Methods
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1691163
75. Main API Functions
CreateHIT (Requirements, Pay rate, Description) – returns HIT Id and
HIT Type Id
SubmitAssignment (AssignmentId) – notifies Amazon that this
assignment has been completed
ApproveAssignment (AssignmentID) – Requester accepts assignment,
money is transferred, also RejectAssignment
GrantBonus (WorkerID, Amount, Message) – Give the worker the
specified bonus and sends message, should have a failsafe
NotifyWorkers (list of WorkerIds, Message) – e-mails message to the
workers.
76. Command-line Tools
Configuration files
mturk.properties – for interacting with MTurk API
[task name].input – variable name & values by row
[task name].properties – HIT parameters
[task name].question – XML file
Shell scripts
run.sh – post HIT to Mechanical Turk (creates .success file)
getResults.sh – download results (using .success file)
reviewResults.sh – approve or reject assignments
approveAndDeleteResults.sh – approve & delete all
unreviewed HITs
Output files
[task name].success – created HIT ID & Assignment IDs
[task name].results – tab-delimited output from workers
78. [task name].properties
title: Categorize Web Sites
description: Look at URLs, rate, and classify them. These websites have not
been screened for adult content!
keywords: URL, categorize, web sites
reward: 0.01
assignments: 10
annotation:
# this Assignment Duration value is 30 * 60 = 0.5 hours
assignmentduration:1800
# this HIT Lifetime value is 60*60*24*3 = 3 days
hitlifetime:259200
# this Auto Approval period is 60*60*24*15 = 15 days
autoapprovaldelay:1296000