2. Disclaimer
The views, opinions, positions, or strategies expressed in
this talk are mine and do not necessarily reflect the
official policy or position of Microsoft.
3. Introduction
• Crowdsourcing is hot
• Lots of interest in the research community
– Articles showing good results
– Journals special issues (IR, IEEE Internet Computing, etc.)
– Workshops and tutorials (SIGIR, NACL, WSDM, WWW, CHI,
RecSys, VLDB, etc.)
– HCOMP
– CrowdConf
• Large companies leveraging crowdsourcing
• Big data
• Start-ups
• Venture capital investment
4. Crowdsourcing
• Crowdsourcing is the act of taking a
job traditionally performed by a
designated agent (usually an
employee) and outsourcing it to an
undefined, generally large group of
people in the form of an open call.
• The application of Open Source
principles to fields outside of
software.
• Most successful story: Wikipedia
8. Human computation
• Not a new idea
• Computers before
computers
• You are a human
computer
9. Some definitions
• Human computation is a computation
that is performed by a human
• Human computation system is a system
that organizes human efforts to carry
out computation
• Crowdsourcing is a tool that a human
computation system can use to
distribute tasks.
Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
10. More examples
• ESP game
• Captcha: 200M every day
• ReCaptcha: 750M to date
11. Data is king
• Massive free Web data
changed how we train
learning systems
• Crowds provide new access
to cheap & labeled big data
• But quality also matters
M. Banko and E. Brill. “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001.
A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems 2009.
12. Traditional Data Collection
• Setup data collection software / harness
• Recruit participants / annotators / assessors
• Pay a flat fee for experiment or hourly wage
• Characteristics
– Slow
– Expensive
– Difficult and/or Tedious
– Sample Bias…
13. Natural Language Processing
• MTurk annotation for 5 NLP tasks
• 22K labels for US $26
• High agreement between consensus labels and
gold-standard labels
• Workers as good as experts
R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert
Annotations for Natural Language Tasks”. EMNLP-2008.
14. Machine Translation
• Manual evaluation
on translation quality
is slow and expensive
• High agreement
between non-experts
and experts
• $0.10 to translate a
sentence
C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality
Using Amazon’s Mechanical Turk”, EMNLP 2009.
15. Soylent
M. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”, UIST 2010
16. Mechanical Turk
• Amazon Mechanical Turk
(AMT, MTurk,
www.mturk.com)
• Crowdsourcing platform
• On-demand workforce
• “Artificial artificial
intelligence”: get humans to
do hard part
• Named after faux automaton
of 18th C.
22. Flip a coin
• Please flip a coin and report the results
• Two questions
1. Coin type?
2. Head or tails?
• Results
Row Labels Counts
head 57
tail 43
Grand Total 100
Row Labels Count
Dollar 56
Euro 11
Other 30
(blank) 3
Grand Total 100
23. Why is this interesting?
• Easy to prototype and test new experiments
• Cheap and fast
• No need to setup infrastructure
• Introduce experimentation early in the cycle
• For new ideas, this is very helpful
24. Caveats and clarifications
• Trust and reliability
• Wisdom of the crowd re-visit
• Adjust expectations
• Crowdsourcing is another data point for your
analysis
• Complementary to other experiments
25. Why now?
• The Web
• Use humans as processors in a distributed
system
• Address problems that computers aren’t good
• Scale
• Reach
26. Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
31. Careful with That Axe Data, Eugene
• In the area of big data and machine learning:
– labels -> features -> predictive model -> optimization
• Labeling/experimentation perceived as boring
• Don’t rush labeling
– Human and machine
• Label quality is very important
– Don’t outsource it
– Own it end to end
– Large scale
32. More on label quality
• Data gathering is not a free lunch
• Labels for the machine != labels for humans
• Emphasis on algorithms,
models/optimizations and mining from labels
• Not so much on algorithms for ensuring high
quality labels
• Training sets
36. Motivating Example: Relevance Judging
• Relevance of search results is difficult to judge
– Highly subjective
– Expensive to measure
• Professional editors commonly used
• Potential benefits of crowdsourcing
– Scalability (time and cost)
– Diversity of judgments
39. Results for {idiot}
February 2011: 5/7 (R), 2/7 (NR)
Relevant
1. Most of the time those TV reality stars have absolutely no talent. They do whatever they
can to make a quick dollar. Most of the time the reality tv stars don not have a mind of
their own. R
2. Most are just celebrity wannabees. Many have little or no talent, they just want fame. R
3. Have you seen the knuckledraggers on reality television? They should be required to change
their names to idiot after appearing on the show. You could put numbers after the word
idiot so we can tell them apart. R
4. Although I have not followed too many of these shows, those that I have encountered have
for a great part a very common property. That property is that most of the participants
involved exhibit a shallow self-serving personality that borders on social pathological
behavior. To perform or act in such an abysmal way could only be an act of an idiot. R
5. I can see this one going both ways. A particular sort of reality star comes to mind,
though, one who was voted off Survivor because he chose not to use his immunity necklace.
Sometimes the label fits, but sometimes it might be unfair. R
Not Relevant
1. Just because someone else thinks they are an "idiot", doesn't mean that is what the word
means. I don't like to think that any one person's photo would be used to describe a
certain term. NR
2. While some reality-television stars are genuinely stupid (or cultivate an image of
stupidity), that does not mean they can or should be classified as "idiots." Some simply
act that way to increase their TV exposure and potential earnings. Other reality-television
stars are really intelligent people, and may be considered as idiots by people who don't
like them or agree with them. It is too subjective an issue to be a good result for a
search engine. NR
40. You have a new idea
• Novel IR technique
• Don’t have access to click data
• Can’t hire editors
• How to test new ideas?
41. Crowdsourcing and relevance evaluation
• Subject pool access: no need to come into the
lab
• Diversity
• Low cost
• Agile
42. Pedal to the metal
• You read the papers
• You tell your boss (or advisor) that
crowdsourcing is the way to go
• You now need to produce hundreds of
thousands of labels per month
• Easy, right?
43. Ask the right questions
• Instructions are key
• Workers are not IR experts so don’t assume
the same understanding in terms of
terminology
• Show examples
• Hire a technical writer
• Prepare to iterate
44. How not to do things
• Lot of work for a few cents
• Go here, go there, copy, enter, count …
45. UX design
• Time to apply all those usability concepts
• Need to grab attention
• Generic tips
– Experiment should be self-contained.
– Keep it short and simple.
– Be very clear with the task.
– Engage with the worker. Avoid boring stuff.
– Always ask for feedback (open-ended question) in an
input box.
• Localization
46. Payments
• How much is a HIT?
• Delicate balance
– Too little, no interest
– Too much, attract spammers
• Heuristics
– Start with something and wait to see if there is
interest or feedback (“I’ll do this for X amount”)
– Payment based on user effort. Example: $0.04 (2 cents
to answer a yes/no question, 2 cents if you provide
feedback that is not mandatory)
• Bonus
48. Quality control
• Extremely important part of the experiment
• Approach it as “overall” quality – not just for
workers
• Bi-directional channel
– You may think the worker is doing a bad job.
– The same worker may think you are a lousy
requester.
• Test with a gold standard
49. When to assess work quality?
• Beforehand (prior to main task activity)
– How: “qualification tests” or similar mechanism
– Purpose: screening, selection, recruiting, training
• During
– How: assess labels as worker produces them
– Like random checks on a manufacturing line
– Purpose: calibrate, reward/penalize, weight
• After
– How: compute accuracy metrics post-hoc
– Purpose: filter, calibrate, weight, retain
50. How do we measure work quality?
• Compare worker’s label vs.
– Known (correct, trusted) label
– Other workers’ labels
– Model predictions of workers and labels
• Verify worker’s label
– Yourself
– Tiered approach (e.g. Find-Fix-Verify)
51. Methods for measuring agreement
• Inter-agreement level
– Agreement between judges
– Agreement between judges and the gold set
• Some statistics
– Cohen’s kappa (2 raters)
– Fleiss’ kappa (any number of raters)
– Krippendorff’s alpha
• Gray areas
– 2 workers say “relevant” and 3 say “not relevant”
– 2-tier system
52. Content quality
• People like to work on things that they like
• Content and judgments according to modern
times
– TREC data set: airport security docs are pre 9/11
• Document length
• Randomize content
• Avoid worker fatigue
– Judging 100 documents on the same subject can be
tiring, leading to decreasing quality
53. Was the task difficult?
Ask workers to rate difficulty of a search topic
50 topics; 5 workers, $0.01 per task
54. So far …
• One may say “this is all good but looks like a
ton of work”
• The original goal: data is king
• Data quality and experimental designs are
preconditions to make sure we get the right
stuff
• Don’t cut corners
55. Pause
• Crowdsourcing works
– Fast turnaround, easy to experiment, few dollars to test
– But: you have to design experiments carefully, quality,
platform limitations
• Crowdsourcing in production
– Large scale data sets
– Continuous execution
– Difficult to debug
• How do you know the experiment is working
• Goal: framework for ensuring reliability on
crowdsourcing tasks
O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable
results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
56. Labeling tweets – an example of a task
• Is this tweet interesting?
• Subjective activity
• Not focused on specific events
• Findings
– Difficult problem, low inter-rater agreement
– Tested many designs, number of workers, platforms
(MTurk and others)
• Multiple contingent factors
– Worker performance
– Work
– Task design
O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
57. Designs that include in-task CAPTCHA
• Borrowed idea from reCAPTCHA -> use of
control term
• HIDDEN
• Adapt your labeling task
• 2 more questions as control
– 1 algorithmic
– 1 semantic
58. Production example #1
Q1 (k = 0.91, alpha = 0.91)
Q2 (k = 0.771, alpha = 0.771)
Q3 (k = 0.033, alpha = 0.035)
Tweet de-branded
In-task captcha
The main
question
59. Production example #2
Q1 (k = 0.907, alpha = 0.907)
Q2 (k = 0.728, alpha = 0.728)
• Q3 Worthless (alpha = 0.033)
• Q3 Trivial (alpha = 0.043)
• Q3 Funny (alpha = -0.016)
• Q3 Makes me curious (alpha = 0.026)
• Q3 Contains useful info (alpha = 0.048)
• Q3 Important news (alpha = 0.207)
Tweet de-branded
In-task captcha
Breakdown by
categories to
get better signal
60. Once we get here
• High quality labels
• Data will be later be used for rankers, ML
models, evaluations, etc.
• Training sets
• Scalability and repeatability
62. Algorithms
• Bandit problems; explore-exploit
• Optimizing amount of work by workers
– Humans have limited throughput
– Harder to scale than machines
• Selecting the right crowds
• Stopping rule
63. Humans in the loop
• Computation loops that mix humans and
machines
• Kind of active learning
• Double goal:
– Human checking on the machine
– Machine checking on humans
• Example: classifiers for social data
64. Routing
• Expertise detection and routing
• Social load balancing
• When to switch between machines and
humans
• CrowdSTAR
B. Nushi, O. Alonso, M. Hentschel, V. Kandylas. “CrowdSTAR: A Social Task Routing
Framework for Online Communities”, 2014. http://arxiv.org/abs/1407.6714
65. Social Task Routing
Task A? Task B?
C1 C2 Crowd Summaries
Crowd 1 (Twitter) Crowd 2 (Quora)
Routing across
crowds
Routing within a crowd
67. Conclusions
• Crowdsourcing at scale works but requires a solid
framework
• Fast turnaround, easy to experiment, few dollars to test
• But you have to design the experiments carefully
• Usability considerations
• Lots of opportunities to improve current platforms
• Three aspects that need attention: workers, work and
task design
• Labeling social data is hard
68. Conclusions – II
• Important to know your limitations and be
ready to collaborate
• Lots of different skills and expertise required
– Social/behavioral science
– Human factors
– Algorithms
– Economics
– Distributed systems
– Statistics