SlideShare a Scribd company logo
1 of 36
Toward Better Crowdsourcing Science
(& Predicting Annotator Performance)
Matt Lease
School of Information
University of Texas at Austin
ir.ischool.utexas.edu
@mattlease
ml@utexas.edu
Slides: www.slideshare.net/mattlease
“The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org
The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
3
Matt Lease <ml@utexas.edu>
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
4
Matt Lease <ml@utexas.edu>
Roadmap
Hyun Joon Jung
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
5
Matt Lease <ml@utexas.edu>
Roadmap
A Popular Tale of Crowdsourcing Woe
• Heroic ML researcher asks the
crowd to perform a simple task
• Crowd (invariably) screws it up…
• “Aha!” cries the ML researcher, “Fortunately,
I know exactly how to solve this problem!”
Matt Lease <ml@utexas.edu>
6
Matt Lease <ml@utexas.edu>
7
But why can’t the workers just get it
right to begin with?
Matt Lease <ml@utexas.edu>
8
Is everyone just lazy, stupid, or deceitful?!?
Much of our literature
seems to suggest this:
• Cheaters
• Fraudsters
• “Lazy Turkers”
• Scammers
• Spammers
Another story (a parable)
“We had a great software interface, but we went
out of business because our customers were too
stupid to figure out how to use it.”
Moral
• Even if a user were stupid or lazy, we still lose
• By accepting our own responsibility, we create
another opportunity to fix the problem…
– Cynical view: idiot-proofing
Matt Lease <ml@utexas.edu>
9
What is our responsibility?
• Ill-defined/incomplete/ambiguous/subjective task?
• Confusing, difficult, or unusable interface?
• Incomplete or unclear instructions?
• Insufficient or unhelpful examples given?
• Gold standard with low or unknown inter-assessor
agreement (i.e. measurement error in assessing
response quality)?
• Task design matters! (garbage in = garbage out)
– Report it for review, completeness, & reproducibility
Matt Lease <ml@utexas.edu>
10
A Few Simple Suggestions (1 of 2)
1. Make task self-contained: everything the worker
needs to know should be visible in-task
2. Short, simple, & clear instructions with examples
3. Avoid domain-specific & advanced terminology;
write for typical people (e.g., your mom)
4. Engage worker / avoid boring stuff. If possible,
select interesting content for people to work on
5. Always ask for open-ended feedback
Matt Lease <ml@utexas.edu>
11
Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
Suggested Sequencing (2 of 2)
1. Simulate first draft of task with your in-house personnel.
Assess, revise, & iterate (ARI)
2. Run task using relatively few workers & examples (ARI)
1. Do workers understand the instructions?
2. How long does it take? Is pay effective & ethical?
3. Replicate results on another dataset (generalization). (ARI)
4. [Optional] qualification test. (ARI)
5. Increase items. Look for boundary items & noisy gold (ARI)
6. Increase # of workers (ARI)
Matt Lease <ml@utexas.edu>
12
Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
Toward Better Crowdsourcing Science
Goal: Strengthen individual studies and minimize
unwarranted spread of bias in our scientific literature
• Occam’s Razor: avoid making assumptions beyond
what the data actually tells us (avoid prejudice!)
• Enumerate hypotheses for possible causes of low data
quality, assess supporting evidence for each hypothesis,
and for any claims made, cite supporting evidence
• Recognize uncertainty of analyses and convey this via
hedge statements such as, “the data suggests that…”
• Avoid derogatory language use without very strong
supporting evidence. The crowd enables our work!!
– Acknowledge your workers!
Matt Lease <ml@utexas.edu>
13
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
14
Matt Lease <ml@utexas.edu>
Roadmap
Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
15
Matt Lease <ml@utexas.edu>
CACM August, 2013
16
Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.
Matt Lease <ml@utexas.edu>
• “Contribute to society and human well-being”
• “Avoid harm to others”
“As an ACM member I will
– Uphold and promote the principles of this Code
– Treat violations of this code as inconsistent with membership in the ACM”
17
Matt Lease <ml@utexas.edu>
“Which approaches are less expensive and is this sensible? With the advent of
outsourcing and off-shoring these matters become more complex and take on new
dimensions …there are often related ethical issues concerning exploitation…
“…legal, social, professional and ethical [topics] should feature in all computing degrees.”
2008 ACM/IEEE Curriculum Update
• Mistakes are made in HITs rejection, worker blocking
– e.g., student error, bug, poor task design, noisy gold, etc.
• Workers have limited recourse for appeal
• Our errors impact real people’s lives
• What is the loss function to optimize?
• Should anyone hold researchers accountable? IRB?
• How do we balance the risk of human harm vs.
the potential benefit if our research succeeds?
Power Asymmetry on MTurk
18
Matt Lease <ml@utexas.edu>
ACM: “Contribute to society and human
well-being; avoid harm to others”
• How do we know who is doing the work, or if a
decision to work (for a given price) is freely made?
• Does it matter if work is performed by
– Political refugees? Children? Prisoners? Disabled?
• What (if any) moral obligation do crowdsourcing
researchers have to consider broader impacts of
our research (either good or bad) on the lives of
those we depend on to power our systems?
Matt Lease <ml@utexas.edu>
19
Who Are We Building a Better Future For?
• “Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders
of innovative technologies, rather than employers
unconcerned with working conditions.”
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of the
people we ask to power our computing?”
20
Could Effective Human Computation
Sometimes Be a Bad Idea?
• The Googler who Looked at the Worst of the Internet
• Policing the Web’s Lurid Precincts
• Facebook content moderation
• The dirty job of keeping Facebook clean
• Even linguistic annotators report stress &
nightmares from reading news articles!
21
Matt Lease <ml@utexas.edu>
Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw
an informal, occasional blog for researchers
interested in ethical issues in crowd work
22
Matt Lease <ml@utexas.edu>
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
23
Matt Lease <ml@utexas.edu>
Roadmap
Hyun Joon Jung
Quality Control in Crowdsourcing
7/10/2015 24
Crowd workers
Label
Aggregation
Workflow
Design
Worker
Management
Existing Quality Control Methods
Task Design
Who is more accurate?
(worker performance estimation
and prediction)
Requester
Online marketplace
Crowd
workers
Motivation
Matt Lease <ml@utexas.edu>
25
Equally Accurate Workers?
1 0 1 0
7/10/2015 26
1 0 1 0 1 0
0 0 0 0 1 0 1 1 1 1
Alice
Bob
time t
Correctness of the ith task instance
1 -> correct , 0 -> wrong
Accuracy(Alice) = Accuracy(Bob) = 0.5
But should we expect equal work quality in the future?
What if examples are not i.i.d.?
Bob seems to be improving over time.
1: Time-series model
27
Latent Autoregressive
Real observation
Noise Model
Latent variable
𝑦𝑡 = f(𝑥 𝑡)
𝑥𝑡
Temporal correlation
How frequently y has
changed over time
𝜑
Offset
Sign navigates direction
between correct vs. not
𝑐
1 0 1 0
-0.3 0.4 -0.10.8𝑥𝑡
𝑦𝑡
𝑐 φ 𝑐 φ 𝑐 φ𝑐 φ
EM Variant (LAMORE, Park et al. 2014)
Jung et al. Predicting Next Label
Quality: A Time-Series Model of
Crowdwork. AAAI HCOMP 2014.
7/10/2015 28
Integrate multi-dimensional features of a
crowd assessor
Multiple features
Alice
accuracy time
temporal
effect
topic
familiarity
# of
labels
00.7 10.3 0.6 0.8 20
0.6 8.5 0.5 0.2 21 1
0.65 7.5 0.4 0.4 22 0
0.63 11.5 0.3 0.5 23 ?
Predict an assessor’s next label
quality based on a single feature
Alice
0.6
0.5
0.4
0.3
0
1
0
?
temporal
effect
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
2: Modeling More Features
Features
7/10/2015 29
[1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10
[2] Ipeirotis, P.G., Gabrilovich, E.: Quizz: targeted crowdsourcing with a billion (potential) users. WWW’14
[3] Jung, H., et al.: Predicting Next Label Quality: A Time-Series Model of Crowdwork. HCOMP’14
How do we flexibly capture a wider range of assessor behaviors by
incorporating multi-dimensional features?
[1]
[1]
[2]
[3]
[3]
[3]
Various
accuracy
measures
Task features
Temporal
features
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
Model
7/10/2015 30
Input: X (features for crowd assessor model)
Learning Framework [ ]
Output: Y (likelihood of getting correct label at t)
Generalizable feature-based Assessor Model (GAM)
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
Which Features Matter?
7/10/2015 31
. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
s varying decision rejection options (δ=[0⇠0.25] by 0.05). While theother methodss
cant decreasein coverage, under all thegiven reject options, GAM showsbetter cov
l asprediction performance.
49#
43#
39#
28#
27#
23#
22#
20#
19#
16#
10#
7#
5#
0# 10# 20# 30# 40# 50#
AA#
BA_opt#
BA_PES#
C#
NumLabels#
CurrentLabelQuality#
AccChangeDirecHon#
SA#
Phi#
BA_uni#
TaskTime#
TopicChange#
TopicEverSeen#
Fig.4. Summary of relativefeature importance across 54 regression models.
ases (27), which implicitly indicates that task familiarity affects an assessor’s
A GAM with the only top 5 features shows good performance
(7-10% less than full-featured GAM )
Relative feature importance across 54 individual prediction models.
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
3: Reducing Supervision
Matt Lease <ml@utexas.edu>
32
Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supervision. HCOMP 2015.
Soft Label Updating & Discounting
Matt Lease <ml@utexas.edu>
33
Soft Label Updating
Matt Lease <ml@utexas.edu>
34
The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
35
Matt Lease <ml@utexas.edu>
Thank You!
ir.ischool.utexas.eduSlides: www.slideshare.net/mattlease

More Related Content

What's hot

20210325 jim spohrer future ai v11
20210325 jim spohrer future ai v1120210325 jim spohrer future ai v11
20210325 jim spohrer future ai v11
ISSIP
 

What's hot (20)

But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Future of learning 20180425 v1
Future of learning 20180425 v1Future of learning 20180425 v1
Future of learning 20180425 v1
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
20220203 jim spohrer purdue v12
20220203 jim spohrer purdue v1220220203 jim spohrer purdue v12
20220203 jim spohrer purdue v12
 
20210908 jim spohrer naples forum_2021 v1
20210908 jim spohrer naples forum_2021 v120210908 jim spohrer naples forum_2021 v1
20210908 jim spohrer naples forum_2021 v1
 
Korea day1 keynote 20161013 v6
Korea day1 keynote 20161013 v6Korea day1 keynote 20161013 v6
Korea day1 keynote 20161013 v6
 
K tech santa clara 20131114 v1
K tech santa clara 20131114 v1K tech santa clara 20131114 v1
K tech santa clara 20131114 v1
 
Thefutureofcitiesandregions 20200724 v5
Thefutureofcitiesandregions 20200724 v5Thefutureofcitiesandregions 20200724 v5
Thefutureofcitiesandregions 20200724 v5
 
20210325 jim spohrer sir rel future_ai v10 copy
20210325 jim spohrer sir rel future_ai v10 copy20210325 jim spohrer sir rel future_ai v10 copy
20210325 jim spohrer sir rel future_ai v10 copy
 
Japan 20200724 v13
Japan 20200724 v13Japan 20200724 v13
Japan 20200724 v13
 
20201209 jim spohrer platform economy v3
20201209 jim spohrer platform economy v320201209 jim spohrer platform economy v3
20201209 jim spohrer platform economy v3
 
People's Interactions with Cognitive Assistants for Enhanced Performance
People's Interactions with Cognitive Assistants for Enhanced PerformancePeople's Interactions with Cognitive Assistants for Enhanced Performance
People's Interactions with Cognitive Assistants for Enhanced Performance
 
An Introduction to Human Computation and Games With A Purpose - Part I
An Introduction to Human Computation and Games With A Purpose - Part IAn Introduction to Human Computation and Games With A Purpose - Part I
An Introduction to Human Computation and Games With A Purpose - Part I
 
許永真/Crowd Computing for Big and Deep AI
許永真/Crowd Computing for Big and Deep AI許永真/Crowd Computing for Big and Deep AI
許永真/Crowd Computing for Big and Deep AI
 
20201213 jim spohrer icis augmented intelligence v6
20201213 jim spohrer icis augmented intelligence v620201213 jim spohrer icis augmented intelligence v6
20201213 jim spohrer icis augmented intelligence v6
 
20210325 jim spohrer future ai v11
20210325 jim spohrer future ai v1120210325 jim spohrer future ai v11
20210325 jim spohrer future ai v11
 
Ert 20200420 v11
Ert 20200420 v11Ert 20200420 v11
Ert 20200420 v11
 
Aaai fs 2017 cog_asst_in_gov_and_psa 20171110 v2
Aaai fs 2017 cog_asst_in_gov_and_psa 20171110 v2Aaai fs 2017 cog_asst_in_gov_and_psa 20171110 v2
Aaai fs 2017 cog_asst_in_gov_and_psa 20171110 v2
 
Robotisation of Knowledge and Service Work
Robotisation of Knowledge and Service WorkRobotisation of Knowledge and Service Work
Robotisation of Knowledge and Service Work
 
20210322 jim spohrer eaae deans summit v13
20210322 jim spohrer eaae deans summit v1320210322 jim spohrer eaae deans summit v13
20210322 jim spohrer eaae deans summit v13
 

Similar to Toward Better Crowdsourcing Science

Similar to Toward Better Crowdsourcing Science (20)

The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject Crowdsourcing
 
Crowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing: From Aggregation to Search Engine EvaluationCrowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing: From Aggregation to Search Engine Evaluation
 
Metrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMetrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-Computing
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Semiconductors 20240320 v14 corrected slides.pptx
Semiconductors 20240320 v14 corrected slides.pptxSemiconductors 20240320 v14 corrected slides.pptx
Semiconductors 20240320 v14 corrected slides.pptx
 
Semiconductors 20240320 v14 Narayanasamy event.pptx
Semiconductors 20240320 v14 Narayanasamy event.pptxSemiconductors 20240320 v14 Narayanasamy event.pptx
Semiconductors 20240320 v14 Narayanasamy event.pptx
 
Spohrer PHD_ICT_KES 20230316 v10.pptx
Spohrer PHD_ICT_KES 20230316 v10.pptxSpohrer PHD_ICT_KES 20230316 v10.pptx
Spohrer PHD_ICT_KES 20230316 v10.pptx
 
Seminar 20221027 v4.pptx
Seminar 20221027 v4.pptxSeminar 20221027 v4.pptx
Seminar 20221027 v4.pptx
 
Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)
 
ICServ2023 20230914 v8.pptx
ICServ2023 20230914 v8.pptxICServ2023 20230914 v8.pptx
ICServ2023 20230914 v8.pptx
 
Ntegra 20231003 v3.pptx
Ntegra 20231003 v3.pptxNtegra 20231003 v3.pptx
Ntegra 20231003 v3.pptx
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptx
20240104 HICSS  Panel on AI and Legal Ethical 20240103 v7.pptx20240104 HICSS  Panel on AI and Legal Ethical 20240103 v7.pptx
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptx
 
Spohrer EMAC 20230509 v14.pptx
Spohrer EMAC 20230509 v14.pptxSpohrer EMAC 20230509 v14.pptx
Spohrer EMAC 20230509 v14.pptx
 
Graphic design and UI efficiency
Graphic design and UI efficiencyGraphic design and UI efficiency
Graphic design and UI efficiency
 
NHH 20221023 v3.pptx
NHH 20221023 v3.pptxNHH 20221023 v3.pptx
NHH 20221023 v3.pptx
 
UCSC-SV HCI_Masters 20240308 v13 AI.pptx
UCSC-SV HCI_Masters 20240308 v13 AI.pptxUCSC-SV HCI_Masters 20240308 v13 AI.pptx
UCSC-SV HCI_Masters 20240308 v13 AI.pptx
 
24 Hours of UX, 2023: Preventing the Future
24 Hours of UX, 2023: Preventing the Future24 Hours of UX, 2023: Preventing the Future
24 Hours of UX, 2023: Preventing the Future
 
Classroom to careers in Web Development
Classroom to careers in Web DevelopmentClassroom to careers in Web Development
Classroom to careers in Web Development
 
UCSC-SV 20220825 v1.pptx
UCSC-SV 20220825 v1.pptxUCSC-SV 20220825 v1.pptx
UCSC-SV 20220825 v1.pptx
 

More from Matthew Lease

More from Matthew Lease (18)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
Toward Effective and Sustainable Online Crowd Work
Toward Effective and Sustainable Online Crowd WorkToward Effective and Sustainable Online Crowd Work
Toward Effective and Sustainable Online Crowd Work
 
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
 
Crowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing Transcription Beyond Mechanical TurkCrowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing Transcription Beyond Mechanical Turk
 
Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to Ethics
 
Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & ethics: a few thoughts and refences. Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & ethics: a few thoughts and refences.
 
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid SystemsCrowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
 
Mechanical Turk is Not Anonymous
Mechanical Turk is Not AnonymousMechanical Turk is Not Anonymous
Mechanical Turk is Not Anonymous
 
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Toward Better Crowdsourcing Science

  • 1. Toward Better Crowdsourcing Science (& Predicting Annotator Performance) Matt Lease School of Information University of Texas at Austin ir.ischool.utexas.edu @mattlease ml@utexas.edu Slides: www.slideshare.net/mattlease
  • 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 www.ischools.org
  • 3. The Future of Crowd Work, CSCW’13 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 3 Matt Lease <ml@utexas.edu>
  • 4. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 4 Matt Lease <ml@utexas.edu> Roadmap Hyun Joon Jung
  • 5. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 5 Matt Lease <ml@utexas.edu> Roadmap
  • 6. A Popular Tale of Crowdsourcing Woe • Heroic ML researcher asks the crowd to perform a simple task • Crowd (invariably) screws it up… • “Aha!” cries the ML researcher, “Fortunately, I know exactly how to solve this problem!” Matt Lease <ml@utexas.edu> 6
  • 8. But why can’t the workers just get it right to begin with? Matt Lease <ml@utexas.edu> 8 Is everyone just lazy, stupid, or deceitful?!? Much of our literature seems to suggest this: • Cheaters • Fraudsters • “Lazy Turkers” • Scammers • Spammers
  • 9. Another story (a parable) “We had a great software interface, but we went out of business because our customers were too stupid to figure out how to use it.” Moral • Even if a user were stupid or lazy, we still lose • By accepting our own responsibility, we create another opportunity to fix the problem… – Cynical view: idiot-proofing Matt Lease <ml@utexas.edu> 9
  • 10. What is our responsibility? • Ill-defined/incomplete/ambiguous/subjective task? • Confusing, difficult, or unusable interface? • Incomplete or unclear instructions? • Insufficient or unhelpful examples given? • Gold standard with low or unknown inter-assessor agreement (i.e. measurement error in assessing response quality)? • Task design matters! (garbage in = garbage out) – Report it for review, completeness, & reproducibility Matt Lease <ml@utexas.edu> 10
  • 11. A Few Simple Suggestions (1 of 2) 1. Make task self-contained: everything the worker needs to know should be visible in-task 2. Short, simple, & clear instructions with examples 3. Avoid domain-specific & advanced terminology; write for typical people (e.g., your mom) 4. Engage worker / avoid boring stuff. If possible, select interesting content for people to work on 5. Always ask for open-ended feedback Matt Lease <ml@utexas.edu> 11 Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
  • 12. Suggested Sequencing (2 of 2) 1. Simulate first draft of task with your in-house personnel. Assess, revise, & iterate (ARI) 2. Run task using relatively few workers & examples (ARI) 1. Do workers understand the instructions? 2. How long does it take? Is pay effective & ethical? 3. Replicate results on another dataset (generalization). (ARI) 4. [Optional] qualification test. (ARI) 5. Increase items. Look for boundary items & noisy gold (ARI) 6. Increase # of workers (ARI) Matt Lease <ml@utexas.edu> 12 Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
  • 13. Toward Better Crowdsourcing Science Goal: Strengthen individual studies and minimize unwarranted spread of bias in our scientific literature • Occam’s Razor: avoid making assumptions beyond what the data actually tells us (avoid prejudice!) • Enumerate hypotheses for possible causes of low data quality, assess supporting evidence for each hypothesis, and for any claims made, cite supporting evidence • Recognize uncertainty of analyses and convey this via hedge statements such as, “the data suggests that…” • Avoid derogatory language use without very strong supporting evidence. The crowd enables our work!! – Acknowledge your workers! Matt Lease <ml@utexas.edu> 13
  • 14. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 14 Matt Lease <ml@utexas.edu> Roadmap
  • 15. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010. 15 Matt Lease <ml@utexas.edu>
  • 16. CACM August, 2013 16 Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013. Matt Lease <ml@utexas.edu>
  • 17. • “Contribute to society and human well-being” • “Avoid harm to others” “As an ACM member I will – Uphold and promote the principles of this Code – Treat violations of this code as inconsistent with membership in the ACM” 17 Matt Lease <ml@utexas.edu> “Which approaches are less expensive and is this sensible? With the advent of outsourcing and off-shoring these matters become more complex and take on new dimensions …there are often related ethical issues concerning exploitation… “…legal, social, professional and ethical [topics] should feature in all computing degrees.” 2008 ACM/IEEE Curriculum Update
  • 18. • Mistakes are made in HITs rejection, worker blocking – e.g., student error, bug, poor task design, noisy gold, etc. • Workers have limited recourse for appeal • Our errors impact real people’s lives • What is the loss function to optimize? • Should anyone hold researchers accountable? IRB? • How do we balance the risk of human harm vs. the potential benefit if our research succeeds? Power Asymmetry on MTurk 18 Matt Lease <ml@utexas.edu>
  • 19. ACM: “Contribute to society and human well-being; avoid harm to others” • How do we know who is doing the work, or if a decision to work (for a given price) is freely made? • Does it matter if work is performed by – Political refugees? Children? Prisoners? Disabled? • What (if any) moral obligation do crowdsourcing researchers have to consider broader impacts of our research (either good or bad) on the lives of those we depend on to power our systems? Matt Lease <ml@utexas.edu> 19
  • 20. Who Are We Building a Better Future For? • “Irani and Silberman (2013) – “…AMT helps employers see themselves as builders of innovative technologies, rather than employers unconcerned with working conditions.” • Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of the people we ask to power our computing?” 20
  • 21. Could Effective Human Computation Sometimes Be a Bad Idea? • The Googler who Looked at the Worst of the Internet • Policing the Web’s Lurid Precincts • Facebook content moderation • The dirty job of keeping Facebook clean • Even linguistic annotators report stress & nightmares from reading news articles! 21 Matt Lease <ml@utexas.edu>
  • 22. Join the conversation! Crowdwork-ethics, by Six Silberman http://crowdwork-ethics.wtf.tw an informal, occasional blog for researchers interested in ethical issues in crowd work 22 Matt Lease <ml@utexas.edu>
  • 23. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 23 Matt Lease <ml@utexas.edu> Roadmap Hyun Joon Jung
  • 24. Quality Control in Crowdsourcing 7/10/2015 24 Crowd workers Label Aggregation Workflow Design Worker Management Existing Quality Control Methods Task Design Who is more accurate? (worker performance estimation and prediction) Requester Online marketplace Crowd workers
  • 26. Equally Accurate Workers? 1 0 1 0 7/10/2015 26 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1 Alice Bob time t Correctness of the ith task instance 1 -> correct , 0 -> wrong Accuracy(Alice) = Accuracy(Bob) = 0.5 But should we expect equal work quality in the future? What if examples are not i.i.d.? Bob seems to be improving over time.
  • 27. 1: Time-series model 27 Latent Autoregressive Real observation Noise Model Latent variable 𝑦𝑡 = f(𝑥 𝑡) 𝑥𝑡 Temporal correlation How frequently y has changed over time 𝜑 Offset Sign navigates direction between correct vs. not 𝑐 1 0 1 0 -0.3 0.4 -0.10.8𝑥𝑡 𝑦𝑡 𝑐 φ 𝑐 φ 𝑐 φ𝑐 φ EM Variant (LAMORE, Park et al. 2014) Jung et al. Predicting Next Label Quality: A Time-Series Model of Crowdwork. AAAI HCOMP 2014.
  • 28. 7/10/2015 28 Integrate multi-dimensional features of a crowd assessor Multiple features Alice accuracy time temporal effect topic familiarity # of labels 00.7 10.3 0.6 0.8 20 0.6 8.5 0.5 0.2 21 1 0.65 7.5 0.4 0.4 22 0 0.63 11.5 0.3 0.5 23 ? Predict an assessor’s next label quality based on a single feature Alice 0.6 0.5 0.4 0.3 0 1 0 ? temporal effect Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015. 2: Modeling More Features
  • 29. Features 7/10/2015 29 [1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10 [2] Ipeirotis, P.G., Gabrilovich, E.: Quizz: targeted crowdsourcing with a billion (potential) users. WWW’14 [3] Jung, H., et al.: Predicting Next Label Quality: A Time-Series Model of Crowdwork. HCOMP’14 How do we flexibly capture a wider range of assessor behaviors by incorporating multi-dimensional features? [1] [1] [2] [3] [3] [3] Various accuracy measures Task features Temporal features Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  • 30. Model 7/10/2015 30 Input: X (features for crowd assessor model) Learning Framework [ ] Output: Y (likelihood of getting correct label at t) Generalizable feature-based Assessor Model (GAM) Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  • 31. Which Features Matter? 7/10/2015 31 . Prediction performance (MAE) of assessors’ next judgments and corresponding cov s varying decision rejection options (δ=[0⇠0.25] by 0.05). While theother methodss cant decreasein coverage, under all thegiven reject options, GAM showsbetter cov l asprediction performance. 49# 43# 39# 28# 27# 23# 22# 20# 19# 16# 10# 7# 5# 0# 10# 20# 30# 40# 50# AA# BA_opt# BA_PES# C# NumLabels# CurrentLabelQuality# AccChangeDirecHon# SA# Phi# BA_uni# TaskTime# TopicChange# TopicEverSeen# Fig.4. Summary of relativefeature importance across 54 regression models. ases (27), which implicitly indicates that task familiarity affects an assessor’s A GAM with the only top 5 features shows good performance (7-10% less than full-featured GAM ) Relative feature importance across 54 individual prediction models. Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  • 32. 3: Reducing Supervision Matt Lease <ml@utexas.edu> 32 Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supervision. HCOMP 2015.
  • 33. Soft Label Updating & Discounting Matt Lease <ml@utexas.edu> 33
  • 34. Soft Label Updating Matt Lease <ml@utexas.edu> 34
  • 35. The Future of Crowd Work, CSCW’13 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 35 Matt Lease <ml@utexas.edu>