SlideShare une entreprise Scribd logo
1  sur  69
Human computation, crowdsourcing 
and social: An industrial perspective 
Omar Alonso 
Microsoft 
12 November 2014
Disclaimer 
The views, opinions, positions, or strategies expressed in 
this talk are mine and do not necessarily reflect the 
official policy or position of Microsoft.
Introduction 
• Crowdsourcing is hot 
• Lots of interest in the research community 
– Articles showing good results 
– Journals special issues (IR, IEEE Internet Computing, etc.) 
– Workshops and tutorials (SIGIR, NACL, WSDM, WWW, CHI, 
RecSys, VLDB, etc.) 
– HCOMP 
– CrowdConf 
• Large companies leveraging crowdsourcing 
• Big data 
• Start-ups 
• Venture capital investment
Crowdsourcing 
• Crowdsourcing is the act of taking a 
job traditionally performed by a 
designated agent (usually an 
employee) and outsourcing it to an 
undefined, generally large group of 
people in the form of an open call. 
• The application of Open Source 
principles to fields outside of 
software. 
• Most successful story: Wikipedia
HUMAN COMPUTATION
Human computation 
• Not a new idea 
• Computers before 
computers 
• You are a human 
computer
Some definitions 
• Human computation is a computation 
that is performed by a human 
• Human computation system is a system 
that organizes human efforts to carry 
out computation 
• Crowdsourcing is a tool that a human 
computation system can use to 
distribute tasks. 
Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
More examples 
• ESP game 
• Captcha: 200M every day 
• ReCaptcha: 750M to date
Data is king 
• Massive free Web data 
changed how we train 
learning systems 
• Crowds provide new access 
to cheap & labeled big data 
• But quality also matters 
M. Banko and E. Brill. “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001. 
A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems 2009.
Traditional Data Collection 
• Setup data collection software / harness 
• Recruit participants / annotators / assessors 
• Pay a flat fee for experiment or hourly wage 
• Characteristics 
– Slow 
– Expensive 
– Difficult and/or Tedious 
– Sample Bias…
Natural Language Processing 
• MTurk annotation for 5 NLP tasks 
• 22K labels for US $26 
• High agreement between consensus labels and 
gold-standard labels 
• Workers as good as experts 
R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert 
Annotations for Natural Language Tasks”. EMNLP-2008.
Machine Translation 
• Manual evaluation 
on translation quality 
is slow and expensive 
• High agreement 
between non-experts 
and experts 
• $0.10 to translate a 
sentence 
C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality 
Using Amazon’s Mechanical Turk”, EMNLP 2009.
Soylent 
M. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”, UIST 2010
Mechanical Turk 
• Amazon Mechanical Turk 
(AMT, MTurk, 
www.mturk.com) 
• Crowdsourcing platform 
• On-demand workforce 
• “Artificial artificial 
intelligence”: get humans to 
do hard part 
• Named after faux automaton 
of 18th C.
• Multiple Channels 
• Gold-based tests 
• Only pay for 
“trusted” judgments
HIT example
HIT example
{where to go on vacation} 
• MTurk: 50 answers, 
$1.80 
• Quora: 2 answers 
• Y! Answers: 2 
answers 
• FB: 1 answer 
• Tons of results 
• Read title + snippet + 
URL 
• Explore a few pages in 
detail
{where to go on vacation} 
Countries 
Cities
Flip a coin 
• Please flip a coin and report the results 
• Two questions 
1. Coin type? 
2. Head or tails? 
• Results 
Row Labels Counts 
head 57 
tail 43 
Grand Total 100 
Row Labels Count 
Dollar 56 
Euro 11 
Other 30 
(blank) 3 
Grand Total 100
Why is this interesting? 
• Easy to prototype and test new experiments 
• Cheap and fast 
• No need to setup infrastructure 
• Introduce experimentation early in the cycle 
• For new ideas, this is very helpful
Caveats and clarifications 
• Trust and reliability 
• Wisdom of the crowd re-visit 
• Adjust expectations 
• Crowdsourcing is another data point for your 
analysis 
• Complementary to other experiments
Why now? 
• The Web 
• Use humans as processors in a distributed 
system 
• Address problems that computers aren’t good 
• Scale 
• Reach
Who are 
the workers? 
• A. Baio, November 2008. The Faces of Mechanical Turk. 
• P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk 
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
Issues
ASSESSMENTS AND LABELS
Relevance assessments 
Is this document relevant to the query?
Careful with That Axe Data, Eugene 
• In the area of big data and machine learning: 
– labels -> features -> predictive model -> optimization 
• Labeling/experimentation perceived as boring 
• Don’t rush labeling 
– Human and machine 
• Label quality is very important 
– Don’t outsource it 
– Own it end to end 
– Large scale
More on label quality 
• Data gathering is not a free lunch 
• Labels for the machine != labels for humans 
• Emphasis on algorithms, 
models/optimizations and mining from labels 
• Not so much on algorithms for ensuring high 
quality labels 
• Training sets
The importance of labels – IR context
INFORMATION RETRIEVAL AND 
CROWDSOURCING
Motivating Example: Relevance Judging 
• Relevance of search results is difficult to judge 
– Highly subjective 
– Expensive to measure 
• Professional editors commonly used 
• Potential benefits of crowdsourcing 
– Scalability (time and cost) 
– Diversity of judgments
Started with a joke …
Results for {idiot} 
February 2011: 5/7 (R), 2/7 (NR) 
Relevant 
1. Most of the time those TV reality stars have absolutely no talent. They do whatever they 
can to make a quick dollar. Most of the time the reality tv stars don not have a mind of 
their own. R 
2. Most are just celebrity wannabees. Many have little or no talent, they just want fame. R 
3. Have you seen the knuckledraggers on reality television? They should be required to change 
their names to idiot after appearing on the show. You could put numbers after the word 
idiot so we can tell them apart. R 
4. Although I have not followed too many of these shows, those that I have encountered have 
for a great part a very common property. That property is that most of the participants 
involved exhibit a shallow self-serving personality that borders on social pathological 
behavior. To perform or act in such an abysmal way could only be an act of an idiot. R 
5. I can see this one going both ways. A particular sort of reality star comes to mind, 
though, one who was voted off Survivor because he chose not to use his immunity necklace. 
Sometimes the label fits, but sometimes it might be unfair. R 
Not Relevant 
1. Just because someone else thinks they are an "idiot", doesn't mean that is what the word 
means. I don't like to think that any one person's photo would be used to describe a 
certain term. NR 
2. While some reality-television stars are genuinely stupid (or cultivate an image of 
stupidity), that does not mean they can or should be classified as "idiots." Some simply 
act that way to increase their TV exposure and potential earnings. Other reality-television 
stars are really intelligent people, and may be considered as idiots by people who don't 
like them or agree with them. It is too subjective an issue to be a good result for a 
search engine. NR
You have a new idea 
• Novel IR technique 
• Don’t have access to click data 
• Can’t hire editors 
• How to test new ideas?
Crowdsourcing and relevance evaluation 
• Subject pool access: no need to come into the 
lab 
• Diversity 
• Low cost 
• Agile
Pedal to the metal 
• You read the papers 
• You tell your boss (or advisor) that 
crowdsourcing is the way to go 
• You now need to produce hundreds of 
thousands of labels per month 
• Easy, right?
Ask the right questions 
• Instructions are key 
• Workers are not IR experts so don’t assume 
the same understanding in terms of 
terminology 
• Show examples 
• Hire a technical writer 
• Prepare to iterate
How not to do things 
• Lot of work for a few cents 
• Go here, go there, copy, enter, count …
UX design 
• Time to apply all those usability concepts 
• Need to grab attention 
• Generic tips 
– Experiment should be self-contained. 
– Keep it short and simple. 
– Be very clear with the task. 
– Engage with the worker. Avoid boring stuff. 
– Always ask for feedback (open-ended question) in an 
input box. 
• Localization
Payments 
• How much is a HIT? 
• Delicate balance 
– Too little, no interest 
– Too much, attract spammers 
• Heuristics 
– Start with something and wait to see if there is 
interest or feedback (“I’ll do this for X amount”) 
– Payment based on user effort. Example: $0.04 (2 cents 
to answer a yes/no question, 2 cents if you provide 
feedback that is not mandatory) 
• Bonus
Managing crowds
Quality control 
• Extremely important part of the experiment 
• Approach it as “overall” quality – not just for 
workers 
• Bi-directional channel 
– You may think the worker is doing a bad job. 
– The same worker may think you are a lousy 
requester. 
• Test with a gold standard
When to assess work quality? 
• Beforehand (prior to main task activity) 
– How: “qualification tests” or similar mechanism 
– Purpose: screening, selection, recruiting, training 
• During 
– How: assess labels as worker produces them 
– Like random checks on a manufacturing line 
– Purpose: calibrate, reward/penalize, weight 
• After 
– How: compute accuracy metrics post-hoc 
– Purpose: filter, calibrate, weight, retain
How do we measure work quality? 
• Compare worker’s label vs. 
– Known (correct, trusted) label 
– Other workers’ labels 
– Model predictions of workers and labels 
• Verify worker’s label 
– Yourself 
– Tiered approach (e.g. Find-Fix-Verify)
Methods for measuring agreement 
• Inter-agreement level 
– Agreement between judges 
– Agreement between judges and the gold set 
• Some statistics 
– Cohen’s kappa (2 raters) 
– Fleiss’ kappa (any number of raters) 
– Krippendorff’s alpha 
• Gray areas 
– 2 workers say “relevant” and 3 say “not relevant” 
– 2-tier system
Content quality 
• People like to work on things that they like 
• Content and judgments according to modern 
times 
– TREC data set: airport security docs are pre 9/11 
• Document length 
• Randomize content 
• Avoid worker fatigue 
– Judging 100 documents on the same subject can be 
tiring, leading to decreasing quality
Was the task difficult? 
Ask workers to rate difficulty of a search topic 
50 topics; 5 workers, $0.01 per task
So far … 
• One may say “this is all good but looks like a 
ton of work” 
• The original goal: data is king 
• Data quality and experimental designs are 
preconditions to make sure we get the right 
stuff 
• Don’t cut corners
Pause 
• Crowdsourcing works 
– Fast turnaround, easy to experiment, few dollars to test 
– But: you have to design experiments carefully, quality, 
platform limitations 
• Crowdsourcing in production 
– Large scale data sets 
– Continuous execution 
– Difficult to debug 
• How do you know the experiment is working 
• Goal: framework for ensuring reliability on 
crowdsourcing tasks 
O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable 
results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
Labeling tweets – an example of a task 
• Is this tweet interesting? 
• Subjective activity 
• Not focused on specific events 
• Findings 
– Difficult problem, low inter-rater agreement 
– Tested many designs, number of workers, platforms 
(MTurk and others) 
• Multiple contingent factors 
– Worker performance 
– Work 
– Task design 
O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
Designs that include in-task CAPTCHA 
• Borrowed idea from reCAPTCHA -> use of 
control term 
• HIDDEN 
• Adapt your labeling task 
• 2 more questions as control 
– 1 algorithmic 
– 1 semantic
Production example #1 
Q1 (k = 0.91, alpha = 0.91) 
Q2 (k = 0.771, alpha = 0.771) 
Q3 (k = 0.033, alpha = 0.035) 
Tweet de-branded 
In-task captcha 
The main 
question
Production example #2 
Q1 (k = 0.907, alpha = 0.907) 
Q2 (k = 0.728, alpha = 0.728) 
• Q3 Worthless (alpha = 0.033) 
• Q3 Trivial (alpha = 0.043) 
• Q3 Funny (alpha = -0.016) 
• Q3 Makes me curious (alpha = 0.026) 
• Q3 Contains useful info (alpha = 0.048) 
• Q3 Important news (alpha = 0.207) 
Tweet de-branded 
In-task captcha 
Breakdown by 
categories to 
get better signal
Once we get here 
• High quality labels 
• Data will be later be used for rankers, ML 
models, evaluations, etc. 
• Training sets 
• Scalability and repeatability
CURRENT TRENDS
Algorithms 
• Bandit problems; explore-exploit 
• Optimizing amount of work by workers 
– Humans have limited throughput 
– Harder to scale than machines 
• Selecting the right crowds 
• Stopping rule
Humans in the loop 
• Computation loops that mix humans and 
machines 
• Kind of active learning 
• Double goal: 
– Human checking on the machine 
– Machine checking on humans 
• Example: classifiers for social data
Routing 
• Expertise detection and routing 
• Social load balancing 
• When to switch between machines and 
humans 
• CrowdSTAR 
B. Nushi, O. Alonso, M. Hentschel, V. Kandylas. “CrowdSTAR: A Social Task Routing 
Framework for Online Communities”, 2014. http://arxiv.org/abs/1407.6714
Social Task Routing 
Task A? Task B? 
C1 C2 Crowd Summaries 
Crowd 1 (Twitter) Crowd 2 (Quora) 
Routing across 
crowds 
Routing within a crowd
Question Posting – Twitter Examples
Conclusions 
• Crowdsourcing at scale works but requires a solid 
framework 
• Fast turnaround, easy to experiment, few dollars to test 
• But you have to design the experiments carefully 
• Usability considerations 
• Lots of opportunities to improve current platforms 
• Three aspects that need attention: workers, work and 
task design 
• Labeling social data is hard
Conclusions – II 
• Important to know your limitations and be 
ready to collaborate 
• Lots of different skills and expertise required 
– Social/behavioral science 
– Human factors 
– Algorithms 
– Economics 
– Distributed systems 
– Statistics
Thank you - @elunca

Contenu connexe

Similaire à Human computation, crowdsourcing and social: An industrial perspective

UX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX DesignUX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX DesignSarah Fathallah
 
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive
 
How to Use Artificial Intelligence by Microsoft Product Manager
 How to Use Artificial Intelligence by Microsoft Product Manager How to Use Artificial Intelligence by Microsoft Product Manager
How to Use Artificial Intelligence by Microsoft Product ManagerProduct School
 
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...UXPA International
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)Rosenfeld Media
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesAlan Said
 
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
Master Technical Recruiting Workshop:  How to Recruit Top Tech TalentMaster Technical Recruiting Workshop:  How to Recruit Top Tech Talent
Master Technical Recruiting Workshop: How to Recruit Top Tech TalentRecruitingDaily.com LLC
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012TRG Arts
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Be the Captain of Your Career
Be the Captain of Your Career Be the Captain of Your Career
Be the Captain of Your Career Jack Molisani
 
Managerial Decision-Making
Managerial Decision-MakingManagerial Decision-Making
Managerial Decision-MakingLee Schlenker
 
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdfResearch and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdfVWO
 
Embedding Clinical standards in research workshop
Embedding Clinical standards in research workshopEmbedding Clinical standards in research workshop
Embedding Clinical standards in research workshopJames Malone
 
Managerial Decision Making
Managerial Decision MakingManagerial Decision Making
Managerial Decision MakingLee Schlenker
 
User Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersUser Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersLaura B
 
Conversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynoteConversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynoteWebanalisten .nl
 
20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid themCraig Sullivan
 
More Than Usability
More Than UsabilityMore Than Usability
More Than UsabilityRazan Sadeq
 

Similaire à Human computation, crowdsourcing and social: An industrial perspective (20)

UX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX DesignUX Burlington 2017: Exploratory Research in UX Design
UX Burlington 2017: Exploratory Research in UX Design
 
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
 
How to Use Artificial Intelligence by Microsoft Product Manager
 How to Use Artificial Intelligence by Microsoft Product Manager How to Use Artificial Intelligence by Microsoft Product Manager
How to Use Artificial Intelligence by Microsoft Product Manager
 
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
Mechanical Turk Demystified: Best practices for sourcing and scaling quality ...
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
Research Operations at Scale (Christian Rohrer at DesignOps Summit 2017)
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
Master Technical Recruiting Workshop:  How to Recruit Top Tech TalentMaster Technical Recruiting Workshop:  How to Recruit Top Tech Talent
Master Technical Recruiting Workshop: How to Recruit Top Tech Talent
 
Proyectos Investigación y Desarrollo
Proyectos Investigación y DesarrolloProyectos Investigación y Desarrollo
Proyectos Investigación y Desarrollo
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Be the Captain of Your Career
Be the Captain of Your Career Be the Captain of Your Career
Be the Captain of Your Career
 
Managerial Decision-Making
Managerial Decision-MakingManagerial Decision-Making
Managerial Decision-Making
 
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdfResearch and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
Research and Discovery Tools for Experimentation - 17 Apr 2024 - v 2.3 (1).pdf
 
Embedding Clinical standards in research workshop
Embedding Clinical standards in research workshopEmbedding Clinical standards in research workshop
Embedding Clinical standards in research workshop
 
Managerial Decision Making
Managerial Decision MakingManagerial Decision Making
Managerial Decision Making
 
User Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with UsersUser Experience Design Fundamentals - Part 2: Talking with Users
User Experience Design Fundamentals - Part 2: Talking with Users
 
Conversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynoteConversion Hotel 2014: Craig Sullivan (UK) keynote
Conversion Hotel 2014: Craig Sullivan (UK) keynote
 
20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them20 top AB testing mistakes and how to avoid them
20 top AB testing mistakes and how to avoid them
 
More Than Usability
More Than UsabilityMore Than Usability
More Than Usability
 

Dernier

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Dernier (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Human computation, crowdsourcing and social: An industrial perspective

  • 1. Human computation, crowdsourcing and social: An industrial perspective Omar Alonso Microsoft 12 November 2014
  • 2. Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft.
  • 3. Introduction • Crowdsourcing is hot • Lots of interest in the research community – Articles showing good results – Journals special issues (IR, IEEE Internet Computing, etc.) – Workshops and tutorials (SIGIR, NACL, WSDM, WWW, CHI, RecSys, VLDB, etc.) – HCOMP – CrowdConf • Large companies leveraging crowdsourcing • Big data • Start-ups • Venture capital investment
  • 4. Crowdsourcing • Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call. • The application of Open Source principles to fields outside of software. • Most successful story: Wikipedia
  • 5.
  • 6.
  • 8. Human computation • Not a new idea • Computers before computers • You are a human computer
  • 9. Some definitions • Human computation is a computation that is performed by a human • Human computation system is a system that organizes human efforts to carry out computation • Crowdsourcing is a tool that a human computation system can use to distribute tasks. Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
  • 10. More examples • ESP game • Captcha: 200M every day • ReCaptcha: 750M to date
  • 11. Data is king • Massive free Web data changed how we train learning systems • Crowds provide new access to cheap & labeled big data • But quality also matters M. Banko and E. Brill. “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001. A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems 2009.
  • 12. Traditional Data Collection • Setup data collection software / harness • Recruit participants / annotators / assessors • Pay a flat fee for experiment or hourly wage • Characteristics – Slow – Expensive – Difficult and/or Tedious – Sample Bias…
  • 13. Natural Language Processing • MTurk annotation for 5 NLP tasks • 22K labels for US $26 • High agreement between consensus labels and gold-standard labels • Workers as good as experts R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks”. EMNLP-2008.
  • 14. Machine Translation • Manual evaluation on translation quality is slow and expensive • High agreement between non-experts and experts • $0.10 to translate a sentence C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
  • 15. Soylent M. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”, UIST 2010
  • 16. Mechanical Turk • Amazon Mechanical Turk (AMT, MTurk, www.mturk.com) • Crowdsourcing platform • On-demand workforce • “Artificial artificial intelligence”: get humans to do hard part • Named after faux automaton of 18th C.
  • 17. • Multiple Channels • Gold-based tests • Only pay for “trusted” judgments
  • 20. {where to go on vacation} • MTurk: 50 answers, $1.80 • Quora: 2 answers • Y! Answers: 2 answers • FB: 1 answer • Tons of results • Read title + snippet + URL • Explore a few pages in detail
  • 21. {where to go on vacation} Countries Cities
  • 22. Flip a coin • Please flip a coin and report the results • Two questions 1. Coin type? 2. Head or tails? • Results Row Labels Counts head 57 tail 43 Grand Total 100 Row Labels Count Dollar 56 Euro 11 Other 30 (blank) 3 Grand Total 100
  • 23. Why is this interesting? • Easy to prototype and test new experiments • Cheap and fast • No need to setup infrastructure • Introduce experimentation early in the cycle • For new ideas, this is very helpful
  • 24. Caveats and clarifications • Trust and reliability • Wisdom of the crowd re-visit • Adjust expectations • Crowdsourcing is another data point for your analysis • Complementary to other experiments
  • 25. Why now? • The Web • Use humans as processors in a distributed system • Address problems that computers aren’t good • Scale • Reach
  • 26. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010.
  • 29. Relevance assessments Is this document relevant to the query?
  • 30.
  • 31. Careful with That Axe Data, Eugene • In the area of big data and machine learning: – labels -> features -> predictive model -> optimization • Labeling/experimentation perceived as boring • Don’t rush labeling – Human and machine • Label quality is very important – Don’t outsource it – Own it end to end – Large scale
  • 32. More on label quality • Data gathering is not a free lunch • Labels for the machine != labels for humans • Emphasis on algorithms, models/optimizations and mining from labels • Not so much on algorithms for ensuring high quality labels • Training sets
  • 33. The importance of labels – IR context
  • 34. INFORMATION RETRIEVAL AND CROWDSOURCING
  • 35.
  • 36. Motivating Example: Relevance Judging • Relevance of search results is difficult to judge – Highly subjective – Expensive to measure • Professional editors commonly used • Potential benefits of crowdsourcing – Scalability (time and cost) – Diversity of judgments
  • 37.
  • 38. Started with a joke …
  • 39. Results for {idiot} February 2011: 5/7 (R), 2/7 (NR) Relevant 1. Most of the time those TV reality stars have absolutely no talent. They do whatever they can to make a quick dollar. Most of the time the reality tv stars don not have a mind of their own. R 2. Most are just celebrity wannabees. Many have little or no talent, they just want fame. R 3. Have you seen the knuckledraggers on reality television? They should be required to change their names to idiot after appearing on the show. You could put numbers after the word idiot so we can tell them apart. R 4. Although I have not followed too many of these shows, those that I have encountered have for a great part a very common property. That property is that most of the participants involved exhibit a shallow self-serving personality that borders on social pathological behavior. To perform or act in such an abysmal way could only be an act of an idiot. R 5. I can see this one going both ways. A particular sort of reality star comes to mind, though, one who was voted off Survivor because he chose not to use his immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R Not Relevant 1. Just because someone else thinks they are an "idiot", doesn't mean that is what the word means. I don't like to think that any one person's photo would be used to describe a certain term. NR 2. While some reality-television stars are genuinely stupid (or cultivate an image of stupidity), that does not mean they can or should be classified as "idiots." Some simply act that way to increase their TV exposure and potential earnings. Other reality-television stars are really intelligent people, and may be considered as idiots by people who don't like them or agree with them. It is too subjective an issue to be a good result for a search engine. NR
  • 40. You have a new idea • Novel IR technique • Don’t have access to click data • Can’t hire editors • How to test new ideas?
  • 41. Crowdsourcing and relevance evaluation • Subject pool access: no need to come into the lab • Diversity • Low cost • Agile
  • 42. Pedal to the metal • You read the papers • You tell your boss (or advisor) that crowdsourcing is the way to go • You now need to produce hundreds of thousands of labels per month • Easy, right?
  • 43. Ask the right questions • Instructions are key • Workers are not IR experts so don’t assume the same understanding in terms of terminology • Show examples • Hire a technical writer • Prepare to iterate
  • 44. How not to do things • Lot of work for a few cents • Go here, go there, copy, enter, count …
  • 45. UX design • Time to apply all those usability concepts • Need to grab attention • Generic tips – Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box. • Localization
  • 46. Payments • How much is a HIT? • Delicate balance – Too little, no interest – Too much, attract spammers • Heuristics – Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory) • Bonus
  • 48. Quality control • Extremely important part of the experiment • Approach it as “overall” quality – not just for workers • Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. • Test with a gold standard
  • 49. When to assess work quality? • Beforehand (prior to main task activity) – How: “qualification tests” or similar mechanism – Purpose: screening, selection, recruiting, training • During – How: assess labels as worker produces them – Like random checks on a manufacturing line – Purpose: calibrate, reward/penalize, weight • After – How: compute accuracy metrics post-hoc – Purpose: filter, calibrate, weight, retain
  • 50. How do we measure work quality? • Compare worker’s label vs. – Known (correct, trusted) label – Other workers’ labels – Model predictions of workers and labels • Verify worker’s label – Yourself – Tiered approach (e.g. Find-Fix-Verify)
  • 51. Methods for measuring agreement • Inter-agreement level – Agreement between judges – Agreement between judges and the gold set • Some statistics – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha • Gray areas – 2 workers say “relevant” and 3 say “not relevant” – 2-tier system
  • 52. Content quality • People like to work on things that they like • Content and judgments according to modern times – TREC data set: airport security docs are pre 9/11 • Document length • Randomize content • Avoid worker fatigue – Judging 100 documents on the same subject can be tiring, leading to decreasing quality
  • 53. Was the task difficult? Ask workers to rate difficulty of a search topic 50 topics; 5 workers, $0.01 per task
  • 54. So far … • One may say “this is all good but looks like a ton of work” • The original goal: data is king • Data quality and experimental designs are preconditions to make sure we get the right stuff • Don’t cut corners
  • 55. Pause • Crowdsourcing works – Fast turnaround, easy to experiment, few dollars to test – But: you have to design experiments carefully, quality, platform limitations • Crowdsourcing in production – Large scale data sets – Continuous execution – Difficult to debug • How do you know the experiment is working • Goal: framework for ensuring reliability on crowdsourcing tasks O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results” http://research.microsoft.com/apps/pubs/default.aspx?id=219755.
  • 56. Labeling tweets – an example of a task • Is this tweet interesting? • Subjective activity • Not focused on specific events • Findings – Difficult problem, low inter-rater agreement – Tested many designs, number of workers, platforms (MTurk and others) • Multiple contingent factors – Worker performance – Work – Task design O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
  • 57. Designs that include in-task CAPTCHA • Borrowed idea from reCAPTCHA -> use of control term • HIDDEN • Adapt your labeling task • 2 more questions as control – 1 algorithmic – 1 semantic
  • 58. Production example #1 Q1 (k = 0.91, alpha = 0.91) Q2 (k = 0.771, alpha = 0.771) Q3 (k = 0.033, alpha = 0.035) Tweet de-branded In-task captcha The main question
  • 59. Production example #2 Q1 (k = 0.907, alpha = 0.907) Q2 (k = 0.728, alpha = 0.728) • Q3 Worthless (alpha = 0.033) • Q3 Trivial (alpha = 0.043) • Q3 Funny (alpha = -0.016) • Q3 Makes me curious (alpha = 0.026) • Q3 Contains useful info (alpha = 0.048) • Q3 Important news (alpha = 0.207) Tweet de-branded In-task captcha Breakdown by categories to get better signal
  • 60. Once we get here • High quality labels • Data will be later be used for rankers, ML models, evaluations, etc. • Training sets • Scalability and repeatability
  • 62. Algorithms • Bandit problems; explore-exploit • Optimizing amount of work by workers – Humans have limited throughput – Harder to scale than machines • Selecting the right crowds • Stopping rule
  • 63. Humans in the loop • Computation loops that mix humans and machines • Kind of active learning • Double goal: – Human checking on the machine – Machine checking on humans • Example: classifiers for social data
  • 64. Routing • Expertise detection and routing • Social load balancing • When to switch between machines and humans • CrowdSTAR B. Nushi, O. Alonso, M. Hentschel, V. Kandylas. “CrowdSTAR: A Social Task Routing Framework for Online Communities”, 2014. http://arxiv.org/abs/1407.6714
  • 65. Social Task Routing Task A? Task B? C1 C2 Crowd Summaries Crowd 1 (Twitter) Crowd 2 (Quora) Routing across crowds Routing within a crowd
  • 66. Question Posting – Twitter Examples
  • 67. Conclusions • Crowdsourcing at scale works but requires a solid framework • Fast turnaround, easy to experiment, few dollars to test • But you have to design the experiments carefully • Usability considerations • Lots of opportunities to improve current platforms • Three aspects that need attention: workers, work and task design • Labeling social data is hard
  • 68. Conclusions – II • Important to know your limitations and be ready to collaborate • Lots of different skills and expertise required – Social/behavioral science – Human factors – Algorithms – Economics – Distributed systems – Statistics
  • 69. Thank you - @elunca