Talk at "Wisdom of the Crowd" AAAI 2012 Spring Symposium workshop (http://users.wpi.edu/~soniac/WisdomOfTheCrowd/WoCSchedule.htm) on 2011 AAAI-HComp paper by the same title.
AWS Community Day CPH - Three problems of Terraform
On Quality Control and Machine Learning in Crowdsourcing
1. On Quality Control and Machine Learning
in Crowdsourcing
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
2. Quality Control
• Many factors matter
– guidelines, experimental design, human factors,
automation, …
• Only as strong as weakest link
– automation is not a silver bullet
• Errors are not just due to lazy/stupid workers
– Even in carefully designed and managed
annotation projects, uncertain cases encountered
2
3. Human Factors (HF)
• Questionnaire / Survey Design
• Interface / Interaction Design
• Incentives
• Human Relations (HR): recruitment & retention
• Long-term Commitment
– rapport with co-workers
– buy-in to organizational mission & value of work
– opportunities for advancement in organization
• Oversight / Management / Organization
• Communication
3
4. HF Challenges & Consequences
• Not part of typical CS curriculum or expertise
– crowdsourcing disrupts prior area boundaries
• NLP, IR, ML people traditionally don’t do HCI
– now many of us dealing with such issues
• Consequences
– Errors from poor HF
– Stumbling into known problems, recreating solutions
– May see problems through limited vantage point
– May over-rely on automation
• Great opportunities for HCI collaboration
4
5. Minority Voice & Diversity
• Opportunity: more diversity than “experts”
• Risk: false reinforcement of majority view
when minority is ignored, lost, or eliminated
• Questions
– How to recognize when majority is wrong?
– How to recognize alternative or better truths?
– Is QC systematically eliminating diversity?
– How diverse is the crowd really?
5
6. Automation
• Examples
– Task Routing / Worker Selection
– Adaptive Plurality, Decomposition
– Post-hoc: Calibration, Filtering & Aggregation
• Separation of concerns / middleware
– Users specify their task, and system handles QC
– Many do not have interest, time, skill, or risk tolerance
to manage low-level QC on their own
– Critical to widespread/enterprise adoption
– Accelerate field progress
• divide problem space for different groups to work on
6
7. Automation: Questions
• Who are the workers?
• What is the labor model?
• What are affordances of the platform?
• How does that drive subsequent setup?
• Appropriate inner-annotator agreement
measures for crowdwork?
7
8. Lessons from Traditional Annotation
• Need clear, detailed guidelines
• Cannot predict all cases in advance
• Guidelines evolve during annotation
• Humans not merely better visual, audio sensors
– e.g. imprecise directions & unforeseen examples
• Crowdsourcing Questions
– How to handle examples for which current guidelines
are ambiguous, unclear, or insufficient?
– What role do annotators play?
– How to facilitate interaction?
8
9. Worker Organization
• How might we organize workers for effective QC?
• Do workers participate in high level discussions
(telecommuters) or act like automata (HPU)?
• What organizational patterns might be used
– e.g. find-verify, fix-fix-verify, qualify-work
• How do different organizational patterns interact
with automation and other QC factors?
9
10. Impact on Machine Learning: More
• Labeled data
• Uncertain data
• Diverse data
• Specific data
• Ongoing data
• Rapid data
• Hybrid systems
• On-demand evaluation
• Datasets & Benchmarks
• Tasks
10
11. Open Questions
• How do cheap, plentiful , rapid labels alter how we utilize
supervisied vs. semi-supervised vs. unsupervised methods?
– Revist task-specific learning curves
• Mask uncertainty via QC or model, propagate, and expose?
• How do we handle noise in active learning?
• How to best utilize a 24/7 global crowd for lifetime,
continuous, never-ending learning systems?
– Sample size vs. adaptation
• Can we develop a more formal, computational
understanding of Wisdom of Crowds?
– diversity, independence, decentralization, and aggregation
• Can we better connect consensus algorithms with more
general feature-based and ensemble models?
11
12. Other Issues
• Hybrid systems match human-level competence
– Achievable now at certain time/cost tradeoff, which can be
navigated as function of context and need
• Diverse labeling particularly valuable when subjective
– Traditional in-house annotators not diverse & few
• A middle way between traditional annotation and
automated proxy metrics
– e.g. translation quality & BLEU
– More rapid than traditional annotation, more accurate
than automated metrics
• Less re-use has the risk of less comparable evaluation
– Enduring value of community evaluations like TREC
12
13. Thank You!
ir.ischool.utexas.edu/crowd
• Students
– Catherine Grady (iSchool) Matt Lease
– Hyunjoon Jung (ECE) ml@ischool.utexas.edu
– Jorn Klinger (Linguistics) @mattlease
– Adriana Kovashka (CS)
– Abhimanu Kumar (CS)
– Di Liu (iSchool)
– Hohyon Ryu (iSchool)
– William Tang (CS)
– Stephen Wolfson (iSchool)
• Omar Alonso, Microsoft Bing
• Support
– John P. Commons 13