Presented Nov. 16 2011 at the National Institute of Standards (NIST) Text REtrieval Conference (TREC). Track organized with Gabriella Kazai with assistance from Hyun Joon Jung.
3. What is Crowdsourcing?
• A collection of mechanisms and associated
methodologies for scaling and directing crowd
activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related concepts
– Collective intelligence
– Social computing
– People services
– Human computation
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 3
4. Why Crowdsourcing? Potential…
• Scalability (e.g. cost, time, effort)
– e.g. scale to greater pool sizes
• Quality (by getting more eyes on the data)
– More diverse judgments
– More accurate judgments (“wisdom of crowds”)
• And more!
– New datasets, new tasks, interaction, on-demand
evaluation, hybrid search systems
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 4
5. Track Goals (for Year 1)
• Promote IR community awareness
of, investigation of, and experience with
crowdsourcing mechanisms and methods
• Improve understanding of best practices
• Establish shared, reusable benchmarks
• Assess state-of-the-art of the field
• Attract experience from outside IR community
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 5
6. Crowdsourcing in 2011
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 6
7. Two Questions, Two Tasks
• Task 1: Assessment (human factors)
– How can we obtain quality relevance judgments
from individual (crowd) participants?
• Task 2: Aggregation (statistics)
– How can we derive a quality relevance judgment
from multiple (crowd) judgments?
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 7
9. Task 2: Aggregation (statistics)
• “Wisdom of crowds” computing
• Typical assumption: noisy input labels
– But not always (cf. Yang et al., SIGIR’10)
• Many statistical methods have been proposed
– Common baseline: majority vote
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 9
10. Crowdsourcing, Noise & Uncertainty
Broadly two approaches
1. Alchemy: turn noisy data into gold
– Once we have gold, we can go on training and
evaluating as before (separation of concerns)
– Assume we can mostly clean it up and ignore any
remaining error (even gold is rarely 100% pure)
2. Model & propagate uncertainty
– Let it “spill over” into training and evaluation
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 10
11. Test Collection: ClueWeb09 subset
• Collection: 19K pages rendered by Waterloo
– Task 1: teams judge (a subset)
– Task 2: teams aggregate judgments we provide
• Topics: taken from past MQ and RF tracks
• Gold: Roughly 3K prior NIST judgments
– Remaining 16K pages have no “gold” judgments
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 11
12. What to Predict?
• Teams submit classification and/or ranking labels
– Classification supports traditional absolute relevance judging
– Rank labels support pair-wise preference or list-wise judging
• Classification labels in [0,1]
– Probability of relevance (assessor/system uncertainty)
– Simple generalization of binary relevance
– If probabilities submitted but no ranking, rank labels induced
• Ranking as [1..N]
– Task 1: rank 5 documents per set
• Same worker had to label all 5 examples in a given set (challenge)
– Task 2: rank all documents per topic
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 12
13. Metrics
• Classification
– Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss
– Probabilistic ground truth: KL, RMSE
• Ranking
– Mean Average Precision (MAP)
– Normalized Discounted Cumulative Gain (NDCG)
• Ternary NIST judgments conflated to binary
• Could explore mapping [0,1] consensus to ternary categories
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 13
15. Classification Metrics (cont’d)
• Classification – Binary ground truth (cont’d)
• Classification – Probabilistic ground truth
Root Mean Squared Error (RMSE)
• Notes
– To avoid log(0) = infinity, replace 0 with 10^-15
– Revision: compute average per-example logloss and KL so error does
not grow with sample size (particularly with varying team coverage)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 15
16. Ground Truth: Three Versions
• Gold: NIST Judgments
– only available for a subset of the test collection
• Consensus: generated by aggregating team labels (automatic)
– full coverage
• Team-based (Task 2 only)
– use each team’s labels as truth to evaluate all other teams
– Inspect variance in team rankings over alternative ground truths
– Coverage varies
Three primary evaluation conditions
1. Over examples having gold labels (evaluate vs. gold labels)
2. Over examples having gold labels (evaluate vs. consensus labels)
3. Over all examples (evaluate vs. consensus labels)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 16
17. Consensus
• Goal: Infer single consensus label from multiple input labels
• Methodological Goals: unbiased, transparent, simple
• Method: simple average, rounded when metrics require
– Task 2: input = example labels from each team
– Task 1: input = per-example average of worker labels from each team
• Details
– Classification labels only; no rank fusion
– Using primary runs only
– Task 1: each team gets 1 vote regardless of worker count (prevent bias)
– Exclude any examples where
• only one team submitted a label (bias)
• consensus would yield a tie (binary metrics only)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 17
20. Task 1: Data
• Option 1: Use Waterloo rendered pages
– Available as images, PDFs, and plain text (+html)
– Many page images fetched from CMU server
– Protect workers from malicious scripting
• Option 2: Use some other format
– Any team creating some other format was asked
to provide that data or conversion tool to others
– Avoid comparison based on different rendering
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 20
21. Task 1: Data
• Topics: 270 (240 development, 30 test)
• Test Effort: ~2200 topic-document pairs for each team to judge
– Shared sets: judged by all teams
• Test: 1655 topic-document pairs (331 sets) over 20 topics
– Assigned sets: judged subset of teams
• Test: 1545 topic-document pairs (309 sets) over 15 topics in total
• ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown)
– Split intended to let organizers measure any worker-training effects
• Increased track complexity, decreased useful redundancy & gold …
• Gold: 395 topic-document pairs for test
– made available to teams for cross-validation (not blind)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 21
22. Task 1: Cost & Sponsorship
• Paid crowd labor only one form of crowdsourcing
– Other models: directed gaming, citizen science, virtual pay
– Incentives: socialize with others, recognition, social good, learn, etc.
• Nonetheless, paid models continue to dominate
– e.g. Amazon Mechanical Turk (MTurk), CrowdFlower
• Risk: cost of crowd labor being barrier to track participation
• Risk Mitigation: sponsorship
– CrowdFlower: $100 free credit to interested teams
– Amazon: ~ $300 reimbursement to teams using MTurk (expected)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 22
23. Task 1: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
– CrowdFlower qualification, MTurk judging
2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk
3. Delft University of Technology & University of Iowa (GeAnn)
– Game, recruit via CrowdFlower
4. Glasgow – Terrier (uogTr): MTurk
5. Microsoft (MSRC): MTurk
6. RMIT University (RMIT): CrowdFlower
7. University Carlos III of Madrid (uc3m): Mturk
8. University of Waterloo (UWaterlooMDS): in-house judging
5 used MTurk, 3 used CrowdFlower , 1 in-house
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 23
24. Task 1: Evaluation method
• Average per-worker performance
– Average weighted by number of labels per worker
– Primary evaluation includes rejected work
• Additional metric: Coverage
– What % of examples were labeled by the team?
• Cost & time to be self-reported by teams
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 24
25. ¼ most productive workers do ¾ of the work
# of workers
# of labels % of labels
Top 25%
44917 76.77%
Top 50%
53444 91.34%
Top 75%
56558.5 96.66%
Total
58510 100%
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 25
26. Same worker, multiple teams
2000
1800
1600 # of teams avg. # of
belongs to # of worker examples
1400
1
Number of Examples
1200
947 56.21
1000
2
35 146.65
800
3
600 2 72.25
400
200
0
1
25
49
73
97
121
145
169
193
217
241
265
289
313
337
361
385
409
433
457
481
505
529
553
577
601
625
649
673
697
721
745
769
793
817
841
865
889
913
937
961
Anonymized Worker ID
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 26
28. Task 2: Data
• Input: judgments provided by organizers
– 19,033 topic-document pairs
– 89,624 binary judgments from 762 workers
• Evaluation: average per-topic performance
• Gold: 3275 labels
– 2275 for training (1275 relevant, 1000 non-relevant)
• Excluded from evaluation
– 1000 for blind test (balanced 500/500)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 28
29. Task 2: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
2. Delft University of Technology – Vuurens (TUD_DMIR)
3. Delft University of Technology & University of Iowa (GeAnn)
4. Glasgow – Terrier (uogTr)
5. Glasgow – Zuccon (qirdcsuog)
6. LingPipe
7. Microsoft (MSRC)
8. University Carlos III of Madrid (uc3m)
9. University of Texas at Austin (UTAustin)
10. University of Waterloo (UWaterlooMDS)
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 29
30. Discussion
• Consensus Labels as ground-truth
– Consensus Algorithm for Label Generation?
– Probabilistic or Rounded Binary Consensus Labels?
• Proper scoring rules
• Changes for 2012?
– Which document collection? Request NIST judging?
– Drop the two-task format? Pre-suppose crowdsourced solution?
– Broaden sponsorship? Narrow scope?
– Additional organizer?
– Details
• Focus on worker training effects
• Treatment of rejected work
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 30
31. Conclusion
• Interesting first year of track
– Some insights about what worked well and less well in track design
– Participants will tell us about methods developed
– More analysis still needed for evaluation
• Track will run again in 2012
– Help shape it with feedback (planning session. Hallway, or email)
• Acknowledgments
– Hyun Joon Jung (UT Austin)
– Mark Smucker (U Waterloo)
– Ellen Voorhees & Ian Soboroff (NIST)
• Sponsors
– Amazon
– CrowdFlower
Nov. 16, 2011 TREC 2011 Crowdsourcing Track 31