SlideShare une entreprise Scribd logo
1  sur  31
Overview of the TREC 2011
      Crowdsourcing Track



                  Organizers:
Gabriella Kazai, Microsoft Research Cambridge
  Matt Lease, University of Texas at Austin
Nov. 16, 2011   TREC 2011 Crowdsourcing Track   2
What is Crowdsourcing?
• A collection of mechanisms and associated
  methodologies for scaling and directing crowd
  activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related concepts
       – Collective intelligence
       – Social computing
       – People services
       – Human computation
Nov. 16, 2011            TREC 2011 Crowdsourcing Track   3
Why Crowdsourcing? Potential…
• Scalability (e.g. cost, time, effort)
       – e.g. scale to greater pool sizes
• Quality (by getting more eyes on the data)
       – More diverse judgments
       – More accurate judgments (“wisdom of crowds”)
• And more!
       – New datasets, new tasks, interaction, on-demand
         evaluation, hybrid search systems

Nov. 16, 2011             TREC 2011 Crowdsourcing Track    4
Track Goals (for Year 1)
• Promote IR community awareness
  of, investigation of, and experience with
  crowdsourcing mechanisms and methods
• Improve understanding of best practices
• Establish shared, reusable benchmarks
• Assess state-of-the-art of the field
• Attract experience from outside IR community

Nov. 16, 2011          TREC 2011 Crowdsourcing Track   5
Crowdsourcing in 2011
•   AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
•   ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
•   Crowdsourcing Technologies for Language and Cognition Studies (July 27)
•   CHI-CHC: Crowdsourcing and Human Computation (May 8)
•   CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
•   CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
•   Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
•   EC: Workshop on Social Computing and User Generated Content (June 5)
•   ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
•   Interspeech: Crowdsourcing for speech processing (August)
•   NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
•   SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
•   TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)
•   UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
•   WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
    Nov. 16, 2011                     TREC 2011 Crowdsourcing Track                           6
Two Questions, Two Tasks
• Task 1: Assessment (human factors)
       – How can we obtain quality relevance judgments
         from individual (crowd) participants?


• Task 2: Aggregation (statistics)
       – How can we derive a quality relevance judgment
         from multiple (crowd) judgments?




Nov. 16, 2011           TREC 2011 Crowdsourcing Track     7
Task 1: Assessment (human factors)
• Measurable outcomes & potential tradeoffs
       – Quality, time, cost, & effort
• Many possible factors
       – Incentive structures
       – Interface design
       – Instructions / guidance
       – Interaction / feedback
       – Recruitment & retention
       –…
Nov. 16, 2011            TREC 2011 Crowdsourcing Track   8
Task 2: Aggregation (statistics)
• “Wisdom of crowds” computing
• Typical assumption: noisy input labels
       – But not always (cf. Yang et al., SIGIR’10)
• Many statistical methods have been proposed
       – Common baseline: majority vote




Nov. 16, 2011             TREC 2011 Crowdsourcing Track   9
Crowdsourcing, Noise & Uncertainty
Broadly two approaches
1. Alchemy: turn noisy data into gold
       – Once we have gold, we can go on training and
         evaluating as before (separation of concerns)
       – Assume we can mostly clean it up and ignore any
         remaining error (even gold is rarely 100% pure)
2. Model & propagate uncertainty
       – Let it “spill over” into training and evaluation

Nov. 16, 2011             TREC 2011 Crowdsourcing Track     10
Test Collection: ClueWeb09 subset
• Collection: 19K pages rendered by Waterloo
       – Task 1: teams judge (a subset)
       – Task 2: teams aggregate judgments we provide
• Topics: taken from past MQ and RF tracks
• Gold: Roughly 3K prior NIST judgments
       – Remaining 16K pages have no “gold” judgments




Nov. 16, 2011          TREC 2011 Crowdsourcing Track    11
What to Predict?
• Teams submit classification and/or ranking labels
       – Classification supports traditional absolute relevance judging
       – Rank labels support pair-wise preference or list-wise judging
• Classification labels in [0,1]
       – Probability of relevance (assessor/system uncertainty)
       – Simple generalization of binary relevance
       – If probabilities submitted but no ranking, rank labels induced
• Ranking as [1..N]
       – Task 1: rank 5 documents per set
                • Same worker had to label all 5 examples in a given set (challenge)
       – Task 2: rank all documents per topic
Nov. 16, 2011                        TREC 2011 Crowdsourcing Track                     12
Metrics
• Classification
   – Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss
   – Probabilistic ground truth: KL, RMSE

• Ranking
   – Mean Average Precision (MAP)
   – Normalized Discounted Cumulative Gain (NDCG)
                • Ternary NIST judgments conflated to binary
                • Could explore mapping [0,1] consensus to ternary categories



Nov. 16, 2011                     TREC 2011 Crowdsourcing Track             13
Prediction
Classification Metrics
                                                            Rel    Non-rel




                                      Ground
                                                  True      TP         TN




                                       Truth
                                                  False     FP         FN




Nov. 16, 2011     TREC 2011 Crowdsourcing Track                              14
Classification Metrics (cont’d)
• Classification – Binary ground truth (cont’d)


• Classification – Probabilistic ground truth


                Root Mean Squared Error (RMSE)
• Notes
       – To avoid log(0) = infinity, replace 0 with 10^-15
       – Revision: compute average per-example logloss and KL so error does
         not grow with sample size (particularly with varying team coverage)


Nov. 16, 2011                    TREC 2011 Crowdsourcing Track                 15
Ground Truth: Three Versions

• Gold: NIST Judgments
       – only available for a subset of the test collection
• Consensus: generated by aggregating team labels (automatic)
       – full coverage
• Team-based (Task 2 only)
       – use each team’s labels as truth to evaluate all other teams
       – Inspect variance in team rankings over alternative ground truths
       – Coverage varies

Three primary evaluation conditions
1. Over examples having gold labels (evaluate vs. gold labels)
2. Over examples having gold labels (evaluate vs. consensus labels)
3. Over all examples (evaluate vs. consensus labels)

Nov. 16, 2011                     TREC 2011 Crowdsourcing Track             16
Consensus
• Goal: Infer single consensus label from multiple input labels
• Methodological Goals: unbiased, transparent, simple
• Method: simple average, rounded when metrics require
       – Task 2: input = example labels from each team
       – Task 1: input = per-example average of worker labels from each team
• Details
       –   Classification labels only; no rank fusion
       –   Using primary runs only
       –   Task 1: each team gets 1 vote regardless of worker count (prevent bias)
       –   Exclude any examples where
                • only one team submitted a label (bias)
                • consensus would yield a tie (binary metrics only)

Nov. 16, 2011                         TREC 2011 Crowdsourcing Track             17
How good is consensus? Compare to gold.

 Task 1: 395 gold topic-document pairs
        Labels       ACC    PRE          REC           SPE     LL      KL    RMSE
     Probabilistic
      Consensus      0.69   0.74        0.79           0.57   0.71    0.23   0.38
    Rounded Binary
      Consensus      0.80   0.87        0.85           0.66   6.85    3.14   0.45



Task 2: 1000 gold topic-document pairs
        Labels       ACC    PRE          REC           SPE     LL      KL    RMSE
     Probabilistic
      Consensus
                     0.62   0.73        0.60          0.50    0.65    0.19   0.47
    Rounded Binary
      Consensus      0.69   0.83        0.65          0.55    10.71   2.94   0.56



 Issue: need to consider proper scoring rules
Nov. 16, 2011                 TREC 2011 Crowdsourcing Track                     18
Task 1: Assessment
     (Judging)
Task 1: Data
• Option 1: Use Waterloo rendered pages
       – Available as images, PDFs, and plain text (+html)
       – Many page images fetched from CMU server
       – Protect workers from malicious scripting
• Option 2: Use some other format
       – Any team creating some other format was asked
         to provide that data or conversion tool to others
       – Avoid comparison based on different rendering

Nov. 16, 2011            TREC 2011 Crowdsourcing Track       20
Task 1: Data
• Topics: 270 (240 development, 30 test)
• Test Effort: ~2200 topic-document pairs for each team to judge
       – Shared sets: judged by all teams
                • Test: 1655 topic-document pairs (331 sets) over 20 topics
       – Assigned sets: judged subset of teams
                • Test: 1545 topic-document pairs (309 sets) over 15 topics in total
                • ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown)
       – Split intended to let organizers measure any worker-training effects
                • Increased track complexity, decreased useful redundancy & gold …

• Gold: 395 topic-document pairs for test
       – made available to teams for cross-validation (not blind)

Nov. 16, 2011                           TREC 2011 Crowdsourcing Track                  21
Task 1: Cost & Sponsorship
• Paid crowd labor only one form of crowdsourcing
       – Other models: directed gaming, citizen science, virtual pay
       – Incentives: socialize with others, recognition, social good, learn, etc.

• Nonetheless, paid models continue to dominate
       – e.g. Amazon Mechanical Turk (MTurk), CrowdFlower

• Risk: cost of crowd labor being barrier to track participation
• Risk Mitigation: sponsorship
       – CrowdFlower: $100 free credit to interested teams
       – Amazon: ~ $300 reimbursement to teams using MTurk (expected)
Nov. 16, 2011                    TREC 2011 Crowdsourcing Track                      22
Task 1: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
      –         CrowdFlower qualification, MTurk judging
2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk
3. Delft University of Technology & University of Iowa (GeAnn)
      –         Game, recruit via CrowdFlower
4.     Glasgow – Terrier (uogTr): MTurk
5.     Microsoft (MSRC): MTurk
6.     RMIT University (RMIT): CrowdFlower
7.     University Carlos III of Madrid (uc3m): Mturk
8.     University of Waterloo (UWaterlooMDS): in-house judging

5 used MTurk, 3 used CrowdFlower , 1 in-house

Nov. 16, 2011                        TREC 2011 Crowdsourcing Track   23
Task 1: Evaluation method
• Average per-worker performance
       – Average weighted by number of labels per worker
       – Primary evaluation includes rejected work


• Additional metric: Coverage
       – What % of examples were labeled by the team?


• Cost & time to be self-reported by teams
Nov. 16, 2011           TREC 2011 Crowdsourcing Track   24
¼ most productive workers do ¾ of the work

            # of workers
                           # of labels                     % of labels

                Top 25%
                               44917                        76.77%

                Top 50%
                               53444                        91.34%

                Top 75%
                             56558.5                        96.66%

                 Total
                               58510                         100%


Nov. 16, 2011              TREC 2011 Crowdsourcing Track                 25
Same worker, multiple teams
                      2000


                      1800


                      1600            # of teams                              avg. # of
                                      belongs to # of worker                  examples
                      1400


                                          1
Number of Examples




                      1200
                                                             947               56.21
                      1000
                                          2
                                                              35               146.65
                       800

                                          3
                       600                                     2               72.25
                       400


                       200


                         0
                               1
                              25
                              49
                              73
                              97
                             121
                             145
                             169
                             193
                             217
                             241
                             265
                             289
                             313
                             337
                             361
                             385
                             409
                             433
                             457
                             481
                             505
                             529
                             553
                             577
                             601
                             625
                             649
                             673
                             697
                             721
                             745
                             769
                             793
                             817
                             841
                             865
                             889
                             913
                             937
                             961
                                              Anonymized Worker ID
                     Nov. 16, 2011            TREC 2011 Crowdsourcing Track               26
Task 2: Aggregation
Task 2: Data
• Input: judgments provided by organizers
    – 19,033 topic-document pairs
    – 89,624 binary judgments from 762 workers
• Evaluation: average per-topic performance
• Gold: 3275 labels
    – 2275 for training (1275 relevant, 1000 non-relevant)
           • Excluded from evaluation
    – 1000 for blind test (balanced 500/500)

Nov. 16, 2011               TREC 2011 Crowdsourcing Track   28
Task 2: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
2. Delft University of Technology – Vuurens (TUD_DMIR)
3. Delft University of Technology & University of Iowa (GeAnn)
4. Glasgow – Terrier (uogTr)
5. Glasgow – Zuccon (qirdcsuog)
6. LingPipe
7. Microsoft (MSRC)
8. University Carlos III of Madrid (uc3m)
9. University of Texas at Austin (UTAustin)
10. University of Waterloo (UWaterlooMDS)

Nov. 16, 2011           TREC 2011 Crowdsourcing Track            29
Discussion
• Consensus Labels as ground-truth
       – Consensus Algorithm for Label Generation?
       – Probabilistic or Rounded Binary Consensus Labels?
• Proper scoring rules
• Changes for 2012?
       –   Which document collection? Request NIST judging?
       –   Drop the two-task format? Pre-suppose crowdsourced solution?
       –   Broaden sponsorship? Narrow scope?
       –   Additional organizer?
       –   Details
                • Focus on worker training effects
                • Treatment of rejected work



Nov. 16, 2011                         TREC 2011 Crowdsourcing Track       30
Conclusion
• Interesting first year of track
       – Some insights about what worked well and less well in track design
       – Participants will tell us about methods developed
       – More analysis still needed for evaluation
• Track will run again in 2012
       – Help shape it with feedback (planning session. Hallway, or email)
• Acknowledgments
       – Hyun Joon Jung (UT Austin)
       – Mark Smucker (U Waterloo)
       – Ellen Voorhees & Ian Soboroff (NIST)
• Sponsors
       – Amazon
       – CrowdFlower
Nov. 16, 2011                  TREC 2011 Crowdsourcing Track                  31

Contenu connexe

Similaire à Crowdsourcing Track Overview at TREC 2011

Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)Elena Simperl
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingMatthew Lease
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...eXascale Infolab
 
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Matthew Lease
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringLiwei Ren任力偉
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big DiscoveriesGovnet Events
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction CS, NcState
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methodsvoginip
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsToine Bogers
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
 

Similaire à Crowdsourcing Track Overview at TREC 2011 (20)

Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
TRank ISWC2013
TRank ISWC2013TRank ISWC2013
TRank ISWC2013
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
 
Crowdsourcing in NLP
Crowdsourcing in NLPCrowdsourcing in NLP
Crowdsourcing in NLP
 
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big Discoveries
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 

Plus de Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingMatthew Lease
 

Plus de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject Crowdsourcing
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Crowdsourcing Track Overview at TREC 2011

  • 1. Overview of the TREC 2011 Crowdsourcing Track Organizers: Gabriella Kazai, Microsoft Research Cambridge Matt Lease, University of Texas at Austin
  • 2. Nov. 16, 2011 TREC 2011 Crowdsourcing Track 2
  • 3. What is Crowdsourcing? • A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve some goal(s) • Enabled by internet-connectivity • Many related concepts – Collective intelligence – Social computing – People services – Human computation Nov. 16, 2011 TREC 2011 Crowdsourcing Track 3
  • 4. Why Crowdsourcing? Potential… • Scalability (e.g. cost, time, effort) – e.g. scale to greater pool sizes • Quality (by getting more eyes on the data) – More diverse judgments – More accurate judgments (“wisdom of crowds”) • And more! – New datasets, new tasks, interaction, on-demand evaluation, hybrid search systems Nov. 16, 2011 TREC 2011 Crowdsourcing Track 4
  • 5. Track Goals (for Year 1) • Promote IR community awareness of, investigation of, and experience with crowdsourcing mechanisms and methods • Improve understanding of best practices • Establish shared, reusable benchmarks • Assess state-of-the-art of the field • Attract experience from outside IR community Nov. 16, 2011 TREC 2011 Crowdsourcing Track 5
  • 6. Crowdsourcing in 2011 • AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8) • ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2) • Crowdsourcing Technologies for Language and Cognition Studies (July 27) • CHI-CHC: Crowdsourcing and Human Computation (May 8) • CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”) • CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2) • Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13) • EC: Workshop on Social Computing and User Generated Content (June 5) • ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20) • Interspeech: Crowdsourcing for speech processing (August) • NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD) • SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28) • TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18) • UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18) • WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 6
  • 7. Two Questions, Two Tasks • Task 1: Assessment (human factors) – How can we obtain quality relevance judgments from individual (crowd) participants? • Task 2: Aggregation (statistics) – How can we derive a quality relevance judgment from multiple (crowd) judgments? Nov. 16, 2011 TREC 2011 Crowdsourcing Track 7
  • 8. Task 1: Assessment (human factors) • Measurable outcomes & potential tradeoffs – Quality, time, cost, & effort • Many possible factors – Incentive structures – Interface design – Instructions / guidance – Interaction / feedback – Recruitment & retention –… Nov. 16, 2011 TREC 2011 Crowdsourcing Track 8
  • 9. Task 2: Aggregation (statistics) • “Wisdom of crowds” computing • Typical assumption: noisy input labels – But not always (cf. Yang et al., SIGIR’10) • Many statistical methods have been proposed – Common baseline: majority vote Nov. 16, 2011 TREC 2011 Crowdsourcing Track 9
  • 10. Crowdsourcing, Noise & Uncertainty Broadly two approaches 1. Alchemy: turn noisy data into gold – Once we have gold, we can go on training and evaluating as before (separation of concerns) – Assume we can mostly clean it up and ignore any remaining error (even gold is rarely 100% pure) 2. Model & propagate uncertainty – Let it “spill over” into training and evaluation Nov. 16, 2011 TREC 2011 Crowdsourcing Track 10
  • 11. Test Collection: ClueWeb09 subset • Collection: 19K pages rendered by Waterloo – Task 1: teams judge (a subset) – Task 2: teams aggregate judgments we provide • Topics: taken from past MQ and RF tracks • Gold: Roughly 3K prior NIST judgments – Remaining 16K pages have no “gold” judgments Nov. 16, 2011 TREC 2011 Crowdsourcing Track 11
  • 12. What to Predict? • Teams submit classification and/or ranking labels – Classification supports traditional absolute relevance judging – Rank labels support pair-wise preference or list-wise judging • Classification labels in [0,1] – Probability of relevance (assessor/system uncertainty) – Simple generalization of binary relevance – If probabilities submitted but no ranking, rank labels induced • Ranking as [1..N] – Task 1: rank 5 documents per set • Same worker had to label all 5 examples in a given set (challenge) – Task 2: rank all documents per topic Nov. 16, 2011 TREC 2011 Crowdsourcing Track 12
  • 13. Metrics • Classification – Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss – Probabilistic ground truth: KL, RMSE • Ranking – Mean Average Precision (MAP) – Normalized Discounted Cumulative Gain (NDCG) • Ternary NIST judgments conflated to binary • Could explore mapping [0,1] consensus to ternary categories Nov. 16, 2011 TREC 2011 Crowdsourcing Track 13
  • 14. Prediction Classification Metrics Rel Non-rel Ground True TP TN Truth False FP FN Nov. 16, 2011 TREC 2011 Crowdsourcing Track 14
  • 15. Classification Metrics (cont’d) • Classification – Binary ground truth (cont’d) • Classification – Probabilistic ground truth Root Mean Squared Error (RMSE) • Notes – To avoid log(0) = infinity, replace 0 with 10^-15 – Revision: compute average per-example logloss and KL so error does not grow with sample size (particularly with varying team coverage) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 15
  • 16. Ground Truth: Three Versions • Gold: NIST Judgments – only available for a subset of the test collection • Consensus: generated by aggregating team labels (automatic) – full coverage • Team-based (Task 2 only) – use each team’s labels as truth to evaluate all other teams – Inspect variance in team rankings over alternative ground truths – Coverage varies Three primary evaluation conditions 1. Over examples having gold labels (evaluate vs. gold labels) 2. Over examples having gold labels (evaluate vs. consensus labels) 3. Over all examples (evaluate vs. consensus labels) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 16
  • 17. Consensus • Goal: Infer single consensus label from multiple input labels • Methodological Goals: unbiased, transparent, simple • Method: simple average, rounded when metrics require – Task 2: input = example labels from each team – Task 1: input = per-example average of worker labels from each team • Details – Classification labels only; no rank fusion – Using primary runs only – Task 1: each team gets 1 vote regardless of worker count (prevent bias) – Exclude any examples where • only one team submitted a label (bias) • consensus would yield a tie (binary metrics only) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 17
  • 18. How good is consensus? Compare to gold. Task 1: 395 gold topic-document pairs Labels ACC PRE REC SPE LL KL RMSE Probabilistic Consensus 0.69 0.74 0.79 0.57 0.71 0.23 0.38 Rounded Binary Consensus 0.80 0.87 0.85 0.66 6.85 3.14 0.45 Task 2: 1000 gold topic-document pairs Labels ACC PRE REC SPE LL KL RMSE Probabilistic Consensus 0.62 0.73 0.60 0.50 0.65 0.19 0.47 Rounded Binary Consensus 0.69 0.83 0.65 0.55 10.71 2.94 0.56 Issue: need to consider proper scoring rules Nov. 16, 2011 TREC 2011 Crowdsourcing Track 18
  • 19. Task 1: Assessment (Judging)
  • 20. Task 1: Data • Option 1: Use Waterloo rendered pages – Available as images, PDFs, and plain text (+html) – Many page images fetched from CMU server – Protect workers from malicious scripting • Option 2: Use some other format – Any team creating some other format was asked to provide that data or conversion tool to others – Avoid comparison based on different rendering Nov. 16, 2011 TREC 2011 Crowdsourcing Track 20
  • 21. Task 1: Data • Topics: 270 (240 development, 30 test) • Test Effort: ~2200 topic-document pairs for each team to judge – Shared sets: judged by all teams • Test: 1655 topic-document pairs (331 sets) over 20 topics – Assigned sets: judged subset of teams • Test: 1545 topic-document pairs (309 sets) over 15 topics in total • ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown) – Split intended to let organizers measure any worker-training effects • Increased track complexity, decreased useful redundancy & gold … • Gold: 395 topic-document pairs for test – made available to teams for cross-validation (not blind) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 21
  • 22. Task 1: Cost & Sponsorship • Paid crowd labor only one form of crowdsourcing – Other models: directed gaming, citizen science, virtual pay – Incentives: socialize with others, recognition, social good, learn, etc. • Nonetheless, paid models continue to dominate – e.g. Amazon Mechanical Turk (MTurk), CrowdFlower • Risk: cost of crowd labor being barrier to track participation • Risk Mitigation: sponsorship – CrowdFlower: $100 free credit to interested teams – Amazon: ~ $300 reimbursement to teams using MTurk (expected) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 22
  • 23. Task 1: Participants 1. Beijing University of Posts and Telecommunications (BUPT) – CrowdFlower qualification, MTurk judging 2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk 3. Delft University of Technology & University of Iowa (GeAnn) – Game, recruit via CrowdFlower 4. Glasgow – Terrier (uogTr): MTurk 5. Microsoft (MSRC): MTurk 6. RMIT University (RMIT): CrowdFlower 7. University Carlos III of Madrid (uc3m): Mturk 8. University of Waterloo (UWaterlooMDS): in-house judging 5 used MTurk, 3 used CrowdFlower , 1 in-house Nov. 16, 2011 TREC 2011 Crowdsourcing Track 23
  • 24. Task 1: Evaluation method • Average per-worker performance – Average weighted by number of labels per worker – Primary evaluation includes rejected work • Additional metric: Coverage – What % of examples were labeled by the team? • Cost & time to be self-reported by teams Nov. 16, 2011 TREC 2011 Crowdsourcing Track 24
  • 25. ¼ most productive workers do ¾ of the work # of workers # of labels % of labels Top 25% 44917 76.77% Top 50% 53444 91.34% Top 75% 56558.5 96.66% Total 58510 100% Nov. 16, 2011 TREC 2011 Crowdsourcing Track 25
  • 26. Same worker, multiple teams 2000 1800 1600 # of teams avg. # of belongs to # of worker examples 1400 1 Number of Examples 1200 947 56.21 1000 2 35 146.65 800 3 600 2 72.25 400 200 0 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 841 865 889 913 937 961 Anonymized Worker ID Nov. 16, 2011 TREC 2011 Crowdsourcing Track 26
  • 28. Task 2: Data • Input: judgments provided by organizers – 19,033 topic-document pairs – 89,624 binary judgments from 762 workers • Evaluation: average per-topic performance • Gold: 3275 labels – 2275 for training (1275 relevant, 1000 non-relevant) • Excluded from evaluation – 1000 for blind test (balanced 500/500) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 28
  • 29. Task 2: Participants 1. Beijing University of Posts and Telecommunications (BUPT) 2. Delft University of Technology – Vuurens (TUD_DMIR) 3. Delft University of Technology & University of Iowa (GeAnn) 4. Glasgow – Terrier (uogTr) 5. Glasgow – Zuccon (qirdcsuog) 6. LingPipe 7. Microsoft (MSRC) 8. University Carlos III of Madrid (uc3m) 9. University of Texas at Austin (UTAustin) 10. University of Waterloo (UWaterlooMDS) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 29
  • 30. Discussion • Consensus Labels as ground-truth – Consensus Algorithm for Label Generation? – Probabilistic or Rounded Binary Consensus Labels? • Proper scoring rules • Changes for 2012? – Which document collection? Request NIST judging? – Drop the two-task format? Pre-suppose crowdsourced solution? – Broaden sponsorship? Narrow scope? – Additional organizer? – Details • Focus on worker training effects • Treatment of rejected work Nov. 16, 2011 TREC 2011 Crowdsourcing Track 30
  • 31. Conclusion • Interesting first year of track – Some insights about what worked well and less well in track design – Participants will tell us about methods developed – More analysis still needed for evaluation • Track will run again in 2012 – Help shape it with feedback (planning session. Hallway, or email) • Acknowledgments – Hyun Joon Jung (UT Austin) – Mark Smucker (U Waterloo) – Ellen Voorhees & Ian Soboroff (NIST) • Sponsors – Amazon – CrowdFlower Nov. 16, 2011 TREC 2011 Crowdsourcing Track 31