5. News Queries change over time for query in task return Random(Yes,No) end loop 2 News-related? Sea Creature? Query: Octopus? Work Cup Predications? How can we overcome these difficulties to create a high quality dataset for news query classification?
19. Interfaces : LinkSupported Links to three major search engines . . . Triggers a search containing the query and its date Also get some additional feedback from workers 12
37. How is our Baseline?Validation is very important: 32% of judgments were rejected Basic Interface Accuracy : Combined Measure (assumes that the workers labelled non-news-related queries correctly) Recall : The % of all news-related queries that the workers labelled correctly As expected the baseline is fairly poor, i.e. Agreement between workers per query is low (25-50%) 20% of those were completed VERY quickly: Bots? Kfree : Kappa agreement assuming that workers would label randomly Precision : The % of queries labelled as news-related that agree with our ground truth . . . and new users Kfleiss : Kappa agreement assuming that workers will label according to class distribution Watch out for bursty judging
38.
39.
40. High recall and agreement indicate that the labels are of high qualityPrecision . Workers finding other queries to be news-related Recall Workers got all of the news-related queries right! Agreement . Workers maybe learning the task? Majority of work done by 3 users testset fullset
41.
42. We are confident that our dataset is reliable since agreement is highBest Practices Online worker validation is paramount, catch out bots and lazy workers to improve agreement Provide workers with additional information to help improve labelling quality Workers can learn, running large single jobs may allow workers to become better at the task Questions? 21
Notes de l'éditeur
Poisson sampling – literature says is representative
At beginningOverriding hypothesis
Too late$2
Precision : the number of
Worker overlap is only 1/3
Precision : the number of
More detail and link backwardsGraph missingHow confident in the test collection – looks reliable