SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Click Models
Kira Radinsky
Slides based on material from:
Filip Radlinski, Madhu Kurup, and Thorsten Joachims
Motivation
• How can we evaluate search engine quality?
Option 1: Ask experts to judge queries & result sets.
For a sample of queries, judges are paid to examine a
sample of documents and mark their relevance. This
standard process gives a reusable dataset.
Option 2: Watch how users act and hope it tells us
something about quality.
For all queries, record how users act and infer the quality
of the search results based on the logs of user actions.
Motivation
• How can we evaluate search engine quality?
Option 1: Ask experts to judge queries & result sets.
For a sample of queries, judges are paid to examine a
sample of documents and determine relevance. This
standard process gives a reusable dataset.
Option 2: Watch how users act and hope it tells us
something about quality.
For all queries, record how users act and infer the
quality of the search results based on the logs of user
actions.
• The key question: What is the relationship between user
behaviour and ranking quality?
Outline
• Describe a study of evaluation search with clicks
– Control ranking quality, and measure the effect on
user behaviour.
• Evaluation with Absolute Metrics
– Users were shown results from different functions.
– Measure statistics about user responses.
• Evaluation using Paired Comparisons
– Show a combination of results from 2 rankings.
– Infer relative preferences.
• Discuss limitations and open questions
Experiment Design
• Start with search ranking function f.
• Intentionally degrade performance in two
steps, making f1 and f2.
• Measure how user behaviour differs between
the ranking functions.
• Interleave results from two rankings and
measure responses.
Setup: f better than f1 better than f2
User Study on arXiv.org
– Real users and queries
– Users in natural context
– Degradation types:
ORIG FLAT RAND
• ORIG hand-tuned function
• FLAT ignore meta-data
• RAND randomize top-10
ORIG SWAP2 SWAP4
• ORIG hand-tuned function
• SWAP2 swap 2 pairs
• SWAP4 swap 4 pairs
– How does user behaviour change?
 
 
Experiment Setup
Phase 1: ORIG-FLAT-RAND
• Each user who comes to the search engine is assigned one of 6 experimental conditions:
– Results generated by ORIG
– Results generated by FLAT
– Results generated by RAND
– Results generated by interleaving ORIG & FLAT
– Results generated by interleaving ORIG & RAND
– Results generated by interleaving FLAT & RAND
Phase 2: ORIG-SWAP2-SWAP4
• Each user who comes to the search engine is assigned one of 6 experimental conditions:
– Results generated by ORIG
– Results generated by SWAP2
– Results generated by SWAP4
– Results generated by interleaving ORIG & SWAP2
– Results generated by interleaving ORIG & SWAP4
– Results generated by interleaving SWAP2 & SWAP4
Experiment 1: Absolute Metrics
• Measured eight easily recorded statistics
• As the ranking quality decreases, we can
hypothesize:
Metric Expected change as ranking gets worse
Abandonment Rate Increase (more bad result sets)
Reformulation Rate Increase (more need to reformulate)
Queries per Session Increase (more need to reformulate)
Clicks per Query Decrease (Fewer relevant results)
Max Reciprocal Rank* Decrease (Top results are worse)
Mean Reciprocal Rank* Decrease (More need for many clicks)
Time to First Click* Increase (Good results are lower)
Time to Last Click* Decrease (Fewer relevant results)
(*) Only queries with at least one click count
Experiment Statistics
• On average:
– About 700 queries a day
– About 300 distinct IPs
– About 600 clicks on results
• Each experiment phase was run for one month.
• Each experimental condition observed:
– About 3,000 queries
– About 1,000 queries with clicks
– About 600 distinct IPs.
Absolute Metrics: Results
0
0.5
1
1.5
2
ORIG
FLAT
RAND
ORIG
SWAP2
SWAP4
Absolute Metrics: Results
• Summarizing the results, out of 6 pairs:
Summary
• Statistical fluctuations after
one month of data make
conclusions hard to draw
• None of the absolute
metrics reliably identify the
better ranking.
Experiment 2: Interleaved Metrics
• Paired comparisons in sensory analysis:
– Perceptual qualities are hard to test on absolute
scale (e.g. taste, sound).
– Subjects usually presented with 2+ alternatives.
– Asked to specify which they prefer.
• Can do the same thing with ranking functions:
– Present two rankings, ask which is better.
– But we’d also like evaluation to be transparent.
• So we can do an interleaving experiment.
Team Draft Interleaving
• Think of making high school sports teams
– We start with two captains.
– Each has a preference order over players.
– They take turns picking their next player.
• Interleaving Algorithm
– Flip a coin to see which ranking goes first.
– That ranking picks highest ranked available document.
Any clicks on it will be assigned to that ranking.
– The other team picks highest ranked available doc.
– Flip a coin again and continue.
Team Draft Interleaving Phase
Phase 3: ORIG-FLAT-RAND and ORIG-SWAP2-SWAP4
• Each user who comes to the search engine is
assigned one of 6 experimental conditions:
– Results generated by team-draft: ORIG & FLAT
– Results generated by team-draft: ORIG & RAND
– Results generated by team-draft: FLAT & RAND
– Results generated by team-draft: ORIG & SWAP2
– Results generated by team-draft: ORIG & SWAP4
– Results generated by team-draft: SWAP2 & SWAP4
Team Draft Interleaving
Ranking A
1. Napa Valley – The authority for lodging...
www.napavalley.com
2. Napa Valley Wineries - Plan your wine...
www.napavalley.com/wineries
3. Napa Valley College
www.napavalley.edu/homex.asp
4. Been There | Tips | Napa Valley
www.ivebeenthere.co.uk/tips/16681
5. Napa Valley Wineries and Wine
www.napavintners.com
6. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
Ranking B
1. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
2. Napa Valley – The authority for lodging...
www.napavalley.com
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
5. NapaValley.org
www.napavalley.org
6. The Napa Valley Marathon
www.napavalleymarathon.org
Presented Ranking
1. Napa Valley – The authority for lodging...
www.napavalley.com
2. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries
5. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
6. Napa Balley College
www.napavalley.edu/homex.asp
7 NapaValley.org
www.napavalley.org
AB
Team Draft Interleaving
Ranking A
1. Napa Valley – The authority for lodging...
www.napavalley.com
2. Napa Valley Wineries - Plan your wine...
www.napavalley.com/wineries
3. Napa Valley College
www.napavalley.edu/homex.asp
4. Been There | Tips | Napa Valley
www.ivebeenthere.co.uk/tips/16681
5. Napa Valley Wineries and Wine
www.napavintners.com
6. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
Ranking B
1. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
2. Napa Valley – The authority for lodging...
www.napavalley.com
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
5. NapaValley.org
www.napavalley.org
6. The Napa Valley Marathon
www.napavalleymarathon.org
Presented Ranking
1. Napa Valley – The authority for lodging...
www.napavalley.com
2. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries
5. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
6. Napa Balley College
www.napavalley.edu/homex.asp
7 NapaValley.org
www.napavalley.org
Tie!
Interleaving Results
0
10
20
30
40
50
60
ORIG > FLAT FLAT > RAND ORIG > RAND ORIG > SWAP2 SWAP2 > SWAP4 ORIG > SWAP4
Better ranking wins Worse ranking wins
Interleaving Results
• The conclusion is consistent and stronger:
(Absolute Metrics)
Summary
• Paired comparison tests
always correctly identified
the better ranking.
• Most of the differences are
statistically significant.
Summary of Experiment
• Constructed two triplets of ranking functions.
• Tested on real users.
• Absolute metrics didn’t change as we expected.
– Changes weren’t always monotonic.
• Interleaved gave more significant results, and
was more reliable.
– But cannot be run “after the fact” from logs.
• But there are many caveats to think about...
Discussion: Users & Queries
• We were only able to explore a few aspects of
the problem:
– The users are not “typical” web users.
– The type of queries is not typical.
– Results could be different in other settings:
enterprise search, general web search, personalized
search, desktop search, mobile search...
– It would be interesting to conduct similar
experiments in some of these other settings.
Discussion: User Interactions
• All click evaluations rely on clicks being useful.
• Presentation should not bias toward either ranking
function.
– If naively interleave two rankings with different snippet
engines, could bias users.
– But what if, say, url length just differs?
• Answer may be in the snippet (“instant answers”).
– In that case there may be no click.
– Other effects (e.g. temporal, mouse movement, browser
buttons) may give more information, but harder to log.
Discussion: Click Metrics
• The metrics we used were fairly simple
– What if “clicked followed by back within 5 seconds”
didn’t count?
– If we got much more data, absolute metrics could
also become reliable.
– More sophisticated absolute metrics may be more
powerful or reliable.
– More sophisticated interleaved metrics may also be.
Discussion: Log Reusability
• Say somebody else comes up with a new ranking
function. Are our logs useful to them?
– For absolute metrics
• Would provide baseline performance numbers.
• But temporal effects, etc, may affect evaluation.
– For paired comparison test:
• Hard to know what the user would have clicked given a
different input, so probably not
Conclusions
• We’d like to evaluate rankings by observing
real users: reflects real needs, cheaper, faster.
• This can be done using absolute measures, or
designing a paired comparison experiment.
• In this particular setting, the paired
comparison was more reliable and sensitive.
• There are many open questions about when
paired comparison is indeed better.

Contenu connexe

Similaire à Tutorial 12 (click models)

Recommandation systems -
Recommandation systems - Recommandation systems -
Recommandation systems - Yousef Fadila
 
recommendation system techunique and issue
recommendation system techunique and issuerecommendation system techunique and issue
recommendation system techunique and issueNutanBhor
 
Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...
Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...
Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...Michael Powers
 
Improve the UX of Your Content and Prove It
Improve the UX of Your Content and Prove ItImprove the UX of Your Content and Prove It
Improve the UX of Your Content and Prove ItPam Noreault
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsTomer Gabel
 
Ronny lempelyahooindiabigthinkerapril2013
Ronny lempelyahooindiabigthinkerapril2013Ronny lempelyahooindiabigthinkerapril2013
Ronny lempelyahooindiabigthinkerapril2013Muthusamy Chelliah
 
Recommender systems
Recommender systemsRecommender systems
Recommender systemsTamer Rezk
 
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity Search
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity SearchFrom “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity Search
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity SearchMounia Lalmas-Roelleke
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunk
 
Analyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experienceAnalyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experiencePavel Serdyukov
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Editionkrisztianbalog
 
Personalized Search at Sandia National Labs
Personalized Search at Sandia National LabsPersonalized Search at Sandia National Labs
Personalized Search at Sandia National LabsLucidworks
 
Sistema de recomendações de Filmes do Netflix
Sistema de recomendações de Filmes do NetflixSistema de recomendações de Filmes do Netflix
Sistema de recomendações de Filmes do NetflixGabriel Peixe
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Information Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesInformation Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesDaniel Valcarce
 

Similaire à Tutorial 12 (click models) (20)

Recommandation systems -
Recommandation systems - Recommandation systems -
Recommandation systems -
 
recommendation system techunique and issue
recommendation system techunique and issuerecommendation system techunique and issue
recommendation system techunique and issue
 
Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...
Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...
Fast, Cheap, and Actionable: Creating an Affordable User Research Program (Th...
 
Improve the UX of Your Content and Prove It
Improve the UX of Your Content and Prove ItImprove the UX of Your Content and Prove It
Improve the UX of Your Content and Prove It
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of Us
 
Ronny lempelyahooindiabigthinkerapril2013
Ronny lempelyahooindiabigthinkerapril2013Ronny lempelyahooindiabigthinkerapril2013
Ronny lempelyahooindiabigthinkerapril2013
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Fashiondatasc
FashiondatascFashiondatasc
Fashiondatasc
 
Don't Fear the User
Don't Fear the UserDon't Fear the User
Don't Fear the User
 
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity Search
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity SearchFrom “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity Search
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity Search
 
Web testing
Web testingWeb testing
Web testing
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP Interactive
 
Analyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experienceAnalyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experience
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Edition
 
Personalized Search at Sandia National Labs
Personalized Search at Sandia National LabsPersonalized Search at Sandia National Labs
Personalized Search at Sandia National Labs
 
Sistema de recomendações de Filmes do Netflix
Sistema de recomendações de Filmes do NetflixSistema de recomendações de Filmes do Netflix
Sistema de recomendações de Filmes do Netflix
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Information Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesInformation Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slides
 

Plus de Kira

Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Kira
 

Plus de Kira (9)

Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Tutorial 12 (click models)

  • 1. Click Models Kira Radinsky Slides based on material from: Filip Radlinski, Madhu Kurup, and Thorsten Joachims
  • 2. Motivation • How can we evaluate search engine quality? Option 1: Ask experts to judge queries & result sets. For a sample of queries, judges are paid to examine a sample of documents and mark their relevance. This standard process gives a reusable dataset. Option 2: Watch how users act and hope it tells us something about quality. For all queries, record how users act and infer the quality of the search results based on the logs of user actions.
  • 3. Motivation • How can we evaluate search engine quality? Option 1: Ask experts to judge queries & result sets. For a sample of queries, judges are paid to examine a sample of documents and determine relevance. This standard process gives a reusable dataset. Option 2: Watch how users act and hope it tells us something about quality. For all queries, record how users act and infer the quality of the search results based on the logs of user actions. • The key question: What is the relationship between user behaviour and ranking quality?
  • 4. Outline • Describe a study of evaluation search with clicks – Control ranking quality, and measure the effect on user behaviour. • Evaluation with Absolute Metrics – Users were shown results from different functions. – Measure statistics about user responses. • Evaluation using Paired Comparisons – Show a combination of results from 2 rankings. – Infer relative preferences. • Discuss limitations and open questions
  • 5. Experiment Design • Start with search ranking function f. • Intentionally degrade performance in two steps, making f1 and f2. • Measure how user behaviour differs between the ranking functions. • Interleave results from two rankings and measure responses. Setup: f better than f1 better than f2
  • 6. User Study on arXiv.org – Real users and queries – Users in natural context – Degradation types: ORIG FLAT RAND • ORIG hand-tuned function • FLAT ignore meta-data • RAND randomize top-10 ORIG SWAP2 SWAP4 • ORIG hand-tuned function • SWAP2 swap 2 pairs • SWAP4 swap 4 pairs – How does user behaviour change?    
  • 7. Experiment Setup Phase 1: ORIG-FLAT-RAND • Each user who comes to the search engine is assigned one of 6 experimental conditions: – Results generated by ORIG – Results generated by FLAT – Results generated by RAND – Results generated by interleaving ORIG & FLAT – Results generated by interleaving ORIG & RAND – Results generated by interleaving FLAT & RAND Phase 2: ORIG-SWAP2-SWAP4 • Each user who comes to the search engine is assigned one of 6 experimental conditions: – Results generated by ORIG – Results generated by SWAP2 – Results generated by SWAP4 – Results generated by interleaving ORIG & SWAP2 – Results generated by interleaving ORIG & SWAP4 – Results generated by interleaving SWAP2 & SWAP4
  • 8. Experiment 1: Absolute Metrics • Measured eight easily recorded statistics • As the ranking quality decreases, we can hypothesize: Metric Expected change as ranking gets worse Abandonment Rate Increase (more bad result sets) Reformulation Rate Increase (more need to reformulate) Queries per Session Increase (more need to reformulate) Clicks per Query Decrease (Fewer relevant results) Max Reciprocal Rank* Decrease (Top results are worse) Mean Reciprocal Rank* Decrease (More need for many clicks) Time to First Click* Increase (Good results are lower) Time to Last Click* Decrease (Fewer relevant results) (*) Only queries with at least one click count
  • 9. Experiment Statistics • On average: – About 700 queries a day – About 300 distinct IPs – About 600 clicks on results • Each experiment phase was run for one month. • Each experimental condition observed: – About 3,000 queries – About 1,000 queries with clicks – About 600 distinct IPs.
  • 11. Absolute Metrics: Results • Summarizing the results, out of 6 pairs: Summary • Statistical fluctuations after one month of data make conclusions hard to draw • None of the absolute metrics reliably identify the better ranking.
  • 12. Experiment 2: Interleaved Metrics • Paired comparisons in sensory analysis: – Perceptual qualities are hard to test on absolute scale (e.g. taste, sound). – Subjects usually presented with 2+ alternatives. – Asked to specify which they prefer. • Can do the same thing with ranking functions: – Present two rankings, ask which is better. – But we’d also like evaluation to be transparent. • So we can do an interleaving experiment.
  • 13. Team Draft Interleaving • Think of making high school sports teams – We start with two captains. – Each has a preference order over players. – They take turns picking their next player. • Interleaving Algorithm – Flip a coin to see which ranking goes first. – That ranking picks highest ranked available document. Any clicks on it will be assigned to that ranking. – The other team picks highest ranked available doc. – Flip a coin again and continue.
  • 14. Team Draft Interleaving Phase Phase 3: ORIG-FLAT-RAND and ORIG-SWAP2-SWAP4 • Each user who comes to the search engine is assigned one of 6 experimental conditions: – Results generated by team-draft: ORIG & FLAT – Results generated by team-draft: ORIG & RAND – Results generated by team-draft: FLAT & RAND – Results generated by team-draft: ORIG & SWAP2 – Results generated by team-draft: ORIG & SWAP4 – Results generated by team-draft: SWAP2 & SWAP4
  • 15. Team Draft Interleaving Ranking A 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Balley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org AB
  • 16. Team Draft Interleaving Ranking A 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Balley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org Tie!
  • 17. Interleaving Results 0 10 20 30 40 50 60 ORIG > FLAT FLAT > RAND ORIG > RAND ORIG > SWAP2 SWAP2 > SWAP4 ORIG > SWAP4 Better ranking wins Worse ranking wins
  • 18. Interleaving Results • The conclusion is consistent and stronger: (Absolute Metrics) Summary • Paired comparison tests always correctly identified the better ranking. • Most of the differences are statistically significant.
  • 19. Summary of Experiment • Constructed two triplets of ranking functions. • Tested on real users. • Absolute metrics didn’t change as we expected. – Changes weren’t always monotonic. • Interleaved gave more significant results, and was more reliable. – But cannot be run “after the fact” from logs. • But there are many caveats to think about...
  • 20. Discussion: Users & Queries • We were only able to explore a few aspects of the problem: – The users are not “typical” web users. – The type of queries is not typical. – Results could be different in other settings: enterprise search, general web search, personalized search, desktop search, mobile search... – It would be interesting to conduct similar experiments in some of these other settings.
  • 21. Discussion: User Interactions • All click evaluations rely on clicks being useful. • Presentation should not bias toward either ranking function. – If naively interleave two rankings with different snippet engines, could bias users. – But what if, say, url length just differs? • Answer may be in the snippet (“instant answers”). – In that case there may be no click. – Other effects (e.g. temporal, mouse movement, browser buttons) may give more information, but harder to log.
  • 22. Discussion: Click Metrics • The metrics we used were fairly simple – What if “clicked followed by back within 5 seconds” didn’t count? – If we got much more data, absolute metrics could also become reliable. – More sophisticated absolute metrics may be more powerful or reliable. – More sophisticated interleaved metrics may also be.
  • 23. Discussion: Log Reusability • Say somebody else comes up with a new ranking function. Are our logs useful to them? – For absolute metrics • Would provide baseline performance numbers. • But temporal effects, etc, may affect evaluation. – For paired comparison test: • Hard to know what the user would have clicked given a different input, so probably not
  • 24. Conclusions • We’d like to evaluate rankings by observing real users: reflects real needs, cheaper, faster. • This can be done using absolute measures, or designing a paired comparison experiment. • In this particular setting, the paired comparison was more reliable and sensitive. • There are many open questions about when paired comparison is indeed better.