SlideShare une entreprise Scribd logo
1  sur  65
Measuring effectiveness of
machine learning systems
AMIT SHARMA, MICROSOFT RESEARCH INDIA
www.amitsharma.in
@amt_shrma
Are these systems performing well?
How do these systems affect people’s
behavior?
How can we make these systems better?
Better for whom?
Evaluating systemsEstimating
causal effects
Three examples.
Two studies:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.
Easiest answer: Look at metrics
But metrics can be misleading
Performance metric can be high or low for a number of
reasons
--day of week, time of year
--selection effects
Different metrics may provide a different picture.
Example 1: Increasing activity
on Xbox
7
From data to prediction
8
Use these correlations to make a predictive model.
Future Activity ->
f(number of friends, logins in past month)

From data to “actionable insights”
9
Would increasing the number of friends increase
people’s activity on our system?
Maybe, may be not (!)
Different explanations are possible
10
Example 2: Search Ads
11
Are search ads really that effective?
12
But search results point to the same
website
13
Without reasoning about causality, may
overestimate effectiveness of ads
14
Okay, search ads have an explicit intent.
Display ads should be fine?
15
Estimating the impact of ads
16
People anyways buy more toys in
December
17
18
19
Example 3: Did a system change
lead to better outcomes?
20
System A System B
Comparing old versus new
system
21
Old (A) New (B)
50/1000 (5%) 54/1000 (5.4%)
New system is better?
Looking at change in CTR by
income
22
Old System (A) New System (B)
10/400 (2.5%) 4/200 (2%)
Old System (A) New System (B)
40/600 (6.6%) 50/800 (6.2%)
Low-income
High-income
The Simpson’s paradox
Is Algorithm A better?
23
Old system (A) New system (B)
Conversion rate for
Low-income people
10/400 (2.5%) 4/200 (2%)
Conversion rate for
High-income people
40/600 (6.6%) 50/800 (6.2%)
Total Conversion
rate
50/1000 (5%) 54/1000 (5.4%)
Answer (as usual): May be, may be not.
High-income people might have better means to know about the updated
information.
Time of year may have an effect.
There could be other hidden causal variations.
24
25
Easy answer: Do A/B tests
Great for testing focused hypotheses.
But cannot help you find robust hypotheses.
Not possible in many scenarios for ethical or practical
reasons.
Harder answer: Approximate
A/B tests offline
Gives all the benefits of A/B tests
Can be run at scale.
But, without an experiment, we have no data on
what would happen if we change the system.
How do we estimate something that we have never
observed?
Causal inference to the rescue…
28
Evaluating systemsEstimating
causal effects
Two examples:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.
Study 1: Estimating the impact
of a recommender system
Sharma-Hofman-Watts (2015,2016)
Example: Estimating the causal impact of
Amazon’s recommender system
31
How much activity comes from
the recommendation system?
32
Confounding: Observed click-throughs
may be due to correlated demand
33
Demand for
The Road
Visits to The
Road
Rec. visits to
No Country
for Old Men
Demand for
No Country for
Old Men
Observed activity is almost surely
an overestimate of the causal
effect
34
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page
visits
?
ACTIVITY WITHOUT
RECOMMENDER
Finding auxiliary outcome: Split outcome into
recommender (primary) and direct visits (auxiliary)
35
All visits to a
recommended product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Auxiliary outcome: Proxy for
unobserved demand
? ?
1a. Search for any product
with a shock to page visits
36
1b. Filtering out invalid natural
experiments
37
The “split-door” criterion
38
Criterion: 𝑿 ∐ 𝒀 𝑫
Demand for
focal product
(UX)
Visits to focal
product (X)
Rec. visits
(YR)
Direct visits
(YD)
Demand for
rec. product
(UY)
More formally, why does it work?
39
Unobserved
variables (UX)
Cause
(X)
Outcome (YR)
Auxiliary
Outcome
(YD)
Unobserved
variables (UY)
Data from Amazon.com, using
Bing toolbar
•
•
•
Out of which 20 K products have at least 10 visits on any one day
Constructed sequence of visits
for each user
41
Implementing the split-door criterion
42
𝑡 = 15 days
Using the split-door criterion, obtain 23,000 natural
experiments for over 12,000 products.
43
44
Observational click-through rate
overestimates causal effect
45
Generalization? Distribution of products
with a natural experiment identical to
overall distribution
46
Study 2: How does user
satisfaction for a search engine
vary across demographics?
Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)
Tricky: straightforward
optimization can lead to
differential performance
Search engine uses a standard metric: time spent on clicked result page as an
indicator of satisfaction.
Goal: estimate difference in user satisfaction between these two demographic
groups.
Suppose older users issue more of “retirement planning” queries
Age: >50 years
80% users 10% users
Age: <30 years
…
1. Overall metrics can hide
differential satisfaction
Average user satisfaction for “retirement planning” may be
high.
But,
Average satisfaction for younger users=0.7
Average satisfaction for older users=0.2
2. Query-level metrics can hide
differential satisfaction
<query>
<query>
<query>
<query>
<query>
<query>
retirement planning
<query>
<query>
retirement planning
retirement planning
<query>
retirement planning
…
Same user satisfaction for
“retirement planning” for both older and
younger users = 0.7
What if average satisfaction for
<query>=0.9?
Older users still receiving more of lower-
quality results than younger users.
Younger users
Older users
3. More critically, even individual-
level metrics can also hide
differential satisfaction
Reading time for the same webpage result for the
same user satisfaction
Time spent on a webpage
Younger Users
Older Users
How do we know whether some
users are more satisfied than
others?
Data: Demographic characteristics
of search engine users
Internal logs from Bing.com for two weeks
4 M users | 32 M impressions | 17 M sessions
Demographics: Age & Gender
Age:
◦ post-Millenial: <18
◦ Millenial: 18-34
◦ Generation X: 35-54
◦ Baby Boomer: 55 - 74
Overall metrics across Demographics
Four metrics:
Graded Utility (GU) Reformulation Rate (RR)
Successful Click Count (SCC) Page Click Count (PCC)
Pitfalls with Overall Metrics
Conflate two separate effects:
◦ natural demographic variation caused by the differing traits among the
different demographic groups e.g.
◦ Different queries issued
◦ Different information need for the same query
◦ Even for the same satisfaction, demographic A tends to click more than demographic B
◦ Systemic difference in user satisfaction due to the search engine
Constructing a causal model
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
I. Context Matching: selecting for
activity with near-identical context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
For any two users from different demographics,
1. Same Query
2. Same Information Need:
1. Control for user intent: same final SAT click
2. Only consider navigational queries
3. Identical top-8 Search Results
1.2 M impressions, 19K unique queries, 617K users
Age-wise differences in metrics disappear
General auditing tool: robust
Very low coverage across queries: Did we control for too much?
II. Query-level pairwise model:
Estimating satisfaction directly by
considering pairs of users
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Estimating absolute satisfaction is non-trivial
• Instead, Estimate relative satisfaction by considering pairs of users
for the same query
• Conservative proxy for pairwise satisfaction by only considering “big”
differences in observed metric for the same query
• Logistic regression model for estimating probability of impression i
being more satisfied than impression j:
Again, see a small age-wise difference in
satisfaction
Conclusion I: ML systems need
Grey Box analysis
Conclusion II
Evaluation of ML systems requires careful analysis
of inputs and outputs.
Observational metrics are usually biased:
◦ Big fraction of recommendation click-throughs may be
convenience.
◦ Search engine metrics do not provide a provide a clear
picture of user satisfaction.
Causal models essential for developing robust
metrics.
Thank you
Amit Sharma
@amt_shrma
http://www.amitsharma.in

Contenu connexe

Tendances

The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017Big Data Spain
 
IRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET Journal
 
Causal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationCausal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationScientificRevenue
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_satoMasahiro Sato
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroRobert Munro
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Essential econometrics for data scientists
Essential econometrics for data scientistsEssential econometrics for data scientists
Essential econometrics for data scientistsBenjamin Skrainka
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016Fabian Abel
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker StrategiesTom Plasterer
 
Filtering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithmFiltering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithmVenkat Projects
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...David Zibriczky
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science processBenjamin Skrainka
 
Interleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentationInterleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentationXin QIAN
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesQuestionPro
 
How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?Ganes Kesari
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientificRevenue
 
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-finalICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-finalriedlc
 

Tendances (20)

The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
 
IRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product Marketing
 
Causal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationCausal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous Optimization
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_sato
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert Munro
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Essential econometrics for data scientists
Essential econometrics for data scientistsEssential econometrics for data scientists
Essential econometrics for data scientists
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
Filtering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithmFiltering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithm
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
 
Interleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentationInterleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentation
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
 
How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-finalICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
 

Similaire à Measuring effectiveness of machine learning systems

Auditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographicsAuditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographicsAmit Sharma
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesJPINFOTECH JAYAPRAKASH
 
Evaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsEvaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsMegaVjohnson
 
Fuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemFuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemRSIS International
 
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...IEEEGLOBALSOFTTECHNOLOGIES
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesIEEEFINALYEARPROJECTS
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentationnirvdrum
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Jin Young Kim
 
MAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comMAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comdonaldzs24
 
MAT 510 Exceptional Education - snaptutorial.com
MAT 510   Exceptional Education - snaptutorial.comMAT 510   Exceptional Education - snaptutorial.com
MAT 510 Exceptional Education - snaptutorial.comDavisMurphyB11
 
Behavioural Modelling Outcomes prediction using Casual Factors
Behavioural Modelling Outcomes prediction using Casual  FactorsBehavioural Modelling Outcomes prediction using Casual  Factors
Behavioural Modelling Outcomes prediction using Casual FactorsIJMER
 
Sweeny group think-ias2015
Sweeny group think-ias2015Sweeny group think-ias2015
Sweeny group think-ias2015Marianne Sweeny
 
Mat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comMat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comBaileya19
 
MAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.comMAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.combellflower184
 
Mat 510 Believe Possibilities / snaptutorial.com
Mat 510  Believe Possibilities / snaptutorial.comMat 510  Believe Possibilities / snaptutorial.com
Mat 510 Believe Possibilities / snaptutorial.comDavis29a
 
MAT 510 Inspiring Innovation/tutorialrank.com
 MAT 510 Inspiring Innovation/tutorialrank.com MAT 510 Inspiring Innovation/tutorialrank.com
MAT 510 Inspiring Innovation/tutorialrank.comjonhson139
 
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...Pieter Van Gorp
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
MAT 510 Effective Communication - tutorialrank.com
MAT 510  Effective Communication - tutorialrank.comMAT 510  Effective Communication - tutorialrank.com
MAT 510 Effective Communication - tutorialrank.comBartholomew46
 

Similaire à Measuring effectiveness of machine learning systems (20)

Auditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographicsAuditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographics
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomes
 
Evaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsEvaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender Systems
 
Fuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemFuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender System
 
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomes
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
 
MAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comMAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.com
 
MAT 510 Exceptional Education - snaptutorial.com
MAT 510   Exceptional Education - snaptutorial.comMAT 510   Exceptional Education - snaptutorial.com
MAT 510 Exceptional Education - snaptutorial.com
 
Behavioural Modelling Outcomes prediction using Casual Factors
Behavioural Modelling Outcomes prediction using Casual  FactorsBehavioural Modelling Outcomes prediction using Casual  Factors
Behavioural Modelling Outcomes prediction using Casual Factors
 
Sweeny group think-ias2015
Sweeny group think-ias2015Sweeny group think-ias2015
Sweeny group think-ias2015
 
Mat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comMat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.com
 
MAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.comMAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.com
 
Mat 510 Believe Possibilities / snaptutorial.com
Mat 510  Believe Possibilities / snaptutorial.comMat 510  Believe Possibilities / snaptutorial.com
Mat 510 Believe Possibilities / snaptutorial.com
 
MAT 510 Inspiring Innovation/tutorialrank.com
 MAT 510 Inspiring Innovation/tutorialrank.com MAT 510 Inspiring Innovation/tutorialrank.com
MAT 510 Inspiring Innovation/tutorialrank.com
 
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
MAT 510 Effective Communication - tutorialrank.com
MAT 510  Effective Communication - tutorialrank.comMAT 510  Effective Communication - tutorialrank.com
MAT 510 Effective Communication - tutorialrank.com
 
MBA
MBAMBA
MBA
 

Plus de Amit Sharma

Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
 
Artificial Intelligence for Societal Impact
Artificial Intelligence for Societal ImpactArtificial Intelligence for Societal Impact
Artificial Intelligence for Societal ImpactAmit Sharma
 
Causal inference in data science
Causal inference in data scienceCausal inference in data science
Causal inference in data scienceAmit Sharma
 
Causal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practicesCausal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practicesAmit Sharma
 
Equivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential OutcomesEquivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential OutcomesAmit Sharma
 
Estimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actionsEstimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actionsAmit Sharma
 
From prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systemsFrom prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systemsAmit Sharma
 
Causal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhereCausal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhereAmit Sharma
 
The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...Amit Sharma
 
The role of social connections in shaping our preferences
The role of social connections in shaping our preferencesThe role of social connections in shaping our preferences
The role of social connections in shaping our preferencesAmit Sharma
 
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...Amit Sharma
 
RSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendationRSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendationAmit Sharma
 

Plus de Amit Sharma (12)

Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal Models
 
Artificial Intelligence for Societal Impact
Artificial Intelligence for Societal ImpactArtificial Intelligence for Societal Impact
Artificial Intelligence for Societal Impact
 
Causal inference in data science
Causal inference in data scienceCausal inference in data science
Causal inference in data science
 
Causal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practicesCausal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practices
 
Equivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential OutcomesEquivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
 
Estimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actionsEstimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actions
 
From prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systemsFrom prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systems
 
Causal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhereCausal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhere
 
The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...
 
The role of social connections in shaping our preferences
The role of social connections in shaping our preferencesThe role of social connections in shaping our preferences
The role of social connections in shaping our preferences
 
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
 
RSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendationRSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendation
 

Dernier

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Dernier (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Measuring effectiveness of machine learning systems

  • 1. Measuring effectiveness of machine learning systems AMIT SHARMA, MICROSOFT RESEARCH INDIA www.amitsharma.in @amt_shrma
  • 2.
  • 3. Are these systems performing well? How do these systems affect people’s behavior? How can we make these systems better? Better for whom?
  • 4. Evaluating systemsEstimating causal effects Three examples. Two studies: Question: How good is a recommender system? Estimate what would have happened without the recommender system. Question: Is a search engine performing well for all users? Estimate how performance changes if a person had a different demographic, without changing anything else.
  • 5. Easiest answer: Look at metrics
  • 6. But metrics can be misleading Performance metric can be high or low for a number of reasons --day of week, time of year --selection effects Different metrics may provide a different picture.
  • 7. Example 1: Increasing activity on Xbox 7
  • 8. From data to prediction 8 Use these correlations to make a predictive model. Future Activity -> f(number of friends, logins in past month) 
  • 9. From data to “actionable insights” 9 Would increasing the number of friends increase people’s activity on our system? Maybe, may be not (!)
  • 12. Are search ads really that effective? 12
  • 13. But search results point to the same website 13
  • 14. Without reasoning about causality, may overestimate effectiveness of ads 14
  • 15. Okay, search ads have an explicit intent. Display ads should be fine? 15
  • 17. People anyways buy more toys in December 17
  • 18. 18
  • 19. 19
  • 20. Example 3: Did a system change lead to better outcomes? 20 System A System B
  • 21. Comparing old versus new system 21 Old (A) New (B) 50/1000 (5%) 54/1000 (5.4%) New system is better?
  • 22. Looking at change in CTR by income 22 Old System (A) New System (B) 10/400 (2.5%) 4/200 (2%) Old System (A) New System (B) 40/600 (6.6%) 50/800 (6.2%) Low-income High-income
  • 23. The Simpson’s paradox Is Algorithm A better? 23 Old system (A) New system (B) Conversion rate for Low-income people 10/400 (2.5%) 4/200 (2%) Conversion rate for High-income people 40/600 (6.6%) 50/800 (6.2%) Total Conversion rate 50/1000 (5%) 54/1000 (5.4%)
  • 24. Answer (as usual): May be, may be not. High-income people might have better means to know about the updated information. Time of year may have an effect. There could be other hidden causal variations. 24
  • 25. 25
  • 26. Easy answer: Do A/B tests Great for testing focused hypotheses. But cannot help you find robust hypotheses. Not possible in many scenarios for ethical or practical reasons.
  • 27. Harder answer: Approximate A/B tests offline Gives all the benefits of A/B tests Can be run at scale. But, without an experiment, we have no data on what would happen if we change the system. How do we estimate something that we have never observed?
  • 28. Causal inference to the rescue… 28
  • 29. Evaluating systemsEstimating causal effects Two examples: Question: How good is a recommender system? Estimate what would have happened without the recommender system. Question: Is a search engine performing well for all users? Estimate how performance changes if a person had a different demographic, without changing anything else.
  • 30. Study 1: Estimating the impact of a recommender system Sharma-Hofman-Watts (2015,2016)
  • 31. Example: Estimating the causal impact of Amazon’s recommender system 31
  • 32. How much activity comes from the recommendation system? 32
  • 33. Confounding: Observed click-throughs may be due to correlated demand 33 Demand for The Road Visits to The Road Rec. visits to No Country for Old Men Demand for No Country for Old Men
  • 34. Observed activity is almost surely an overestimate of the causal effect 34 Causal Convenience OBSERVED ACTIVITY FROM RECOMMENDER All page visits ? ACTIVITY WITHOUT RECOMMENDER
  • 35. Finding auxiliary outcome: Split outcome into recommender (primary) and direct visits (auxiliary) 35 All visits to a recommended product Recommender visits Direct visits Search visits Direct browsing Auxiliary outcome: Proxy for unobserved demand
  • 36. ? ? 1a. Search for any product with a shock to page visits 36
  • 37. 1b. Filtering out invalid natural experiments 37
  • 38. The “split-door” criterion 38 Criterion: 𝑿 ∐ 𝒀 𝑫 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
  • 39. More formally, why does it work? 39 Unobserved variables (UX) Cause (X) Outcome (YR) Auxiliary Outcome (YD) Unobserved variables (UY)
  • 40. Data from Amazon.com, using Bing toolbar • • • Out of which 20 K products have at least 10 visits on any one day
  • 41. Constructed sequence of visits for each user 41
  • 42. Implementing the split-door criterion 42 𝑡 = 15 days
  • 43. Using the split-door criterion, obtain 23,000 natural experiments for over 12,000 products. 43
  • 44. 44
  • 46. Generalization? Distribution of products with a natural experiment identical to overall distribution 46
  • 47. Study 2: How does user satisfaction for a search engine vary across demographics? Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)
  • 48. Tricky: straightforward optimization can lead to differential performance Search engine uses a standard metric: time spent on clicked result page as an indicator of satisfaction. Goal: estimate difference in user satisfaction between these two demographic groups. Suppose older users issue more of “retirement planning” queries Age: >50 years 80% users 10% users Age: <30 years …
  • 49. 1. Overall metrics can hide differential satisfaction Average user satisfaction for “retirement planning” may be high. But, Average satisfaction for younger users=0.7 Average satisfaction for older users=0.2
  • 50. 2. Query-level metrics can hide differential satisfaction <query> <query> <query> <query> <query> <query> retirement planning <query> <query> retirement planning retirement planning <query> retirement planning … Same user satisfaction for “retirement planning” for both older and younger users = 0.7 What if average satisfaction for <query>=0.9? Older users still receiving more of lower- quality results than younger users. Younger users Older users
  • 51. 3. More critically, even individual- level metrics can also hide differential satisfaction Reading time for the same webpage result for the same user satisfaction Time spent on a webpage Younger Users Older Users
  • 52. How do we know whether some users are more satisfied than others?
  • 53. Data: Demographic characteristics of search engine users Internal logs from Bing.com for two weeks 4 M users | 32 M impressions | 17 M sessions Demographics: Age & Gender Age: ◦ post-Millenial: <18 ◦ Millenial: 18-34 ◦ Generation X: 35-54 ◦ Baby Boomer: 55 - 74
  • 54. Overall metrics across Demographics Four metrics: Graded Utility (GU) Reformulation Rate (RR) Successful Click Count (SCC) Page Click Count (PCC)
  • 55. Pitfalls with Overall Metrics Conflate two separate effects: ◦ natural demographic variation caused by the differing traits among the different demographic groups e.g. ◦ Different queries issued ◦ Different information need for the same query ◦ Even for the same satisfaction, demographic A tends to click more than demographic B ◦ Systemic difference in user satisfaction due to the search engine
  • 56. Constructing a causal model Information Need Demographics Metric User satisfaction Query Search Results
  • 57. I. Context Matching: selecting for activity with near-identical context Information Need Demographics Metric User satisfaction Query Search Results Context
  • 58. Information Need Demographics Metric User satisfaction Query Search Results Context For any two users from different demographics, 1. Same Query 2. Same Information Need: 1. Control for user intent: same final SAT click 2. Only consider navigational queries 3. Identical top-8 Search Results 1.2 M impressions, 19K unique queries, 617K users
  • 59. Age-wise differences in metrics disappear General auditing tool: robust Very low coverage across queries: Did we control for too much?
  • 60. II. Query-level pairwise model: Estimating satisfaction directly by considering pairs of users Information Need Demographics Metric User satisfaction Query Search Results
  • 61. Estimating absolute satisfaction is non-trivial • Instead, Estimate relative satisfaction by considering pairs of users for the same query • Conservative proxy for pairwise satisfaction by only considering “big” differences in observed metric for the same query • Logistic regression model for estimating probability of impression i being more satisfied than impression j:
  • 62. Again, see a small age-wise difference in satisfaction
  • 63. Conclusion I: ML systems need Grey Box analysis
  • 64. Conclusion II Evaluation of ML systems requires careful analysis of inputs and outputs. Observational metrics are usually biased: ◦ Big fraction of recommendation click-throughs may be convenience. ◦ Search engine metrics do not provide a provide a clear picture of user satisfaction. Causal models essential for developing robust metrics.