Evalua&ng	
  the	
  search	
  experience:	
  
from	
  Retrieval	
  Effec&veness	
  to	
  
User	
  Engagement	
  
Mounia Lal...
This talk
§ Evaluation in search
(offline evaluation)
(online evaluation)
§  Interpreting the signals
§ Introduction to...
The Message of this talk
What you want
to optimize for
each task,
session, query
M1
M2
M3
.
.
.
Mn
LTV1
LTV2
LTV3
.
.
.
LT...
Evaluation in
search
How to evaluate a search system
§ Coverage	
  
§ Speed	
  
§ Query	
  language	
  
§ User	
  interface	
  
§ User	
  ...
Within an online
session
›  July 2012
›  2.5M users
›  785M page views
›  Categorization of the most
frequent accessed sit...
Measuring user happiness
Most	
  common	
  proxy:	
  relevance	
  of	
  retrieved	
  results	
  
Sec. 8.1
Relevant
Retriev...
Measuring user happiness
Most	
  common	
  proxy:	
  relevance	
  of	
  retrieval	
  results	
  
Sec. 8.1
Explicit signals...
Examples of implicit signals
§  Number of clicks
§  SAT click
§  Quick-back click
§  Click at given position
§  Time ...
What is a happy user in search
1.  The user information need is satisfied
2.  The user has learned about a topic and even
...
Interpreting the
signals
User variability
(Anderson & Krathwohl, 2001; Bailey etal, 2015)
T: number of documents users (judges) expected to read
Q:...
Explicit signal: MAP
(Turpin & Scholer, 2006)
Similar results obtained with P@2, P@3, P@4 and P@10
PRECISION-BASED SEARCH
Explicit signal: MAP (2)
(Turpin & Scholer, 2006)
RECALL-BASED SEARCH
top most popular tweets top most popular tweets + geographical diverse
Being from a central or peripheral location makes a...
Implicit signal: Click-through rate
CTR
new ranking algorithm
new design of search result page
…
Multimedia search
activities often
driven by
entertainment
needs, not by
information needs
Relevance in multimedia search
...
(Miliaraki, Blanco & Lalmas, 2015)
Implicit signal: Clicks (II)
Explorative and serendipitous search
I just wanted the phone number … I am totally happy J
Implicit signal: No click
Information-rich snippet
Implicit signal: No click
Cickthrough rate:
% of clicks when URL
shown (per query)
Hover rate:
% hover over URL
(per query...
§  Abandonment is when there is no click on the search result page
›  User is dissatisfied (bad abandonment)
›  User foun...
“reading” cursor heatmap of relevant document vs “scanning” cursor heatmap
of non-relevant document (both dwell time of 30...
Implicit signal: Dwell time
“reading” a relevant long document vs “scanning” a long non-relevant
document
(Guo & Agichtein...
Implicit signal: Dwell time
DWELL TIME
used a proxy of
user experience
Publisher
click on
an ad on
mobile
device
Dwell tim...
User engagement
What is user engagement?
“User engagement is a quality of the
user experience that emphasizes the
phenomena associated wit...
Characteristics of user engagement
Novelty
(Webster & Ho, 1997; O’Brien,
2008)
Richness and control
(Jacques et al, 1995; ...
Measuring user engagement
Measures	
   Attributes	
  
Self-report Questionnaire, interview,
think-aloud and think after
pr...
Attributes of user engagement
§ Scale (small versus large)
§ Setting (laboratory versus field)
§ Objective versus subje...
User engagement metrics
User engagement metrics
0-1 1-0.5 0.5
Kendall’s tau with p-value < 0.05
('-' insignificant correlations)
High correlation
...
Online sites differ with respect to
their engagement pattern
Games
Users spend
much time per
visit
Search
Users come
frequ...
From intra- to
inter-session
evaluation
1.  Search
2.  Mobile advertising
happy users
come back
The Message: From intra- to inter-
session evaluation
What you want
to optimize for
each task,
session, query
M1
M2
M3
.
....
Search
Search experience
What you want
to optimize for
each task,
session, query
search
metrics
(signals)
absence
time
(revisit t...
intra-session search
metrics
•  Dwell time
•  Number of clicks
•  Time to 1st lick
•  Skipping
•  Click through rate
•  Ab...
Dwell time (I)
§ Definition
The contiguous time spent on
a site or web page
§ Cons
Not clear that the user was
actually ...
Dwell time (II)
Dwell time varies by
site type:
•  leisure sites tend to have
longer dwell times than
news, e-commerce, et...
Search result page for “asparagus” (I)
Search result page for “asparagus” (II)
Absence time and survival analysis
story 1
story 2
story 3
story 4
story 5
story 6
story 7
story 8
story 9
0 5 10 15 20
0....
Absence time applied to search
Ranking function on Yahoo Answer Japan
Two-weeks click data on Yahoo Answer Japan: search
O...
survival analysis: high hazard rate (die quickly) = short absence
5 clicks
control=noclick
Absence time and number of clic...
Using DCG versus absence to evaluate
five ranking functions
DCG@1
Ranking Alg 1
Ranking Alg 2
Ranking Alg 3
Ranking Alg 4
...
Absence time and search experience
§  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice
from the user...
Absence time – search experience
From 21 experiments carried out through A/B testing, using absence time
agrees with 14 of...
Native advertising
The context — Post-click experience
on mobile advertising
What you want
to optimize for
each task,
session, query
dwell
ti...
Native Advertising
…
Mobile Desktop
Estimating the quality of the post-click
experience
Best experience is when conversion happens
Estimating the probability ...
Dwell time as a proxy of the post-click
experience
mobile
200K ad clicks
Ø  It needs less time to
get the same
probabilit...
Dwell time and
absence time
0%
200%
400%
600%
short ad clicks long ad clicks
adclickdifference
Dwell time à ad click
Posi...
From intra- to inter-
session evaluation
Absence time
1.  Search
2.  Mobile advertising
happy users
come back
What’s next?
Large-scale online measurement
Decide the in-
the-moment
metric(s)
Decide the long-
term-value
metric(s)
System
Models
Fea...
O’Brien & Toms User
Engagement Scale
31-items and six sub-
scales:
aesthetic appeal, novelty,
felt involvement,
focused at...
Towards User Engagement
happy users
come back
we need to
properly identify
that a user is
happy
Merci
Prochain SlideShare
Chargement dans…5
×

Evaluating the search experience: from Retrieval Effectiveness to User Engagement

1 575 vues

Publié le

These are my slides for my presentation at CLEF 2015 which is being held in Toulouse. I discuss evaluation in the context of search, and how to move towards looking at long-term effect of the search experience. I do this through the concept of absence time. I present examples for search but also in the context if mobile advertising. My aim is to frame evaluation within user engagement.

Publié dans : Internet
0 commentaire
8 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
1 575
Sur SlideShare
0
Issues des intégrations
0
Intégrations
148
Actions
Partages
0
Téléchargements
24
Commentaires
0
J’aime
8
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

Evaluating the search experience: from Retrieval Effectiveness to User Engagement

  1. 1. Evalua&ng  the  search  experience:   from  Retrieval  Effec&veness  to   User  Engagement   Mounia Lalmas Yahoo Labs London mounia@acm.org CLEF 2015 – Toulouse
  2. 2. This talk § Evaluation in search (offline evaluation) (online evaluation) §  Interpreting the signals § Introduction to user engagement § From retrieval effectiveness to user engagement (from intra-session to inter-session evaluation)
  3. 3. The Message of this talk What you want to optimize for each task, session, query M1 M2 M3 . . . Mn LTV1 LTV2 LTV3 . . . LTVm Mi LTVj What you want to optimize long- termSystem Models Features
  4. 4. Evaluation in search
  5. 5. How to evaluate a search system § Coverage   § Speed   § Query  language   § User  interface   § User  happiness   Users  find  what  they  want  and  return  to  the  search  system     § But  let  us  remember:   In  carrying  out  a  search  task,  search  is  a  means,  not  an  end   Sec. 8.6 (Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)
  6. 6. Within an online session ›  July 2012 ›  2.5M users ›  785M page views ›  Categorization of the most frequent accessed sites •  11 categories (e.g. news), 33 subcategories (e.g. news finance, news society) •  760 sites from 70 countries/regions short sessions: average 3.01 distinct sites visited with revisitation rate 10% long sessions: average 9.62 distinct sites visited with revisitation rate 22% (Lehmann etal, 2013)
  7. 7. Measuring user happiness Most  common  proxy:  relevance  of  retrieved  results   Sec. 8.1 Relevant Retrieved all items §  User  informa(on  need  translated  into   a  query   §  Relevance  assessed  rela&ve  to     informa(on  need  not  the  query   §  Example:   ›  Informa&on  need:  I  am  looking  for  tennis   holiday  in  a  country  with  no  rain   ›  Query:  tennis  academy  good  weather   Evaluation measures: •  precision, recall, R-precision; precision@n; average precision; F-measure; … •  bpref; cumulative gains, rank-biased precision, expected reciprocal rank, Q-measure, … precision recall
  8. 8. Measuring user happiness Most  common  proxy:  relevance  of  retrieval  results   Sec. 8.1 Explicit signals Test collection methodology (TREC, CLEF, …) Human labeled corpora Implicit signals User behavior in online settings (clicks, skips, …) Explicit and implicit signals can be used together
  9. 9. Examples of implicit signals §  Number of clicks §  SAT click §  Quick-back click §  Click at given position §  Time to first click §  Skipping §  Abandonment rate §  Number of query reformulations §  Dwell time §  Hover
  10. 10. What is a happy user in search 1.  The user information need is satisfied 2.  The user has learned about a topic and even about other topics 3.  The system was inviting and even fun to use In-the-moment engagement Users on a site Long-term engagement Users come back frequently USER ENGAGEMENT
  11. 11. Interpreting the signals
  12. 12. User variability (Anderson & Krathwohl, 2001; Bailey etal, 2015) T: number of documents users (judges) expected to read Q: number of queries users (judges) expected to issue Task complexity Task complexity
  13. 13. Explicit signal: MAP (Turpin & Scholer, 2006) Similar results obtained with P@2, P@3, P@4 and P@10 PRECISION-BASED SEARCH
  14. 14. Explicit signal: MAP (2) (Turpin & Scholer, 2006) RECALL-BASED SEARCH
  15. 15. top most popular tweets top most popular tweets + geographical diverse Being from a central or peripheral location makes a difference. Peripheral users did not perceive the timeline as being diverse Explicit signal: “Diversity” It should never be just about the algorithm, but also how users respond to what the algorithm returns to them (Graells-Garrido, Lalmas & Baeza-Yates, Under Review)
  16. 16. Implicit signal: Click-through rate CTR new ranking algorithm new design of search result page …
  17. 17. Multimedia search activities often driven by entertainment needs, not by information needs Relevance in multimedia search (Slaney, 2011) Signal signal: Clicks (I)
  18. 18. (Miliaraki, Blanco & Lalmas, 2015) Implicit signal: Clicks (II) Explorative and serendipitous search
  19. 19. I just wanted the phone number … I am totally happy J Implicit signal: No click Information-rich snippet
  20. 20. Implicit signal: No click Cickthrough rate: % of clicks when URL shown (per query) Hover rate: % hover over URL (per query) Unclicked hover: Median time user hovers over URL but no click (per query) Max hover time: Maximum time user hovers over a result (per SERP) (Huang et al, 2011) 20
  21. 21. §  Abandonment is when there is no click on the search result page ›  User is dissatisfied (bad abandonment) ›  User found result(s) on the search result page (good abandonment) §  858 queries (21% good vs. 79% abandonment manually examined) §  Cursor trail length ›  Total distance (pixel) traveled by cursor on SERP ›  Shorter for good abandonment §  Movement time ›  Total time (second) cursor moved on SERP ›  Longer when answers in snippet (good abandonment) §  Cursor speed ›  Average cursor speed (pixel/second) ›  Slower when answers in snippet (good abandonment) (Huang et al, 2011) Implicit signal: Abandonment rate
  22. 22. “reading” cursor heatmap of relevant document vs “scanning” cursor heatmap of non-relevant document (both dwell time of 30s) (Guo & Agichtein, 2012) 22 Implicit signal: Dwell time
  23. 23. Implicit signal: Dwell time “reading” a relevant long document vs “scanning” a long non-relevant document (Guo & Agichtein, 2012) 23
  24. 24. Implicit signal: Dwell time DWELL TIME used a proxy of user experience Publisher click on an ad on mobile device Dwell time on non-optimized landing pages comparable and even higher than on mobile- optimized ones … when mobile optimized, users realize quickly whether they “like” the ad or not? (Lalmas etal, 2015) non-mobile optimized mobile optimized
  25. 25. User engagement
  26. 26. What is user engagement? “User engagement is a quality of the user experience that emphasizes the phenomena associated with wanting to use a technological resource longer and frequently” (Attfield et al, 2011)
  27. 27. Characteristics of user engagement Novelty (Webster & Ho, 1997; O’Brien, 2008) Richness and control (Jacques et al, 1995; Webster & Ho, 1997) Aesthetics (Jacques et al, 1995; O’Brien, 2008) Endurability (Read, MacFarlane, & Casey, 2002; O’Brien, 2008) Focused attention (Webster & Ho, 1997; O’Brien, 2008) Reputation, trust and expectation (Attfield et al, 2011) Positive Affect (O’Brien & Toms, 2008) Motivation, interests, incentives, and benefits (Jacques et al., 1995; O’Brien & Toms, 2008) (O’Brien, Lalmas & Yom-Tov, 2014)
  28. 28. Measuring user engagement Measures   Attributes   Self-report Questionnaire, interview, think-aloud and think after protocols Subjective Short- and long-term Lab and field Small scale Physiology EEG, SCL, fMRI eye tracking mouse-tracking Objective Short-term Lab and field Small and large scale Analytics intra- and inter-session metrics data science Objective Short- and long-term Field Large scale
  29. 29. Attributes of user engagement § Scale (small versus large) § Setting (laboratory versus field) § Objective versus subjective § Temporality (in-the-moment versus long-term) What you want to optimize for each task, session, query What you want to optimize long- term Mi LTVj
  30. 30. User engagement metrics
  31. 31. User engagement metrics 0-1 1-0.5 0.5 Kendall’s tau with p-value < 0.05 ('-' insignificant correlations) High correlation between metrics in same group Low correlation between metrics in different groups [POP]#Users [POP]#Visits [POP]#Clicks [ACT]PageViewsV [ACT]DwellTimeV [LOY]ActiveDays [LOY]ReturnRate #Users [POP] 0.82 0.75 - - 0.43 0.34 #Visits [POP] 0.82 0.85 - - 0.60 0.52 #Clicks [POP] 0.75 0.85 0.16 0.18 0.59 0.51 PageViewsV [ACT] - - 0.16 0.33 - - DwellTimeV [ACT] - - 0.18 0.33 - - ActiveDays [LOY] 0.43 0.60 0.59 - - 0.79 ReturnRate [LOY] 0.34 0.52 0.51 - - 0.79 0.69 (Lehmann etal, 2012) in-the-moment long-term
  32. 32. Online sites differ with respect to their engagement pattern Games Users spend much time per visit Search Users come frequently and do not stay long Social media Users come frequently and stay long Niche Users come on average once a week e.g. weekly post News Users come periodically, e.g. morning and evening Service Users visit site, when needed, e.g. to renew subscription (Lehmann etal, 2012) in-the-moment: at each visit long-term: visit frequency
  33. 33. From intra- to inter-session evaluation
  34. 34. 1.  Search 2.  Mobile advertising happy users come back
  35. 35. The Message: From intra- to inter- session evaluation What you want to optimize for each task, session, query M1 M2 M3 . . . Mn LTV1 LTV2 LTV3 . . . LTVm Mi LTVj What you want to optimize long- termSystem Models Features
  36. 36. Search
  37. 37. Search experience What you want to optimize for each task, session, query search metrics (signals) absence time (revisit the site) Mi LTVj What you want to optimize long- termSearch system Models Features
  38. 38. intra-session search metrics •  Dwell time •  Number of clicks •  Time to 1st lick •  Skipping •  Click through rate •  Abandonment rate •  Number of query reformulations •  … Dwell time as a proxy of user interest Dwell time as a proxy of relevance Dwell time as a proxy of conversion Dwell time as a proxy of post-click ad quality … User engagement metrics for search (Proxy: relevance of search results) intra-session inter-session
  39. 39. Dwell time (I) § Definition The contiguous time spent on a site or web page § Cons Not clear that the user was actually looking at the site while there à blur/focus Distribution of dwell times on 50 websites (O’Brien, Lalmas & Yom-Tov, 2014)
  40. 40. Dwell time (II) Dwell time varies by site type: •  leisure sites tend to have longer dwell times than news, e-commerce, etc. Dwell time has a relatively large variance even for the same site Dwell time on 50 websites (tourists, active, VIP … users) (O’Brien, Lalmas & Yom-Tov, 2014)
  41. 41. Search result page for “asparagus” (I)
  42. 42. Search result page for “asparagus” (II)
  43. 43. Absence time and survival analysis story 1 story 2 story 3 story 4 story 5 story 6 story 7 story 8 story 9 0 5 10 15 20 0.00.20.40.60.81.0 Users (%) who did come back Users (%) who read story 2 but did not come back after 10 hours SURVIVE DIE DIE = RETURN TO SITE èSHORT ABSENCE TIME hours
  44. 44. Absence time applied to search Ranking function on Yahoo Answer Japan Two-weeks click data on Yahoo Answer Japan: search One millions users Six ranking functions 30-minute session boundary
  45. 45. survival analysis: high hazard rate (die quickly) = short absence 5 clicks control=noclick Absence time and number of clicks on search result page 3 clicks §  No click means a bad user experience §  Clicking between 3-5 results leads to same user experience §  Clicking on more than 5 results reflects poorer user experience; users cannot find what they are looking for (Dupret & Lalmas, 2013)
  46. 46. Using DCG versus absence to evaluate five ranking functions DCG@1 Ranking Alg 1 Ranking Alg 2 Ranking Alg 3 Ranking Alg 4 Ranking Alg 5 DCG@5 Ranking Alg 1 Ranking Alg 3 Ranking Alg 2 Ranking Alg 4 Ranking Alg 5 Absence time Ranking Alg 1 Ranking Alg 2 Ranking Alg 5 Ranking Alg 3 Ranking Alg 4 (Dupret & Lalmas, 2013)
  47. 47. Absence time and search experience §  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice from the user (compared to 1st) §  Clicking at bottom is a sign of low quality overall ranking §  Users finding their answers quickly (time to 1st click) return sooner to the search application §  Returning to the same search result page is a worse user experience than reformulating the query search session metrics à absence time (Dupret & Lalmas, 2013)
  48. 48. Absence time – search experience From 21 experiments carried out through A/B testing, using absence time agrees with 14 of them (which one is better) (Chakraborty etal, 2014) Positive signals •  One more query in session •  One more click in session •  SAT clicks •  Query reformulation Negative signals •  Abandoned session •  Quick-back clicks search session metrics à absence time
  49. 49. Native advertising
  50. 50. The context — Post-click experience on mobile advertising What you want to optimize for each task, session, query dwell time on landing page absence time (next ad click) Mi LTVj What you want to optimize long- termnative ad serving Models Features
  51. 51. Native Advertising … Mobile Desktop
  52. 52. Estimating the quality of the post-click experience Best experience is when conversion happens Estimating the probability of conversion is hard! - Conversion data is not available for all advertisers - Conversion data is not missing at random Proxy metric of post-click quality: dwell time on the ad landing page - No conversion does not mean a bad experience tad-click tback-to-publisher dwell time = tback-to-publisher – tad-click
  53. 53. Dwell time as a proxy of the post-click experience mobile 200K ad clicks Ø  It needs less time to get the same probability of a second click desktop (toolbar) 30K ad clicks Ø  23.3% of users visit other websites than the ad landing page before returning to publisher Ø  this goes down to 7.4% for dwell time up to 3 mins. Probability of a second click increases with dwell time
  54. 54. Dwell time and absence time 0% 200% 400% 600% short ad clicks long ad clicks adclickdifference Dwell time à ad click Positive post-click experience (“long” clicks) has an effect on users clicking on ads again (mobile) (Lalmas etal, 2015) Absence time: •  return to publisher •  click on an ad
  55. 55. From intra- to inter- session evaluation Absence time 1.  Search 2.  Mobile advertising happy users come back
  56. 56. What’s next?
  57. 57. Large-scale online measurement Decide the in- the-moment metric(s) Decide the long- term-value metric(s) System Models Features Which in-the- moment metric(s) are good predictor of long- term value metric(s) Optimize for the identified in-the- moment metric(s) Lots of data required to remove noise What is a signal? What is a metric?
  58. 58. O’Brien & Toms User Engagement Scale 31-items and six sub- scales: aesthetic appeal, novelty, felt involvement, focused attention, perceived usability, endurability (O’Brien & Toms, 2010; Arguello etal, 2012; Bordino etal, Under Review) Small-scale measurement
  59. 59. Towards User Engagement happy users come back we need to properly identify that a user is happy
  60. 60. Merci

×