Evaluating the search experience: from Retrieval Effectiveness to User Engagement

Evalua&ng
the
search
experience:

from
Retrieval
Eﬀec&veness
to

User
Engagement

Mounia Lalmas
Yahoo Labs London
mounia@acm.org
CLEF 2015 – Toulouse

This talk
§ Evaluation in search
(offline evaluation)
(online evaluation)
§  Interpreting the signals
§ Introduction to user engagement
§ From retrieval effectiveness to user engagement
(from intra-session to inter-session evaluation)

The Message of this talk
What you want
to optimize for
each task,
session, query
M1
M2
M3
.
.
.
Mn
LTV1
LTV2
LTV3
.
.
.
LTVm
Mi LTVj
What you want
to optimize long-
termSystem
Models
Features

How to evaluate a search system
§ Coverage

§ Speed

§ Query
language

§ User
interface

§ User
happiness

Users
ﬁnd
what
they
want
and
return
to
the
search
system

§ But
let
us
remember:

In
carrying
out
a
search
task,
search
is
a
means,
not
an
end

Sec. 8.6
(Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)

Within an online
session
›  July 2012
›  2.5M users
›  785M page views
›  Categorization of the most
frequent accessed sites
•  11 categories (e.g. news), 33
subcategories (e.g. news finance,
news society)
•  760 sites from 70 countries/regions
short sessions: average 3.01 distinct sites visited with revisitation rate 10%
long sessions: average 9.62 distinct sites visited with revisitation rate 22%
(Lehmann etal, 2013)

Measuring user happiness
Most
common
proxy:
relevance
of
retrieved
results

Sec. 8.1
Relevant
Retrieved
all items
§  User
informa(on
need
translated
into

a
query

§  Relevance
assessed
rela&ve
to

informa(on
need
not
the
query

§  Example:

›  Informa&on
need:
I
am
looking
for
tennis

holiday
in
a
country
with
no
rain

›  Query:
tennis
academy
good
weather

Evaluation measures:
•  precision, recall, R-precision; precision@n;
average precision; F-measure; …
•  bpref; cumulative gains, rank-biased precision,
expected reciprocal rank, Q-measure, …
precision
recall

Measuring user happiness
Most
common
proxy:
relevance
of
retrieval
results

Sec. 8.1
Explicit signals
Test collection methodology (TREC, CLEF, …)
Human labeled corpora
Implicit signals
User behavior in online settings (clicks, skips, …)
Explicit and implicit signals can be used together

Examples of implicit signals
§  Number of clicks
§  SAT click
§  Quick-back click
§  Click at given position
§  Time to first click
§  Skipping
§  Abandonment rate
§  Number of query reformulations
§  Dwell time
§  Hover

What is a happy user in search
1.  The user information need is satisfied
2.  The user has learned about a topic and even
about other topics
3.  The system was inviting and even fun to use
In-the-moment engagement
Users on a site
Long-term engagement
Users come back frequently
USER ENGAGEMENT

User variability
(Anderson & Krathwohl, 2001; Bailey etal, 2015)
T: number of documents users (judges) expected to read
Q: number of queries users (judges) expected to issue
Task complexity Task complexity

Explicit signal: MAP
(Turpin & Scholer, 2006)
Similar results obtained with P@2, P@3, P@4 and P@10
PRECISION-BASED SEARCH

Explicit signal: MAP (2)
(Turpin & Scholer, 2006)
RECALL-BASED SEARCH

top most popular tweets top most popular tweets + geographical diverse
Being from a central or peripheral location makes a difference.
Peripheral users did not perceive the timeline as being diverse
Explicit signal: “Diversity”
It should never be just about the algorithm, but also how users respond to what the
algorithm returns to them
(Graells-Garrido, Lalmas & Baeza-Yates, Under Review)

Implicit signal: Click-through rate
CTR
new ranking algorithm
new design of search result page
…

Multimedia search
activities often
driven by
entertainment
needs, not by
information needs
Relevance in multimedia search
(Slaney, 2011)
Signal signal: Clicks (I)

(Miliaraki, Blanco & Lalmas, 2015)
Implicit signal: Clicks (II)
Explorative and serendipitous search

I just wanted the phone number … I am totally happy J
Implicit signal: No click
Information-rich snippet

Implicit signal: No click
Cickthrough rate:
% of clicks when URL
shown (per query)
Hover rate:
% hover over URL
(per query)
Unclicked hover:
Median time user hovers over
URL but no click (per query)
Max hover time:
Maximum time user hovers
over a result (per SERP)
(Huang et al, 2011)
20

§  Abandonment is when there is no click on the search result page
›  User is dissatisfied (bad abandonment)
›  User found result(s) on the search result page (good abandonment)
§  858 queries (21% good vs. 79% abandonment manually examined)
§  Cursor trail length
›  Total distance (pixel) traveled by cursor on SERP
›  Shorter for good abandonment
§  Movement time
›  Total time (second) cursor moved on SERP
›  Longer when answers in snippet (good abandonment)
§  Cursor speed
›  Average cursor speed (pixel/second)
›  Slower when answers in snippet (good abandonment)
(Huang et al, 2011)
Implicit signal: Abandonment rate

“reading” cursor heatmap of relevant document vs “scanning” cursor heatmap
of non-relevant document (both dwell time of 30s)
(Guo & Agichtein, 2012)
22
Implicit signal: Dwell time

“reading” a relevant long document vs “scanning” a long non-relevant
document
(Guo & Agichtein, 2012)
23

DWELL TIME
used a proxy of
user experience
Publisher
click on
an ad on
mobile
device
Dwell time on non-optimized landing pages
comparable and even higher than on mobile-
optimized ones
… when mobile optimized, users realize quickly
whether they “like” the ad or not?
(Lalmas etal, 2015)
non-mobile optimized mobile optimized

What is user engagement?
“User engagement is a quality of the
user experience that emphasizes the
phenomena associated with wanting to
use a technological resource longer and
frequently” (Attfield et al, 2011)

Characteristics of user engagement
Novelty
(Webster & Ho, 1997; O’Brien,
2008)
Richness and control
(Jacques et al, 1995; Webster &
Ho, 1997)
Aesthetics
(Jacques et al, 1995; O’Brien,
2008)
Endurability
(Read, MacFarlane, & Casey,
2002; O’Brien, 2008)
Focused attention
(Webster & Ho, 1997; O’Brien,
2008)
Reputation, trust and
expectation
(Attfield et al, 2011)
Positive Affect
(O’Brien & Toms, 2008)
Motivation, interests,
incentives, and benefits
(Jacques et al., 1995; O’Brien & Toms,
2008)
(O’Brien, Lalmas & Yom-Tov, 2014)

Measuring user engagement
Measures
Attributes

Self-report Questionnaire, interview,
think-aloud and think after
protocols
Subjective
Short- and long-term
Lab and field
Small scale
Physiology EEG, SCL, fMRI
eye tracking
mouse-tracking
Objective
Short-term
Lab and field
Small and large scale
Analytics intra- and inter-session metrics
data science
Objective
Short- and long-term
Field
Large scale

Attributes of user engagement
§ Scale (small versus large)
§ Setting (laboratory versus field)
§ Objective versus subjective
§ Temporality (in-the-moment versus long-term)
What you want
to optimize for
each task,
session, query
What you want
to optimize long-
term
Mi LTVj

User engagement metrics
0-1 1-0.5 0.5
Kendall’s tau with p-value < 0.05
('-' insignificant correlations)
High correlation
between metrics in
same group
Low correlation
between metrics in
different groups
[POP]#Users
[POP]#Visits
[POP]#Clicks
[ACT]PageViewsV
[ACT]DwellTimeV
[LOY]ActiveDays
[LOY]ReturnRate
#Users [POP] 0.82 0.75 - - 0.43 0.34
#Visits [POP] 0.82 0.85 - - 0.60 0.52
#Clicks [POP] 0.75 0.85 0.16 0.18 0.59 0.51
PageViewsV [ACT] - - 0.16 0.33 - -
DwellTimeV [ACT] - - 0.18 0.33 - -
ActiveDays [LOY] 0.43 0.60 0.59 - - 0.79
ReturnRate [LOY] 0.34 0.52 0.51 - - 0.79
0.69
in-the-moment
long-term

Online sites differ with respect to
their engagement pattern
Games
Users spend
much time per
visit
Search
Users come
frequently and
do not stay long
Social media
Users come
frequently and
stay long
Niche
Users come on
average once
a week e.g. weekly
post
News
Users come
periodically,
e.g. morning and
evening
Service
Users visit site,
when needed,
e.g. to renew
subscription
in-the-moment: at each visit
long-term: visit frequency

From intra- to
inter-session
evaluation

1.  Search
2.  Mobile advertising
happy users
come back

The Message: From intra- to inter-
session evaluation
What you want
to optimize for
each task,
session, query
M1
M2
M3
.
.
.
Mn
LTV1
LTV2
LTV3
.
.
.
LTVm
Mi LTVj
What you want
to optimize long-
termSystem
Models
Features

Search experience
What you want
to optimize for
each task,
session, query
search
metrics
(signals)
absence
time
(revisit the
site)
Mi LTVj
What you want
to optimize long-
termSearch
system
Models
Features

intra-session search
metrics
•  Dwell time
•  Number of clicks
•  Time to 1st lick
•  Skipping
•  Click through rate
•  Abandonment rate
•  Number of query
reformulations
•  …
Dwell time as a proxy of user interest
Dwell time as a proxy of relevance
Dwell time as a proxy of conversion
Dwell time as a proxy of post-click ad
quality
…
User engagement metrics for search
(Proxy: relevance of search results)
intra-session
inter-session

Dwell time (I)
§ Definition
The contiguous time spent on
a site or web page
§ Cons
Not clear that the user was
actually looking at the site
while there à blur/focus
Distribution of dwell times on 50
websites

Dwell time (II)
Dwell time varies by
site type:
•  leisure sites tend to have
longer dwell times than
news, e-commerce, etc.
Dwell time has a
relatively large variance
even for the same site
Dwell time on 50 websites
(tourists, active, VIP …
users)

Search result page for “asparagus” (I)

Search result page for “asparagus” (II)

Absence time and survival analysis
story 1
story 2
story 3
story 4
story 5
story 6
story 7
story 8
story 9
0 5 10 15 20
0.00.20.40.60.81.0
Users (%) who did come back
Users (%) who read story 2 but did not come back after 10 hours
SURVIVE
DIE
DIE = RETURN TO SITE èSHORT ABSENCE TIME
hours

Absence time applied to search
Ranking function on Yahoo Answer Japan
Two-weeks click data on Yahoo Answer Japan: search
One millions users
Six ranking functions
30-minute session boundary

survival analysis: high hazard rate (die quickly) = short absence
5 clicks
control=noclick
Absence time and number of clicks on
search result page
3 clicks
§  No click means a bad user experience
§  Clicking between 3-5 results leads to same user experience
§  Clicking on more than 5 results reflects poorer user experience; users cannot
find what they are looking for
(Dupret & Lalmas, 2013)

Using DCG versus absence to evaluate
five ranking functions
DCG@1
Ranking Alg 1
Ranking Alg 2
Ranking Alg 3
Ranking Alg 4
Ranking Alg 5
DCG@5
Ranking Alg 1
Ranking Alg 3
Ranking Alg 2
Ranking Alg 4
Ranking Alg 5
Absence time
Ranking Alg 1
Ranking Alg 2
Ranking Alg 5
Ranking Alg 3
Ranking Alg 4

Absence time and search experience
§  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice
from the user (compared to 1st)
§  Clicking at bottom is a sign of low quality overall ranking
§  Users finding their answers quickly (time to 1st click) return sooner to
the search application
§  Returning to the same search result page is a worse user experience
than reformulating the query
search session metrics à absence time

Absence time – search experience
From 21 experiments carried out through A/B testing, using absence time
agrees with 14 of them (which one is better)
(Chakraborty etal, 2014)
Positive signals
•  One more query in session
•  One more click in session
•  SAT clicks
•  Query reformulation
Negative signals
•  Abandoned session
•  Quick-back clicks
search session metrics à absence time

The context — Post-click experience
on mobile advertising
What you want
to optimize for
each task,
session, query
dwell
time on
landing
page
absence
time
(next ad
click)
Mi LTVj
What you want
to optimize long-
termnative ad
serving
Models
Features

Native Advertising
…
Mobile Desktop

Estimating the quality of the post-click
experience
Best experience is when conversion happens
Estimating the probability of conversion is hard!
- Conversion data is not available for all advertisers
- Conversion data is not missing at random
Proxy metric of post-click quality:
dwell time on the ad landing page
- No conversion does not mean a bad experience
tad-click tback-to-publisher
dwell time = tback-to-publisher – tad-click

Dwell time as a proxy of the post-click
experience
mobile
200K ad clicks
Ø  It needs less time to
get the same
probability of a
second click
desktop (toolbar)
30K ad clicks
Ø  23.3% of users visit other websites
than the ad landing page before
returning to publisher
Ø  this goes down to 7.4% for dwell time
up to 3 mins.
Probability of a second click
increases with dwell time

Dwell time and
absence time
0%
200%
400%
600%
short ad clicks long ad clicks
adclickdifference
Dwell time à ad click
Positive post-click
experience (“long” clicks)
has an effect on users
clicking on ads again
(mobile)
(Lalmas etal, 2015)
Absence time:
•  return to publisher
•  click on an ad

From intra- to inter-
session evaluation
Absence time
1.  Search
2.  Mobile advertising
happy users
come back

Large-scale online measurement
Decide the in-
the-moment
metric(s)
Decide the long-
term-value
metric(s)
System
Models
Features
Which in-the-
moment metric(s)
are good
predictor of long-
term value
metric(s)
Optimize for the
identified in-the-
moment
metric(s)
Lots of data
required to
remove noise
What is a
signal?
What is a
metric?

O’Brien & Toms User
Engagement Scale
31-items and six sub-
scales:
aesthetic appeal, novelty,
felt involvement,
focused attention,
perceived usability,
endurability
(O’Brien & Toms, 2010; Arguello etal, 2012;
Bordino etal, Under Review)
Small-scale measurement

Towards User Engagement
happy users
come back
we need to
properly identify
that a user is
happy

Evaluating the search experience: from Retrieval Effectiveness to User Engagement

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Evaluating the search experience: from Retrieval Effectiveness to User Engagement

Similaire à Evaluating the search experience: from Retrieval Effectiveness to User Engagement (20)

Plus de Mounia Lalmas-Roelleke

Plus de Mounia Lalmas-Roelleke (14)

Dernier

Dernier (20)

Evaluating the search experience: from Retrieval Effectiveness to User Engagement