ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems

5th ACM International Conference on
Recommender Systems – RecSys 2011

Rank and Relevance
in Novelty and Diversity Metrics
for Recommender Systems

Saúl Vargas and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es

IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011

Beyond accuracy: novelty and diversity
You bought So you are recommended…
(or browsed)
Revolver

Rubber Soul With The Beatles Let it be Help!
Beatles for Sale

A Hard Day’s Sgt. Pp’s Lonely Yellow Magical Mystery The White
Night Hearts Club Band Submarine Tour Album

Abbey Road

The recommendedPlease are…
items 1967-1970 1962-1966 Past Masters Past Masters
 Very similar to each other (Blue)
Please me (Red) Vol 2

 Very similar to what the
user has already seen … More Beatles’
albums
 Very widely known
Dark Side Some Girls Bob Dylan
of the Moon

IRG

Novelty and diversity in Recommender Systems

Algorithms to enhance novelty and diversity
 Greedy optimization of objective functions (accuracy + diversity), promotion of long-tail items, etc.
(Ziegler 2005, Zhang 2008, Celma 2008)

Metrics and methodologies to measure and evaluate novelty and diversity
 Inverse popularity –mean self-information (Zhou 2010)  recommend in the long tail
1
MSI  
R
 log
iR
2 p i  Novelty

 Intra-list diversity –average pairwise distance (Ziegler 2005, Zhang 2008)

2
ILD  R d ik , il 
R  R  1 ik ,il  Diversity
k l

 Other: temporal diversity (Lathia 2010), diversity relative to other users & to other systems
(Bellogín 2010), aggregate diversity (Adomavicius 2011), unexpectedness (Adamopoulos 2011), etc.

IRG

Some limitations

R1 R2

Metrics are insensitive to the
Diverse

Not diverse
order of recommender items

Same item sets  same
measured diversity/novelty
Not diverse

Diverse
…

…

IRG

Some limitations

Accuracy and diversity/novelty measured independently

Method A is better than B Which one is better?

Method A
Method B
Accuracy

Diversity

IRG

Our research goals

1. Further formalize recommendation novelty and diversity
metrics based on a few basic fundamental principles

2. Build a unified metric framework where:
– As many state of the art novelty and diversity metrics as possible
are related and generalized

– New metrics can be defined

3. Enhance the novelty and diversity metrics with rank
sensitivity and relevance awareness

IRG

Basic fundamental principles to build metrics upon

 Our approach: define and formalize novelty and diversity metrics
based on models of how users interact with items

 Three basic fundamental principles in user-item interaction
– Discovery – an item is seen by a user
– Relevance – an item would be liked by (or useful for, etc.) a user
– Choice – an item is actually accepted (bought, consumed, etc.) by a user

 Formalized as binary random variables
– seen, rel, choose taking values in {true, false}
seen choose rel
 Simplifying assumptions:
– seen and rel are mutually independent
– If a user sees an item that is relevant for her,  p choose  p seen  p rel 
she chooses it

IRG

Proposed metric framework

Expected effective novelty of items when a user interacts
R
with a ranked list of recommended items in a context 

m  R    C  p choose i, u, R  nov i  
iR

Novelty is relative: item novelty context 
i
To (what we know about) what someone has seen sometime somewhere
 Someone  the target user, a set of users, all users…
 Sometime  a specific past time period, an ongoing session, “ever”…
 Somewhere  past recommendations, the current recommendation R,
recommendations by other systems, “anywhere”…
 “What we know about that”  context of observation: available observations
…

IRG

Metric framework components

m  R    C  p choose i, u, R  nov i  
iR

 Item novelty model nov i  

 Choice model p choose i, u, R 

IRG

Item novelty models

Item novelty model nov(i|)
 Discovery-based (negative popularity)

– Popularity complement nov i    1  p seen i,   Forced discovery

– Self-information (surprisal) nov i     log2 p i seen,   Free discovery

 Distance-based ( here represents a set of items)

– Expected item distance nov i     p  j choose, i,   d i, j 
j 

– Minimum item distance nov i    min d i, j 
j 

IRG

Choice model

Choice model p(choose|i,u,R)

p choose  p seen  p rel 

p choose i, u, R  p seen i, u, R  p rel i, u 

Browsing Relevance Independent
model model from R

IRG

Browsing model

R
Browsing model where p(seen|ik,u,R) should decrease with k
1
 Can be formalized as different probabilistic discount functions
2
(see e.g. Carterette 2011)
3
 In general, p(seen|ik,u,R) = disc(k)
4
disc k 
5
p k 1 exponential, as in RBP (Moffat 2008)
k=6 ?
1 log k  1 as in nDCG
7
1k Zipfian, as in MRR, MAP, etc.
8
1 no discount
9
... many others...
…

IRG

Wrapping up: resulting metric scheme

m  R    C  disc k  p rel ik , u  nov ik  
ik R
Rank Item Item
discount relevance novelty

Normalization – to get the novelty ratio by expected number of browsed items

1
C 
 disc k 
ik R
Expected browsing depth

IRG

Implementation

Ground model estimates

  observed interaction between all users and items in the system

 Discovery distributions can be estimated from rating data or access records

– Forced discovery p(seen|i,)  IUF (ratio of users who have interacted with i)

– Free discovery: p(i|seen,)  ICF (ratio of interactions involving i)

 Relevance distribution p(rel|i,u) is estimated by a mapping from ratings

to relevance (see definition of ERR in Chapelle 2009)

IRG

Novelty and diversity metrics

Putting all together
Some metric framework instantiations

IRG

Putting all together: metric framework instantiations

Discovery-based metrics

  observed interaction between all users and items in the system

 Expected popularity complement

EPC R   C  disc k  p rel ik , u  1  p seen ik 
ik R
 
Novelty
 Expected free discovery

EFD R   C  disc k  p rel ik , u  log p ik seen 
ik R
1
Without rank and relevance reduces to  MSI  R   
R
 log p i seen 
iR

IRG

Putting all together: metric framework instantiations

Distance-based metrics

  the observed interaction of the target user only
 Expected profile distance
Unexpectedness
EPD  R   C u  disc k  p rel ik , u  p rel j, u  d ik , j  (user-specific)
ik R
j u

  the recommended items the target user can see in R
 Expected intra-list diversity
Diversity
EILD  R    C disc k  disc l k  p rel i , u  p rel i , u  d i , i 
ik R
k k l k l

il R
k l 2
Without rank and relevance reduces to 
ILD  R    d ik , il 
R  R  1 ik ,il R
k l

IRG

Novelty and diversity metrics

Some experiments

IRG

Experiments

 Datasets  Recommender algorithms
– MovieLens 1M – CB Content-based (ML only)
– Last.fm data by Òscar Celma – UB User-based kNN
 Experiment design – MF Matrix factorization
– Run baseline recommenders – AVG Average rating
– Rerank top 500 recommended items – RND Random
by diversification algorithms
– Measure metrics on top 50 items  Diversification algorithms
 Metrics – MMR Greedy optimization
of relevance + diversity
– EPC@50 Novelty (Zhang 2008)
(popularity complement)
– IA-Select Adaptation of IR
– EPD@50 Unexpectednes diversity algorithm
(profile distance) (Agrawal 2008)
– EILD@50 Intra-list diversity – NGD Greedy optimization
Distance function: complement of Jaccard of relevance + novelty
(MovieLens genres) and Pearson (Last.fm) – Random

IRG

Experimental results on baseline recommenders (no rank discount)

MovieLens 1M Last.fm
Without relevance
1.0 1.00 CB
0.9 MF
 CB is good for long-
0.97
No relevance

0.8 UB tail, not so good at
0.94
0.7 AVG unexpectedness
0.91 RND
0.6 and diversity
0.5 0.88
 AVG rating and RND
0.4 0.85
EPC@50 EPD@50 EILD@50 EPC@50 EPD@50 EILD@50 stand out, especial-
ly on Last.fm

MovieLens 1M Last.fm
With relevance
0.07 0.5 CB
 MF stands out on
Relevance-aware

0.06 MF
0.4
0.05 UB MovieLens
0.04 0.3 AVG
0.03 RND
 UB stands out on
0.2
0.02 Last.fm
0.01 0.1
0.00 0.0
 AVG rating and RND
EPC@50 EPD@50 EILD@50 EPC@50 EPD@50 EILD@50 drop drastically

IRG

Experimental results with diversification algorithms
Wilcoxon MovieLens 1M Last.fm
p < 0.001 EPC@50 EPD@50 EILD@50 EPC@50 EPD@50 EILD@50
disc (k) 1 0.85k–1 1 0.85k–1 1 0.85k–1 1 0.85k–1 1 0.85k–1 1 0.85k–1
MF 0.9124 0.8876 0.7632 0.7466 0.7164 0.6191 0.8754 0.8481 0.8949 0.8895 0.8862 0.7954
No relevance

IA-Select 0.9045 0.8886 0.8080 0.7577 0.8289 0.7483 0.8840 0.9089 0.8912 0.8909 0.8878 0.8274
MMR 0.9063 0.8769 0.7605 0.7428 0.7191 0.6247 0.9068 0.8903 0.9133 0.9107 0.9166 0.8398
NGD 0.9851 0.9795 0.7725 0.7551 0.6563 0.5430 0.9722 0.9571 0.9423 0.9398 0.9485 0.8784
Random 0.9525 0.9527 0.7699 0.7699 0.7283 0.6719 0.9359 0.9357 0.9278 0.9279 0.9318 0.8619

MF 0.0671 0.1043 0.0580 0.0944 0.0471 0.0551 0.2501 0.2115 0.2671 0.2587 0.2518 0.1900
IA-Select
Relevance

0.0705 0.1161 0.0639 0.1032 0.0537 0.0648 0.3343 0.4752 0.3462 0.3994 0.3343 0.4154
MMR 0.0719 0.1131 0.0620 0.1020 0.0510 0.0610 0.2351 0.1936 0.2439 0.2340 0.2360 0.1759
NGD 0.0155 0.0223 0.0128 0.0200 0.0067 0.0017 0.2286 0.3077 0.2212 0.2593 0.2165 0.2656
Random 0.0222 0.0218 0.0182 0.0179 0.0117 0.0058 0.1362 0.1368 0.1407 0.1405 0.1342 0.1113

 Improvement w.r.t. random reranking is clearer with relevance best
 Rank sensitivity uncovers further improvements by diversification algorithms > random
 Different metrics appreciate different diversification algorithms consistently < baseline

IRG

Experimental results

 The metrics behave consistently
– E.g. content-based recommender scores high on novelty (long-tail) but low on
unexpectedness and diversity

– Diversified recommendations score higher than baselines

– Different diversification strategies met their specific target

 Relevance makes a large difference
– Probe recommenders such as random and average rating score high without
relevance and rank discount –and they drop with relevance

– Same effect for random diversification

 Rank sensitiveness uncovers further improvements by diversification
algorithms which otherwise go unnoticed

IRG

Conclusion

 General metric framework for recommendation novelty and diversity evaluation

 Flexible and configurable, supports a fair range of variants and configurations
– Key configuration components: item novelty models, context , rank and relevance

 Unifies and generalizes state of the art metrics
– Further metrics can be unified taking alternative  : temporal novelty/diversity,
inter-system diversity, inter-user diversity

 Provides for rank sensitivity and relevance awareness (as an option)

 Provides for single metric assessing accuracy and diversity/novelty

 Further ongoing empirical testing, wide space for further exploration!

IRG

ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

More from Pablo Castells

More from Pablo Castells (9)

Recently uploaded

Recently uploaded (20)

ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems