Powerful Google developer tools for immediate impact! (2023-24 C)
Toward a new Protocol to evaluate Recommender Systems
1. Toward a New Protocol to Evaluate
Recommender Systems
Frank Meyer, Françoise Fessant, Fabrice Clerot, Eric Gaussier
Franck.meyer@orange.com
University Joseph Fourier & Orange
RecSys 2012 – WorkShop on Recommendation Utility Evaluation
2012 – v1.18
R&D
2. Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
p2 Orange R&D Orange FT-group
3. Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
p3 Orange R&D Orange FT-group
4. Recommender systems
For industrial applications
Amazon, Google News, Youtube (Google), ContentWise, BeeHive
(IBM),...
as for well-known academic realizations
Fab, More, Twittomender,...
the recommendation is multi-facetted
pushing items, sorting items, linking items...
and cannot be reduced to the rating prediction of a score of
interest of a user u for an item i.
What is a good recommender system?
just a system accurate for rating prediction for top N blockbusters and
top M big users?
... or something else?
p4 Orange R&D Orange FT-group
5. Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
p5 Orange R&D Orange FT-group
6. Industrial point of view
Main goals of the automatic recommendation:
to increase sales
to increase the audience (click rates...)
to increase customer’s satisfaction and loyalty
Main needs (analysis at Orange: TV, Video On Demand,
shows, web-radios,...)
1. Helping all the users: big users and small users
2. recommending all the items : frequently purchased/viewed items,
rarely purchased/viewed items
3. Helping users on different identified problems
1. should I take this item?
2. should I take this item or that one?
3. what should interest me in this catalog?
4. what is similar to this item?
p6 Orange R&D Orange FT-group
7. We propose 4 key functions
Help to Explore (navigate) Example:
Given an item i used as a context, give N items similar to i.
Help to Decide Example:
Given an user u, and an item i, give a predictive score of
interest of u for i (a rating).
Help to Compare
Example:
Given a user u and a list of items i1,…,in, sort the items in
a decreasing order according to the score of interest for u.
Help to Discover
Given a user u, give N interesting items for u. Example:
p7 Orange R&D Orange FT-group
8. Decide/ Compare / Discover / Explore
Function Quality criteria Measure
Decide The rating prediction must be precise. Existing measure: RMSE
Extreme errors must be penalized
because they may more often lead to
a wrong decision.
Compare The ranking prediction must be good Existing measure: NDPM
for any couple of items of the catalog (or number of compatible orders)
(not only for a Top N).
Discover The recommendation must be useful. Existing measure : Precision
Problem: if one recommends only well-
known blockbusters (i.e. Star Wars,
Titanic...) one will be precise but not useful!
Introducing the Impact Measure
Explore Problem: the semantic relevance is
not evaluable without user’s feedback.
Introducing a validation method
for a similarity measure
p8 Orange R&D Orange FT-group
9. Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
p9 Orange R&D Orange FT-group
10. Known Vs Unknown, Risky Vs Safe
Recommending an item for a user...
Probability that
the user already
knows the item Bad Trivial
recommendation recommendation
But the item is correct but not often
generally known by useful
name by the user
Very bad Very good
recommendation recommendation
the user does not know Help to Discover
the item: if he trusts the
systems, he will be misled
Probability that
the user likes
the item
Orange R&D Orange FT-group
11. Measuring the Help to Discover
Proba user
already
knows
Average Measure of Impact
Proba user
likes
Recommendation impact
Impact if the user Impact if the user likes
dislikes the item the item
Recommending a slightly negative slightly positive
popular item
Recommending a rare, Strongly negative Strongly positive
unknown item
Size of the
List Z of Impact: rarity of the catalog
List H of logs
recommended items * relative rating of (normalization)
(u,i,r) in the
items Test Set the user u (according to
p11
her mean of ratings)
Orange R&D Orange FT-group
12. Principle of the protocol
Datasets used:
MovieLens 1M and Netflix.
LOGS TEST No long tail distribution
detected in Netflix neither in
MovieLens’ dataset
So we use the simplest
userID, itemID, note
userID, itemID, note
segmentation according to
userID, itemID, rating the mean of the number of
Simple mean-based ratings: light/heavy users,
item/user segmentation popular/unpopular items
Learn
For each (userID, itemID) in Test: RMSE
generate a rating prediction, compare with true rating
For each list of itemIDs for each userID in Test : %COMP
Sort the list according to the ratings, compare the strict (% compatible)
orders of the rating with the order given by the model
Model
For each userID in Test:
generate a list of recommended items; for each of this
items actually rating by userID in Test, evaluate the
relavance
Orange R&D Orange FT-group
AMI
13. We will use 4 algorithms to validate the protocol
Uniform Random Predictor
Returns a rating between 1 and 5 (min et max) with a random uniform
distribution
Default Predictor (mean of item + mean of user )/2
Robust mean of the items: requires at least 10 ratings on the item, otherwise
use only the user’s mean
K-Nearest Neighbor item method
Use K nearest neighbors per item, a scoring method detailed below, a
similarity measure called Weighted Pearson. Uses the Default predictor when
an item cannot be predicted
• Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted
Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255
Fast factorization method
Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF”
implementation)
• Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative
Filtering Approaches for Large Recommender Systems. Journal of Machine Learning
Research 10: 623-656 (2009)
p13 Orange R&D Orange FT-group
14. What about “Help to Explore”?
How to compare the “semantic quality” of the link between 2 items?
Principle
Define a similarity measure that could be extracted from the model
use the similarity measure to build an item-item similarity matrix
use the similarity matrix as a model for a recommender system using a KNN item-item
model
if this system obtains good performances for RMSE, %COMP, and AMI then the
semantic quality of the similarity measure must be good
Application
for a KNN-item model this is immediate (there is an intrinsic similarity)
for a matrix factorization model, we can use a similarity measure (as Pearson)
computed on the items’ factors
for a random rating predictor, this is not applicable...
for a mean-based rating predictor, this is not applicable...
p14 Orange R&D Orange FT-group
15. Evaluating “Help To Explore” for Gravity
columns of users
items X users
matrix of
rows of ratings
items
Gravity (fast
Matrix matrix of
Factorization) users’ factors
matrix of
(not used)
items’ factors
Similarity KNN based
Matrix (KNN) recommender
of the items system
items’ similarity (model for a
computations and K recommender
Nearest Neighbors system)
search, using the matrix Possible evaluation of the
of items’ factors quality of this similarity matrix
via RMSE, %Comp, AMI...
p15
Orange R&D Orange FT-group
16. Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
p16 Orange R&D Orange FT-group
17. Finding 1: different performances
according to the segments
We have a decrease in performance of more than
25% between heavy user popular item segment
and light user unpopular item segment
RMSE for Gravity on Netflix RMSE for KNN on Netflix
rmse av.
1.05
Default Pred 1.1 rmse av.
rmse 1.05 Default Pred.
1
rmse
1
0.95 rmse Huser
0.95
RMSE
Pitem rmse Huser
RMSE
0.9
0.9 Pitem
rmse Luser rmse Luser
0.85 Pitem 0.85 Pitem
0.8 rmse Huser 0.8 rmse Huser
Uitem Uitem
0.75
0.75 rmse Luser
rmse Luser 0 50 100 150 200
0 10 20 30 40 50 60 70 Uitem
Uitem
Number Of Factors Number of KNN
the 4 RMSE Light users Unpopular items (Luser Uitem)
RMSE Light users Popular items (Luser Pitem)
segments RMSE Heavy users Unpopular items (Huser Uitem)
analyzed RMSE Heavy users Popular items (Huser Pitem)
+ RMSE (global)
p17 Orange R&D Orange FT-group
+ Default predictor
18. Finding 2: RMSE not strictly linked to the
other performances
the light user popular item segment is the light user popular item segment is as
easier to optimize than the light user difficult to optimize as the light user
unpopular item segment for RMSE unpopular item segment for Ranking
RMSE for Gravity on Netflix Ranking compatibility for Gravity - Netflix
rmse av.
1.05
Default Pred 77.00%
%Compatible
rmse 75.00% Default Pred
1
%compatible
%Compatible
73.00%
0.95 rmse Huser
Pitem 71.00% %compatible
RMSE
0.9 Huser Pitem
rmse Luser 69.00% %compatible
0.85 Pitem Luser Pitem
67.00%
0.8 rmse Huser %compatible
Uitem 65.00% Huser Uitem
0.75 %compatible
rmse Luser 0 20 40 60
0 10 20 30 40 50 60 70 Luser Uitem
Uitem Number of factors
Number Of Factors
RMSE Light users Unpopular items
RMSE Light users Popular items
Example on 2 segments... RMSE Heavy users Unpopular items
RMSE Heavy users Popular items
RMSE (global)
Default predictor (global)
p18 Orange R&D Orange FT-group
19. Main Fact 2 (continued): RMSE not strictly
linked to the other performances
RMSE for KNN on Netflix
RMSE for Gravity on Netflix rmse av. 1.1 rmse av.
1.05
Default Pred 1.05 Default Pred.
rmse
rmse 1
1
0.95
RMSE
rmse Huser
0.95 rmse Huser Pitem
0.9
Pitem rmse Luser
RMSE
0.9 0.85 Pitem
rmse Luser
0.8 rmse Huser
0.85 Pitem
Uitem
0.75
0.8 rmse Huser rmse Luser
Uitem 0 50 100 150 200
Uitem
0.75 Number of KNN
0 10 20 30 40 50 60 70
rmse Luser
Uitem
Number Of Factors
Average Measure of Impact - Netflix
2.5
RMSE (global) 2
1.5
1
Average Measure of Impact -
Netflix
0.5
Globally, Gravity is better than KNN for RMSE,
but is worse than KNN for Average Measure of 0
Impact Random Pred Default Pred KNN, K=100 Gravity, F=32
-0.5
-1
p19 Orange R&D Orange FT-group
20. Global results
Help to Decide / Compare / Discover
Gravity
dominates
for the
RMSE
measure
KNN
dominates on
the heavy
user
segments
The default
Predictor is
very useful for
unpopular (i.e.
infrequent)
item segments
p20 Orange R&D Orange FT-group
21. Comparing native similarities with
Gravity-based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16
factors) Gravity :
1. KNN item-item can be performed on a factorized matrix with little performance loss (and
faster!).
2. Gravity can be used for the “Help to Explore function”
Native KNN KNN computed on Gravity's
K=100 items factors
K=100, number of
factors=16
RMSE 0.8440 0.8691
Ranking: % compatible 77.03% 75.67%
Precision 91.90% 86.39%
AMI 2.043 2.025
Global time 5290 seconds 3758 seconds
of the modeling task
p21 Orange R&D Orange FT-group
22. Summary
Introduction
1. Industrial tasks for recommender systems
2. Industrial (off line) protocol
3. Main results
Conclusion and future works
p22 Orange R&D Orange FT-group
23. Conclusion: contributions
As industrial recommendation is multi-facetted
we proposed to list the key functions of the recommendation
• Help to Decide, Help to Compare, Help to Discover, Help to Explore
• Note for Help to explore: the similarity feature is mandatory for a recommender system
we proposed to define a dual segmentation of Items and Users
• just being very accurate on big users and blockbuster items is not very useful
For a new offline protocol to evaluate recommender systems
we proposed to use the recommender’s key functions with the dual segmentation
• Mapping Key functions with measures
• adding the measure of Impact to evaluate the “Help to Discover” function
• adding a method to evaluate the “Help to Explore” function
we made a demonstration of its utility
• RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover,
Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee
with the other measures!)
• The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited
to improve the global performances
• + we saw empirically that the KNN approach could be virtualized, performing the similarities
between items on a factorized space built for instance by Gravity
p23 Orange R&D Orange FT-group
24. Future works: 3 main axis
1. Evalutation of the quality of the 4 core functions using an online
A/B Testing protocol
2. Hybrid switch system: the best algorithm for the adapted task
according to the user-item-segment
3. KNN virtualization via matrix factorization
p24 Orange R&D Orange FT-group
26. about this work...
Frank Meyer: Recommender systems in industrial
contexts. CoRR abs/1203.4487: (2012)
Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric
Gaussier: Toward a New Protocol to Evaluate Recommender
Systems. Workshop on Recommender Utility Evaluation, RecSys
2012. Dublin.
Frank Meyer, Françoise Fessant: Reperio: A Generic and
Flexible Industrial Recommender System. Web Intelligence
2011: 502-505. Lyon.
p26 Orange R&D Orange FT-group
27. Classic mathematic representation
of the recommendation problem
thousands of users
u1 u2 ul un
i1 4 2 ? 5 ? 2 ? 1
i2 4 5 4 5 5 4 1 5 4
known
4 3 1 1 ratings
2 1 of
interest
thousands ik 3 ? 4 ? 5
of items
? 2 ratings
of
1 4 5 interest
to predict
? ? ?
4 5 4 4
3 ?
im 5 ? 2 4
p27 Orange R&D Orange FT-group
28. Well known industrial example:
Item-to-Items recommendation (Amazon )
TM
Orange R&D Orange FT-group
p28
29. Multi-facetted analysis: measures
predicted rating
RMSE
real rating
number of logs in the Test
Set nb of contradictory
orders
on a same
nb of compatible dataset and on
NDPM orders a same user,
% compatible
nb strict orders directly usable
given by the user
Precision
number of recommeded
items actually evaluable in
the Test Set
AMI
Average Measure of
Impact
Orange R&D Orange FT-group
30. Comparing native similarities with Gravity-
based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) :
Gravity can be used for the “Help to Explore function”
KNN item-item can be performed on a factorized matrix with little performance loss!.
p30 Orange R&D Orange FT-group
31. Reperio C-V5
Centralized mode, example of a movie recommender
p31 Orange R&D Orange FT-group
32. Reperio E-V2
Embedded Mode, example of a TV program recommender
p32 Orange R&D Orange FT-group