Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
RecSys Boston,	Sept	17,	2016 1
Contrasting Offline and Online
Results when Evaluating
Recommendation Algorithms
Marco	Ross...
RecSys Boston,	Sept	17,	2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity
become...
RecSys Boston,	Sept	17,	2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy
mea...
RecSys Boston,	Sept	17,	2016 4
Study Design
• Collected likes on ML movies
from 241 users
• On average 137 ratings per use...
RecSys Boston,	Sept	17,	2016 5
Offline and Online Evaluations
ML1M
All-but-1	validation Users	Answers
Popularity
MF80:	Mat...
RecSys Boston,	Sept	17,	2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0....
RecSys Boston,	Sept	17,	2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.0...
RecSys Boston,	Sept	17,	2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05
MF80
p = 0.05 p = 0.05
p = 0.05
Usefu...
RecSys Boston,	Sept	17,	2016 9
Conclusions
• Comparison of different algorithms online and offline based on
a within-users...
RecSys Boston,	Sept	17,	2016
Thank you!
10
Marco	Rossetti
Trainline Ltd.,	London
@ross85
Prochain SlideShare
Chargement dans…5
×

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

Most evaluations of novel algorithmic contributions assess their accuracy in predicting what was withheld in an offline evaluation scenario. However, several doubts have been raised that standard offline evaluation practices are not appropriate to select the best algorithm for field deployment. The goal of this work is therefore to compare the offline and the online evaluation methodology with the same study participants, i.e. a within users experimental design. This paper presents empirical evidence that the ranking of algorithms based on offline accuracy measurements clearly contradicts the results from the online study with the same set of users. Thus the external validity of the most commonly applied evaluation methodology is not guaranteed.

  • Identifiez-vous pour voir les commentaires

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

  1. 1. RecSys Boston, Sept 17, 2016 1 Contrasting Offline and Online Results when Evaluating Recommendation Algorithms Marco Rossetti Trainline Ltd., London (previously University of Milan-Bicocca) Fabio Stella Department of Informatics, Systems and Communication University of Milano-Bicocca Markus Zanker Faculty of Computer Science Free University of Bozen-Bolzano
  2. 2. RecSys Boston, Sept 17, 2016 2 Research Goal • Given the dominance of offline evaluation reflecting on its validity becomes important • Said and Bellogin (RecSys 2014) identified serious problems with the internal validity (not reproducible results with different open source frameworks). • Different results from offline and online evaluations have also been identified putting question marks on the external validity (e.g. Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et al. 2014, Maksai et al., 2015). • Proposition: • Compare performance of an offline experimentation with an online evaluation. • Use of a within-users experimental design, where we can test for differences in paired samples.
  3. 3. RecSys Boston, Sept 17, 2016 3 Research Questions 1. Does the relative ranking of algorithms based on offline accuracy measurements predict the relative ranking according to an accuracy measurement in a user-centric evaluation? 2. Does the relative ranking of algorithms based on offline measurements of the predictive accuracy for long- tail items produce comparable results to a user-centric evaluation? 3. Do offline accuracy measurements allow to predict the utility of recommendations in a user-centric evaluation?
  4. 4. RecSys Boston, Sept 17, 2016 4 Study Design • Collected likes on ML movies from 241 users • On average 137 ratings per user 1 • Same users, evaluated 4 algorithms, 5 recommendations each • On average 17.4 + 2 recommendations • 122 users returned, 100 after cleaning 2
  5. 5. RecSys Boston, Sept 17, 2016 5 Offline and Online Evaluations ML1M All-but-1 validation Users Answers Popularity MF80: Matrix Factorization with 80 factors MF400: Matrix Factorization with 400 factors I2I: Item To Item K-Nearest Neighbors train Offline evaluation Online evaluation Metrics à precision on all items ß à precision on long tail ß useful recommendations ß
  6. 6. RecSys Boston, Sept 17, 2016 6 Precision All Items MF400 MF80 POP I2I p = 0.05 p = 0.05 p = 0.05 MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.1 Algorithm Offline Online I2I 0.438 0.546 MF80 0.504 0.598 MF400 0.454 0.604 POP 0.340 0.516 Offline precision all items Online precision all items
  7. 7. RecSys Boston, Sept 17, 2016 7 Precision on Long Tail Items MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 Offline = Online precision long tail items Algorithm Offline Online I2I 0.280 0.356 MF80 0.018 0.054 MF400 0.360 0.628 POP 0.000 0.000
  8. 8. RecSys Boston, Sept 17, 2016 8 Useful Recommendations MF400I2I POP p = 0.05 p = 0.05 MF80 p = 0.05 p = 0.05 p = 0.05 Useful recommendations Algorithm Online I2I 0.126 MF80 0.082 MF400 0.116 POP 0.026
  9. 9. RecSys Boston, Sept 17, 2016 9 Conclusions • Comparison of different algorithms online and offline based on a within-users experimental design. • The algorithm performing best according to a traditional offline accuracy measurement was significantly worse, when it comes to useful (i.e. relevant and novel) recommendations measured online. • Academia and industry should keep investigating this topic in order to find the best possible way to validate offline evaluations.
  10. 10. RecSys Boston, Sept 17, 2016 Thank you! 10 Marco Rossetti Trainline Ltd., London @ross85

×