SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Toward a New Protocol to Evaluate
     Recommender Systems
Frank Meyer, Françoise Fessant, Fabrice Clerot, Eric Gaussier
Franck.meyer@orange.com

University Joseph Fourier & Orange
RecSys 2012 – WorkShop on Recommendation Utility Evaluation
2012 – v1.18




                                              R&D
Summary
    Introduction

1. Industrial tasks for recommender systems

2. Industrial (off line) protocol

3. Main results

    Conclusion and future works




p2                              Orange R&D    Orange FT-group
Summary

 Introduction

1. Industrial tasks for recommender systems

2. Industrial (off line) protocol

3. Main results

    Conclusion and future works




p3                              Orange R&D    Orange FT-group
Recommender systems
    For industrial applications
         Amazon, Google News, Youtube (Google), ContentWise, BeeHive
          (IBM),...
    as for well-known academic realizations
         Fab, More, Twittomender,...
    the recommendation is multi-facetted
         pushing items, sorting items, linking items...
    and cannot be reduced to the rating prediction of a score of
     interest of a user u for an item i.

    What is a good recommender system?
         just a system accurate for rating prediction for top N blockbusters and
          top M big users?
         ... or something else?

p4                                       Orange R&D                       Orange FT-group
Summary
    Introduction

1. Industrial tasks for recommender systems
2. Industrial (off line) protocol

3. Main results

    Conclusion and future works




p5                              Orange R&D   Orange FT-group
Industrial point of view
    Main goals of the automatic recommendation:
       to increase sales
       to increase the audience (click rates...)
       to increase customer’s satisfaction and loyalty


    Main needs (analysis at Orange: TV, Video On Demand,
     shows, web-radios,...)
      1.   Helping all the users: big users and small users
      2.   recommending all the items : frequently purchased/viewed items,
           rarely purchased/viewed items
      3.   Helping users on different identified problems
           1.   should I take this item?
           2.   should I take this item or that one?
           3.   what should interest me in this catalog?
           4.   what is similar to this item?

p6                                       Orange R&D                   Orange FT-group
We propose 4 key functions
   Help to Explore (navigate)                                                Example:


            Given an item i used as a context, give N items similar to i.


   Help to Decide                                                           Example:
            Given an user u, and an item i, give a predictive score of
             interest of u for i (a rating).


   Help to Compare
                                                                             Example:
            Given a user u and a list of items i1,…,in, sort the items in
             a decreasing order according to the score of interest for u.


   Help to Discover
            Given a user u, give N interesting items for u.                 Example:




    p7                                        Orange R&D                       Orange FT-group
Decide/ Compare / Discover / Explore
Function            Quality criteria                                      Measure
Decide     The rating prediction must be precise.                  Existing measure: RMSE
           Extreme errors must be penalized
           because they may more often lead to
           a wrong decision.


Compare    The ranking prediction must be good                      Existing measure: NDPM
           for any couple of items of the catalog              (or number of compatible orders)
           (not only for a Top N).

Discover   The recommendation must be useful.                    Existing measure : Precision

                                                         Problem: if one recommends only well-
                                                         known blockbusters (i.e. Star Wars,
                                                         Titanic...) one will be precise but not useful!

                                                         Introducing the Impact Measure

Explore    Problem: the semantic relevance is
           not evaluable without user’s feedback.
                                                         Introducing a validation method
                                                              for a similarity measure


  p8                                        Orange R&D                                        Orange FT-group
Summary
    Introduction

1. Industrial tasks for recommender systems


2. Industrial (off line) protocol
3. Main results

    Conclusion and future works




p9                             Orange R&D     Orange FT-group
Known Vs Unknown, Risky Vs Safe
                          Recommending an item for a user...



Probability that
the user already
knows the item      Bad                            Trivial
                    recommendation                 recommendation

                    But the item is                correct but not often
                    generally known by             useful
                    name by the user



                   Very bad                         Very good
                   recommendation                   recommendation

                   the user does not know           Help to Discover
                   the item: if he trusts the
                   systems, he will be misled

                                                                           Probability that
                                                                           the user likes
                                                                           the item

                                           Orange R&D                               Orange FT-group
Measuring the Help to Discover
                                                   Proba user
                                                   already
                                                   knows
       Average Measure of Impact
                                                                               Proba user
                                                                               likes


                                       Recommendation impact
                            Impact if the user         Impact if the user likes
                            dislikes the item          the item
   Recommending a           slightly negative          slightly positive
   popular item
   Recommending a rare,     Strongly negative          Strongly positive
   unknown item




                                                                 Size of the
List Z of                          Impact: rarity of the         catalog
                 List H of logs
recommended                        items * relative rating of    (normalization)
                 (u,i,r) in the
items            Test Set          the user u (according to
  p11
                                   her mean of ratings)
                                      Orange R&D                           Orange FT-group
Principle of the protocol
                                                                         Datasets used:
                                                                         MovieLens 1M and Netflix.

LOGS                                     TEST                            No long tail distribution
                                                                         detected in Netflix neither in
                                                                         MovieLens’ dataset

                                                                         So we use the simplest
                                   userID, itemID, note
                                  userID, itemID, note
                                                                         segmentation according to
                                 userID, itemID, rating                  the mean of the number of
        Simple mean-based                                                ratings: light/heavy users,
        item/user segmentation                                           popular/unpopular items
Learn

                For each (userID, itemID) in Test:                                         RMSE
                generate a rating prediction, compare with true rating

                For each list of itemIDs for each userID in Test :                        %COMP
                Sort the list according to the ratings, compare the strict             (% compatible)
                orders of the rating with the order given by the model
Model
                For each userID in Test:
                generate a list of recommended items; for each of this
                items actually rating by userID in Test, evaluate the
                relavance
                                          Orange R&D                                   Orange FT-group
                           AMI
We will use 4 algorithms to validate the protocol
        Uniform Random Predictor
             Returns a rating between 1 and 5 (min et max) with a random uniform
              distribution
        Default Predictor (mean of item + mean of user )/2
             Robust mean of the items: requires at least 10 ratings on the item, otherwise
              use only the user’s mean
        K-Nearest Neighbor item method
           Use K nearest neighbors per item, a scoring method detailed below, a

              similarity measure called Weighted Pearson. Uses the Default predictor when
              an item cannot be predicted
               • Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted
                 Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255
        Fast factorization method
             Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF”
              implementation)
               • Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative
                 Filtering Approaches for Large Recommender Systems. Journal of Machine Learning
                 Research 10: 623-656 (2009)

   p13                                           Orange R&D                                Orange FT-group
What about “Help to Explore”?

   How to compare the “semantic quality” of the link between 2 items?

   Principle
         Define a similarity measure that could be extracted from the model
         use the similarity measure to build an item-item similarity matrix
         use the similarity matrix as a model for a recommender system using a KNN item-item
          model
         if this system obtains good performances for RMSE, %COMP, and AMI then the
          semantic quality of the similarity measure must be good


   Application
         for a KNN-item model this is immediate (there is an intrinsic similarity)
         for a matrix factorization model, we can use a similarity measure (as Pearson)
          computed on the items’ factors
         for a random rating predictor, this is not applicable...
         for a mean-based rating predictor, this is not applicable...

         p14                                     Orange R&D                            Orange FT-group
Evaluating “Help To Explore” for Gravity
           columns of users



              items X users
              matrix of
rows of       ratings
items
                               Gravity (fast
                               Matrix                            matrix of
                               Factorization)                    users’ factors
                                                matrix of
                                                                 (not used)
                                                items’ factors




                                                  Similarity                      KNN based
                                                  Matrix (KNN)                    recommender
                                                  of the items                    system
                   items’ similarity              (model for a
                   computations and K             recommender
                   Nearest Neighbors              system)
                   search, using the matrix                      Possible evaluation of the
                   of items’ factors                             quality of this similarity matrix
                                                                 via RMSE, %Comp, AMI...
     p15

                                                Orange R&D                               Orange FT-group
Summary
     Introduction

1. Industrial tasks for recommender systems

2. Industrial (off line) protocol


3. Main results
     Conclusion and future works




p16                             Orange R&D    Orange FT-group
Finding 1: different performances
                                 according to the segments
                                                                We have a decrease in performance of more than
                                                                25% between heavy user popular item segment
                                                                and light user unpopular item segment


                             RMSE for Gravity on Netflix                                                       RMSE for KNN on Netflix
                                                                        rmse av.
       1.05
                                                                        Default Pred                 1.1                                            rmse av.
                                                                        rmse                        1.05                                            Default Pred.
         1
                                                                                                                                                    rmse
                                                                                                      1
       0.95                                                             rmse Huser
                                                                                                    0.95




                                                                                             RMSE
                                                                        Pitem                                                                       rmse Huser
RMSE




        0.9
                                                                                                     0.9                                            Pitem
                                                                        rmse Luser                                                                  rmse Luser
       0.85                                                             Pitem                       0.85                                            Pitem
        0.8                                                             rmse Huser                   0.8                                            rmse Huser
                                                                        Uitem                                                                       Uitem
                                                                                                    0.75
       0.75                                                                                                                                         rmse Luser
                                                                        rmse Luser                         0   50       100         150   200
              0         10   20       30    40        50   60    70                                                                                 Uitem
                                                                        Uitem
                                  Number Of Factors                                                                 Number of KNN



              the 4                              RMSE Light users Unpopular items                                   (Luser Uitem)
                                                 RMSE Light users Popular items                                     (Luser Pitem)
              segments                           RMSE Heavy users Unpopular items                                   (Huser Uitem)
              analyzed                           RMSE Heavy users Popular items                                     (Huser Pitem)

                                                 + RMSE (global)
                  p17                                                                  Orange R&D                                               Orange FT-group
                                                 + Default predictor
Finding 2: RMSE not strictly linked to the
                     other performances
                  the light user popular item segment is                                                          the light user popular item segment is as
                  easier to optimize than the light user                                                          difficult to optimize as the light user
                  unpopular item segment for RMSE                                                                 unpopular item segment for Ranking



                             RMSE for Gravity on Netflix                                                      Ranking compatibility for Gravity - Netflix
                                                                     rmse av.
       1.05
                                                                     Default Pred                    77.00%
                                                                                                                                                       %Compatible
                                                                     rmse                            75.00%                                            Default Pred
         1
                                                                                                                                                       %compatible




                                                                                       %Compatible
                                                                                                     73.00%
       0.95                                                          rmse Huser
                                                                     Pitem                           71.00%                                            %compatible
RMSE




        0.9                                                                                                                                            Huser Pitem
                                                                     rmse Luser                      69.00%                                            %compatible
       0.85                                                          Pitem                                                                             Luser Pitem
                                                                                                     67.00%
        0.8                                                          rmse Huser                                                                        %compatible
                                                                     Uitem                           65.00%                                            Huser Uitem
       0.75                                                                                                                                            %compatible
                                                                     rmse Luser                               0           20           40      60
              0         10   20       30    40        50   60   70                                                                                     Luser Uitem
                                                                     Uitem                                                Number of factors
                                  Number Of Factors

                                                                                                                  RMSE Light users Unpopular items
                                                                                                                  RMSE Light users Popular items
                    Example on 2 segments...                                                                      RMSE Heavy users Unpopular items
                                                                                                                  RMSE Heavy users Popular items

                                                                                                                  RMSE (global)
                                                                                                                  Default predictor (global)
                  p18                                                               Orange R&D                                                      Orange FT-group
Main Fact 2 (continued): RMSE not strictly
             linked to the other performances
                                                                                                                     RMSE for KNN on Netflix
                             RMSE for Gravity on Netflix             rmse av.                        1.1                                                            rmse av.
       1.05
                                                                     Default Pred                   1.05                                                            Default Pred.
                                                                                                                                                                    rmse
                                                                     rmse                             1
         1
                                                                                                    0.95




                                                                                             RMSE
                                                                                                                                                                    rmse Huser
       0.95                                                          rmse Huser                                                                                     Pitem
                                                                                                     0.9
                                                                     Pitem                                                                                          rmse Luser
RMSE




        0.9                                                                                         0.85                                                            Pitem
                                                                     rmse Luser
                                                                                                     0.8                                                            rmse Huser
       0.85                                                          Pitem
                                                                                                                                                                    Uitem
                                                                                                    0.75
        0.8                                                          rmse Huser                                                                                     rmse Luser
                                                                     Uitem                                 0         50         100         150        200
                                                                                                                                                                    Uitem
       0.75                                                                                                               Number of KNN
              0         10   20       30    40        50   60   70
                                                                     rmse Luser
                                                                     Uitem
                                  Number Of Factors
                                                                                                       Average Measure of Impact - Netflix
                                                                                     2.5

                                                 RMSE (global)                         2

                                                                                     1.5

                                                                                       1
                                                                                                                                                        Average Measure of Impact -
                                                                                                                                                        Netflix
                                                                                     0.5
 Globally, Gravity is better than KNN for RMSE,
 but is worse than KNN for Average Measure of                                          0

 Impact                                                                                     Random Pred    Default Pred   KNN, K=100   Gravity, F=32
                                                                                     -0.5

                                                                                      -1
                  p19                                                               Orange R&D                                                               Orange FT-group
Global results
               Help to Decide / Compare / Discover



                                                     Gravity
                                                     dominates
                                                     for the
                                                     RMSE
                                                     measure


KNN
dominates on
the heavy
user
segments




                                                     The default
                                                     Predictor is
                                                     very useful for
                                                     unpopular (i.e.
                                                     infrequent)
                                                     item segments




      p20                     Orange R&D             Orange FT-group
Comparing native similarities with
             Gravity-based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16
   factors) Gravity :
1. KNN item-item can be performed on a factorized matrix with little performance loss (and
    faster!).
2. Gravity can be used for the “Help to Explore function”
                                      Native KNN            KNN computed on Gravity's
                                        K=100                     items factors
                                                               K=100, number of
                                                                    factors=16
          RMSE                           0.8440                       0.8691
   Ranking: % compatible                 77.03%                       75.67%

           Precision                    91.90%                        86.39%
             AMI                         2.043                         2.025
          Global time                 5290 seconds                  3758 seconds
     of the modeling task



    p21                                    Orange R&D                           Orange FT-group
Summary
     Introduction

1. Industrial tasks for recommender systems

2. Industrial (off line) protocol

3. Main results

 Conclusion         and future works




p22                             Orange R&D    Orange FT-group
Conclusion: contributions
     As industrial recommendation is multi-facetted
          we proposed to list the key functions of the recommendation
             • Help to Decide, Help to Compare, Help to Discover, Help to Explore
             • Note for Help to explore: the similarity feature is mandatory for a recommender system
          we proposed to define a dual segmentation of Items and Users
             • just being very accurate on big users and blockbuster items is not very useful


     For a new offline protocol to evaluate recommender systems
          we proposed to use the recommender’s key functions with the dual segmentation
             • Mapping Key functions with measures
             • adding the measure of Impact to evaluate the “Help to Discover” function
             • adding a method to evaluate the “Help to Explore” function
          we made a demonstration of its utility
             • RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover,
               Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee
               with the other measures!)
             • The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited
               to improve the global performances
             • + we saw empirically that the KNN approach could be virtualized, performing the similarities
               between items on a factorized space built for instance by Gravity




p23                                                   Orange R&D                                        Orange FT-group
Future works: 3 main axis

1.    Evalutation of the quality of the 4 core functions using an online
      A/B Testing protocol

2.     Hybrid switch system: the best algorithm for the adapted task
       according to the user-item-segment

3.     KNN virtualization via matrix factorization




p24                                 Orange R&D                     Orange FT-group
Annexes



p25      Orange R&D   Orange FT-group
about this work...

     Frank Meyer: Recommender systems in industrial
      contexts. CoRR abs/1203.4487: (2012)

     Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric
      Gaussier: Toward a New Protocol to Evaluate Recommender
      Systems. Workshop on Recommender Utility Evaluation, RecSys
      2012. Dublin.

     Frank Meyer, Françoise Fessant: Reperio: A Generic and
      Flexible Industrial Recommender System. Web Intelligence
      2011: 502-505. Lyon.




p26                             Orange R&D                   Orange FT-group
Classic mathematic representation
                   of the recommendation problem
                                thousands of users

                  u1   u2                    ul                          un

             i1   4         2   ?            5           ?       2   ?   1
             i2   4    5    4         5      5           4   1   5       4
                                                                                  known
                  4    3        1                            1                    ratings
                  2                                                  1            of
                                                                                  interest
thousands    ik   3    ?              4      ?                           5
of items
                  ?         2                                                     ratings
                                                                                  of
                  1    4                                     5                    interest
                                                                                  to predict
                       ?        ?                                    ?
                  4    5              4                      4
                  3    ?
             im   5    ?        2                                    4
       p27                                  Orange R&D                        Orange FT-group
Well known industrial example:
      Item-to-Items recommendation (Amazon )
                                          TM




                       Orange R&D        Orange FT-group
p28
Multi-facetted analysis: measures
                                                                  predicted rating

  RMSE
                                                                  real rating

             number of logs in the Test
             Set                                          nb of contradictory
                                                          orders
                                                                                on a same
                                                          nb of compatible      dataset and on
  NDPM                                                    orders                a same user,
                                                                                % compatible
                                                          nb strict orders      directly usable
                                                          given by the user




 Precision
                                                       number of recommeded
                                                       items actually evaluable in
                                                       the Test Set



   AMI

Average Measure of
Impact
                                          Orange R&D                            Orange FT-group
Comparing native similarities with Gravity-
           based similarities
    Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) :
   Gravity can be used for the “Help to Explore function”
   KNN item-item can be performed on a factorized matrix with little performance loss!.




    p30                                            Orange R&D                                   Orange FT-group
Reperio C-V5
   Centralized mode, example of a movie recommender




    p31                     Orange R&D                 Orange FT-group
Reperio E-V2
   Embedded Mode, example of a TV program recommender




    p32                       Orange R&D                 Orange FT-group

Contenu connexe

Similaire à Toward a new Protocol to evaluate Recommender Systems

Lak12 - Leeds - Deriving Group Profiles from Social Media
Lak12 - Leeds - Deriving Group Profiles from Social Media Lak12 - Leeds - Deriving Group Profiles from Social Media
Lak12 - Leeds - Deriving Group Profiles from Social Media
lydia-lau
 
Product, Pricing, and Channels Paper Grading GuideMKT421 Vers.docx
Product, Pricing, and Channels Paper Grading GuideMKT421 Vers.docxProduct, Pricing, and Channels Paper Grading GuideMKT421 Vers.docx
Product, Pricing, and Channels Paper Grading GuideMKT421 Vers.docx
wkyra78
 
Technology Entrepreneurship (Assign No 2)
Technology Entrepreneurship (Assign No 2)Technology Entrepreneurship (Assign No 2)
Technology Entrepreneurship (Assign No 2)
zohaibqadir
 

Similaire à Toward a new Protocol to evaluate Recommender Systems (20)

UX Research
UX ResearchUX Research
UX Research
 
COSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR ApplicationsCOSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR Applications
 
U C D Methodology
U C D  MethodologyU C D  Methodology
U C D Methodology
 
Pubs first
Pubs firstPubs first
Pubs first
 
Prototyping and Piloting
Prototyping and PilotingPrototyping and Piloting
Prototyping and Piloting
 
Lak12 - Leeds - Deriving Group Profiles from Social Media
Lak12 - Leeds - Deriving Group Profiles from Social Media Lak12 - Leeds - Deriving Group Profiles from Social Media
Lak12 - Leeds - Deriving Group Profiles from Social Media
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
 
Product, Pricing, and Channels Paper Grading GuideMKT421 Vers.docx
Product, Pricing, and Channels Paper Grading GuideMKT421 Vers.docxProduct, Pricing, and Channels Paper Grading GuideMKT421 Vers.docx
Product, Pricing, and Channels Paper Grading GuideMKT421 Vers.docx
 
Technology Entrepreneurship (assig no 2)
Technology Entrepreneurship (assig no 2)Technology Entrepreneurship (assig no 2)
Technology Entrepreneurship (assig no 2)
 
Technology Entrepreneurship (Assign No 2)
Technology Entrepreneurship (Assign No 2)Technology Entrepreneurship (Assign No 2)
Technology Entrepreneurship (Assign No 2)
 
20120140506003
2012014050600320120140506003
20120140506003
 
20320140501009 2
20320140501009 220320140501009 2
20320140501009 2
 
How to Correctly Use Experimentation in PM by Google PM
How to Correctly Use Experimentation in PM by Google PMHow to Correctly Use Experimentation in PM by Google PM
How to Correctly Use Experimentation in PM by Google PM
 
Google: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer GroupsGoogle: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer Groups
 
Prototyping and Scrum
Prototyping and ScrumPrototyping and Scrum
Prototyping and Scrum
 
NYTECH "Measuring Your User Experience Design"
NYTECH "Measuring Your User Experience Design"NYTECH "Measuring Your User Experience Design"
NYTECH "Measuring Your User Experience Design"
 
Gigwalk talk at re think conference 4 18 12
Gigwalk talk at re think conference 4 18 12Gigwalk talk at re think conference 4 18 12
Gigwalk talk at re think conference 4 18 12
 
Mobile UX London - Mobile Usability Hands-on by SABRINA DUDA
Mobile UX London - Mobile Usability Hands-on by SABRINA DUDAMobile UX London - Mobile Usability Hands-on by SABRINA DUDA
Mobile UX London - Mobile Usability Hands-on by SABRINA DUDA
 
Construindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com PythonConstruindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com Python
 
You aren't your target market. - UX Research Basics
You aren't your target market. - UX Research BasicsYou aren't your target market. - UX Research Basics
You aren't your target market. - UX Research Basics
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Toward a new Protocol to evaluate Recommender Systems

  • 1. Toward a New Protocol to Evaluate Recommender Systems Frank Meyer, Françoise Fessant, Fabrice Clerot, Eric Gaussier Franck.meyer@orange.com University Joseph Fourier & Orange RecSys 2012 – WorkShop on Recommendation Utility Evaluation 2012 – v1.18 R&D
  • 2. Summary  Introduction 1. Industrial tasks for recommender systems 2. Industrial (off line) protocol 3. Main results  Conclusion and future works p2 Orange R&D Orange FT-group
  • 3. Summary  Introduction 1. Industrial tasks for recommender systems 2. Industrial (off line) protocol 3. Main results  Conclusion and future works p3 Orange R&D Orange FT-group
  • 4. Recommender systems  For industrial applications  Amazon, Google News, Youtube (Google), ContentWise, BeeHive (IBM),...  as for well-known academic realizations  Fab, More, Twittomender,...  the recommendation is multi-facetted  pushing items, sorting items, linking items...  and cannot be reduced to the rating prediction of a score of interest of a user u for an item i.  What is a good recommender system?  just a system accurate for rating prediction for top N blockbusters and top M big users?  ... or something else? p4 Orange R&D Orange FT-group
  • 5. Summary  Introduction 1. Industrial tasks for recommender systems 2. Industrial (off line) protocol 3. Main results  Conclusion and future works p5 Orange R&D Orange FT-group
  • 6. Industrial point of view  Main goals of the automatic recommendation:  to increase sales  to increase the audience (click rates...)  to increase customer’s satisfaction and loyalty  Main needs (analysis at Orange: TV, Video On Demand, shows, web-radios,...) 1. Helping all the users: big users and small users 2. recommending all the items : frequently purchased/viewed items, rarely purchased/viewed items 3. Helping users on different identified problems 1. should I take this item? 2. should I take this item or that one? 3. what should interest me in this catalog? 4. what is similar to this item? p6 Orange R&D Orange FT-group
  • 7. We propose 4 key functions  Help to Explore (navigate) Example:  Given an item i used as a context, give N items similar to i.  Help to Decide Example:  Given an user u, and an item i, give a predictive score of interest of u for i (a rating).  Help to Compare Example:  Given a user u and a list of items i1,…,in, sort the items in a decreasing order according to the score of interest for u.  Help to Discover  Given a user u, give N interesting items for u. Example: p7 Orange R&D Orange FT-group
  • 8. Decide/ Compare / Discover / Explore Function Quality criteria Measure Decide The rating prediction must be precise. Existing measure: RMSE Extreme errors must be penalized because they may more often lead to a wrong decision. Compare The ranking prediction must be good Existing measure: NDPM for any couple of items of the catalog (or number of compatible orders) (not only for a Top N). Discover The recommendation must be useful. Existing measure : Precision Problem: if one recommends only well- known blockbusters (i.e. Star Wars, Titanic...) one will be precise but not useful! Introducing the Impact Measure Explore Problem: the semantic relevance is not evaluable without user’s feedback. Introducing a validation method for a similarity measure p8 Orange R&D Orange FT-group
  • 9. Summary  Introduction 1. Industrial tasks for recommender systems 2. Industrial (off line) protocol 3. Main results  Conclusion and future works p9 Orange R&D Orange FT-group
  • 10. Known Vs Unknown, Risky Vs Safe Recommending an item for a user... Probability that the user already knows the item Bad Trivial recommendation recommendation But the item is correct but not often generally known by useful name by the user Very bad Very good recommendation recommendation the user does not know Help to Discover the item: if he trusts the systems, he will be misled Probability that the user likes the item Orange R&D Orange FT-group
  • 11. Measuring the Help to Discover Proba user already knows  Average Measure of Impact Proba user likes Recommendation impact Impact if the user Impact if the user likes dislikes the item the item Recommending a slightly negative slightly positive popular item Recommending a rare, Strongly negative Strongly positive unknown item Size of the List Z of Impact: rarity of the catalog List H of logs recommended items * relative rating of (normalization) (u,i,r) in the items Test Set the user u (according to p11 her mean of ratings) Orange R&D Orange FT-group
  • 12. Principle of the protocol Datasets used: MovieLens 1M and Netflix. LOGS TEST No long tail distribution detected in Netflix neither in MovieLens’ dataset So we use the simplest userID, itemID, note userID, itemID, note segmentation according to userID, itemID, rating the mean of the number of Simple mean-based ratings: light/heavy users, item/user segmentation popular/unpopular items Learn For each (userID, itemID) in Test: RMSE generate a rating prediction, compare with true rating For each list of itemIDs for each userID in Test : %COMP Sort the list according to the ratings, compare the strict (% compatible) orders of the rating with the order given by the model Model For each userID in Test: generate a list of recommended items; for each of this items actually rating by userID in Test, evaluate the relavance Orange R&D Orange FT-group AMI
  • 13. We will use 4 algorithms to validate the protocol  Uniform Random Predictor  Returns a rating between 1 and 5 (min et max) with a random uniform distribution  Default Predictor (mean of item + mean of user )/2  Robust mean of the items: requires at least 10 ratings on the item, otherwise use only the user’s mean  K-Nearest Neighbor item method  Use K nearest neighbors per item, a scoring method detailed below, a similarity measure called Weighted Pearson. Uses the Default predictor when an item cannot be predicted • Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255  Fast factorization method  Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF” implementation) • Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research 10: 623-656 (2009) p13 Orange R&D Orange FT-group
  • 14. What about “Help to Explore”?  How to compare the “semantic quality” of the link between 2 items?  Principle  Define a similarity measure that could be extracted from the model  use the similarity measure to build an item-item similarity matrix  use the similarity matrix as a model for a recommender system using a KNN item-item model  if this system obtains good performances for RMSE, %COMP, and AMI then the semantic quality of the similarity measure must be good  Application  for a KNN-item model this is immediate (there is an intrinsic similarity)  for a matrix factorization model, we can use a similarity measure (as Pearson) computed on the items’ factors  for a random rating predictor, this is not applicable...  for a mean-based rating predictor, this is not applicable... p14 Orange R&D Orange FT-group
  • 15. Evaluating “Help To Explore” for Gravity columns of users items X users matrix of rows of ratings items Gravity (fast Matrix matrix of Factorization) users’ factors matrix of (not used) items’ factors Similarity KNN based Matrix (KNN) recommender of the items system items’ similarity (model for a computations and K recommender Nearest Neighbors system) search, using the matrix Possible evaluation of the of items’ factors quality of this similarity matrix via RMSE, %Comp, AMI... p15 Orange R&D Orange FT-group
  • 16. Summary  Introduction 1. Industrial tasks for recommender systems 2. Industrial (off line) protocol 3. Main results  Conclusion and future works p16 Orange R&D Orange FT-group
  • 17. Finding 1: different performances according to the segments We have a decrease in performance of more than 25% between heavy user popular item segment and light user unpopular item segment RMSE for Gravity on Netflix RMSE for KNN on Netflix rmse av. 1.05 Default Pred 1.1 rmse av. rmse 1.05 Default Pred. 1 rmse 1 0.95 rmse Huser 0.95 RMSE Pitem rmse Huser RMSE 0.9 0.9 Pitem rmse Luser rmse Luser 0.85 Pitem 0.85 Pitem 0.8 rmse Huser 0.8 rmse Huser Uitem Uitem 0.75 0.75 rmse Luser rmse Luser 0 50 100 150 200 0 10 20 30 40 50 60 70 Uitem Uitem Number Of Factors Number of KNN the 4 RMSE Light users Unpopular items (Luser Uitem) RMSE Light users Popular items (Luser Pitem) segments RMSE Heavy users Unpopular items (Huser Uitem) analyzed RMSE Heavy users Popular items (Huser Pitem) + RMSE (global) p17 Orange R&D Orange FT-group + Default predictor
  • 18. Finding 2: RMSE not strictly linked to the other performances the light user popular item segment is the light user popular item segment is as easier to optimize than the light user difficult to optimize as the light user unpopular item segment for RMSE unpopular item segment for Ranking RMSE for Gravity on Netflix Ranking compatibility for Gravity - Netflix rmse av. 1.05 Default Pred 77.00% %Compatible rmse 75.00% Default Pred 1 %compatible %Compatible 73.00% 0.95 rmse Huser Pitem 71.00% %compatible RMSE 0.9 Huser Pitem rmse Luser 69.00% %compatible 0.85 Pitem Luser Pitem 67.00% 0.8 rmse Huser %compatible Uitem 65.00% Huser Uitem 0.75 %compatible rmse Luser 0 20 40 60 0 10 20 30 40 50 60 70 Luser Uitem Uitem Number of factors Number Of Factors RMSE Light users Unpopular items RMSE Light users Popular items Example on 2 segments... RMSE Heavy users Unpopular items RMSE Heavy users Popular items RMSE (global) Default predictor (global) p18 Orange R&D Orange FT-group
  • 19. Main Fact 2 (continued): RMSE not strictly linked to the other performances RMSE for KNN on Netflix RMSE for Gravity on Netflix rmse av. 1.1 rmse av. 1.05 Default Pred 1.05 Default Pred. rmse rmse 1 1 0.95 RMSE rmse Huser 0.95 rmse Huser Pitem 0.9 Pitem rmse Luser RMSE 0.9 0.85 Pitem rmse Luser 0.8 rmse Huser 0.85 Pitem Uitem 0.75 0.8 rmse Huser rmse Luser Uitem 0 50 100 150 200 Uitem 0.75 Number of KNN 0 10 20 30 40 50 60 70 rmse Luser Uitem Number Of Factors Average Measure of Impact - Netflix 2.5 RMSE (global) 2 1.5 1 Average Measure of Impact - Netflix 0.5 Globally, Gravity is better than KNN for RMSE, but is worse than KNN for Average Measure of 0 Impact Random Pred Default Pred KNN, K=100 Gravity, F=32 -0.5 -1 p19 Orange R&D Orange FT-group
  • 20. Global results Help to Decide / Compare / Discover Gravity dominates for the RMSE measure KNN dominates on the heavy user segments The default Predictor is very useful for unpopular (i.e. infrequent) item segments p20 Orange R&D Orange FT-group
  • 21. Comparing native similarities with Gravity-based similarities Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) Gravity : 1. KNN item-item can be performed on a factorized matrix with little performance loss (and faster!). 2. Gravity can be used for the “Help to Explore function” Native KNN KNN computed on Gravity's K=100 items factors K=100, number of factors=16 RMSE 0.8440 0.8691 Ranking: % compatible 77.03% 75.67% Precision 91.90% 86.39% AMI 2.043 2.025 Global time 5290 seconds 3758 seconds of the modeling task p21 Orange R&D Orange FT-group
  • 22. Summary  Introduction 1. Industrial tasks for recommender systems 2. Industrial (off line) protocol 3. Main results  Conclusion and future works p22 Orange R&D Orange FT-group
  • 23. Conclusion: contributions  As industrial recommendation is multi-facetted  we proposed to list the key functions of the recommendation • Help to Decide, Help to Compare, Help to Discover, Help to Explore • Note for Help to explore: the similarity feature is mandatory for a recommender system  we proposed to define a dual segmentation of Items and Users • just being very accurate on big users and blockbuster items is not very useful  For a new offline protocol to evaluate recommender systems  we proposed to use the recommender’s key functions with the dual segmentation • Mapping Key functions with measures • adding the measure of Impact to evaluate the “Help to Discover” function • adding a method to evaluate the “Help to Explore” function  we made a demonstration of its utility • RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover, Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee with the other measures!) • The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited to improve the global performances • + we saw empirically that the KNN approach could be virtualized, performing the similarities between items on a factorized space built for instance by Gravity p23 Orange R&D Orange FT-group
  • 24. Future works: 3 main axis 1. Evalutation of the quality of the 4 core functions using an online A/B Testing protocol 2. Hybrid switch system: the best algorithm for the adapted task according to the user-item-segment 3. KNN virtualization via matrix factorization p24 Orange R&D Orange FT-group
  • 25. Annexes p25 Orange R&D Orange FT-group
  • 26. about this work...  Frank Meyer: Recommender systems in industrial contexts. CoRR abs/1203.4487: (2012)  Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric Gaussier: Toward a New Protocol to Evaluate Recommender Systems. Workshop on Recommender Utility Evaluation, RecSys 2012. Dublin.  Frank Meyer, Françoise Fessant: Reperio: A Generic and Flexible Industrial Recommender System. Web Intelligence 2011: 502-505. Lyon. p26 Orange R&D Orange FT-group
  • 27. Classic mathematic representation of the recommendation problem thousands of users u1 u2 ul un i1 4 2 ? 5 ? 2 ? 1 i2 4 5 4 5 5 4 1 5 4 known 4 3 1 1 ratings 2 1 of interest thousands ik 3 ? 4 ? 5 of items ? 2 ratings of 1 4 5 interest to predict ? ? ? 4 5 4 4 3 ? im 5 ? 2 4 p27 Orange R&D Orange FT-group
  • 28. Well known industrial example: Item-to-Items recommendation (Amazon ) TM Orange R&D Orange FT-group p28
  • 29. Multi-facetted analysis: measures predicted rating RMSE real rating number of logs in the Test Set nb of contradictory orders on a same nb of compatible dataset and on NDPM orders a same user, % compatible nb strict orders directly usable given by the user Precision number of recommeded items actually evaluable in the Test Set AMI Average Measure of Impact Orange R&D Orange FT-group
  • 30. Comparing native similarities with Gravity- based similarities Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) :  Gravity can be used for the “Help to Explore function”  KNN item-item can be performed on a factorized matrix with little performance loss!. p30 Orange R&D Orange FT-group
  • 31. Reperio C-V5  Centralized mode, example of a movie recommender p31 Orange R&D Orange FT-group
  • 32. Reperio E-V2  Embedded Mode, example of a TV program recommender p32 Orange R&D Orange FT-group