SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
Recommendation Engines:
A key personalization feature of modern web applications




           Haralambos (Babis) Marmanis

                       NEJUG
                    June 11, 2009
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




Presentation Outline
       1       Introduction
                  Recommendations in Action
                  “It’s the Economy ...”
                  Java source code
       2       Basic Concepts
                  The Online Music Store Example
                  Similarity
                  Distance (formulas)
                  Similarity (formulas)
                  The ”best” Similarity formula
       3       Collaborative Filtering
                  User based
                  Rating Counting Matrix
                  Item based
       4       Content based
Introduction    Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Recommendations in Action


Online store recommendations

      Amazon.com
      Provide recommendations for purchasing more items
Introduction    Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Recommendations in Action


Online store recommendations

      Netflix.com
      Provide recommendations for viewing more movies
Introduction    Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Recommendations in Action


Content recommendations

      Any news portal or other content aggregator
      Recommendations for articles, books, news stories
Introduction        Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


“It’s the Economy ...”


The Long Tail

       Goodbye Pareto Principle, Hello Long Tail
               Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,
               used a log-linear curve to describe the relationship
               between Amazon.com sales and sales ranking.
               They found that a large proportion of Amazon.com’s book
               sales come from obscure books that were not available in
               brick-and-mortar stores.
               They also found that consumer benefit from access to
               increased product variety in online book stores is ten times
               larger than their benefit from access to lower prices online!
Introduction        Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


“It’s the Economy ...”


The Long Tail

       Goodbye Pareto Principle, Hello Long Tail
               Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,
               used a log-linear curve to describe the relationship
               between Amazon.com sales and sales ranking.
               They found that a large proportion of Amazon.com’s book
               sales come from obscure books that were not available in
               brick-and-mortar stores.
               They also found that consumer benefit from access to
               increased product variety in online book stores is ten times
               larger than their benefit from access to lower prices online!
Introduction        Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


“It’s the Economy ...”


The Long Tail

       Goodbye Pareto Principle, Hello Long Tail
               Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,
               used a log-linear curve to describe the relationship
               between Amazon.com sales and sales ranking.
               They found that a large proportion of Amazon.com’s book
               sales come from obscure books that were not available in
               brick-and-mortar stores.
               They also found that consumer benefit from access to
               increased product variety in online book stores is ten times
               larger than their benefit from access to lower prices online!
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Java source code


Yooreeka!

      Open Source, Machine Learning library
      Search, recommendations, clustering, classification, and
      combination of classifiers!
      URL: http://code.google.com/p/yooreeka/
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




Presentation Outline
       1       Introduction
                  Recommendations in Action
                  “It’s the Economy ...”
                  Java source code
       2       Basic Concepts
                  The Online Music Store Example
                  Similarity
                  Distance (formulas)
                  Similarity (formulas)
                  The ”best” Similarity formula
       3       Collaborative Filtering
                  User based
                  Rating Counting Matrix
                  Item based
       4       Content based
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The Online Music Store Example



      Frank’s music ratings
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The Online Music Store Example



      Constantine’s music ratings
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The Online Music Store Example



      Catherine’s music ratings
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Similarity



       The notion of Similarity
               Often based on the notion of distance
               The smaller the distance, the greater the similarity
               Similarity values, typically, constrained in [0,∞) or [0,1]
               It is not necessary to define similarity formulas. E.g. if
               d < then similar, otherwise not.
               Similarity could also be empirical or probabilistic
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Similarity



       The notion of Similarity
               Often based on the notion of distance
               The smaller the distance, the greater the similarity
               Similarity values, typically, constrained in [0,∞) or [0,1]
               It is not necessary to define similarity formulas. E.g. if
               d < then similar, otherwise not.
               Similarity could also be empirical or probabilistic
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Similarity



       The notion of Similarity
               Often based on the notion of distance
               The smaller the distance, the greater the similarity
               Similarity values, typically, constrained in [0,∞) or [0,1]
               It is not necessary to define similarity formulas. E.g. if
               d < then similar, otherwise not.
               Similarity could also be empirical or probabilistic
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Similarity



       The notion of Similarity
               Often based on the notion of distance
               The smaller the distance, the greater the similarity
               Similarity values, typically, constrained in [0,∞) or [0,1]
               It is not necessary to define similarity formulas. E.g. if
               d < then similar, otherwise not.
               Similarity could also be empirical or probabilistic
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Similarity



       The notion of Similarity
               Often based on the notion of distance
               The smaller the distance, the greater the similarity
               Similarity values, typically, constrained in [0,∞) or [0,1]
               It is not necessary to define similarity formulas. E.g. if
               d < then similar, otherwise not.
               Similarity could also be empirical or probabilistic
Introduction      Basic Concepts     Collaborative Filtering       Content based   Netflix Prize    Summary


Distance (formulas)


      Let Xi and Yi be two vectors in RN
      Minkowski or p-norm distance
                                                                         1
                                                  N                      p

                                    d =                   |Xi − Yi |p                             (1)
                                                i=1


      Manhattan distance
                                        d = max |Xi − Yi |                                        (2)
                                                      i


      Chebychev or L∞ distance
                                                                             1
                                                          N                  p

                                   d = lim                    |Xi − Yi |p                         (3)
                                         p→∞
                                                      i=1
Introduction      Basic Concepts     Collaborative Filtering       Content based   Netflix Prize    Summary


Distance (formulas)


      Let Xi and Yi be two vectors in RN
      Minkowski or p-norm distance
                                                                         1
                                                  N                      p

                                    d =                   |Xi − Yi |p                             (1)
                                                i=1


      Manhattan distance
                                        d = max |Xi − Yi |                                        (2)
                                                      i


      Chebychev or L∞ distance
                                                                             1
                                                          N                  p

                                   d = lim                    |Xi − Yi |p                         (3)
                                         p→∞
                                                      i=1
Introduction      Basic Concepts     Collaborative Filtering       Content based   Netflix Prize    Summary


Distance (formulas)


      Let Xi and Yi be two vectors in RN
      Minkowski or p-norm distance
                                                                         1
                                                  N                      p

                                    d =                   |Xi − Yi |p                             (1)
                                                i=1


      Manhattan distance
                                        d = max |Xi − Yi |                                        (2)
                                                      i


      Chebychev or L∞ distance
                                                                             1
                                                          N                  p

                                   d = lim                    |Xi − Yi |p                         (3)
                                         p→∞
                                                      i=1
Introduction      Basic Concepts     Collaborative Filtering       Content based   Netflix Prize    Summary


Distance (formulas)


      Let Xi and Yi be two vectors in RN
      Minkowski or p-norm distance
                                                                         1
                                                  N                      p

                                    d =                   |Xi − Yi |p                             (1)
                                                i=1


      Manhattan distance
                                        d = max |Xi − Yi |                                        (2)
                                                      i


      Chebychev or L∞ distance
                                                                             1
                                                          N                  p

                                   d = lim                    |Xi − Yi |p                         (3)
                                         p→∞
                                                      i=1
Introduction       Basic Concepts    Collaborative Filtering     Content based   Netflix Prize    Summary


Similarity (formulas)



       Na¨ve Similarity
         ı
                                                                β
                                        simNaive =                                              (4)
                                                               β+d
       where d is the Euclidean distance.

       Similarity I

                                      simI = 1 − tanh(σ)                                        (5)
       where σ is the biased estimator of sample variance

       Similarity II
                                                               common
                                    simII = simI ×                                              (6)
                                                               maximum
       There is more . . . Jaccard, Tanimoto, and so on
Introduction       Basic Concepts    Collaborative Filtering     Content based   Netflix Prize    Summary


Similarity (formulas)



       Na¨ve Similarity
         ı
                                                                β
                                        simNaive =                                              (4)
                                                               β+d
       where d is the Euclidean distance.

       Similarity I

                                      simI = 1 − tanh(σ)                                        (5)
       where σ is the biased estimator of sample variance

       Similarity II
                                                               common
                                    simII = simI ×                                              (6)
                                                               maximum
       There is more . . . Jaccard, Tanimoto, and so on
Introduction       Basic Concepts    Collaborative Filtering     Content based   Netflix Prize    Summary


Similarity (formulas)



       Na¨ve Similarity
         ı
                                                                β
                                        simNaive =                                              (4)
                                                               β+d
       where d is the Euclidean distance.

       Similarity I

                                      simI = 1 − tanh(σ)                                        (5)
       where σ is the biased estimator of sample variance

       Similarity II
                                                               common
                                    simII = simI ×                                              (6)
                                                               maximum
       There is more . . . Jaccard, Tanimoto, and so on
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The ”best” Similarity formula



      Which is the best similarity formula?
               There is no such thing! It depends on the problem, the
               data, the definition of ... ”best”
                                         ¨ ¨ ¨
               Spertus,Sahami, and Buyukkokten (2005)
               Evaluating similarity measures: a large-scale study in the
               orkut social network. Proceedings of the eleventh ACM
               SIGKDD international conference on Knowledge discovery
               in data mining
               The simple L2 based (cosine) similarity showed the best
               empirical results among seven similarity metrics.
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The ”best” Similarity formula



      Which is the best similarity formula?
               There is no such thing! It depends on the problem, the
               data, the definition of ... ”best”
                                         ¨ ¨ ¨
               Spertus,Sahami, and Buyukkokten (2005)
               Evaluating similarity measures: a large-scale study in the
               orkut social network. Proceedings of the eleventh ACM
               SIGKDD international conference on Knowledge discovery
               in data mining
               The simple L2 based (cosine) similarity showed the best
               empirical results among seven similarity metrics.
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The ”best” Similarity formula



      Which is the best similarity formula?
               There is no such thing! It depends on the problem, the
               data, the definition of ... ”best”
                                         ¨ ¨ ¨
               Spertus,Sahami, and Buyukkokten (2005)
               Evaluating similarity measures: a large-scale study in the
               orkut social network. Proceedings of the eleventh ACM
               SIGKDD international conference on Knowledge discovery
               in data mining
               The simple L2 based (cosine) similarity showed the best
               empirical results among seven similarity metrics.
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


The ”best” Similarity formula



      Which is the best similarity formula?
               There is no such thing! It depends on the problem, the
               data, the definition of ... ”best”
                                         ¨ ¨ ¨
               Spertus,Sahami, and Buyukkokten (2005)
               Evaluating similarity measures: a large-scale study in the
               orkut social network. Proceedings of the eleventh ACM
               SIGKDD international conference on Knowledge discovery
               in data mining
               The simple L2 based (cosine) similarity showed the best
               empirical results among seven similarity metrics.
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




Presentation Outline
       1       Introduction
                  Recommendations in Action
                  “It’s the Economy ...”
                  Java source code
       2       Basic Concepts
                  The Online Music Store Example
                  Similarity
                  Distance (formulas)
                  Similarity (formulas)
                  The ”best” Similarity formula
       3       Collaborative Filtering
                  User based
                  Rating Counting Matrix
                  Item based
       4       Content based
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Tapestry
               Experimental mail system by Goldberg et al. (circa 1992)
               in Xerox PARC
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Tapestry
               Experimental mail system by Goldberg et al. (circa 1992)
               in Xerox PARC
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Tapestry
               Experimental mail system by Goldberg et al. (circa 1992)
               in Xerox PARC
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Tapestry
               Experimental mail system by Goldberg et al. (circa 1992)
               in Xerox PARC
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Tapestry
               Experimental mail system by Goldberg et al. (circa 1992)
               in Xerox PARC
Introduction        Basic Concepts    Collaborative Filtering   Content based        Netflix Prize   Summary


User based



      User Similarity Matrix
             U1       U2     U3                           U4        U5          ..

      U1       [    S11        S12        S13           S14       S15       ... ]
      U2       [    S21        S22        S23           S24       S25       ... ]
      U3       [    S31        S32        S33           S34       S35       ... ]
      U4       [    S41        S42        S43           S44       S45       ... ]
      U5       [    S51        S52        S53           S54       S55       ... ]
      ..           [ ...        ...        ...           ...       ...       ... ]
Introduction   Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


User based



      User Similarity Matrix (cont.)
            U1     U2       U3       U4 U5  ..
      U1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ]
      U2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ]
      U3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ]
      U4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ]
      U5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ]
      .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Rating Counting Matrix



      Rating Counting Matrix
             R1     R2      R3                          R4        R5

      R1       [   X11        X12       X13           X14       X15     ]
      R2       [   X21        X22       X23           X24       X25     ]
      R3       [   X31        X32       X33           X34       X35     ]
      R4       [   X41        X42       X43           X44       X45     ]
      R5       [   X51        X52       X53           X54       X55     ]
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Rating Counting Matrix



      BeanShell script (Users)
      BaseDataset ds = MusicData.createDataset();

      Delphi delphi = new
      Delphi(ds,RecommendationType.USER_BASED);

      MusicUser mu1 = ds.pickUser("Bob");

      delphi.findSimilarUsers(mu1);

      delphi.recommend(mu1);
Introduction   Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      Item Similarity Matrix
            I1     I2        I3 I4   I5     ...
      I1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ]
      I2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ]
      I3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ]
      I4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ]
      I5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ]
      .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
Introduction    Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      BeanShell script (Items)
          Delphi delphi = new
          Delphi(ds,RecommendationType.ITEM_BASED);

               MusicUser mu1 = ds.pickUser("Bob");

               delphi.recommend(mu1);

               MusicItem mi = ds.pickItem("La Bamba");

               delphi.findSimilarItems(mi);
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      Peruse the code
          Delphi
               UserBasedSimilarity
               ItemBasedSimilarity
               BaseSimilarityMatrix
               RatingCountMatrix
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      Peruse the code
          Delphi
               UserBasedSimilarity
               ItemBasedSimilarity
               BaseSimilarityMatrix
               RatingCountMatrix
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      Peruse the code
          Delphi
               UserBasedSimilarity
               ItemBasedSimilarity
               BaseSimilarityMatrix
               RatingCountMatrix
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      Peruse the code
          Delphi
               UserBasedSimilarity
               ItemBasedSimilarity
               BaseSimilarityMatrix
               RatingCountMatrix
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Item based



      Peruse the code
          Delphi
               UserBasedSimilarity
               ItemBasedSimilarity
               BaseSimilarityMatrix
               RatingCountMatrix
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




Presentation Outline
       1       Introduction
                  Recommendations in Action
                  “It’s the Economy ...”
                  Java source code
       2       Basic Concepts
                  The Online Music Store Example
                  Similarity
                  Distance (formulas)
                  Similarity (formulas)
                  The ”best” Similarity formula
       3       Collaborative Filtering
                  User based
                  Rating Counting Matrix
                  Item based
       4       Content based
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Text Parsing & Analysis



      No more ratings, what do we do?
               Now we deal with documents
               So, we need to define similarity based on the content of
               the documents
               Use Lucene’s StandardAnalyzer
               Build your own! (see CustomAnalyzer)
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Text Parsing & Analysis



      No more ratings, what do we do?
               Now we deal with documents
               So, we need to define similarity based on the content of
               the documents
               Use Lucene’s StandardAnalyzer
               Build your own! (see CustomAnalyzer)
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Text Parsing & Analysis



      No more ratings, what do we do?
               Now we deal with documents
               So, we need to define similarity based on the content of
               the documents
               Use Lucene’s StandardAnalyzer
               Build your own! (see CustomAnalyzer)
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Text Parsing & Analysis



      No more ratings, what do we do?
               Now we deal with documents
               So, we need to define similarity based on the content of
               the documents
               Use Lucene’s StandardAnalyzer
               Build your own! (see CustomAnalyzer)
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Document representation



      No more ratings!
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Document representation



      No more ratings!
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Document representation



      No more ratings!
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




Presentation Outline
       1       Introduction
                  Recommendations in Action
                  “It’s the Economy ...”
                  Java source code
       2       Basic Concepts
                  The Online Music Store Example
                  Similarity
                  Distance (formulas)
                  Similarity (formulas)
                  The ”best” Similarity formula
       3       Collaborative Filtering
                  User based
                  Rating Counting Matrix
                  Item based
       4       Content based
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Netflix Prize Description



      Netflix prize
               More than 100 million ratings
               480 thousand randomly-chosen, anonymous customers
               18 thousand movie titles
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Netflix Prize Description



      Netflix prize
               More than 100 million ratings
               480 thousand randomly-chosen, anonymous customers
               18 thousand movie titles
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Netflix Prize Description



      Netflix prize
               More than 100 million ratings
               480 thousand randomly-chosen, anonymous customers
               18 thousand movie titles
Introduction       Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Netflix Prize Description



      Netflix prize
               More than 100 million ratings
               480 thousand randomly-chosen, anonymous customers
               18 thousand movie titles
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Lessons learned



      Important considerations
               Data normalization
               Neighbor selection
                    How many neighbors?
                    Who are the ”best” neighbors?

               Neighbor weights
               ”Our experience is that most efforts should be
               concentrated in deriving substantially different approaches,
               rather than refining a single technique.”
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Lessons learned



      Important considerations
               Data normalization
               Neighbor selection
                    How many neighbors?
                    Who are the ”best” neighbors?

               Neighbor weights
               ”Our experience is that most efforts should be
               concentrated in deriving substantially different approaches,
               rather than refining a single technique.”
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Lessons learned



      Important considerations
               Data normalization
               Neighbor selection
                    How many neighbors?
                    Who are the ”best” neighbors?

               Neighbor weights
               ”Our experience is that most efforts should be
               concentrated in deriving substantially different approaches,
               rather than refining a single technique.”
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary


Lessons learned



      Important considerations
               Data normalization
               Neighbor selection
                    How many neighbors?
                    Who are the ”best” neighbors?

               Neighbor weights
               ”Our experience is that most efforts should be
               concentrated in deriving substantially different approaches,
               rather than refining a single technique.”
Introduction      Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




Presentation Outline
       1       Introduction
                  Recommendations in Action
                  “It’s the Economy ...”
                  Java source code
       2       Basic Concepts
                  The Online Music Store Example
                  Similarity
                  Distance (formulas)
                  Similarity (formulas)
                  The ”best” Similarity formula
       3       Collaborative Filtering
                  User based
                  Rating Counting Matrix
                  Item based
       4       Content based
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Important considerations
               Business value validation - ”Long Tail”, ”niches to riches”,
               etc.
               Similarity metrics - Many to choose from, do not be afraid
               to explore!
               Collaborative Filtering: ”Show me your friend ...”
                   User based
                   Item based

               Content based recommendations - NLP challenges
               Large scale implementations - Speed, data size, quality
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Important considerations
               Business value validation - ”Long Tail”, ”niches to riches”,
               etc.
               Similarity metrics - Many to choose from, do not be afraid
               to explore!
               Collaborative Filtering: ”Show me your friend ...”
                   User based
                   Item based

               Content based recommendations - NLP challenges
               Large scale implementations - Speed, data size, quality
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Important considerations
               Business value validation - ”Long Tail”, ”niches to riches”,
               etc.
               Similarity metrics - Many to choose from, do not be afraid
               to explore!
               Collaborative Filtering: ”Show me your friend ...”
                   User based
                   Item based

               Content based recommendations - NLP challenges
               Large scale implementations - Speed, data size, quality
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Important considerations
               Business value validation - ”Long Tail”, ”niches to riches”,
               etc.
               Similarity metrics - Many to choose from, do not be afraid
               to explore!
               Collaborative Filtering: ”Show me your friend ...”
                   User based
                   Item based

               Content based recommendations - NLP challenges
               Large scale implementations - Speed, data size, quality
Introduction     Basic Concepts   Collaborative Filtering   Content based   Netflix Prize   Summary




      Important considerations
               Business value validation - ”Long Tail”, ”niches to riches”,
               etc.
               Similarity metrics - Many to choose from, do not be afraid
               to explore!
               Collaborative Filtering: ”Show me your friend ...”
                   User based
                   Item based

               Content based recommendations - NLP challenges
               Large scale implementations - Speed, data size, quality

Contenu connexe

En vedette (20)

Chuong 6 stress thich nghi
Chuong 6 stress   thich nghiChuong 6 stress   thich nghi
Chuong 6 stress thich nghi
 
Co quan cam giac p1 (anh)
Co quan cam giac p1 (anh)Co quan cam giac p1 (anh)
Co quan cam giac p1 (anh)
 
Gioi Thieu Mitraco
Gioi Thieu MitracoGioi Thieu Mitraco
Gioi Thieu Mitraco
 
Flagler Budget.Key
Flagler Budget.KeyFlagler Budget.Key
Flagler Budget.Key
 
Ta Review OES
Ta Review OESTa Review OES
Ta Review OES
 
Going Mobile
Going MobileGoing Mobile
Going Mobile
 
As nature made us
As nature made usAs nature made us
As nature made us
 
Mythbusters
MythbustersMythbusters
Mythbusters
 
Cai tien bo may
Cai tien bo mayCai tien bo may
Cai tien bo may
 
Mistery
MisteryMistery
Mistery
 
Tns China Sourcebook
Tns China SourcebookTns China Sourcebook
Tns China Sourcebook
 
Gay science
Gay scienceGay science
Gay science
 
Nhap mon sinh hoc 3
Nhap mon sinh hoc 3Nhap mon sinh hoc 3
Nhap mon sinh hoc 3
 
Conquering new spaces 20101110
Conquering new spaces 20101110Conquering new spaces 20101110
Conquering new spaces 20101110
 
He tieu hoa p6 (anh)
He tieu hoa p6 (anh)He tieu hoa p6 (anh)
He tieu hoa p6 (anh)
 
Culver noigandres concrete
Culver noigandres concreteCulver noigandres concrete
Culver noigandres concrete
 
Slide Show
Slide ShowSlide Show
Slide Show
 
Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2
 
Case Sara Zoekt Werk
Case Sara Zoekt WerkCase Sara Zoekt Werk
Case Sara Zoekt Werk
 
изабелла козлова+
изабелла козлова+изабелла козлова+
изабелла козлова+
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Recommendation Engines

  • 1. Recommendation Engines: A key personalization feature of modern web applications Haralambos (Babis) Marmanis NEJUG June 11, 2009
  • 2. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Presentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 3. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Recommendations in Action Online store recommendations Amazon.com Provide recommendations for purchasing more items
  • 4. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Recommendations in Action Online store recommendations Netflix.com Provide recommendations for viewing more movies
  • 5. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Recommendations in Action Content recommendations Any news portal or other content aggregator Recommendations for articles, books, news stories
  • 6. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary “It’s the Economy ...” The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer benefit from access to increased product variety in online book stores is ten times larger than their benefit from access to lower prices online!
  • 7. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary “It’s the Economy ...” The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer benefit from access to increased product variety in online book stores is ten times larger than their benefit from access to lower prices online!
  • 8. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary “It’s the Economy ...” The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer benefit from access to increased product variety in online book stores is ten times larger than their benefit from access to lower prices online!
  • 9. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Java source code Yooreeka! Open Source, Machine Learning library Search, recommendations, clustering, classification, and combination of classifiers! URL: http://code.google.com/p/yooreeka/
  • 10. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Presentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 11. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The Online Music Store Example Frank’s music ratings
  • 12. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The Online Music Store Example Constantine’s music ratings
  • 13. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The Online Music Store Example Catherine’s music ratings
  • 14. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 15. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 16. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 17. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 18. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 19. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Distance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 20. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Distance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 21. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Distance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 22. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Distance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 23. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
  • 24. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
  • 25. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Similarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
  • 26. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 27. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 28. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 29. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary The ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 30. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Presentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 31. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 32. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 33. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 34. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 35. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 36. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary User based User Similarity Matrix U1 U2 U3 U4 U5 .. U1 [ S11 S12 S13 S14 S15 ... ] U2 [ S21 S22 S23 S24 S25 ... ] U3 [ S31 S32 S33 S34 S35 ... ] U4 [ S41 S42 S43 S44 S45 ... ] U5 [ S51 S52 S53 S54 S55 ... ] .. [ ... ... ... ... ... ... ]
  • 37. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary User based User Similarity Matrix (cont.) U1 U2 U3 U4 U5 .. U1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ] U2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ] U3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ] U4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ] U5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ] .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
  • 38. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Rating Counting Matrix Rating Counting Matrix R1 R2 R3 R4 R5 R1 [ X11 X12 X13 X14 X15 ] R2 [ X21 X22 X23 X24 X25 ] R3 [ X31 X32 X33 X34 X35 ] R4 [ X41 X42 X43 X44 X45 ] R5 [ X51 X52 X53 X54 X55 ]
  • 39. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Rating Counting Matrix BeanShell script (Users) BaseDataset ds = MusicData.createDataset(); Delphi delphi = new Delphi(ds,RecommendationType.USER_BASED); MusicUser mu1 = ds.pickUser("Bob"); delphi.findSimilarUsers(mu1); delphi.recommend(mu1);
  • 40. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based Item Similarity Matrix I1 I2 I3 I4 I5 ... I1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ] I2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ] I3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ] I4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ] I5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ] .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
  • 41. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based BeanShell script (Items) Delphi delphi = new Delphi(ds,RecommendationType.ITEM_BASED); MusicUser mu1 = ds.pickUser("Bob"); delphi.recommend(mu1); MusicItem mi = ds.pickItem("La Bamba"); delphi.findSimilarItems(mi);
  • 42. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 43. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 44. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 45. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 46. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Item based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 47. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Presentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 48. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Text Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 49. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Text Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 50. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Text Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 51. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Text Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 52. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Document representation No more ratings!
  • 53. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Document representation No more ratings!
  • 54. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Document representation No more ratings!
  • 55. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Presentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 56. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Netflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 57. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Netflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 58. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Netflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 59. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Netflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 60. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Lessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 61. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Lessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 62. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Lessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 63. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Lessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 64. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Presentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 65. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 66. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 67. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 68. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 69. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality