SlideShare une entreprise Scribd logo
1  sur  54
Offline Components:
Collaborative Filtering in Cold-start Situations
Problem




            Algorithm selects Item j with item features xj
                                                          (keywords, content categories, ...)




  User i visits                              (i, j) : response yij
  with                                              (explicit rating, implicit click/no-click)
  user features xi
  (demographics,
  browse history,
                                     Predict the unobserved entries based on
  search history, …)                 features and the observed entries



 Deepak Agarwal & Bee-Chung Chen @ ICML’11                                                2
Model Choices

• Feature-based (or content-based) approach
   – Use features to predict response
       • (regression, Bayes Net, mixture models, …)
   – Limitation: need predictive features
       • Bias often high, does not capture signals at granular levels


• Collaborative filtering (CF aka Memory based)
   – Make recommendation based on past user-item interaction
      • User-user, item-item, matrix factorization, …
      • See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08
           Tutorial], etc.
   – Better performance for old users and old items
   – Does not naturally handle new users and new items (cold-start)


 Deepak Agarwal & Bee-Chung Chen @ ICML’11                              3
Collaborative Filtering (Memory based methods)

    User-User Similarity




    Item-Item similarities, incorporating both
                Estimating Similarities
                   • Pearson’s correlation
                   • Optimization based (Koren et al)




  Deepak Agarwal & Bee-Chung Chen @ ICML’11             4
How to Deal with the Cold-Start Problem

•   Heuristic-based approaches
      – Linear combination of regression and CF models

      – Filterbot
          • Add user features as psuedo users and do collaborative filtering
      - Hybrid approaches
          - Use content based to fill up entries, then use CF


• Matrix Factorization
      – Good performance on Netflix (Koren, 2009)
•   Model-based approaches
      – Bilinear random-effects model (probabilistic matrix factorization)
         • Good on Netflix data [Ruslan et al ICML, 2009]
      – Add feature-based regression to matrix factorization
         • (Agarwal and Chen, 2009)
      – Add topic discovery (from textual items) to matrix factorization
         • (Agarwal and Chen, 2009; Chun and Blei, 2011)
    Deepak Agarwal & Bee-Chung Chen @ ICML’11                                  5
Per-item regression models

• When tracking users by cookies, distribution of visit patters
  could get extremely skewed
    – Majority of cookies have 1-2 visits


• Per item models (regression) based on user covariates
  attractive in such cases




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                  6
Several per-item regressions: Multi-task learning

 • Agarwal,Chen and Elango, KDD, 2010


                                                            •Low dimension
                                                            •(5-10),


                                                            •B estimated
                                                            •retrospective data



                                              Affinity to
                                              old items




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                   7
Per-user, per-item models
via bilinear random-effects model
Motivation

• Data measuring k-way interactions pervasive
    – Consider k = 2 for all our discussions
       • E.g. User-Movie, User-content, User-Publisher-Ads,….
    – Power law on both user and item degrees


• Classical Techniques
    – Approximate matrix through a singular value decomposition (SVD)
       • After adjusting for marginal effects (user pop, movie pop,..)
    – Does not work
       • Matrix highly incomplete, severe over-fitting
    – Key issue
       • Regularization of eigenvectors (factors) to avoid overfitting



  Deepak Agarwal & Bee-Chung Chen @ ICML’11                         9
Early work on complete matrices


• Tukey’s 1-df model (1956)
    – Rank 1 approximation of small nearly complete matrix


• Criss-cross regression (Gabriel, 1978)
• Incomplete matrices: Psychometrics (1-factor model only;
  small data sets; 1960s)
• Modern day recommender problems
    – Highly incomplete, large, noisy.




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                  10
Latent Factor Models
                 •u      •s
                         p
                         o         •v
                         r
                         t                      •Affinity = u’v
                              •“newsy”
                         y




                                              •item
     •“sporty”           •s


                                         •z



                       •“newsy”                 •Affinity = s’z




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                       11
Factorization – Brief Overview

 • Latent user factors:                       • Latent movie factors:
   (αi , ui=(ui1,…,uin))                        (βj , vj=(v j1,….,v jn))
                                      Interaction

                    E ( yij ) = µ + α i + β j + ui′Bv j

 • (Nn + Mm)                                   will overfit for moderate
                                                values of n,m
   parameters

                                              Regularization
 • Key technical issue:

  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                12
Latent Factor Models: Different Aspects




• Matrix Factorization
    – Factors in Euclidean space
    – Factors on the simplex


• Incorporating features and ratings simultaneously
• Online updates




  Deepak Agarwal & Bee-Chung Chen @ ICML’11           13
Maximum Margin Matrix Factorization (MMMF)


• Complete matrix by minimizing loss (hinge,squared-error)
  on observed entries subject to constraints on trace norm
    – Srebro, Rennie, Jakkola (NIPS 2004)
       • Convex, Semi-definite programming (expensive, not scalable)


• Fast MMMF (Rennie & Srebro, ICML, 2005)
    – Constrain the Frobenious norm of left and right eigenvector
      matrices, not convex but becomes scalable.


• Other variation: Ensemble MMMF (DeCoste, ICML2005)
    – Ensembles of partially trained MMMF (some improvements)




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                         14
Matrix Factorization for Netflix prize data

• Minimize the objective function


         ∑ (r           − ui v j ) + λ ( ∑ ui + ∑ v j )
                            T       2             2       2
                   ij
        ij∈obs                                i       j




• Simon Funk: Stochastic Gradient Descent


• Koren et al (KDD 2007): Alternate Least Squares
    – They move to SGD later in the competition




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                   15
Probabilistic Matrix Factorization
                            (Ruslan & Minh, 2008, NIPS)
    au                          av              •Optimization is through Gradient
                                                Descent (Iterated conditional modes)

        ui                    vj
                                                •Other variations like constraining the
                                                mean through sigmoid, using “who-rated-
                                                whom”

                                                •Combining with Boltzmann Machines
                                                also improved performance
                   rij
                                               rij ~ N (uT v j , σ 2 )
                                                         i

                                               u i ~ MVN (0, au I )
σ   2
                                               v j ~ MVN (0, av I )

        Deepak Agarwal & Bee-Chung Chen @ ICML’11                                      16
Bayesian Probabilistic Matrix Factorization
(Ruslan and Minh, ICML 2008)

• Fully Bayesian treatment using an MCMC approach
• Significant improvement


• Interpretation as a fully Bayesian hierarchical model
  shows why that is the case
    – Failing to incorporate uncertainty leads to bias in estimates
    – Multi-modal posterior, MCMC helps in converging to a better one




                     MCEM also more resistant to over-fitting
           •r
  •Var-comp: au

  Deepak Agarwal & Bee-Chung Chen @ ICML’11                             17
Non-parametric Bayesian matrix completion
(Zhou et al, SAM, 2010)

• Specify rank probabilistically (automatic rank selection)
                                r
           yij ~ N (∑ z k uik v jk , σ )      2

                               k =1

           z k ~ Ber (π k )
          π k ~ Beta(a / r , b(r − 1) / r )
          z k ~ Ber (1, a /( a + b( r −1)))
          E ( # Factors) = ra /( a + b( r −1))

  Deepak Agarwal & Bee-Chung Chen @ ICML’11                   18
How to incorporate features:
Deal with both warm start and cold-start

• Models to predict ratings for new pairs
    – Warm-start: (user, movie) present in the training data with
      large sample size
    – Cold-start: At least one of (user, movie) new or has small
      sample size
        • Rough definition, warm-start/cold-start is a continuum.


• Challenges
    – Highly incomplete (user, movie) matrix
    – Heavy tailed degree distributions for users/movies
         • Large fraction of ratings from small fraction of users/movies
    – Handling both warm-start and cold-start effectively in the
      presence of predictive features
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                19
Possible approaches

• Large scale regression based on covariates
    – Does not provide good estimates for heavy users/movies
    – Large number of predictors to estimate interactions


• Collaborative filtering
    – Neighborhood based
    – Factorization
       • Good for warm-start; cold-start dealt with separately


• Single model that handles cold-start and warm-start
    – Heavy users/movies → User/movie specific model
    – Light users/movies → fallback on regression model
    – Smooth fallback mechanism for good performance




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                      20
Add Feature-based Regression into
                Matrix Factorization

RLFM: Regression-based Latent Factor Model
Regression-based Factorization Model (RLFM)

• Main idea: Flexible prior, predict factors through
  regressions
• Seamlessly handles cold-start and warm-start
• Modified state equation to incorporate covariates




  Deepak Agarwal & Bee-Chung Chen @ ICML’11            22
RLFM: Model

 • Rating:      yij ~ N ( µij , σ 2 )           Gaussian Model
   user i       yij ~ Bernoulli ( µij )         Logistic Model (for binary rating)
   gives        yij ~ Poisson( µij N ij )       Poisson Model (for counts)
   item j

                      t ( µij ) = xij b + α i + β j+ui t v j
                                   t



    • Bias of user i:                 α i = g 0 xi + ε iα , ε iα ~ N (0, σ α )
                                              t                            2


    • Popularity of item j:           β j = d 0 x j + ε β , ε β ~ N (0, σ β )
                                              t
                                                        j     j
                                                                          2


    • Factors of user i:              ui = Gxi + ε iu , ε iu ~ N (0,σ u I )
                                                                      2

    • Factors of item j:              vi = Dx j + ε iv , ε iv ~ N (0,σ v I )
                                                                       2


      Could use other classes of regression models

 Deepak Agarwal & Bee-Chung Chen @ ICML’11                                           23
Graphical representation of the model




  Deepak Agarwal & Bee-Chung Chen @ ICML’11   24
Advantages of RLFM

• Better regularization of factors
   – Covariates “shrink” towards a better centroid

• Cold-start: Fallback regression model (FeatureOnly)




  Deepak Agarwal & Bee-Chung Chen @ ICML’11             25
RLFM: Illustration of Shrinkage


Plot the first factor
value for each user
(fitted using Yahoo! FP
data)




   Deepak Agarwal & Bee-Chung Chen @ ICML’11   26
Induced correlations among observations

                                              Hierarchical random-effects model

                                              Marginal distribution obtained by
                                              integrating out random effects




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                   27
Closer look at induced marginal correlations




  Deepak Agarwal & Bee-Chung Chen @ ICML’11    28
Model fitting: EM for our class of models




  Deepak Agarwal & Bee-Chung Chen @ ICML’11   29
The parameters for RLFM

• Latent parameters


            Δ = ({αi }, {β j }, {ui }, {v j })

• Hyper-parameters



           Θ = (b, G, D, A u = a u I, A v = a v I)

 Deepak Agarwal & Bee-Chung Chen @ ICML’11       30
Computing the mode




                            Minimized




 Deepak Agarwal & Bee-Chung Chen @ ICML’11   31
The EM algorithm




 Deepak Agarwal & Bee-Chung Chen @ ICML’11   32
Computing the E-step

• Often hard to compute in closed form
• Stochastic EM (Markov Chain EM; MCEM)
    – Compute expectation by drawing samples from



    – Effective for multi-modal posteriors but more expensive
• Iterated Conditional Modes algorithm (ICM)
    – Faster but biased hyper-parameter estimates




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                     33
Monte Carlo E-step

• Through a vanilla Gibbs sampler (conditionals closed form)




• Other conditionals also Gaussian and closed form
• Conditionals of users (movies) sampled simultaneously
• Small number of samples in early iterations, large numbers
  in later iterations
  Deepak Agarwal & Bee-Chung Chen @ ICML’11               34
M-step (Why MCEM is better than ICM)

• Update G, optimize



• Update Au=au I




         Ignored by ICM, underestimates factor variability
         Factors over-shrunk, posterior not explored well


  Deepak Agarwal & Bee-Chung Chen @ ICML’11                  35
Experiment 1: Better regularization

• MovieLens-100K, avg RMSE using pre-specified splits
• ZeroMean, RLFM and FeatureOnly (no cold-start issues)
• Covariates:
    – Users : age, gender, zipcode (1st digit only)
    – Movies: genres




  Deepak Agarwal & Bee-Chung Chen @ ICML’11             36
Experiment 2: Better handling of Cold-start

• MovieLens-1M; EachMovie
• Training-test split based on timestamp
• Same covariates as in Experiment 1.




  Deepak Agarwal & Bee-Chung Chen @ ICML’11   37
Experiment 4: Predicting click-rate on articles

• Goal: Predict click-rate on articles for a user on F1 position
• Article lifetimes short, dynamic updates important


• User covariates:
    – Age, Gender, Geo, Browse behavior


• Article covariates
    – Content Category, keywords


• 2M ratings, 30K users, 4.5 K articles



  Deepak Agarwal & Bee-Chung Chen @ ICML’11                   38
Results on Y! FP data




  Deepak Agarwal & Bee-Chung Chen @ ICML’11   39
Some other related approaches

• Stern, Herbrich and Graepel, WWW, 2009
    – Similar to RLFM, different parametrization and expectation
      propagation used to fit the models


• Porteus, Asuncion and Welling, AAAI, 2011
    – Non-parametric approach using a Dirichlet process


• Agarwal, Zhang and Mazumdar, Annals of Applied
  Statistics, 2011
    – Regression + random effects per user regularized through a
      Graphical Lasso




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                        40
Add Topic Discovery into
                                  Matrix Factorization

fLDA: Matrix Factorization through Latent Dirichlet Allocation
fLDA: Introduction

• Model the rating yij that user i gives to item j as the user’s
  affinity to the topics that the item has
                              User i ’s affinity to topic k

       yij = ... +     ∑k sik z jk
                                  Pr(item j has topic k) estimated by averaging
                                  the LDA topic of each word in item j
 Old items: zjk’s are Item latent factors learnt from data with the LDA prior
 New items: zjk’s are predicted based on the bag of words in the items

    – Unlike regular unsupervised LDA topic modeling, here the LDA
      topics are learnt in a supervised manner based on past rating data
    – fLDA can be thought of as a “multi-task learning” version of the
      supervised LDA model [Blei’07] for cold-start recommendation
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                   42
LDA Topic Modeling (1)

• LDA is effective for unsupervised topic discovery [Blei’03]
   – It models the generating process of a corpus of items (articles)
   – For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η)
   – For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)
   – For each word, say the nth word, in item j,
      • Draw a topic zjn for that word from θj = [θj1, …, θjK]
         • Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn

         Item j
    Topic distribution: [θj1, …, θjK]                           Φ11, …, Φ1W   Topic 1
                                         Assume zjn = topic k
                                                                     …
       Per-word topic: zj1, …, zjn, …                           Φk1, …, ΦkW   Topic k
                                                                     …
                                                                ΦK1, …, ΦKW   Topic K
                Words: wj1, …, wjn, …
                                              Observed
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                       43
LDA Topic Modeling (2)

  • Model training:
       – Estimate the prior parameters and the posterior topic×word
         distribution Φ based on a training corpus of items
       – EM + Gibbs sampling is a popular method
  • Inference for new items
       – Compute the item topic distribution based on the prior parameters
         and Φ estimated in the training phase
  • Supervised LDA [Blei’07]
       – Predict a target value for each item based on supervised LDA
         topics
          Regression weight for topic k        One regression per user

                  yj =   ∑    k
                                  sk z jk   vs.       yij = ... +   ∑k sik z jk
                                                 Same set of topics across different regressions
Target value of item j
                            Pr(item j has topic k) estimated by averaging
                            the topic of each word in item j
     Deepak Agarwal & Bee-Chung Chen @ ICML’11                                            44
fLDA: Model

 • Rating:       yij ~ N ( µij , σ 2 )            Gaussian Model
    user i       yij ~ Bernoulli ( µij )          Logistic Model (for binary rating)
    gives        yij ~ Poisson( µij N ij )        Poisson Model (for counts)
    item j

                 t ( µij ) = xij b + α i + β j + ∑ k sik z jk
                              t



 • Bias of user i:                   α i = g 0 xi + ε iα , ε iα ~ N (0, σ α )
                                             t                            2


 • Popularity of item j:            β j = d 0 x j + ε β , ε β ~ N (0, σ β )
                                            t
                                                      j     j
                                                                        2


 • Topic affinity of user i:          si = Hxi + ε is , ε is ~ N (0, σ s2 I )
 • Pr(item j has topic k):          z jk = ∑ n 1( z jn = k ) /( # words in item j )
                                                The LDA topic of the nth word in item j
 • Observed words:                  w jn ~ LDA(λ ,η , z jn )
                                 The nth word in item j
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                               45
Model Fitting

• Given:
    – Features X = {xi, xj, xij}
    – Observed ratings y = {yij} and words w = {wjn}

• Estimate:
    – Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]
       • Regression weights and prior parameters
    – Latent factors: ∆ = {αi, βj, si} and z = {zjn}
       • User factors, item factors and per-word topic assignment
• Empirical Bayes approach:
    – Maximum likelihood estimate of the parameters:
          ˆ
                                                       ∫
          Θ = arg max Pr[ y, w | Θ] = arg max Pr[ y, w, ∆, z | Θ] d∆dz
                        Θ                          Θ
    – The posterior distribution of the factors:

                        ˆ
           Pr[∆, z | y, Θ]
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                              46
The EM Algorithm

• Iterate through the E and M steps until convergence
           ˆ
    – Let Θ( n ) be the current estimate
    – E-step: Compute
                             f ( Θ) = E                        ˆ [log Pr( y , w, ∆, z
                                              ( ∆ , z | y , w, Θ n )
                                                                                        | Θ)]
         • The expectation is not in closed form
         • We draw Gibbs samples and compute the Monte Carlo mean

                              ˆ
                              Θ( n +1) = arg max f (Θ)
    – M-step: Find
                                              Θ

         • It consists of solving a number of regression and optimization
           problems




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                                 47
Supervised Topic Assignment


 The topic of the nth word in item j
                                                     Probability of observing yij
       Pr( z jn = k | Rest )                             given the model

        Z kl jn + η
          ¬
      ∝ ¬jn
       Z k + Wη
                      jk    (                 )
                    Z ¬jn + λ ⋅ ∏i rated j f ( yij | z jn = k )


       Same as unsupervised LDA                   Likelihood of observed ratings
                                                  by users who rated item j when
                                                  zjn is set to topic k




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                         48
fLDA: Experimental Results (Movie)

• Task: Predict the rating that a user would give a movie
• Training/test split:
                                                     Model    Test RMSE
    – Sort observations by time                      RLFM       0.9363
    – First 75% → Training data                       fLDA      0.9381
                                               Factor-Only      0.9422
    – Last 25% → Test data                        FilterBot     0.9517
                                               unsup-LDA        0.9520
• Item warm-start scenario                    MostPopular       0.9726
    – Only 2% new items in test data          Feature-Only      1.0906
                                                 Constant       1.1190




fLDA is as strong as the best method
It does not reduce the performance in warm-start scenarios


  Deepak Agarwal & Bee-Chung Chen @ ICML’11                               49
fLDA: Experimental Results (Yahoo! Buzz)

• Task: Predict whether a user would buzz-up an article
• Severe item cold-start
    – All items are new in test data



     fLDA significantly
     outperforms other
               models

             Data Statistics
          1.2M observations
                   4K users
                10K articles


  Deepak Agarwal & Bee-Chung Chen @ ICML’11               50
Experimental Results: Buzzing Topics

     3/4 topics are interpretable; 1/2 are similar to unsupervised topics
                  Top Terms (after stemming)                           Topic
    bush, tortur, interrog, terror, administr, CIA, offici,       CIA interrogation
    suspect, releas, investig, georg, memo, al
    mexico, flu, pirat, swine, drug, ship, somali, border,           Swine flu
    mexican, hostag, offici, somalia, captain
    NFL, player, team, suleman, game, nadya, star, high,            NFL games
    octuplet, nadya_suleman, michael, week
    court, gai, marriag, suprem, right, judg, rule, sex, pope,     Gay marriage
    supreme_court, appeal, ban, legal, allow
    palin, republican, parti, obama, limbaugh, sarah, rush,         Sarah Palin
    gop, presid, sarah_palin, sai, gov, alaska
    idol, american, night, star, look, michel, win, dress,         American idol
    susan, danc, judg, boyl, michelle_obama
    economi, recess, job, percent, econom, bank, expect,             Recession
    rate, jobless, year, unemploy, month
    north, korea, china, north_korea, launch, nuclear,           North Korea issues
    rocket, missil, south, said, russia



  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                           51
fLDA Summary

• fLDA is a useful model for cold-start item recommendation
• It also provides interpretable recommendations for users
    – User’s preference to interpretable LDA topics
• Future directions:
    – Investigate Gibbs sampling chains and the convergence properties of the
      EM algorithm
    – Apply fLDA to other multi-task prediction problems
         • fLDA can be used as a tool to generate supervised features
           (topics) from text data




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                 52
Summary

• Regularizing factors through covariates effective


•    Regression based factor model that regularizes better and
    deals with both cold-start and warm-start in a single
    framework in a seamless way looks attractive


• Fitting method scalable; Gibbs sampling for users and
  movies can be done in parallel. Regressions in M-step can
  be done with any off-the-shelf scalable linear regression
  routine
• Distributed computing on Hadoop: Multiple models and
  average across partitions
    Deepak Agarwal & Bee-Chung Chen @ ICML’11               53
Hierarchical smoothing
Advertising application, Agarwal et al. KDD 10


 • Product of states for each node pair
                                                                        Advertiser
                                         pubclass
                                                                         Conv-id

                                                                       campaign
                                            Pub-id
                                                             z         Ad-id


 • Spike and Slab prior                              (   Sz, Ez, λz)



     – Known to encourage parsimonious solutions
        • Several cell states have no corrections
     – Not used before for multi-hierarchy models, only in regression
     – We choose P = .5 (and choose “a” by cross-validation)
        • a – psuedo number of successes


   Deepak Agarwal & Bee-Chung Chen @ ICML’11                                         54

Contenu connexe

Tendances

Learning from Computer Simulation to Tackle Real-World Problems
Learning from Computer Simulation to Tackle Real-World ProblemsLearning from Computer Simulation to Tackle Real-World Problems
Learning from Computer Simulation to Tackle Real-World Problems
NAVER Engineering
 

Tendances (18)

Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417
 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
 
Learning from Computer Simulation to Tackle Real-World Problems
Learning from Computer Simulation to Tackle Real-World ProblemsLearning from Computer Simulation to Tackle Real-World Problems
Learning from Computer Simulation to Tackle Real-World Problems
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Supervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemSupervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking system
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Exploiting variable associations to configure efficient local search in large...
Exploiting variable associations to configure efficient local search in large...Exploiting variable associations to configure efficient local search in large...
Exploiting variable associations to configure efficient local search in large...
 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Text cnn on acme ugc moderation
Text cnn on acme ugc moderationText cnn on acme ugc moderation
Text cnn on acme ugc moderation
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 

Similaire à Recommender Systems Tutorial (Part 2) -- Offline Components

2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr
Debora Da Rosa
 
Query dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighborQuery dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighbor
iyo
 
Machine Learning for objective QoE assessment: Science, Myths and a look to t...
Machine Learning for objective QoE assessment: Science, Myths and a look to t...Machine Learning for objective QoE assessment: Science, Myths and a look to t...
Machine Learning for objective QoE assessment: Science, Myths and a look to t...
Förderverein Technische Fakultät
 

Similaire à Recommender Systems Tutorial (Part 2) -- Offline Components (20)

Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
 
Collaborative Metric Learning (WWW'17)
Collaborative Metric Learning (WWW'17)Collaborative Metric Learning (WWW'17)
Collaborative Metric Learning (WWW'17)
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clustering
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr
 
Information processing with artificial spiking neural networks
Information processing with artificial spiking neural networksInformation processing with artificial spiking neural networks
Information processing with artificial spiking neural networks
 
Semi-supervised concept detection by learning the structure of similarity graphs
Semi-supervised concept detection by learning the structure of similarity graphsSemi-supervised concept detection by learning the structure of similarity graphs
Semi-supervised concept detection by learning the structure of similarity graphs
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
A Benchmark for Simulated Manipulation
A Benchmark for Simulated ManipulationA Benchmark for Simulated Manipulation
A Benchmark for Simulated Manipulation
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Query dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighborQuery dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighbor
 
Machine Learning for objective QoE assessment: Science, Myths and a look to t...
Machine Learning for objective QoE assessment: Science, Myths and a look to t...Machine Learning for objective QoE assessment: Science, Myths and a look to t...
Machine Learning for objective QoE assessment: Science, Myths and a look to t...
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer better
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
 
ExplainingMLModels.pdf
ExplainingMLModels.pdfExplainingMLModels.pdf
ExplainingMLModels.pdf
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Recommender Systems Tutorial (Part 2) -- Offline Components

  • 2. Problem Algorithm selects Item j with item features xj (keywords, content categories, ...) User i visits (i, j) : response yij with (explicit rating, implicit click/no-click) user features xi (demographics, browse history, Predict the unobserved entries based on search history, …) features and the observed entries Deepak Agarwal & Bee-Chung Chen @ ICML’11 2
  • 3. Model Choices • Feature-based (or content-based) approach – Use features to predict response • (regression, Bayes Net, mixture models, …) – Limitation: need predictive features • Bias often high, does not capture signals at granular levels • Collaborative filtering (CF aka Memory based) – Make recommendation based on past user-item interaction • User-user, item-item, matrix factorization, … • See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. – Better performance for old users and old items – Does not naturally handle new users and new items (cold-start) Deepak Agarwal & Bee-Chung Chen @ ICML’11 3
  • 4. Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities • Pearson’s correlation • Optimization based (Koren et al) Deepak Agarwal & Bee-Chung Chen @ ICML’11 4
  • 5. How to Deal with the Cold-Start Problem • Heuristic-based approaches – Linear combination of regression and CF models – Filterbot • Add user features as psuedo users and do collaborative filtering - Hybrid approaches - Use content based to fill up entries, then use CF • Matrix Factorization – Good performance on Netflix (Koren, 2009) • Model-based approaches – Bilinear random-effects model (probabilistic matrix factorization) • Good on Netflix data [Ruslan et al ICML, 2009] – Add feature-based regression to matrix factorization • (Agarwal and Chen, 2009) – Add topic discovery (from textual items) to matrix factorization • (Agarwal and Chen, 2009; Chun and Blei, 2011) Deepak Agarwal & Bee-Chung Chen @ ICML’11 5
  • 6. Per-item regression models • When tracking users by cookies, distribution of visit patters could get extremely skewed – Majority of cookies have 1-2 visits • Per item models (regression) based on user covariates attractive in such cases Deepak Agarwal & Bee-Chung Chen @ ICML’11 6
  • 7. Several per-item regressions: Multi-task learning • Agarwal,Chen and Elango, KDD, 2010 •Low dimension •(5-10), •B estimated •retrospective data Affinity to old items Deepak Agarwal & Bee-Chung Chen @ ICML’11 7
  • 8. Per-user, per-item models via bilinear random-effects model
  • 9. Motivation • Data measuring k-way interactions pervasive – Consider k = 2 for all our discussions • E.g. User-Movie, User-content, User-Publisher-Ads,…. – Power law on both user and item degrees • Classical Techniques – Approximate matrix through a singular value decomposition (SVD) • After adjusting for marginal effects (user pop, movie pop,..) – Does not work • Matrix highly incomplete, severe over-fitting – Key issue • Regularization of eigenvectors (factors) to avoid overfitting Deepak Agarwal & Bee-Chung Chen @ ICML’11 9
  • 10. Early work on complete matrices • Tukey’s 1-df model (1956) – Rank 1 approximation of small nearly complete matrix • Criss-cross regression (Gabriel, 1978) • Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) • Modern day recommender problems – Highly incomplete, large, noisy. Deepak Agarwal & Bee-Chung Chen @ ICML’11 10
  • 11. Latent Factor Models •u •s p o •v r t •Affinity = u’v •“newsy” y •item •“sporty” •s •z •“newsy” •Affinity = s’z Deepak Agarwal & Bee-Chung Chen @ ICML’11 11
  • 12. Factorization – Brief Overview • Latent user factors: • Latent movie factors: (αi , ui=(ui1,…,uin)) (βj , vj=(v j1,….,v jn)) Interaction E ( yij ) = µ + α i + β j + ui′Bv j • (Nn + Mm) will overfit for moderate values of n,m parameters Regularization • Key technical issue: Deepak Agarwal & Bee-Chung Chen @ ICML’11 12
  • 13. Latent Factor Models: Different Aspects • Matrix Factorization – Factors in Euclidean space – Factors on the simplex • Incorporating features and ratings simultaneously • Online updates Deepak Agarwal & Bee-Chung Chen @ ICML’11 13
  • 14. Maximum Margin Matrix Factorization (MMMF) • Complete matrix by minimizing loss (hinge,squared-error) on observed entries subject to constraints on trace norm – Srebro, Rennie, Jakkola (NIPS 2004) • Convex, Semi-definite programming (expensive, not scalable) • Fast MMMF (Rennie & Srebro, ICML, 2005) – Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. • Other variation: Ensemble MMMF (DeCoste, ICML2005) – Ensembles of partially trained MMMF (some improvements) Deepak Agarwal & Bee-Chung Chen @ ICML’11 14
  • 15. Matrix Factorization for Netflix prize data • Minimize the objective function ∑ (r − ui v j ) + λ ( ∑ ui + ∑ v j ) T 2 2 2 ij ij∈obs i j • Simon Funk: Stochastic Gradient Descent • Koren et al (KDD 2007): Alternate Least Squares – They move to SGD later in the competition Deepak Agarwal & Bee-Chung Chen @ ICML’11 15
  • 16. Probabilistic Matrix Factorization (Ruslan & Minh, 2008, NIPS) au av •Optimization is through Gradient Descent (Iterated conditional modes) ui vj •Other variations like constraining the mean through sigmoid, using “who-rated- whom” •Combining with Boltzmann Machines also improved performance rij rij ~ N (uT v j , σ 2 ) i u i ~ MVN (0, au I ) σ 2 v j ~ MVN (0, av I ) Deepak Agarwal & Bee-Chung Chen @ ICML’11 16
  • 17. Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) • Fully Bayesian treatment using an MCMC approach • Significant improvement • Interpretation as a fully Bayesian hierarchical model shows why that is the case – Failing to incorporate uncertainty leads to bias in estimates – Multi-modal posterior, MCMC helps in converging to a better one MCEM also more resistant to over-fitting •r •Var-comp: au Deepak Agarwal & Bee-Chung Chen @ ICML’11 17
  • 18. Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) • Specify rank probabilistically (automatic rank selection) r yij ~ N (∑ z k uik v jk , σ ) 2 k =1 z k ~ Ber (π k ) π k ~ Beta(a / r , b(r − 1) / r ) z k ~ Ber (1, a /( a + b( r −1))) E ( # Factors) = ra /( a + b( r −1)) Deepak Agarwal & Bee-Chung Chen @ ICML’11 18
  • 19. How to incorporate features: Deal with both warm start and cold-start • Models to predict ratings for new pairs – Warm-start: (user, movie) present in the training data with large sample size – Cold-start: At least one of (user, movie) new or has small sample size • Rough definition, warm-start/cold-start is a continuum. • Challenges – Highly incomplete (user, movie) matrix – Heavy tailed degree distributions for users/movies • Large fraction of ratings from small fraction of users/movies – Handling both warm-start and cold-start effectively in the presence of predictive features Deepak Agarwal & Bee-Chung Chen @ ICML’11 19
  • 20. Possible approaches • Large scale regression based on covariates – Does not provide good estimates for heavy users/movies – Large number of predictors to estimate interactions • Collaborative filtering – Neighborhood based – Factorization • Good for warm-start; cold-start dealt with separately • Single model that handles cold-start and warm-start – Heavy users/movies → User/movie specific model – Light users/movies → fallback on regression model – Smooth fallback mechanism for good performance Deepak Agarwal & Bee-Chung Chen @ ICML’11 20
  • 21. Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model
  • 22. Regression-based Factorization Model (RLFM) • Main idea: Flexible prior, predict factors through regressions • Seamlessly handles cold-start and warm-start • Modified state equation to incorporate covariates Deepak Agarwal & Bee-Chung Chen @ ICML’11 22
  • 23. RLFM: Model • Rating: yij ~ N ( µij , σ 2 ) Gaussian Model user i yij ~ Bernoulli ( µij ) Logistic Model (for binary rating) gives yij ~ Poisson( µij N ij ) Poisson Model (for counts) item j t ( µij ) = xij b + α i + β j+ui t v j t • Bias of user i: α i = g 0 xi + ε iα , ε iα ~ N (0, σ α ) t 2 • Popularity of item j: β j = d 0 x j + ε β , ε β ~ N (0, σ β ) t j j 2 • Factors of user i: ui = Gxi + ε iu , ε iu ~ N (0,σ u I ) 2 • Factors of item j: vi = Dx j + ε iv , ε iv ~ N (0,σ v I ) 2 Could use other classes of regression models Deepak Agarwal & Bee-Chung Chen @ ICML’11 23
  • 24. Graphical representation of the model Deepak Agarwal & Bee-Chung Chen @ ICML’11 24
  • 25. Advantages of RLFM • Better regularization of factors – Covariates “shrink” towards a better centroid • Cold-start: Fallback regression model (FeatureOnly) Deepak Agarwal & Bee-Chung Chen @ ICML’11 25
  • 26. RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data) Deepak Agarwal & Bee-Chung Chen @ ICML’11 26
  • 27. Induced correlations among observations Hierarchical random-effects model Marginal distribution obtained by integrating out random effects Deepak Agarwal & Bee-Chung Chen @ ICML’11 27
  • 28. Closer look at induced marginal correlations Deepak Agarwal & Bee-Chung Chen @ ICML’11 28
  • 29. Model fitting: EM for our class of models Deepak Agarwal & Bee-Chung Chen @ ICML’11 29
  • 30. The parameters for RLFM • Latent parameters Δ = ({αi }, {β j }, {ui }, {v j }) • Hyper-parameters Θ = (b, G, D, A u = a u I, A v = a v I) Deepak Agarwal & Bee-Chung Chen @ ICML’11 30
  • 31. Computing the mode Minimized Deepak Agarwal & Bee-Chung Chen @ ICML’11 31
  • 32. The EM algorithm Deepak Agarwal & Bee-Chung Chen @ ICML’11 32
  • 33. Computing the E-step • Often hard to compute in closed form • Stochastic EM (Markov Chain EM; MCEM) – Compute expectation by drawing samples from – Effective for multi-modal posteriors but more expensive • Iterated Conditional Modes algorithm (ICM) – Faster but biased hyper-parameter estimates Deepak Agarwal & Bee-Chung Chen @ ICML’11 33
  • 34. Monte Carlo E-step • Through a vanilla Gibbs sampler (conditionals closed form) • Other conditionals also Gaussian and closed form • Conditionals of users (movies) sampled simultaneously • Small number of samples in early iterations, large numbers in later iterations Deepak Agarwal & Bee-Chung Chen @ ICML’11 34
  • 35. M-step (Why MCEM is better than ICM) • Update G, optimize • Update Au=au I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well Deepak Agarwal & Bee-Chung Chen @ ICML’11 35
  • 36. Experiment 1: Better regularization • MovieLens-100K, avg RMSE using pre-specified splits • ZeroMean, RLFM and FeatureOnly (no cold-start issues) • Covariates: – Users : age, gender, zipcode (1st digit only) – Movies: genres Deepak Agarwal & Bee-Chung Chen @ ICML’11 36
  • 37. Experiment 2: Better handling of Cold-start • MovieLens-1M; EachMovie • Training-test split based on timestamp • Same covariates as in Experiment 1. Deepak Agarwal & Bee-Chung Chen @ ICML’11 37
  • 38. Experiment 4: Predicting click-rate on articles • Goal: Predict click-rate on articles for a user on F1 position • Article lifetimes short, dynamic updates important • User covariates: – Age, Gender, Geo, Browse behavior • Article covariates – Content Category, keywords • 2M ratings, 30K users, 4.5 K articles Deepak Agarwal & Bee-Chung Chen @ ICML’11 38
  • 39. Results on Y! FP data Deepak Agarwal & Bee-Chung Chen @ ICML’11 39
  • 40. Some other related approaches • Stern, Herbrich and Graepel, WWW, 2009 – Similar to RLFM, different parametrization and expectation propagation used to fit the models • Porteus, Asuncion and Welling, AAAI, 2011 – Non-parametric approach using a Dirichlet process • Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 – Regression + random effects per user regularized through a Graphical Lasso Deepak Agarwal & Bee-Chung Chen @ ICML’11 40
  • 41. Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation
  • 42. fLDA: Introduction • Model the rating yij that user i gives to item j as the user’s affinity to the topics that the item has User i ’s affinity to topic k yij = ... + ∑k sik z jk Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items – Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data – fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation Deepak Agarwal & Bee-Chung Chen @ ICML’11 42
  • 43. LDA Topic Modeling (1) • LDA is effective for unsupervised topic discovery [Blei’03] – It models the generating process of a corpus of items (articles) – For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η) – For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ) – For each word, say the nth word, in item j, • Draw a topic zjn for that word from θj = [θj1, …, θjK] • Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn Item j Topic distribution: [θj1, …, θjK] Φ11, …, Φ1W Topic 1 Assume zjn = topic k … Per-word topic: zj1, …, zjn, … Φk1, …, ΦkW Topic k … ΦK1, …, ΦKW Topic K Words: wj1, …, wjn, … Observed Deepak Agarwal & Bee-Chung Chen @ ICML’11 43
  • 44. LDA Topic Modeling (2) • Model training: – Estimate the prior parameters and the posterior topic×word distribution Φ based on a training corpus of items – EM + Gibbs sampling is a popular method • Inference for new items – Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase • Supervised LDA [Blei’07] – Predict a target value for each item based on supervised LDA topics Regression weight for topic k One regression per user yj = ∑ k sk z jk vs. yij = ... + ∑k sik z jk Same set of topics across different regressions Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Deepak Agarwal & Bee-Chung Chen @ ICML’11 44
  • 45. fLDA: Model • Rating: yij ~ N ( µij , σ 2 ) Gaussian Model user i yij ~ Bernoulli ( µij ) Logistic Model (for binary rating) gives yij ~ Poisson( µij N ij ) Poisson Model (for counts) item j t ( µij ) = xij b + α i + β j + ∑ k sik z jk t • Bias of user i: α i = g 0 xi + ε iα , ε iα ~ N (0, σ α ) t 2 • Popularity of item j: β j = d 0 x j + ε β , ε β ~ N (0, σ β ) t j j 2 • Topic affinity of user i: si = Hxi + ε is , ε is ~ N (0, σ s2 I ) • Pr(item j has topic k): z jk = ∑ n 1( z jn = k ) /( # words in item j ) The LDA topic of the nth word in item j • Observed words: w jn ~ LDA(λ ,η , z jn ) The nth word in item j Deepak Agarwal & Bee-Chung Chen @ ICML’11 45
  • 46. Model Fitting • Given: – Features X = {xi, xj, xij} – Observed ratings y = {yij} and words w = {wjn} • Estimate: – Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η] • Regression weights and prior parameters – Latent factors: ∆ = {αi, βj, si} and z = {zjn} • User factors, item factors and per-word topic assignment • Empirical Bayes approach: – Maximum likelihood estimate of the parameters: ˆ ∫ Θ = arg max Pr[ y, w | Θ] = arg max Pr[ y, w, ∆, z | Θ] d∆dz Θ Θ – The posterior distribution of the factors: ˆ Pr[∆, z | y, Θ] Deepak Agarwal & Bee-Chung Chen @ ICML’11 46
  • 47. The EM Algorithm • Iterate through the E and M steps until convergence ˆ – Let Θ( n ) be the current estimate – E-step: Compute f ( Θ) = E ˆ [log Pr( y , w, ∆, z ( ∆ , z | y , w, Θ n ) | Θ)] • The expectation is not in closed form • We draw Gibbs samples and compute the Monte Carlo mean ˆ Θ( n +1) = arg max f (Θ) – M-step: Find Θ • It consists of solving a number of regression and optimization problems Deepak Agarwal & Bee-Chung Chen @ ICML’11 47
  • 48. Supervised Topic Assignment The topic of the nth word in item j Probability of observing yij Pr( z jn = k | Rest ) given the model Z kl jn + η ¬ ∝ ¬jn Z k + Wη jk ( ) Z ¬jn + λ ⋅ ∏i rated j f ( yij | z jn = k ) Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k Deepak Agarwal & Bee-Chung Chen @ ICML’11 48
  • 49. fLDA: Experimental Results (Movie) • Task: Predict the rating that a user would give a movie • Training/test split: Model Test RMSE – Sort observations by time RLFM 0.9363 – First 75% → Training data fLDA 0.9381 Factor-Only 0.9422 – Last 25% → Test data FilterBot 0.9517 unsup-LDA 0.9520 • Item warm-start scenario MostPopular 0.9726 – Only 2% new items in test data Feature-Only 1.0906 Constant 1.1190 fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios Deepak Agarwal & Bee-Chung Chen @ ICML’11 49
  • 50. fLDA: Experimental Results (Yahoo! Buzz) • Task: Predict whether a user would buzz-up an article • Severe item cold-start – All items are new in test data fLDA significantly outperforms other models Data Statistics 1.2M observations 4K users 10K articles Deepak Agarwal & Bee-Chung Chen @ ICML’11 50
  • 51. Experimental Results: Buzzing Topics 3/4 topics are interpretable; 1/2 are similar to unsupervised topics Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, CIA interrogation suspect, releas, investig, georg, memo, al mexico, flu, pirat, swine, drug, ship, somali, border, Swine flu mexican, hostag, offici, somalia, captain NFL, player, team, suleman, game, nadya, star, high, NFL games octuplet, nadya_suleman, michael, week court, gai, marriag, suprem, right, judg, rule, sex, pope, Gay marriage supreme_court, appeal, ban, legal, allow palin, republican, parti, obama, limbaugh, sarah, rush, Sarah Palin gop, presid, sarah_palin, sai, gov, alaska idol, american, night, star, look, michel, win, dress, American idol susan, danc, judg, boyl, michelle_obama economi, recess, job, percent, econom, bank, expect, Recession rate, jobless, year, unemploy, month north, korea, china, north_korea, launch, nuclear, North Korea issues rocket, missil, south, said, russia Deepak Agarwal & Bee-Chung Chen @ ICML’11 51
  • 52. fLDA Summary • fLDA is a useful model for cold-start item recommendation • It also provides interpretable recommendations for users – User’s preference to interpretable LDA topics • Future directions: – Investigate Gibbs sampling chains and the convergence properties of the EM algorithm – Apply fLDA to other multi-task prediction problems • fLDA can be used as a tool to generate supervised features (topics) from text data Deepak Agarwal & Bee-Chung Chen @ ICML’11 52
  • 53. Summary • Regularizing factors through covariates effective • Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive • Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine • Distributed computing on Hadoop: Multiple models and average across partitions Deepak Agarwal & Bee-Chung Chen @ ICML’11 53
  • 54. Hierarchical smoothing Advertising application, Agarwal et al. KDD 10 • Product of states for each node pair Advertiser pubclass Conv-id campaign Pub-id z Ad-id • Spike and Slab prior ( Sz, Ez, λz) – Known to encourage parsimonious solutions • Several cell states have no corrections – Not used before for multi-hierarchy models, only in regression – We choose P = .5 (and choose “a” by cross-validation) • a – psuedo number of successes Deepak Agarwal & Bee-Chung Chen @ ICML’11 54

Notes de l'éditeur

  1. During time interval t User visits to the webpage Some item inventory at our disposal Decision problem: for each visit, select an item to display each display generates a response Goal: choose items to maximize expected clicks