Latent factor models for Collaborative Filtering

AIM3 – Scalable Data Analysis and Data
Mining

11 – Latent factor models for Collaborative Filtering
Sebastian Schelter, Christoph Boden, Volker Markl

Fachgebiet Datenbanksysteme und Informationsmanagement
Technische Universität Berlin

20.06.2012
http://www.dima.tu-berlin.de/
DIMA – TU Berlin 1

Recap: Item-Based Collaborative Filtering

Itembased Collaborative Filtering

• compute pairwise similarities of the columns of
the rating matrix using some similarity measure
• store top 20 to 50 most similar items per item
in the item-similarity matrix
• prediction: use a weighted sum over all items
similar to the unknown item that have been
rated by the current user

p ui =
 j S ( i , u )
s ij ruj

 j S ( i , u )
s 
ij

20.06.2012 DIMA – TU Berlin 2

Drawbacks of similarity-based neighborhood
methods

• the assumption that a rating is defined by all the
user's ratings for commonly co-rated items is
hard to justify in general

• lack of bias correction

• every co-rated item is looked at in isolation,
say a movie was similar to „Lord of the Rings“, do
we want each part to of the trilogy to contribute as
a single similar item?

• best choice of similarity measure is based on
experimentation not on mathematical reasons


Latent factor models

■ Idea

• ratings are deeply influenced by a set of factors that are
very specific to the domain (e.g. amount of action in movies,
complexity of characters)

• these factors are in general not obvious, we might be able to
think of some of them but it's hard to estimate their impact on
the ratings

• the goal is to infer those so called latent factors from the
rating data by using mathematical techniques



■ Approach

• users and items are characterized by latent n
f
factors, each user and item is mapped onto ui ,m j
 R
a latent feature space

• each rating is approximated by the dot T
rij  m j u i
product of the user feature vector
and the item feature vector

• prediction of unknown ratings also uses
this dot product

• squared error as a measure of loss r ij
T
 m j ui  2



■ Approach

• decomposition of the rating matrix into the product of a user
feature and an item feature matrix
• row in U: vector of a user's affinity to the features
• row in M: vector of an item's relation to the features

• closely related to Singular Value Decomposition which
produces an optimal low-rank optimization of a matrix

MT
R ≈ U



■ Properties of the decomposition
• automatically ranks features by their „impact“ on the ratings
• features might not necessarily be intuitively understandable



■ Problematic situation with explicit feedback data

• the rating matrix is not only sparse, but partially defined,
missing entries cannot be interpreted as 0 they are just
unknown
• standard decomposition algorithms like Lanczos method for
SVD are not applicable

Solution

• decomposition has to be done using the known ratings only
• find the set of user and item feature vectors that minimizes the
squared error to the known ratings

 r  m j ui 
T 2
min U, M i, j



■ quality of the decomposition is not measured with respect to
the reconstruction error to the original data, but with
respect to the generalization to unseen data
■ regularization necessary to avoid overfitting

■ model has hyperparameters (regularization, learning rate)
that need to be chosen

■ process: split data into training, test and validation set
□ train model using the training set
□ choose hyperparameters according to performance on the test set
□ evaluate generalization on the validation set
□ ensure that each datapoint is used in each set once
(cross-validation)


Stochastic Gradient Descent

• add a regularizarion term

min U, M  r i, j
T
 m j ui 
2

+ λ ui
2
+ m j
2

• loop through all ratings in the training set, compute
associated prediction error
T
e ui = rij  m j u i

• modify parameters in the opposite direction of the gradient

u i  u i + γ e u, i m j
 λu i

m j
 m j + γ e u, i u i  λm j

• problem: approach is inherently sequential (although recent
research might have unveiled a parallelization technique)


Alternating Least Squares with
Weighted λ-Regularization
■ Model

• feature matrices are modeled directly by using only
the observed ratings
• add a regularization term to avoid overfitting
• minimize regularized error of:

f U, M =  r ij
 m j ui  + λ
T 2
 n u
i
ui
2
+  nm
j
m j
2

Solving technique

• fixing one of the unknown variable to make this a simple
quadratic equation
• rotate between fixing u and m until convergence
(„Alternating Least Squares“)


ALS-WR is scalable

■ Which properties make this approach scalable?

• all the features in one iteration can be computed
independently of each other
• only a small portion of the data necessary to compute
a feature vector

Parallelization with Map/Reduce

• Computing user feature vectors: the mappers need to send
each user's rating vector and the feature vectors of his/her
rated items to the same reducer

• Computing item feature vectors: the mappers need to send
each item's rating vector and the feature vectors of users who
rated it to the same reducer


Incorporating biases

■ Problem: explicit feedback data is highly biased
□ some users tend to rate more extreme than others
□ some items tend to get higher ratings than others

■ Solution: explicitly model biases
□ the bias of a rating is model as a combination of the items average
rating, the item bias and the user bias

b ij    b i  b j

□ the rating bias can be incorporated into the prediction

rij    b i  b j  m j u i
T
ˆ



■ implicit feedback data is very different from explicit data!

□ e.g. use the number of clicks on a product page of an online shop

□ the whole matrix is defined!
□ no negative feedback
□ interactions that did not happen produce zero values
□ however we should have only little confidence in these (maybe the user
never had the chance to interact with these items)

□ using standard decomposition techniques like SVD would give us a
decomposition that is biased towards the zero entries, again not
applicable



■ Solution for working with implicit data:
weighted matrix factorization

1 rij  0
■ create a binary preference matrix P p ij  
0 rij  0


■ each entry in this matrix can be weighted
by a confidence function
□ zero values should get low confidence c ( i , j )  1   rij

□ values that are based on a lot of interactions
should get high confidence

■ confidence is incorporated into the model
□ the factorization will ‚prefer‘ more confident values

f U, M =   T
c ( i , j ) p ij  m j u i 
2
+ λ  ui
2
+  m j
2


Sources

• Sarwar et al.: „Item-Based Collaborative Filtering
Recommendation Algorithms“, 2001
• Koren et al.: „Matrix Factorization Techniques for Recommender
Systems“, 2009
• Funk: „Netflix Update: Try This at Home“,
http://sifter.org/~simon/journal/20061211.html, 2006
• Zhou et al.: „Large-scale Parallel Collaborative Filtering for the
Netflix Prize“, 2008
• Hu et al.: „Collaborative Filtering for Implicit Feedback
Datasets“, 2008


Latent factor models for Collaborative Filtering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Latent factor models for Collaborative Filtering

Similaire à Latent factor models for Collaborative Filtering (20)

Plus de sscdotopen

Plus de sscdotopen (9)

Dernier

Dernier (20)

Latent factor models for Collaborative Filtering