Big Practical Recommendations with Alternating Least Squares

Big, Practical Recommendations
with Alternating Least Squares

Sean Owen • Apache Mahout / Myrrix.com

WHERE’S BIG LEARNING?
 Next: Application Layer
 Analytics
 Machine Learning
Applications
 Like Apache Mahout
 Common Big Data app today Processing
 Clustering, recommenders,
classifiers on Hadoop Database
 Free, open source; not mature

 Where’s commercialized Storage
Big Learning?

A RECOMMENDER SHOULD …
 Answer in Real-time  Accept Diverse Input
 Ingest new data, now  Not just people and products
 Modify recommendations based  Not just explicit ratings
on newest data  Clicks, views, buys
 No “cold start” for new data
 Side information
 Scale Horizontally  Be “Pretty Accurate”
 For queries per second
 For size of data set

NEED: 2-TIER ARCHITECTURE
 Real-time Serving Layer
 Quick results based on
precomputed model
 Incremental update
 Partitionable for scale

 Batch Computation Layer
 Builds model
 Scales out (on Hadoop?)
 Asynchronous, occasional,
long-lived runs

A PRACTICAL ALGORITHM

MATRIX FACTORIZATION BENEFITS
 Factor user-item matrix to  Models intuition
user-feature + feature-item  Factorization is batch
matrix parallelizable
 Well understood in ML, as:  Reconstruction (recs) in
 Principal Component Analysis low-dimension is fast
 Latent Semantic Indexing
 Allows projection of new data
 Several algorithms, like:  Cold start solution
 Singular Value Decomposition  Approximate update solution
 Alternating Least Squares

A PRACTICAL IMPLEMENTATION
ALTERNATING LEAST
SQUARES BENEFITS
 Simple factorization P ≈ X YT  Parallelizable by row --
 Approximate: X, Y are very Hadoop-friendly
“skinny” (low-rank)  Iterative: OK answer fast,
 Faster than the SVD refine as long as desired
 Trivially parallel, iterative  Yields to “binary” input model
 Dumber than the SVD  Ratings as regularization
instead
 No singular values,
 Sparseness / 0s no longer a
orthonormal basis
problem

ALS ALGORITHM 1
 Input: (user, item, strength) 1 4 3
tuples
3
 Anything you can quantify is
input 4 3 2
 Strength is positive 5 2 3
 Many tuples per user-item 5
 R is sparse user-item 2 4 R
interaction matrix
 rij = total strength of
interaction between user i
and item j

ALS ALGORITHM 2
 Follow “Collaborative 1 1 1 0 0
Filtering for Implicit
0 0 1 0 0
Feedback Datasets”
www2.research.att.com/~yifanhu/PUB/cf. 0 1 0 1 1
pdf
1 0 1 0 1
 Construct “binary” matrix P
0 0 0 1 0
 1 where R > 0
1 1 0 0 0 P
 0 where R = 0

 Factor P, not R
 R returns in regularization

 Still sparse; implicit 0s fine

ALS ALGORITHM 3
 P is m x n
 Choose k << m, n
 Factor P as Q = X YT, Q ≈ P
 X is m x k ; YT is k x n YT
 Find best approximation Q
 Minimize L2 norm of diff: || P-Q X
||2
 Minimal squared error:
“Least Squares”
 Recommendations are
largest values in Q

ALS ALGORITHM 4
 Optimizing X, Y
simultaneously is non-
convex, hard
 If X or Y are fixed, system of
YT
linear equations:
convex, easy
 Initialize Y with random X
values
 Solve for X
 Fix X, solve for Y
 Repeat (“Alternating”)

ALS ALGORITHM 5
 Define regularization weights cui = 1 + α rui
 Minimize:

Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)

 Simple least-squares regression objective, plus
 Weighted least-squared error terms by strength,
a penalty for not reconstructing 1 at “strong” association is higher
 Standard L2 regularization term

ALS ALGORITHM 6
 With fixed Y, compute optimal X
 Each row xu is independent
 Define Cu as diagonal matrix of cu (user strength weights)
 xu = (YTCuY + λI)-1 YTCupu
 Compare to simple least-squares regression solution (YTY)-1 YTpu
 Adds Tikhonov / ridge regression regularization term λI
 Attaches cu weights to YT

 See paper for how YTCuY is computed efficiently;
skipping the engineering!

EXAMPLE FACTORIZATION
 k = 3, λ = 2, α = 40, 10 iterations

0.96 0.99 0.99 0.38 0.93
1 1 1 0 0
0.44 0.39 0.98 -0.11 0.39
0 0 1 0 0

≈
0.70 0.99 0.42 0.98 0.98
0 1 0 1 1
1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 Q = X•YT
0.11 0.51 -0.13 1.00 0.57
0 0 0 1 0
0.97 1.00 0.68 0.47 0.91
1 1 0 0 0

FOLD-IN
 Need immediate, if  Note (YTY)(YTY)-1 = I
approximate, updates for  Gives YT’s right inverse:
new data YT (Y(YTY)-1) = I
 New user u needs new row  Xu = Qu Y(YTY)-1
Qu = Xu YT
 Xu ≈ Pu Y(YTY)-1
 We have Pu ≈ Qu
 Recommend as usual:
 Compute Xu via right inverse: Qu = XuYT
X YT(YT)-1 = Q(YT)-1 so:
 For existing user, instead
X = Q(YT)-1
add to existing row Xu
 What is (YT)-1?

THIS IS MYRRIX
 Soft-launched
 Serving Layer available
as open source download
 Computation Layer available
as beta
 Ready on Amazon EC2 / EMR
srowen@myrrix.com
 Full launch Q4 2012
 myrrix.com

EXAMPLES

STACKOVERFLOW TAGS WIKIPEDIA LINKS
 Recommend tags to  Recommend new linked
questions articles from existing links
 Tag questions automatically,  Propose missing, related
improve tag coverage links
 3.5M questions x 30K tags  2.5M articles x 1.8M articles
 4.3 hours x 5 machines on  28 hours x 2 PCs on
Amazon EMR Apache Hadoop 1.0.3
 $3.03 ≈ $0.08 per 100,000
recs

Big Practical Recommendations with Alternating Least Squares

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big Practical Recommendations with Alternating Least Squares

Similaire à Big Practical Recommendations with Alternating Least Squares (20)

Plus de Data Science London

Plus de Data Science London (20)

Dernier

Dernier (20)

Big Practical Recommendations with Alternating Least Squares