Hadoop France meetup Feb2016 : recommendations with spark

Hi! I’m Koby
2
▣ Data Scientist at Equancy
□ Previously: Kpler, Engie
▣ Python Dev
□ scikit-learn / pandas / Jupyter
□ Sometimes I use R
▣ I used Hadoop before for data pipelines
▣ My first project doing distributed ML!

Hello, my name is Hervé!
3
▣ Equancy Partner & Chief Scientist
▣ In charge with Data Technologies
□ Data Engineering
□ Data Science
□ Innovating with data
▣ PhD in Machine Learning many years ago

Recommenders: What for?
6
▣ Only one occasion to interact with customers
□ Which marketing message to choose?
▣ Personalized User Experience
□ Improved Experience!
▣ No information overload
□ ~230,000 Products

Why personalization matters?
Because no personalization is ugly...
7

Three different recommendation systems
9
Homepage Product Page Cart
Collaborative Filtering
(Unsupervised Learning)
Frequently Bought-Together
Prediction
(Supervised Learning)
Content-Based Filtering
(Correlation Maximization)

Three different recommendation systems
10
Homepage Product Page Cart
Collaborative Filtering
(Unsupervised Learning)
Frequently Bought-
Together Prediction
(Supervised Learning)
Content-Based Filtering
(Correlation Maximization)

Business Inputs
▣ Score should be based on three factors:
□ Interaction type - purchase is more important
than a product view
□ Time (decay) - a product purchased in recent
history witll have more impact than a product
purchased in the distant past
□ Season - a product purchased during another
season will have less impact

Business Rules
▣ The following items should be Filtered-out:
□ Purchased recently or very similar
□ Not in current season
□ Not user’s gender
□ Not in stock

1
5 1 3
1
1 1
3 1
1 5
5 3
► 1
► 3
► 5
▣ Map users to products in a matrix

? ? ? 1 ? ?
5 1 3 ? ? ?
? ? ? 1 ? ?
? ? ? ? ? ?
1 1 ? ? ? ?
? ? 3 ? ? 1
? ? 1 ? 5 ?
? 5 ? 3 ? ?
► 1
► 3
► 5
▣ Predict missing interactions

?
?
?
?
?
?
?
?
Training
=
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
1
5 1 3
1
1 1
3 1
1 5
5 3
? ? ? ? ? ?
X
Items
Users
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Latent
Factors
Matrix Factorization
1
5 1 3
1
3 1
1 1
3 1
1 5
5 3

1
5 1 3
1
3 1
1 1
3 1
1 5
5 3
Training
▣ Input:
□ Sparse representation of matrix (tuples)
□ Representation of an interaction score
between user and product

Training
▣ Output:
□ User Features
mapping users to latent features
□ Product Features
Mapping products to latent features
□ Estimation of interaction scores
?
?
?
?
?
?
?
?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
1
5 1 3
1
1 1
3 1
1 5
5 3
? ? ? ? ? ?
X
Items
Users
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Latent
Factors

Alternating Least Squares (ALS)

Implicit Collaborative Filtering

▣ Difficulties:
□ How to interprate missing relations between
users and products?
If a user didn’t click on the item - does it means
that the user doesn’t like it?
Maybe he just didn’t see it yet?
□ What values should we use for missing relations?
should we replace with 0?
should we replace with mean/median?
▣ Using methods for explicit feedback (i.e. product
rating) can’t be applied to our case!

▣ Spark MLlib has a special CF implementation
for the implicit feedback case, based on:
▣ The general idea is using confidence level
that will let us tune what a lack of feedback
means for our applications
(Google the title to read it for free on the author’s page)

Training
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int, alpha:
Double, seed: Long): MatrixFactorizationModel
Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the
form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank
matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of
ALS. This is done using a level of parallelism given by blocks.
ratings
RDD of (userID, productID, rating) pairs
rank
number of features to use
iterations
number of iterations of ALS (recommended: 10-20)
lambda
regularization factor (recommended: 0.01)
blocks
level of parallelism to split computation into
alpha
confidence parameter
seed
random seed

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
ALS for Python

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
println("Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
ALS for Scala

Validation and Parameter Tuning

Measuring prediction Performance
▣ In order to select good parameters for our model
we designed a validation benchmark
▣ We based it on relatively small dataset to be able
to make a significant amount of tests
▣ We chose to measure and minimize the RMSE*:
□ used by default in ALS
□ punishes big errors
□ error is in the scale of the rating unit
□ common metric for CF
* RMSE - Root Mean Square Error

Measuring prediction Performance
# splitting dataset randomly into train set and validation set
training_RDD, validation_RDD = small_ratings_data.randomSplit([0.7, 0.3], seed=0)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
# measuring error on the validation set
min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
model = ALS.train(training_RDD, rank, seed=0, iterations=10, lambda=0.1)
predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
errors[err] = error
err += 1
print 'For rank %s the RMSE is %s' % (rank, error)
if error < min_error:
min_error = error
best_rank = rank
print 'The best model was trained with rank %s' % best_rank
For rank 4 the RMSE is 0.963681878574
The best model was trained with rank 8

Deployment
▣ Training a model is actually pretty fast
▣ Deploying is slow
□ We decided that all users will get a top-n
recommendation
□ This recommendation is stored in a DB
▣ We need to make a fresh recommendation for every
user - there are 4 million users. In Python:
def recommendProducts(self, user, num):
"""
Recommends the top "num" number of products for a given user and returns a list
of Rating objects sorted by the predicted rating in descending order.
"""
return list(self.call("recommendProducts", user, num))
▣ This call was around 20 ms - pretty quick
□ calling this function 4M times = 1 day

Deployment
▣ I wasn’t the only one that needed this feature ...

Deployment
▣ Solutions: Extracting the User / Product features and applying matrix
multiplication and sorting directly the RDD by batches:
users_rdd = model.userFeatures()
products_rdd = model.productFeatures()
…
from joblib import Parallel, delayed
Parallel(n_jobs=cores, verbose=1000)(delayed(prepare_recommendation)(user_features_batch, gender)
for user_features_batch in nested_user_features)
...
user_features_batch.dot(product_features_T)
This was about 10 times faster than calling recommendProducts
▣ Starting from Spark 1.6 recommendProductsForUsers is implemented for
Python
□ This where a Scala has advantage over Python!

Discussing Collaborative Filtering

Domain-specific discussion
▣ Pros
□ Helps us to find non-obvious relations between users and products
□ High diversity and coverage of item catalogue
□ Using an unsupervised method we project to a low-dimensional space:
Latent Factor 1 = 20% red boots + 30% green snickers + …
Latent Factor 2 = 15% adidas snickers + 35% comfy boots + ...
➔ Embodies “deep” preferences (fashion, style, ...)
▣ Cons
□ Unpredictable results:
e.g. user never shopped for red boots - why is it recommended?
□ Can be interpreted as intrusion to the users’ privacy through (a machine

Machine Learning / Big Data discussion
▣ Pros
□ Training of the model is quick thanks to the latent feature low dimensionality
□ Linear model with a closed-form solution (“easy!”)
□ No cold-start problem (vs. User-based CF)
□ Training is parallelizable: Hadoop Friendly
▣ Cons
□ Heavy in computation in comparison to Content-Based approaches
□ Unable to fit non-linear relations (polynomial tricks can’t be applied)

Guess what?
39
We hire!
Data Engineers warmly welcomed

Thank You!
www.equancy.com
47 rue de Chaillot - 75116 Paris
Koby Karp
Hervé Mignot
kkarp@equancy.com
herve.mignot@equancy.com

Hadoop France meetup Feb2016 : recommendations with spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Hadoop France meetup Feb2016 : recommendations with spark

Similaire à Hadoop France meetup Feb2016 : recommendations with spark (20)

Plus de Modern Data Stack France

Plus de Modern Data Stack France (20)

Dernier

Dernier (20)

Hadoop France meetup Feb2016 : recommendations with spark