SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Recommendations with Spark
Hi! I’m Koby
2
▣ Data Scientist at Equancy
□ Previously: Kpler, Engie
▣ Python Dev
□ scikit-learn / pandas / Jupyter
□ Sometimes I use R
▣ I used Hadoop before for data pipelines
▣ My first project doing distributed ML!
Hello, my name is Hervé!
3
▣ Equancy Partner & Chief Scientist
▣ In charge with Data Technologies
□ Data Engineering
□ Data Science
□ Innovating with data
▣ PhD in Machine Learning many years ago
4
Recommender Systems
Recommenders: What for?
6
▣ Only one occasion to interact with customers
□ Which marketing message to choose?
▣ Personalized User Experience
□ Improved Experience!
▣ No information overload
□ ~230,000 Products
Why personalization matters?
Because no personalization is ugly...
7
Recommendation algorithms
8
Three different recommendation systems
9
Homepage Product Page Cart
Collaborative Filtering
(Unsupervised Learning)
Frequently Bought-Together
Prediction
(Supervised Learning)
Content-Based Filtering
(Correlation Maximization)
Three different recommendation systems
10
Homepage Product Page Cart
Collaborative Filtering
(Unsupervised Learning)
Frequently Bought-
Together Prediction
(Supervised Learning)
Content-Based Filtering
(Correlation Maximization)
Business Rules
Business Inputs
▣ Score should be based on three factors:
□ Interaction type - purchase is more important
than a product view
□ Time (decay) - a product purchased in recent
history witll have more impact than a product
purchased in the distant past
□ Season - a product purchased during another
season will have less impact
Business Rules
▣ The following items should be Filtered-out:
□ Purchased recently or very similar
□ Not in current season
□ Not user’s gender
□ Not in stock
Collaborative Filtering
1
5 1 3
1
1 1
3 1
1 5
5 3
► 1
► 3
► 5
▣ Map users to products in a matrix
? ? ? 1 ? ?
5 1 3 ? ? ?
? ? ? 1 ? ?
? ? ? ? ? ?
1 1 ? ? ? ?
? ? 3 ? ? 1
? ? 1 ? 5 ?
? 5 ? 3 ? ?
► 1
► 3
► 5
▣ Predict missing interactions
?
?
?
?
?
?
?
?
Training
=
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
1
5 1 3
1
1 1
3 1
1 5
5 3
? ? ? ? ? ?
X
Items
Users
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Latent
Factors
Matrix Factorization
1
5 1 3
1
3 1
1 1
3 1
1 5
5 3
1
5 1 3
1
3 1
1 1
3 1
1 5
5 3
Training
Matrix Factorization
▣ Input:
□ Sparse representation of matrix (tuples)
□ Representation of an interaction score
between user and product
Training
Matrix Factorization
▣ Output:
□ User Features
mapping users to latent features
□ Product Features
Mapping products to latent features
□ Estimation of interaction scores
?
?
?
?
?
?
?
?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
1
5 1 3
1
1 1
3 1
1 5
5 3
? ? ? ? ? ?
X
Items
Users
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Latent
Factors
Alternating Least Squares (ALS)
Implicit Collaborative Filtering
Implicit Collaborative Filtering
▣ Difficulties:
□ How to interprate missing relations between
users and products?
If a user didn’t click on the item - does it means
that the user doesn’t like it?
Maybe he just didn’t see it yet?
□ What values should we use for missing relations?
should we replace with 0?
should we replace with mean/median?
▣ Using methods for explicit feedback (i.e. product
rating) can’t be applied to our case!
▣ Spark MLlib has a special CF implementation
for the implicit feedback case, based on:
▣ The general idea is using confidence level
that will let us tune what a lack of feedback
means for our applications
Implicit Collaborative Filtering
(Google the title to read it for free on the author’s page)
Implementation in Spark
Training
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int, alpha:
Double, seed: Long): MatrixFactorizationModel
Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the
form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank
matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of
ALS. This is done using a level of parallelism given by blocks.
ratings
RDD of (userID, productID, rating) pairs
rank
number of features to use
iterations
number of iterations of ALS (recommended: 10-20)
lambda
regularization factor (recommended: 0.01)
blocks
level of parallelism to split computation into
alpha
confidence parameter
seed
random seed
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
ALS for Python
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
println("Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
ALS for Scala
Validation and Parameter Tuning
Measuring prediction Performance
▣ In order to select good parameters for our model
we designed a validation benchmark
▣ We based it on relatively small dataset to be able
to make a significant amount of tests
▣ We chose to measure and minimize the RMSE*:
□ used by default in ALS
□ punishes big errors
□ error is in the scale of the rating unit
□ common metric for CF
* RMSE - Root Mean Square Error
Measuring prediction Performance
# splitting dataset randomly into train set and validation set
training_RDD, validation_RDD = small_ratings_data.randomSplit([0.7, 0.3], seed=0)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
# measuring error on the validation set
min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
model = ALS.train(training_RDD, rank, seed=0, iterations=10, lambda=0.1)
predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
errors[err] = error
err += 1
print 'For rank %s the RMSE is %s' % (rank, error)
if error < min_error:
min_error = error
best_rank = rank
print 'The best model was trained with rank %s' % best_rank
For rank 4 the RMSE is 0.963681878574
For rank 8 the RMSE is 0.96250475933
For rank 12 the RMSE is 0.971647563632
The best model was trained with rank 8
Deployment
Deployment
▣ Training a model is actually pretty fast
▣ Deploying is slow
□ We decided that all users will get a top-n
recommendation
□ This recommendation is stored in a DB
▣ We need to make a fresh recommendation for every
user - there are 4 million users. In Python:
def recommendProducts(self, user, num):
"""
Recommends the top "num" number of products for a given user and returns a list
of Rating objects sorted by the predicted rating in descending order.
"""
return list(self.call("recommendProducts", user, num))
▣ This call was around 20 ms - pretty quick
□ calling this function 4M times = 1 day
Deployment
▣ I wasn’t the only one that needed this feature ...
Deployment
▣ Solutions: Extracting the User / Product features and applying matrix
multiplication and sorting directly the RDD by batches:
users_rdd = model.userFeatures()
products_rdd = model.productFeatures()
…
from joblib import Parallel, delayed
Parallel(n_jobs=cores, verbose=1000)(delayed(prepare_recommendation)(user_features_batch, gender)
for user_features_batch in nested_user_features)
...
user_features_batch.dot(product_features_T)
This was about 10 times faster than calling recommendProducts
▣ Starting from Spark 1.6 recommendProductsForUsers is implemented for
Python
□ This where a Scala has advantage over Python!
Discussing Collaborative Filtering
Domain-specific discussion
▣ Pros
□ Helps us to find non-obvious relations between users and products
□ High diversity and coverage of item catalogue
□ Using an unsupervised method we project to a low-dimensional space:
Latent Factor 1 = 20% red boots + 30% green snickers + …
Latent Factor 2 = 15% adidas snickers + 35% comfy boots + ...
➔ Embodies “deep” preferences (fashion, style, ...)
▣ Cons
□ Unpredictable results:
e.g. user never shopped for red boots - why is it recommended?
□ Can be interpreted as intrusion to the users’ privacy through (a machine
Machine Learning / Big Data discussion
▣ Pros
□ Training of the model is quick thanks to the latent feature low dimensionality
□ Linear model with a closed-form solution (“easy!”)
□ No cold-start problem (vs. User-based CF)
□ Training is parallelizable: Hadoop Friendly
▣ Cons
□ Heavy in computation in comparison to Content-Based approaches
□ Unable to fit non-linear relations (polynomial tricks can’t be applied)
Guess what?
39
We hire!
Data Engineers warmly welcomed
QUESTIONS & ANSWERS
Thank You!
www.equancy.com
47 rue de Chaillot - 75116 Paris
Koby Karp
Hervé Mignot
kkarp@equancy.com
herve.mignot@equancy.com

Contenu connexe

Tendances

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
Rebecca Bilbro
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
sscdotopen
 

Tendances (20)

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for Everyone
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlowTensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleUnsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
 

En vedette

อุปกรณ์สำรองข้อมูล
อุปกรณ์สำรองข้อมูลอุปกรณ์สำรองข้อมูล
อุปกรณ์สำรองข้อมูล
Yoshikuni Yuusuke
 
อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์
I'Tay Tanawin
 
China auto parts and components manufacturing industry in depth market resear...
China auto parts and components manufacturing industry in depth market resear...China auto parts and components manufacturing industry in depth market resear...
China auto parts and components manufacturing industry in depth market resear...
Qianzhan Intelligence
 
A Day at NEEV Soaps - Lisa & Kishan
A Day at NEEV Soaps - Lisa & KishanA Day at NEEV Soaps - Lisa & Kishan
A Day at NEEV Soaps - Lisa & Kishan
Anurag Jain
 

En vedette (19)

Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 
Culture
CultureCulture
Culture
 
ACTIVIDAD DE APRENDIZAJE 8
ACTIVIDAD DE APRENDIZAJE  8 ACTIVIDAD DE APRENDIZAJE  8
ACTIVIDAD DE APRENDIZAJE 8
 
EclipseCon NA 2015 - Arduino designer : the making of!
EclipseCon NA 2015 - Arduino designer : the making of!EclipseCon NA 2015 - Arduino designer : the making of!
EclipseCon NA 2015 - Arduino designer : the making of!
 
อุปกรณ์สำรองข้อมูล
อุปกรณ์สำรองข้อมูลอุปกรณ์สำรองข้อมูล
อุปกรณ์สำรองข้อมูล
 
อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์
 
4 logo Cinema One DEGRADE RGB
4 logo Cinema One DEGRADE RGB4 logo Cinema One DEGRADE RGB
4 logo Cinema One DEGRADE RGB
 
China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...
 
Digital and Social Media Change Management
Digital and Social Media Change ManagementDigital and Social Media Change Management
Digital and Social Media Change Management
 
Danny Bluestone - Agile UX – a digital agency’s view’.
Danny Bluestone - Agile UX – a digital agency’s view’.Danny Bluestone - Agile UX – a digital agency’s view’.
Danny Bluestone - Agile UX – a digital agency’s view’.
 
El beso
El besoEl beso
El beso
 
Ephata 630
Ephata 630Ephata 630
Ephata 630
 
China auto parts and components manufacturing industry in depth market resear...
China auto parts and components manufacturing industry in depth market resear...China auto parts and components manufacturing industry in depth market resear...
China auto parts and components manufacturing industry in depth market resear...
 
Marketo Protips 3: New Advice You Can Implement Today
Marketo Protips 3: New Advice You Can Implement TodayMarketo Protips 3: New Advice You Can Implement Today
Marketo Protips 3: New Advice You Can Implement Today
 
A Day at NEEV Soaps - Lisa & Kishan
A Day at NEEV Soaps - Lisa & KishanA Day at NEEV Soaps - Lisa & Kishan
A Day at NEEV Soaps - Lisa & Kishan
 
Purity 2016
Purity 2016Purity 2016
Purity 2016
 
Ephata 620
Ephata 620Ephata 620
Ephata 620
 
Lassen van Aluminium
Lassen van AluminiumLassen van Aluminium
Lassen van Aluminium
 
School work
School workSchool work
School work
 

Similaire à Hadoop France meetup Feb2016 : recommendations with spark

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
DB Tsai
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Webpage Personalization and User Profiling
Webpage Personalization and User ProfilingWebpage Personalization and User Profiling
Webpage Personalization and User Profiling
yingfeng
 

Similaire à Hadoop France meetup Feb2016 : recommendations with spark (20)

Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
2011-02-03 LA RubyConf Rails3 TDD Workshop
2011-02-03 LA RubyConf Rails3 TDD Workshop2011-02-03 LA RubyConf Rails3 TDD Workshop
2011-02-03 LA RubyConf Rails3 TDD Workshop
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Webpage Personalization and User Profiling
Webpage Personalization and User ProfilingWebpage Personalization and User Profiling
Webpage Personalization and User Profiling
 
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
PredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF ScalaPredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF Scala
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 

Plus de Modern Data Stack France

Plus de Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXHadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
 

Dernier

Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 

Dernier (20)

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 

Hadoop France meetup Feb2016 : recommendations with spark

  • 2. Hi! I’m Koby 2 ▣ Data Scientist at Equancy □ Previously: Kpler, Engie ▣ Python Dev □ scikit-learn / pandas / Jupyter □ Sometimes I use R ▣ I used Hadoop before for data pipelines ▣ My first project doing distributed ML!
  • 3. Hello, my name is Hervé! 3 ▣ Equancy Partner & Chief Scientist ▣ In charge with Data Technologies □ Data Engineering □ Data Science □ Innovating with data ▣ PhD in Machine Learning many years ago
  • 4. 4
  • 6. Recommenders: What for? 6 ▣ Only one occasion to interact with customers □ Which marketing message to choose? ▣ Personalized User Experience □ Improved Experience! ▣ No information overload □ ~230,000 Products
  • 7. Why personalization matters? Because no personalization is ugly... 7
  • 9. Three different recommendation systems 9 Homepage Product Page Cart Collaborative Filtering (Unsupervised Learning) Frequently Bought-Together Prediction (Supervised Learning) Content-Based Filtering (Correlation Maximization)
  • 10. Three different recommendation systems 10 Homepage Product Page Cart Collaborative Filtering (Unsupervised Learning) Frequently Bought- Together Prediction (Supervised Learning) Content-Based Filtering (Correlation Maximization)
  • 12. Business Inputs ▣ Score should be based on three factors: □ Interaction type - purchase is more important than a product view □ Time (decay) - a product purchased in recent history witll have more impact than a product purchased in the distant past □ Season - a product purchased during another season will have less impact
  • 13. Business Rules ▣ The following items should be Filtered-out: □ Purchased recently or very similar □ Not in current season □ Not user’s gender □ Not in stock
  • 15. 1 5 1 3 1 1 1 3 1 1 5 5 3 ► 1 ► 3 ► 5 ▣ Map users to products in a matrix
  • 16. ? ? ? 1 ? ? 5 1 3 ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? 1 1 ? ? ? ? ? ? 3 ? ? 1 ? ? 1 ? 5 ? ? 5 ? 3 ? ? ► 1 ► 3 ► 5 ▣ Predict missing interactions
  • 17. ? ? ? ? ? ? ? ? Training = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 5 1 3 1 1 1 3 1 1 5 5 3 ? ? ? ? ? ? X Items Users ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Latent Factors Matrix Factorization 1 5 1 3 1 3 1 1 1 3 1 1 5 5 3
  • 18. 1 5 1 3 1 3 1 1 1 3 1 1 5 5 3 Training Matrix Factorization ▣ Input: □ Sparse representation of matrix (tuples) □ Representation of an interaction score between user and product
  • 19. Training Matrix Factorization ▣ Output: □ User Features mapping users to latent features □ Product Features Mapping products to latent features □ Estimation of interaction scores ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 5 1 3 1 1 1 3 1 1 5 5 3 ? ? ? ? ? ? X Items Users ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Latent Factors
  • 22. Implicit Collaborative Filtering ▣ Difficulties: □ How to interprate missing relations between users and products? If a user didn’t click on the item - does it means that the user doesn’t like it? Maybe he just didn’t see it yet? □ What values should we use for missing relations? should we replace with 0? should we replace with mean/median? ▣ Using methods for explicit feedback (i.e. product rating) can’t be applied to our case!
  • 23. ▣ Spark MLlib has a special CF implementation for the implicit feedback case, based on: ▣ The general idea is using confidence level that will let us tune what a lack of feedback means for our applications Implicit Collaborative Filtering (Google the title to read it for free on the author’s page)
  • 25. Training def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int, alpha: Double, seed: Long): MatrixFactorizationModel Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by blocks. ratings RDD of (userID, productID, rating) pairs rank number of features to use iterations number of iterations of ALS (recommended: 10-20) lambda regularization factor (recommended: 0.01) blocks level of parallelism to split computation into alpha confidence parameter seed random seed
  • 26. from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating # Load and parse the data data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(',')) .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) # Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 10 model = ALS.train(ratings, rank, numIterations) # Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean() print("Mean Squared Error = " + str(MSE)) # Save and load model model.save(sc, "target/tmp/myCollaborativeFilter") sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter") ALS for Python
  • 27. import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 10 val model = ALS.train(ratings, rank, numIterations, 0.01) // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate) }.join(predictions) val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2) err * err }.mean() println("Mean Squared Error = " + MSE) // Save and load model model.save(sc, "target/tmp/myCollaborativeFilter") val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter") ALS for Scala
  • 29. Measuring prediction Performance ▣ In order to select good parameters for our model we designed a validation benchmark ▣ We based it on relatively small dataset to be able to make a significant amount of tests ▣ We chose to measure and minimize the RMSE*: □ used by default in ALS □ punishes big errors □ error is in the scale of the rating unit □ common metric for CF * RMSE - Root Mean Square Error
  • 30. Measuring prediction Performance # splitting dataset randomly into train set and validation set training_RDD, validation_RDD = small_ratings_data.randomSplit([0.7, 0.3], seed=0) validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1])) test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1])) ranks = [4, 8, 12] errors = [0, 0, 0] err = 0 # measuring error on the validation set min_error = float('inf') best_rank = -1 best_iteration = -1 for rank in ranks: model = ALS.train(training_RDD, rank, seed=0, iterations=10, lambda=0.1) predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2])) rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions) error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean()) errors[err] = error err += 1 print 'For rank %s the RMSE is %s' % (rank, error) if error < min_error: min_error = error best_rank = rank print 'The best model was trained with rank %s' % best_rank For rank 4 the RMSE is 0.963681878574 For rank 8 the RMSE is 0.96250475933 For rank 12 the RMSE is 0.971647563632 The best model was trained with rank 8
  • 32. Deployment ▣ Training a model is actually pretty fast ▣ Deploying is slow □ We decided that all users will get a top-n recommendation □ This recommendation is stored in a DB ▣ We need to make a fresh recommendation for every user - there are 4 million users. In Python: def recommendProducts(self, user, num): """ Recommends the top "num" number of products for a given user and returns a list of Rating objects sorted by the predicted rating in descending order. """ return list(self.call("recommendProducts", user, num)) ▣ This call was around 20 ms - pretty quick □ calling this function 4M times = 1 day
  • 33. Deployment ▣ I wasn’t the only one that needed this feature ...
  • 34. Deployment ▣ Solutions: Extracting the User / Product features and applying matrix multiplication and sorting directly the RDD by batches: users_rdd = model.userFeatures() products_rdd = model.productFeatures() … from joblib import Parallel, delayed Parallel(n_jobs=cores, verbose=1000)(delayed(prepare_recommendation)(user_features_batch, gender) for user_features_batch in nested_user_features) ... user_features_batch.dot(product_features_T) This was about 10 times faster than calling recommendProducts ▣ Starting from Spark 1.6 recommendProductsForUsers is implemented for Python □ This where a Scala has advantage over Python!
  • 36. Domain-specific discussion ▣ Pros □ Helps us to find non-obvious relations between users and products □ High diversity and coverage of item catalogue □ Using an unsupervised method we project to a low-dimensional space: Latent Factor 1 = 20% red boots + 30% green snickers + … Latent Factor 2 = 15% adidas snickers + 35% comfy boots + ... ➔ Embodies “deep” preferences (fashion, style, ...) ▣ Cons □ Unpredictable results: e.g. user never shopped for red boots - why is it recommended? □ Can be interpreted as intrusion to the users’ privacy through (a machine
  • 37. Machine Learning / Big Data discussion ▣ Pros □ Training of the model is quick thanks to the latent feature low dimensionality □ Linear model with a closed-form solution (“easy!”) □ No cold-start problem (vs. User-based CF) □ Training is parallelizable: Hadoop Friendly ▣ Cons □ Heavy in computation in comparison to Content-Based approaches □ Unable to fit non-linear relations (polynomial tricks can’t be applied)
  • 38. Guess what? 39 We hire! Data Engineers warmly welcomed
  • 40. Thank You! www.equancy.com 47 rue de Chaillot - 75116 Paris Koby Karp Hervé Mignot kkarp@equancy.com herve.mignot@equancy.com