At the sold-out Spark & Machine Learning Meetup in Brussels on October 27, 2016, Nick Pentreath of the Spark Technology Center teamed up with Jean-François Puget of IBM Analytics to deliver a talk called Creating an end-to-endRecommender System with Apache Spark and Elasticsearch.
Jean-François and Nick started with a look at the workflow for recommender systems and machine learning, then moved on to data modeling and using Spark ML for collaborative filtering. They closed with a discussion of deploying and scoring the recommender models, including a demo.
Boost PC performance: How more available memory can improve productivity
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch - Nick Pentreath & Jean-François Puget
1. Spark Technology Center
Oct /
27 /
16
Creating an end-to-end
Recommender System
with Apache Spark
and Elasticsearch
Jean-François Puget
Nick Pentreath
2. Spark Technology Center
§ @JFPuget
§ Distinguished Engineer, IBM Machine
Learning & Optimization
§ @MLnick
§ Principal Engineer, IBM Spark
Technology Center
§ Apache Spark PMC
About
3. Spark Technology Center
§ Recommender systems & the machine
learning workflow
§ Data modelling for recommender
systems
§ Why Spark & Elasticsearch?
§ Spark ML for collaborative filtering
§ Deploying & scoring recommender
models
§ Demo
Agenda
7. Spark Technology Center
The Machine
Learning
Workflow
Reality
Data
• Historical
• Streaming
Ingest
Data
Processing
• Feature
transformation &
engineering
Model
Training
• Model selection &
evaluation
Deploy
• Pipelines, not just
models
• Versioning
Live System
• Predict given new
data
• Monitoring & live
evaluation
Feedback Loop
Spark DataFrames
Spark ML
Various ???
Stream (Kafka)
Missing
piece!
8. Spark Technology Center
The Machine
Learning
Workflow
Recommender Version
Data Ingest
Data
Processing
• Aggregation
• Handle implicit
data
Model
Training
• ALS
• Ranking-style
evaluation
Deploy
• Model size &
complexity
Live System
•User & item
recommendations
•Monitoring, filters
Feedback => another Event Type
Spark DataFrames
Spark ML
Elasticsearch
• User & Item
Metadata
• Events
Elasticsearch
Stream (Kafka)
12. Spark Technology Center
User interactions
Implicit preference data
• Page view
• eCommerce - cart, purchase
• Media – preview, watch, listen
Intent data
• Search query
Anatomy of a
User Event
Explicit preference data
• Rating
• Review
Social network interactions
• Like
• Share
• Follow
User Interactions
!
!
!
!
!
!
!
!
25. Spark Technology Center
Can we use the same machinery?Recommendation
!
0 0 ⋯ 0 1 ⋯
0 1 ⋯ 1 1 ⋯
1 1 ⋯ 0 0 ⋯
1 0 ⋯ 0 1 ⋯
Sort
results
1.2 ⋯ −0.2 0.3
Dot product & cosine similarity
… the same as we need for recommendations!
Scoring RankingAnalysis Term vectors
!
!!!
SimilarityUser
(or item)
vector
?
26. Spark Technology Center
Delimited Payload FilterElasticsearch
Term Vectors
Raw vector
1.2 ⋯ −0.2 0.3
Term vector with payloads
0|1.2 ⋯ 3|-0.2 4|0.3
Custom analyzer
27. Spark Technology Center
Custom scoring function
• Native script (Java), compiled for speed
• Scoring function computes dot product by:
§ For each document vector index (“term”), retrieve
payload
§ score += payload * query(i)
• Normalize with query vector norm and document
vector norm for cosine similarity (“similar items”)
Elasticsearch
Scoring
28. Spark Technology Center
Can we use the same machinery?Recommendation
! Sort
results
1.2 ⋯ −0.2 0.3
Scoring RankingAnalysis Term vectors
!!
Custom
scoring
function
!!
Delimited
payload filter
−1.1 1.3 ⋯ 0.4
1.2 −0.2 ⋯ 0.3
0.5 0.7 ⋯ −1.3
0.9 1.4 ⋯ −0.8
!
User
(or item)
vector