How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
1. #AnalyticsStreet @joe_Caserta
Building a
Recommendation
Engine on Spark
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
2. About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
#AnalyticsStreet @joe_Caserta
3. Why Big Data?
Enrollments
Claims
Finance
ETL
Traditional
EDW
Big Data Cluster
#AnalyticsStreet @joe_Caserta
Big Data Analytics
Ad-Hoc Query
Traditional BI
Horizontally Scalable Environment - Optimized for Analytics
Canned Reporting
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Spark MapReduce Pig/Hive
N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS)
Others…
Data Science
4. What is Spark
• Spark is a fast, general-purpose cluster computing framework.
• Sits on top of Hadoop
• Up to 100 times faster than Map Reduce
• In-memory cluster computing – well suited for machine learning
• Provides high-level APIs in Java, Scala and Python. Tools include:
• Spark SQL
• MLlib
• GraphX
Data Science Training:
• Spark Streaming https://exploredatascience.com/
#AnalyticsStreet @joe_Caserta
5. Project Objective
• Create a functional recommendation engine to surface to provide
relevant product recommendations to customers.
• Improve Customer Experience
• Increase Customer Retention
• Increase Customer Purchase Activity
• Establish Hadoop with Spark as a high performance, scalable solution
for computing and storage
• Accurately suggest relevant products to customers based on their peer
behavior. Integrate existing EDW data with Hadoop natively using an
enterprise class ETL tool
• Implement an enterprise class business intelligence tool sourcing
directly from Hadoop
#AnalyticsStreet @joe_Caserta
6. Hadoop Environment
• Lab Setup
• 10 node cluster - Cloudera
• 1 TB under management with inexpensive commodity hardware
• ETL – Talend
• Load data from Enterprise Data Warehouse into Hadoop
• Efficacy Reporting - Datameer
• Recommendation Engine Built and Tested
• Recommendations are as good or better than anticipated
• More relevant than possible without Big Data solution
• Algorithms can easily be fine-tuned by adjusting:
• The number of recommendations in the results
• The weighting of the relevancy of the Product
#AnalyticsStreet @joe_Caserta
7. The Math Behind Relevance
• Finding ‘Similar’ Objects
Cosine Similarity
• Value of cos θ varies between:
Figure. Vectors A & B
• -1 [‘θ’ = 180◦, Absolutely dissimilar – Opposite ended vectors/relationship]
• 0 [‘θ’ = 90◦, Dissimilar, perpendicular vectors/relationship]
• +1 [‘θ’ = 0◦, Absolutely Similar – Overlapping vectors/relationship]
#AnalyticsStreet @joe_Caserta
8. Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not have
thought of
• What makes a good recommendation?
• Relevant but not obvious
• Sense of “surprise”
23” LED TV 24” LED TV 25” LED TV
SOLD!!
23” LED TV``
Blu-Ray Home Theater HDMI Cables
#AnalyticsStreet @joe_Caserta
9. Where do we use recommendations?
• Applications can be found in a wide variety of industries and applications:
• Travel
• Financial Service
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others
Our Example: Movies
#AnalyticsStreet @joe_Caserta
10. Our Goal
• Create a powerful, scalable recommendation engine with minimal
development
• Make recommendations to users as they are browsing movie titles -
instantaneously
• Recommendation must have context to the movie they are currently
viewing.
OOPS! – too much surprise!
#AnalyticsStreet @joe_Caserta
11. How do we do it?
Hadoop – distributed file system and processing platform
Spark – low-latency computing
MLlib – Library of Machine Learning Algorithms
We leverage two algorithms:
• Content-Based Filtering – how similar is this particular movie to other
movies based on usage.
• Collaborative Filtering – predict an individuals preference based on their
peers ratings. Spark MLlib implements a collaborative filtering algorithm
called Alternating Least Squares (ALS)
• Both algorithms only require a simple dataset of 3 fields:
“User ID” , “Item ID”, “Rating”
#AnalyticsStreet @joe_Caserta
12. Content-Based Filtering
“People who liked this movie liked these as well”
• Content Based Filter builds a matrix of items to other items and
calculates similarity (based on user rating)
• The most similar item are then output as a list:
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)
7 100 0.690951001800917
7 50 0.653299445638532
7 117 0.643701303640083
At the moment, content based filtering is not available for
Spark in Mllib. On our project, we used Mahout.
#AnalyticsStreet @joe_Caserta
13. Collaborative Filtering
“People with similar taste to you liked these movies”
• Collaborative filtering applies weights based on “peer” user preference.
• Essentially it determines the best movie critics for you to follow
• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are
“Seven” and “Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
#AnalyticsStreet @joe_Caserta
14. Recommendation Store
• Serving recommendations needs to be instantaneous
• The core to this solution is two reference tables:
Rec_Item_Similarity
Item_ID
Similar_Item
Similarity_Score
Rec_User_Item_Base
User_ID
Item_ID
Recommendation_Score
• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID
#AnalyticsStreet @joe_Caserta
15. Delivering Recommendations
So if Johny is viewing “12 Monkeys” we query our recommendation store
and present the results
#AnalyticsStreet @joe_Caserta
Item-Based:
Peers like these
Movies
Best
Recommendations
Item Similarity Raw Score Score
Fargo 0.691 1.000
Star Wars 0.653 0.946
Rock, The 0.644 0.932
Pulp Fiction 0.628 0.909
Return of the Jedi 0.627 0.908
Independence Day 0.618 0.894
Willy Wonka 0.603 0.872
Mission: Impossible 0.597 0.864
Silence of the Lambs, The 0.596 0.863
Star Trek: First Contact 0.594 0.859
Raiders of the Lost Ark 0.584 0.845
Terminator, The 0.574 0.831
Blade Runner 0.571 0.826
Usual Suspects, The 0.569 0.823
Seven (Se7en) 0.569 0.823
Item-Base (Peer) Raw Score Score
Seven 5.000 1.000
Donnie Brasco 4.707 0.941
Babe 4.688 0.938
Heat 4.688 0.938
To Kill a Mockingbird 4.686 0.937
Jaws 4.683 0.937
Monty Python, Holy Grail 4.670 0.934
Blade Runner 4.670 0.934
Get Shorty 4.655 0.931
Top 10 Recommendations
Seven (Se7en) 1.823
Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934
16. From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
…but the 6th result would have been “Babe” the children's movie
• Tuning the algorithms might help: parameter changes, similarity measures.
• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means
#AnalyticsStreet @joe_Caserta
OOPS!
17. Additional Algorithm – K-Means
“These movies are similar based on their attributes”
• Treats items as coordinates
• Places a number of random
“centroids” and assigns the nearest
items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text
#AnalyticsStreet @joe_Caserta
18. Delivery Scoring and Filters
Apply assumptions to control the results of collaborative filtering
• One or more categories must match
• Only children movies will be recommended for children's movies.
Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller
Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0
Babe 0 0 1 1 0 1 0 0 0 0 0
Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1
Star Wars 1 1 0 0 0 0 0 0 1 1 0
Blade Runner 0 0 0 0 0 0 1 0 0 1 0
Fargo 0 0 0 0 1 1 0 0 0 0 1
Willy Wonka 0 1 1 1 0 0 0 0 0 0 0
Monty Python 0 0 0 1 0 0 0 0 0 0 0
Jaws 1 0 0 0 0 0 0 1 0 0 0
Heat 1 0 0 0 1 0 0 0 0 0 1
Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0
To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0
Similarly logic could be applied to promote more favorable options
• New Releases
• Retail Case: Items that are on-sale, overstock
#AnalyticsStreet @joe_Caserta
19. Integrating K-Means into the process
Movies recommended by more than 1 algorithm are the most highly rated
Collaborative Filter
K-Means:
Similar
Content Filter
#AnalyticsStreet @joe_Caserta
Best
Recommendations
20. Sophisticated Recommendation Model
20
What are people
with similar
characteristics
buying?
#AnalyticsStreet @joe_Caserta
What items are we
promoting at time
of sale?
What items are
being promoted
by the Store or
Market?
20
Peer Based
Item
Clustering
Corporate
Deals/
Offers
Customer
Behavior
Market/
Store
Recommendation
What items have
you bought in the
past?
What did people
who ordered
these items also
order?
The solution
allows balancing
of algorithms to
attain the most
effective
recommendation
21. Summary
• Hadoop and Spark can provide a relatively low cost and extremely
scalable platform for recommendations
• Spark, with MLlib offers a great library of established Machine
Learning algorithms, reducing development efforts
• A good recommendation system combines Collaborative and Content
filtering algorithms and custom business rules
• As Spark matures, Mahout or roll-your-own algorithms may be
needed.
#AnalyticsStreet @joe_Caserta
22. Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
#AnalyticsStreet @joe_Caserta
Notes de l'éditeur
Robotman was actually the first cyborg superhero. Robert Crane was fatally shot and had his brain placed in a super strong robot body. The cybernetic Robotman lived on, using a rubber mask and flesh-like body suit to disguise himself as Paul Dennis. The new hero used his cyborg might to smash crime during DC’s Golden Age. First Appearance: Star Spangled Comics #7 (1942)
Cloudera , Talend , Datameer
- Need to talk through the vectors A & B and what we want to express