2. Dylan Valerio
Software engineer > data scientist > Kaggler > academic
ADMU, BS CS, UP Diliman, MS CS
IT Security, e-commerce, internet technologies
Natural language processing, computer vision, deep
learning, recommendation systems
3. (Yes, this is my to-read list)
I like collecting these art for my tabletop games Ask and answer away in Quora.
“Recommendation is invaluable
for companies with content and
users of all sizes. It boosts
engagement and loyalty to the
brand.”
Mendeley is a site for researchers and their references
The Spotify Mix automatically crafts
recommendations from your favorite music
Productivity-buster!
4. Amazon has more than 500M
products in the US and is
estimated to have 65M
Amazon Prime Users.
There is a
deluge of
content for
users.
Netflix has 130M
subscribers and 8000
movies and TV shows
Spotify has 180M users
and 30M songs.Pinterest has 70M active users, 50B pins and
1B boards.
Quora has some 11M
questions and 30M
answers.
That’s a lot of content. Recommendation is an
absolute must for the user to even begin
consuming content.
5. Different Paradigms of Recommendation
watched
Content Filtering Collaborative Filtering
watched
Similar tags: crime, Robert de Niro, dark, mob
recommends
recommends
Pros More interesting for users
Cons Items with no usage (Cold-start)
Pros Readily explainable; Fast
Cons Stale and unchanging
• No free lunch
• It’s a quickly growing field with vast literature and domain-specific nuances
18. Memory-Based Models
User-based K-Nearest Neighbor Recommendation
Intuition: Find the most enjoyed items by the users closest to me in terms of what they watch.
Item-based K-Nearest Neighbor Recommendation
Intuition: Find the closest items of the items I enjoyed in terms of the users that enjoyed both.
Similarity between users
Similarity between items
19. K-Nearest Neighbor Recommenders
User Based
For each user u:
neighbors <- get closest users to u
new_items <- get items u has not rated before from neighbors v
For each new_item i and neighbor v:
Accumulate weighted_scores <- similarity(u,v) * rating(v,i)
Normalize and sort
Item Based
For each user u:
my_items <- u’s rated items
close_items <- get items close to my_items
For each close_item i and my_item j:
Accumulate weighted_scores <- similarity(i,j) * rating(u,j)
Normalize and sort
20. MovieLens 20M My Recommendations
My Top Ratings20M ratings
138k users
26k movies
99.46% zero entries
21. My Own User-Based Recommender
Get closest users
Get items I haven’t
rated
For each neighbor and
new items, compute
weighted score
Normalize
Sort & serve
23. Goals of Recommendation
Minimize difference of
ratings
Rank the
recommendation list
Business metrics
Click through rate
Customer conversion rates
MachineLearning
Metrics
24. Evaluating a Good Recommender
We take out a fraction of watches from each user.
We then compare the similarity of our predicted
recommendations to actual watched items.
Error-Based Metrics
• RMSE, MAE
Ranking Metrics
• Precision
• Recall
• Normalized Discounted Cumulative Gain
Other metrics
• Diversity
• Novelty
• Serendipity
25. Surprise Library
Cross Validation:
Use 2/3 of the data, test on 1/3,
repeat 3 times.
Root Mean Squared Error (RMSE)
Sum of the square of the
differences of the expected from
the predicted rating.
Simple Python Recommendation System Engine (????)
26. Extension to User-Based Recommender
Get closest users
Get items I haven’t
rated
For each neighbor and new
items, compute weighted score
taking into account the mean
of how I and others rate
Normalize
Sort & serve
29. Collaborative Filtering : Matrix Factorization
• Latent factors describe the structure of the data beyond the noise
• There are two latent variables, the items and the users, rows and columns, respectively.
• Can “recover” the missing values in the ratings matrix
User
Item
• Surprise covers SVD, which uses explicit ratings
• Implicit covers Weighted Matrix Factorization (WMF), which uses implicit ratings
30. Steam Games
200k ratings
12k users
5k games
99.68% zero entries
You Might Like… Why? https://itstherealdyl.wordpress.com/2017/07/30/you-might-like-why/
31. Implicit Library
Data Preparation
Convert matrix to sparse format.
Sparse format can accommodate
BIG datasets.
Model:
100 dimension latent matrix
0.1 regularization
20 iterations
4 threads
32. Implicit Library
Bookkeeping:
Since in sparse format, the
original ID’s are lost.
Model Explainability:
Similarity of item i to j
weighted by how much the
user enjoyed “i”