Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Differences in Distributions and
Their Effect on Recommendation
System Performance
Why Collaborative Filtering Doesn’t Sca...
History of Recommendation
Overfitting
Distribution
of All Items
Across Users
Distribution of
All Items Across
All Users in the
Future
Concrete Set o...
Recommender Systems Dilemma
Set of All Items Possible
Set of Items Known to Users in the Future
Set of Items Known to User...
Collaborative Filtering in Music
• Construct correlations between items from set of past known items
• Generate estimated ...
Resulting Biases
Huge number of items where 50%+ of users only ever saw 20 songs a
month out of 3 million
Massive gap be...
First Generation Problems
• Everyone likes The Beatles or Norah Jones
• Extremely frequent in biased data sets
• Since eve...
Problems Over Time
• The ground truth is heavily biased by recommendations controlling
the set of known items
• Machine le...
Best Solution So Far
Past Data Idealized Future Distribution
Idealized Function Feature Value => Rating
Best Solution So Far
• Requires all Items be categorized and quantized
• Requires accuracy and general agreement on these ...
Evaluation Adjustments
• No Replacement for Real World A/B testing
• Machine Learning for evaluation, not just the questio...
Distribution Problems in Recommender Systems
Prochain SlideShare
Chargement dans…5
×

Distribution Problems in Recommender Systems

305 vues

Publié le

Traditional machine learning and collaborative filtering pay little attention to the sources of the data they use. The differences between the distribution backing the learning data, the distribution backing the algorithm output, and the distribution backing the ground truth are often completely different and almost unrelated to the target distribution: true ratings across all items for every user.

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Distribution Problems in Recommender Systems

  1. 1. Differences in Distributions and Their Effect on Recommendation System Performance Why Collaborative Filtering Doesn’t Scale (portions reference Prismatic’s Silicon Valley talk)
  2. 2. History of Recommendation
  3. 3. Overfitting Distribution of All Items Across Users Distribution of All Items Across All Users in the Future Concrete Set of Past Items Across Users Concrete Set of Future Items Across Users
  4. 4. Recommender Systems Dilemma Set of All Items Possible Set of Items Known to Users in the Future Set of Items Known to Users in the Past Set of Items Recommended By Recommenders Items Viewed Or Liked in the Future Items Users Viewed Or Rated in the Past Items Seen in Ground Truth Without Changes in Item Access ??????
  5. 5. Collaborative Filtering in Music • Construct correlations between items from set of past known items • Generate estimated distribution for past users across all items • Hope ‘errors’ relate to future user liked items • Gap between distributions escalates with the scale of data
  6. 6. Resulting Biases Huge number of items where 50%+ of users only ever saw 20 songs a month out of 3 million Massive gap between all items and known items distribution Cross Validation ground truth assumes the 50%+ users only ever saw that new top 20 songs for the new set Results are supposed to be based on if users knew all sets Continuous user testing assumes ‘all items seen’ distributions, but only the set of recommended items are new items seen User data itself is a biased subset of the whole
  7. 7. First Generation Problems • Everyone likes The Beatles or Norah Jones • Extremely frequent in biased data sets • Since everyone listened to before, everyone gets recommended them • Recommendations usually repeat the top 40 of the data collection • Users might like novel recommendations, but that won’t ever be in the evaluation set in cross validation – users never saw them
  8. 8. Problems Over Time • The ground truth is heavily biased by recommendations controlling the set of known items • Machine learning – including collaborative filtering – learns the algorithm distribution more than users preferences • Performance Bias • Future ground truth comes from those that stayed in the system • They liked the system • It doesn’t represent those that were unhappy and left • Biases data to keep existing users happy without regard to ex-users • In extreme cases, even new users are discarded
  9. 9. Best Solution So Far Past Data Idealized Future Distribution Idealized Function Feature Value => Rating
  10. 10. Best Solution So Far • Requires all Items be categorized and quantized • Requires accuracy and general agreement on these values • (Socially Defined versus Absolute) • At least all features are present in all sets • Transforms recommendation into optimization and personalization • Set of items with highest score for a user • Ability to predict poor performing product or agent solutions • Better able to incorporate additional data • Prediction is usually linear time over the number of items
  11. 11. Evaluation Adjustments • No Replacement for Real World A/B testing • Machine Learning for evaluation, not just the question • Hidden dependencies and ‘cheating’ Learned Algorithm Model Training Evaluation Model Model Training Business Objective Ground Truth

×