2. Who am I ?
• Architect @ Flytxt (Big Data Analytics & Automation)
• Passionate about data, distributed computing , machine learning
• Previously
•Virtualization & Cloud Lifecycle Management(BMC)
• Designed and Implemented Cloud Life Cycle Management Interface@BMC
• Large Scale Data Centre Automation(AOL)
• Implemented Centralized Data Center Management Framework for AOL
•Workflow Systems & Automation (Accenture)
• Implemented Service Management Suit for various customers
3. Session Agenda!
• Recommendation Engines – What's the big deal?
• Conceptual Overview
• Collaborative Filtering
• Engineering Challenges
• Apache Mahout
• Getting your recommender to production
• Q&A
3
6. Big deal? Advertisers
Recommend Best Ads
Ads
Content
Users
Ad
Network
Content Publishers
ML Algorithms
User Behavior Modelling
Maximization Criteria
7. BTW, What was the challenge?
User Base : 2 billion+ users world wide
Content Base : 12.51 billion+ indexed pages
Advertiser Base : millions of active advertisers
Real-time nature : Responses in < 200 ms
Multi –objective optimization problem
Noisy Data
8. Recommendation Engines: Overview
A specific type of information filtering system
technique that attempts to recommend information
items or social elements that are likely to be of interest
to the user.
Technologies that can help us sift through all the
available information to predict products or services
that could be interesting to us.
Applying knowledge discovery techniques to the
problem of making personalized recommendations for
information, products or services, usually during a live
interaction.
9. We need a crystal ball to predict ?
We all have opinions/tastes which we express as our likes or dislikes.
Our tastes follow some patterns.
We tend to like things which are similar to things which we already
like(e.g. Songs)
We tend to like things which are liked by people who are similar to
us(e.g. Movies)
From fancy research to mainstream
10. Collaborative Filtering
Problem : We have U users and I items in the system, a user Uk need to
be recommended with a set of m items which are yet un-picked by him
which he might be interested in picking up.
Solution :
Maintain a database of users’ ratings of a variety of items.
For a given user, find other similar users whose ratings strongly
correlate with the current user - User Neighborhood
Recommend items rated highly by these similar users, but not rated by
the current user.
E.g. Amazon, Filpkart etc
11. Utility Matrix
Matrix of values representing each user’s level of affinity to each item.
Sparse matrix
Recommendation engine needs to predict the values for the empty cells
based on available cell values
Denser the matrix, better the quality of recommendation
User | Item i1 i2 i3 i4 i5
u1 r12 r14 r15
u2 r21 r22 r25
u3 r32 r34
u4 r43 r45
12. Engineering Challenges
Massive Data Volume : how do I deal with TBs of raw data to build my
recommendations?
Hadoop and Map-Reduce shines!
How can I make it work in ‘Real-Time’ ?
Batch pre-compute and store in HBase could help!
Will my solution scale? soon my user base is going to double!.
Sure, you can make it scale!
13. Engineering Challenges
Do I need a cloud based infrastructure?
Depends!
Hadoop compatible Machine Learning library?
Mahout would help!
How can I represent/transform my input data appropriately?
Pig/Hive might help!, if not ,map-reduce is always there!
14. Apache Mahout Overview
Scalable machine learning library
core algorithms for clustering, classification and batch based
collaborative filtering implemented over Hadoop
Few popular algos: K-Means, fuzzy K-Means ,Canopy clustering ,LDA
etc
Vibrant community support.
Used by – Adobe ,Yahoo! ,Amazon , AOL, Flytxt…. (list goes on)
mahout-dev-subscribe@apache.org
15. Taking Recommendation Engines to production
Analyzing the input data, what kind of info I can collect from users
Selecting the appropriate recommender (e.g. user based, Item based )
Strategy to recommend to anonymous users(or first time users)
Strategy for distributed computing, modeling the problem as map-
reduce
Choosing the deployment model
Monitoring the system
16. Conclusion
Very popular field of research and implementation
More and more products and services are leveraging the concept
From fancy research to live production systems at scale
Making peoples lives easier by assisting in making decisions
17. Some more concepts.…
Concept of similarity – distance measure etc
Pearson Correlation
User neighborhood computation