I gave this presentation as part of the Data Science meetup in London on 23rd May, 2012.
This describes how I've been making use of Mahout's item-based collaborative filtering recommender system to produce personalised scientific article recommendations for researchers. I discuss how well Mahout performs out of the box and how I manage to reduce processing time by 95% by tuning it to our data set.
2. Use Case
➔
Good researchers are on top of their game
➔
Large amount of research produced
➔
Takes time to get at what you need
➔
Help researchers by recommending relevant research
3. 1.5 million+ users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
50m research articles University of North Carolina
4. 1.5 million+ users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
We need a Harvard University
University of Oxford
recommender that Sao Paulo University
scales up, coping with Imperial College London
University of Edinburgh
our data and future Cornell University
University of California at Berkeley
growth RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
50m research articles University of North Carolina
5.
6. Questions
➔
How does Mahout's recommender work?
➔
How well does it perform out of the box?
➔
How well does it perform after some tuning?
8. Generating recommendations
through matrix multiplication
This is item-based
recommendations as
similarity is based on
items, not users
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
9. Researchers
Turing Babbage Einstein Newton
Comp Sci 1
Research Articles
Comp Sci 2
Physics 1
Physics 2
Input (all user preferences)
10. Researchers
Turing Babbage Einstein Newton
1.5M
Comp Sci 1
Research Articles
Comp Sci 2
Physics 1
Physics 2
300M
prefs
50M
Input (all user preferences)
11. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
12. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research Turing
Articles
A User's Preferences
(item x user)
13. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research
Articles Turing
2 1 0 0
Research
Research
0 0
Articles
1 1
Articles
0 0 2 2
0 0 2 2
Item Similarity A User's Preferences
(item x item) (item x user)
14. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research
Articles Turing Turing
2 1 0 0
Research
Research
Research
0 0
Articles
Articles
1 1
Articles
0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
26. 1. Reduce processing time
➔
Mahout's recommender is already efficient
➔
But your data may have unusual properties
➔
Hadoop may need a helping hand
➔
Let's see what's going on...
27. Task Allocation 37 hours to complete
1 reducer allocated, despite having 48 available...
28. Task Allocation
Allocating more reducers on a per job basis
job.getConfiguration().setInt(
"mapred.reduce.tasks",
numReducers);
Allocating more mappers on a per job basis
job.getConfiguration().set(
"mapred.max.split.size",
String.valueOf(splitSize));
29. Task Allocation 37 hours to complete
14 hours
From 1 → 40
reducers
33. Partitioners 14 hours to complete
2 hours
Evenly
distributed
34. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
35. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
36. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
-4.1K
(63%)
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
37. 2. Improve quality
➔
Mahout provides item-based CF
➔
We have many more items than users
➔
Typically, user-based is more appropriate
➔
So let's make one!
38. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research
Articles Turing Turing
2 1 0 0
Research
Research
Research
0 0
Articles
Articles
1 1
Articles
0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
39. Researchers
user
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Researchers
Research
Articles Turing Turing
2 1 0 0
Researchers
Research
Research
Research
0 0
Articles
Articles
1 1
Articles
0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
User Similarity (user x user)
40. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
41. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
47. Conclusions
➔
Mahout is doing a great job of powering Mendeley Suggest
➔
Large scale data set
➔
Good quality recommendations
➔
Tuning helps
➔
Help Hadoop with task allocation if necessary
➔
Partition your data appropriately
➔
We save 95% resources
➔
Use an appropriate algorithm
➔
Item- vs user-based (MAHOUT-1004)
➔
We increase precision by 66.6%