I gave this presentation as part of the Big Data Week Conferences in London, 25th April, 2012.
Mendeley Suggest is a research article recommendation system powered by Mahout. This presentation explores how Mahout's distributed recommender works and how well it performs when applied to the problem of recommending research to Mendeley users. Based on experimentation, some tips are provided on how to speed Mahout up by tuning it to the characteristics of the training data set. A new recommendation algorithm is also presented that implements user-based collaborative filtering which complements Mahout's existing item-based collaborative filtering algorithm. The user-based implementation will soon be contributed back to the Mahout community.
4. ➔
Mendeley is a data platform for researchers
➔
We're bringing together researchers and the research
that they produce from all over the world
➔
We're structuring this data in a machine readable format
➔
We're opening this data up for you to build applications
on top of it using our API
➔
These applications help researchers to do even better
research and become more productive
➔
How are we building our community?
5. Mendeley provides tools to help users...
...organise
their research
➔
Reference
management
➔
Cite-as-you-
write
➔
Full-text
article search
➔
Digitalised
annotations
6. Mendeley provides tools to help users...
...collaborate with
one another
...organise
their research
➔
Research network
➔
Professional
research groups
7. Mendeley provides tools to help users...
...collaborate with
one another
...organise ...discover new
their research research
➔
Mendeley Suggest
➔
Personalised article
recommendations
➔
Weekly batch of 10
recommended articles
➔
Collaborative Filtering
➔
The more data, the
better
8. 1.5 million+ users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
50m research articles University of North Carolina
9. Mendeley provides tools to help users...
...collaborate with
one another
...organise ...discover new
their research research
We need a recommender
that scales up, coping with
our data and future growth
13. Mahout use cases:
➔
Retrieve related items in
large collections
http://www.slideshare.net/kryton/the-data-layer
14. Mahout use cases:
➔
Retrieve related items in
large collections
➔
Discover relevant items that
you may have overlooked
http://engineering.foursquare.com/2011/03/22/build
ing-a-recommendation-engine-foursquare-style/
15. Mahout use cases:
➔
Retrieve related items in
large collections
➔
Discover relevant items that
you may have overlooked
➔
Find love!
➔
Mahout implements collaborative
filtering, a surprisingly powerful
algorithm
http://www.speeddate.com/apps/site/views/mp/technology.php
16. Mahout use cases:
➔
Retrieve related items in
large collections
➔
Discover relevant items that
you may have overlooked
➔
Find love!
➔
Mahout implements collaborative
filtering, a surprisingly powerful
algorithm
➔
Mendeley Suggest
➔
Discover new research
➔
Fill in gaps in your library
➔
Your personal advisor
http://krisjack.blogspot.co.uk/2012/02/your-very-own-
personalised-research.html
18. Generating recommendations
through matrix multiplication
This is item-based
recommendations as
similarity is based on
items, not users
Not convinced? Try reading these...
Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender
systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions
on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA.
http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2
http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html
19. Researchers
Turing Babbage Einstein Newton
Comp Sci 1
Research Articles
Comp Sci 2
Physics 1
Physics 2
Input (all user preferences)
20. Researchers
Turing Babbage Einstein Newton
1.5M
Comp Sci 1
Research Articles
Comp Sci 2
Physics 1
Physics 2
300M
prefs
50M
Input (all user preferences)
21. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
22. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research Turing
Articles
A User's Preferences
(item x user)
23. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research
Articles Turing
2 1 0 0
Research
Research
0 0
Articles
1 1
Articles
0 0 2 2
0 0 2 2
Item Similarity A User's Preferences
(item x item) (item x user)
25. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research
Articles Turing Turing
2 1 0 0
Research
Research
Research
0 0
Articles
Articles
1 1
Articles
0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
30. Mahout's
Costly & Bad
Normalised Amazon Hours Performance Costly & Good
Cheap & Bad No. Good Recommendations/10 Cheap & Good
31. Mahout's
Costly & Bad
Normalised Amazon Hours Performance Costly & Good
Cheap & Bad No. Good Recommendations/10 Cheap & Good
32. Mahout's
Costly & Bad
Normalised Amazon Hours Performance Costly & Good
Cheap & Bad No. Good Recommendations/10 Cheap & Good
33. Mahout's
Costly & Bad Performance Costly & Good
7K
Normalised Amazon Hours
6K
5K
4K
3K
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
34. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
35. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
36. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
-4.1K
(63%)
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
37. Reducing processing time and cost
➔
Mahout's recommender is already efficient
➔
but your data may have unusual properties
➔
We got improvements by:
➔
tuning Hadoop's mapper and reducer allocation over the 10
steps in the RecommenderJob
➔
using an appropriate partitioner
38. Task Allocation 37 hours to complete
1 reducer allocated, despite having 48 available...
39. Task Allocation
Allocating more reducers on a per job basis
job.getConfiguration().setInt(
"mapred.reduce.tasks",
numMappers);
Allocating more mappers on a per job basis
job.getConfiguration().set(
"mapred.max.split.size",
String.valueOf(splitSize));
40. Task Allocation 37 hours to complete
14 hours
From 1 → 40
reducers
44. Partitioners 14 hours to complete
2 hours
Evenly
distributed
45. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
-4.1K
(63%)
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
1K
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
46. Researchers
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Research
Articles Turing Turing
2 1 0 0
Research
Research
Research
0 0
Articles
Articles
1 1
Articles
0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
47. Researchers
user
Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)
Researchers
Research
Articles Turing Turing
2 1 0 0
Researchers
Research
Research
Research
0 0
Articles
Articles
1 1
Articles
0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
User Similarity (user x user)
48. Mahout's
Costly & Bad Performance Costly & Good
7K
6.5K, 1.5
Normalised Amazon Hours
6K Orig. item-based
5K
4K
3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
0
0.5 0
1 1.5 2 2.5 3
Cheap & Bad No. Good Recommendations/10 Cheap & Good
54. Conclusions
➔
Mahout is doing a great job of powering Mendeley Suggest
➔
Large scale data set
➔
Excellent for batch processing requirements
➔
We'll soon be feeding our user-based implementation into
Mahout
➔
User-based can outperform item-based
➔
Makes Mahout's offering more rounded
➔
Save resources and money by understanding your data
➔
Help Hadoop with task allocation if necessary
➔
Paritition your data appropriately
55. We're Hiring!
➔
Hadoop Data Architect
➔
design a coherent data model across the company
➔
take ownership of our data
➔
hands on Hadoop administration
➔
Marie Curie Senior Research Fellow
➔
ensure that Mendeley’s research catalogue is of high quality
➔
research and development opportunity
➔
£500 Finder's Fee if you find someone who we hire
➔
http://www.mendeley.com/careers/