Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Mahout becomes
a researcher

Kris Jack, PhD
Senior Data Mining Engineer

Overview

➔
What's Mendeley?

➔
Applications of Mahout's Recommender

➔
Under Mahout's Bonnet

➔
Mahout's Research Career so Far

➔
Conclusions

➔
Mendeley is a data platform for researchers
➔
We're bringing together researchers and the research
that they produce from all over the world

➔
We're structuring this data in a machine readable format

➔
We're opening this data up for you to build applications
on top of it using our API

➔
These applications help researchers to do even better
research and become more productive

➔
How are we building our community?

Mendeley provides tools to help users...

...organise
their research

➔
Reference
management

➔
Cite-as-you-
write

➔
Full-text
article search

➔
Digitalised
annotations

...collaborate with
one another
...organise
their research

➔
Research network

➔
Professional
research groups

...collaborate with
one another
...organise ...discover new
their research research

➔
Mendeley Suggest

➔
Personalised article
recommendations

➔
Weekly batch of 10
recommended articles

➔
Collaborative Filtering

➔
The more data, the
better

1.5 million+ users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida

50m research articles University of North Carolina

...collaborate with
one another
...organise ...discover new
their research research

We need a recommender
that scales up, coping with
our data and future growth

Applications of Mahout's
Recommender

Mahout use cases:
➔
Retrieve related items in
large collections

http://www.slideshare.net/kryton/the-data-layer

Mahout use cases:
➔
large collections

➔
Discover relevant items that
you may have overlooked

http://engineering.foursquare.com/2011/03/22/build
ing-a-recommendation-engine-foursquare-style/

Mahout use cases:
➔
large collections

➔

➔
Find love!
➔
Mahout implements collaborative
filtering, a surprisingly powerful
algorithm

http://www.speeddate.com/apps/site/views/mp/technology.php

Mahout use cases:
➔
large collections

➔

➔
Find love!
➔
Mahout implements collaborative
filtering, a surprisingly powerful
algorithm

➔
Mendeley Suggest
➔
Discover new research
➔
Fill in gaps in your library
➔
Your personal advisor

http://krisjack.blogspot.co.uk/2012/02/your-very-own-
personalised-research.html

Generating recommendations
through matrix multiplication

This is item-based
recommendations as
similarity is based on
items, not users

Not convinced? Try reading these...
Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender
systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions
on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA.

http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2
http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html

Researchers
Turing Babbage Einstein Newton

Comp Sci 1
Research Articles

Comp Sci 2

Physics 1

Physics 2

Input (all user preferences)

Researchers
Turing Babbage Einstein Newton
1.5M

Comp Sci 1
Research Articles

Comp Sci 2

Physics 1

Physics 2
300M
prefs

50M

Input (all user preferences)

Researchers

Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)

Researchers

Research
Articles
item.RecommenderJob
(item x user)

Research Turing
Articles

A User's Preferences
(item x user)

Researchers

Research
Articles
item.RecommenderJob
(item x user)

Research
Articles Turing

2 1 0 0
Research
Research

0 0
Articles

1 1
Articles

0 0 2 2
0 0 2 2
Item Similarity A User's Preferences
(item x item) (item x user)

Researchers

Research
Articles
Research Articles
Comp Sci 1 Physics 1
Comp Sci 2 Physics 2
Input (all user
preferences)

Comp Sci 1 2 1 0 0
Research Articles

Comp Sci 2 1 1 0 0
Physics 1
0 0 2 2
Physics 2
0 0 2 2

Researchers

Research
Articles
item.RecommenderJob
(item x user)

Research
Articles Turing Turing

2 1 0 0
Research

Research
Research

0 0
Articles

Articles
1 1
Articles

0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Mahout's Research
Career so Far

Mahout's
Normalised Amazon Hours Performance

No. Good Recommendations/10

Mahout's
Costly & Bad
Normalised Amazon Hours Performance Costly & Good

Cheap & Bad No. Good Recommendations/10 Cheap & Good

Mahout's
Costly & Bad Performance Costly & Good
7K
Normalised Amazon Hours

6K

5K

4K

3K

2K

1K

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K

2K

1K

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
➔
2.4K, 1.5
2K

1K

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K
-4.1K
(63%)
4K

3K Cust. item-based
➔
2.4K, 1.5
2K

1K

0
0.5 0
1 1.5 2 2.5 3

Reducing processing time and cost

➔
Mahout's recommender is already efficient
➔
but your data may have unusual properties
➔
We got improvements by:
➔
tuning Hadoop's mapper and reducer allocation over the 10
steps in the RecommenderJob
➔
using an appropriate partitioner

Task Allocation 37 hours to complete

1 reducer allocated, despite having 48 available...

Task Allocation

Allocating more reducers on a per job basis

job.getConfiguration().setInt(
"mapred.reduce.tasks",
numMappers);

Allocating more mappers on a per job basis

job.getConfiguration().set(
"mapred.max.split.size",
String.valueOf(splitSize));

Task Allocation 37 hours to complete
14 hours

From 1 → 40
reducers

Partitioners 14 hours to complete


~50KB

~500MB

InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(...);
InputSampler.writePartitionFile(conf, sampler);
conf.setPartitionerClass(TotalOrderPartitioner.class);

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-
series-issue-2-getting-started-with-customized-partitioning/

2 hours

Evenly
distributed

Researchers

user

Research
Articles
item.RecommenderJob
(item x user)

Researchers
Research
Articles Turing Turing

2 1 0 0
Researchers

Research

Research
Research

0 0
Articles

Articles
1 1
Articles

0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
User Similarity (user x user)

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
+1 (67%)
➔
2.4K, 1.5
2K -1.4K
Orig. user-based
(58%)
1K
➔
1K, 2.5

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
Cust. user-based
➔
0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K
-4.1K
(63%)
4K

3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K 1K, 2.5
➔

-0.7K Cust. user-based
(70%) ➔0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K +1 (67%)
6.5K, 1.5

6K Orig. item-based

5K

4K
-6.2K
(95%)
3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
Cust. user-based
➔
0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

Conclusions
➔
Mahout is doing a great job of powering Mendeley Suggest
➔
Large scale data set
➔
Excellent for batch processing requirements
➔
We'll soon be feeding our user-based implementation into
Mahout
➔
User-based can outperform item-based
➔
Makes Mahout's offering more rounded
➔
Save resources and money by understanding your data
➔
Help Hadoop with task allocation if necessary
➔
Paritition your data appropriately

We're Hiring!
➔
Hadoop Data Architect
➔
design a coherent data model across the company
➔
take ownership of our data
➔
hands on Hadoop administration
➔
Marie Curie Senior Research Fellow
➔
ensure that Mendeley’s research catalogue is of high quality
➔
research and development opportunity
➔
£500 Finder's Fee if you find someone who we hire
➔
http://www.mendeley.com/careers/

Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Recommended

Recommended

More Related Content

Similar to Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Similar to Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley (20)

More from Kris Jack

More from Kris Jack (14)

Recently uploaded

Recently uploaded (20)

Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley