Scientific Article Recommendation with Mahout

Scientific Article
Recommendation
with Mahout

Kris Jack, PhD
Senior Data Mining Engineer

Use Case
➔
Good researchers are on top of their game
➔
Large amount of research produced
➔
Takes time to get at what you need

➔
Help researchers by recommending relevant research

1.5 million+ users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida

50m research articles University of North Carolina

1.5 million+ users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
We need a Harvard University
University of Oxford
recommender that Sao Paulo University
scales up, coping with Imperial College London
University of Edinburgh
our data and future Cornell University
University of California at Berkeley
growth RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida

50m research articles University of North Carolina

Questions

➔
How does Mahout's recommender work?

➔
How well does it perform out of the box?

➔
How well does it perform after some tuning?

Generating recommendations
through matrix multiplication

This is item-based
recommendations as
similarity is based on
items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Researchers
Turing Babbage Einstein Newton

Comp Sci 1
Research Articles

Comp Sci 2

Physics 1

Physics 2

Input (all user preferences)

Researchers
Turing Babbage Einstein Newton
1.5M

Comp Sci 1
Research Articles

Comp Sci 2

Physics 1

Physics 2
300M
prefs

50M

Input (all user preferences)

Researchers

Research
Articles
item.RecommenderJob
1. Prep. pref. matrix (1-3)
2. Gen. sim. matrix (4-6)
3. Multiply matrices (7-10) All User Preferences
(item x user)

Researchers

Research
Articles
item.RecommenderJob
(item x user)

Research Turing
Articles

A User's Preferences
(item x user)

Researchers

Research
Articles
item.RecommenderJob
(item x user)

Research
Articles Turing

2 1 0 0
Research
Research

0 0
Articles

1 1
Articles

0 0 2 2
0 0 2 2
Item Similarity A User's Preferences
(item x item) (item x user)

Researchers

Research
Articles
item.RecommenderJob
(item x user)

Research
Articles Turing Turing

2 1 0 0
Research

Research
Research

0 0
Articles

Articles
1 1
Articles

0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Mahout's
Normalised Amazon Hours Performance

No. Good Recommendations/10

Mahout's
Costly & Bad
Normalised Amazon Hours Performance Costly & Good

Cheap & Bad No. Good Recommendations/10 Cheap & Good

Mahout's
Costly & Bad Performance Costly & Good
7K
Normalised Amazon Hours

6K

5K

4K

3K

2K

1K

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K

2K

1K

0
0.5 0
1 1.5 2 2.5 3

1. Reduce processing time

2. Improve quality

1. Reduce processing time
➔
Mahout's recommender is already efficient
➔
But your data may have unusual properties
➔
Hadoop may need a helping hand
➔
Let's see what's going on...

Task Allocation 37 hours to complete

1 reducer allocated, despite having 48 available...

Task Allocation

Allocating more reducers on a per job basis

job.getConfiguration().setInt(
"mapred.reduce.tasks",
numReducers);

Allocating more mappers on a per job basis

job.getConfiguration().set(
"mapred.max.split.size",
String.valueOf(splitSize));

Task Allocation 37 hours to complete
14 hours

From 1 → 40
reducers

Partitioners 14 hours to complete


~50KB

~500MB

InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(...);
InputSampler.writePartitionFile(conf, sampler);
conf.setPartitionerClass(TotalOrderPartitioner.class);

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-
series-issue-2-getting-started-with-customized-partitioning/

2 hours

Evenly
distributed

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
➔
2.4K, 1.5
2K

1K

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K
-4.1K
(63%)
4K

3K Cust. item-based
➔
2.4K, 1.5
2K

1K

0
0.5 0
1 1.5 2 2.5 3

2. Improve quality
➔
Mahout provides item-based CF
➔
We have many more items than users
➔
Typically, user-based is more appropriate
➔
So let's make one!

Researchers

user

Research
Articles
item.RecommenderJob
(item x user)

Researchers
Research
Articles Turing Turing

2 1 0 0
Researchers

Research

Research
Research

0 0
Articles

Articles
1 1
Articles

0 0 2 2 X =
0 0 2 2
Item Similarity A User's Preferences Recommendations
(item x item) (item x user) (item x user)
User Similarity (user x user)

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
+1 (67%)
➔
2.4K, 1.5
2K -1.4K
Orig. user-based
(58%)
1K
➔
1K, 2.5

0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K

4K

3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
Cust. user-based
➔
0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K
6.5K, 1.5

6K Orig. item-based

5K
-4.1K
(63%)
4K

3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K 1K, 2.5
➔

-0.7K Cust. user-based
(70%) ➔0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

Mahout's
7K +1 (67%)
6.5K, 1.5

6K Orig. item-based

5K

4K
-6.2K
(95%)
3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
Cust. user-based
➔
0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

Conclusions
➔
Mahout is doing a great job of powering Mendeley Suggest
➔
Large scale data set
➔
Good quality recommendations
➔
Tuning helps
➔
Help Hadoop with task allocation if necessary
➔
Partition your data appropriately
➔
We save 95% resources
➔
Use an appropriate algorithm
➔
Item- vs user-based (MAHOUT-1004)
➔
We increase precision by 66.6%

Mahout's
7K +1 (67%)
6.5K, 1.5

6K Orig. item-based

5K

4K
-6.2K
(95%)
3K Cust. item-based
➔
2.4K, 1.5
2K
Orig. user-based
1K
➔
1K, 2.5
Cust. user-based
➔
0.3K, 2.5
0
0.5 0
1 1.5 2 2.5 3

http://www.mendeley.com/profiles/kris-jack/

Scientific Article Recommendation with Mahout

Recommandé

Recommandé

Contenu connexe

Similaire à Scientific Article Recommendation with Mahout

Similaire à Scientific Article Recommendation with Mahout (20)

Plus de Kris Jack

Plus de Kris Jack (15)

Dernier

Dernier (20)

Scientific Article Recommendation with Mahout