Slope one recommender on hadoop

Slope One Recommender
on Hadoop
YONG ZHENG
Center for Web Intelligence
DePaul University
Nov 15, 2012

Overview
• Introduction

• Recommender Systems & Slope One Recommender

• Distributed Slope One on Mahout and Hadoop

• Experimental Setup and Analyses

• Drive Mahout on Hadoop

• Interesting Communities

Center for Web Intelligence, DePaul University, USA

Introduction
• About Me: a recommendation guy

• My Research: data mining and recommender systems

• Typical Experimental Research

1) Design or improve an algorithm;
2) Run algorithms and baseline algs on datasets;
3) Compare experimental results;
4) Try different parameters, find reasons and even re-design
and improve algorithm itself;
5) Run algorithms and baseline algs on datasets;
6) Compare experimental results;
7) Try different parameters, find reasons and even re-design
and improve algorithm itself;
8) And so on… Until it approaches expected results.

Introduction
• Sometimes, data is large-scale.
e.g. one algorithm may spend days to complete, how
about experimental results are not as expected. Then
improve algorithms and run it for days again, and again.

How can we do previously? (for tasks not that complicated)
1). Paralleling but complicated synchronization and limited
resources, such as CPU, memory, etc;
2). Take advantage of PC Labs, let’s do it with 10 PCs

• Nearly all research will ultimately face the large-scale
problems , especially in the domain of data mining.

• But, we have Map-Reduce NOW!

Introduction

• Do not need to distribute data and tasks manually.
Instead we just simply generate configurations.
• Do not need to care about more details, e.g. how data is
distributed, when one specific task will be ran on which
machine, or how they conduct tasks one by one.
• Instead, we can pre-define working flow. We can take
advantage of the functional contributions from mappers
and reducers.
• More benefits: replication, balancing, robustness, etc

Recommender Systems

• Collaborative Filtering

• Slope One and Simple Weighted Slope One

• Slope One in Mahout

• Distributed Slope One in Mahout

• Mappers and Reducers


Collaborative Filtering (CF)
One of most popular recommendation algorithms.
 User-based: User-CF
 Item-based: Item-CF, Slope One

User 5
Rating?
5

4
4
4 star
5

Example: User-based Collaborative Filtering

Slope One Recommender
Reference: Daniel Lemire, Anna Maclachlan, Slope One Predictors for
Online Rating-Based Collaborative Filtering, In SIAM Data Mining
(SDM'05), April 21-23, 2005. http://lemire.me/fr/abstracts/SDM2005.html

User Batman Spiderman
U1 3 4
U2 2 4
U3 2 ?

1). How different two movies were rated?
U1 rated Spiderman higher by (4-3) = 1
U2 rated Spiderman higher by (4-2) = 2
On average, Spiderman is rated (1+2)/2 = 1.5 higher

2). Rating difference can tell predictions
If we know U3 gave Batman a 2-star, probably he will rated
Spiderman by (2+1.5) = 3.5 star

Simple Weighted Slope One
Usually user rated multiple items
User HarryPotter Batman Spiderman
U1 5 3 4
U2 ? 2 4
U3 4 2 ?

1). How different the two movies were rated?
Diff(Batman, Spiderman) = [(4-3)+(4-2)]/2 = 1.5
Diff(HarryPotter, Spiderman) = (4-5)/1 = -1
“2” and “1” here we call them as “count”.

2). Weighted rating difference can tell predictions
We use a simple weighted approach
Refer to Batman only, rating = 2+1.5 = 3.5
Refer to HarryPotter only, rating = 4-1 = 3
Consider them all, predicted rating = (3.5*2 + 3*1])/ (2+1) = 3.33

Simple Weighted Slope One
u1 5 3 4
u2 ? 2 4
u3 4 2 ?
Question: Online or Offline?
To calculate the prediction ratings, we need 2 matrices:
1).Difference Matrix
Movie1 Movie2 Movie3 Movie4
Movie1
Movie2 -1.5
Movie3 2 1
Movie4 -1 0.5 -2

2). Count Matrix
Just number of users co-rated on two items

Slope One in Mahout
Mahout, an open-source machine learning library.

1). Recommendation algorithms
User-based CF, Item-based CF, Slope One, etc

2). Clustering
KMeans, Fuzzy KMeans, etc

3). Classification
Decision Trees, Naive Bayes, SVM, etc

4). Latent Factor Models
LDA, SVD, Matrix Factorization, etc

Slope One in Mahout
org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender
Pre-Processing Stage: (class MemoryDiffStorage with Map)
for every item i
for every other item j
for every user u expressing preference for both i and j
add the difference in u’s preference for i and j to an average

Recommendation Stage:
for every item i the user u expresses no preference for
for every item j that user u expresses a preference for
find the average preference difference between j and i
add this diff to u’s preference value for j
add this to a running average
return the top items, ranked by these averages

Simple weighting: as introduced previously
StdDev weighting: item-item rating diffs with lower sd should be
weighted highly

Distributed Slope One in Mahout
Similar to our previous practice, e.g. the matrix factorization
Process, what we need is the Difference Matrix.

Suppose there are M users rated N items, the matrix
requires N(N-1)/2 cells. Also, the density is another aspect
– how user rated items. If there are several items and the
rating matrix is dense, the computational costs will increase
accordingly.

Question again: Online or Offline?
Depends on tasks & data.

Large-scale data. Let’s do it offline!

Distributed Slope One in Mahout
package org.apache.mahout.cf.taste.hadoop.slopeone;
class SlopeOneAverageDiffsJob
class SlopeOnePrefsToDiffsReducer
class SlopeOneDiffsToAveragesReducer

package org.apache.mahout.cf.taste.hadoop;
class ToItemPrefsMapper
org.apache.hadoop.mapreduce.Mapper

Two Mapper-Reducer Stages:
1). Create DiffMatrix for each user
2). Collect AvgDiff info, counts, StdDev

Let’s see how it works…

Mapper and Reducer - 1
U1 5 3 4
U2 ? 2 4
U3 4 2 ?

Mapper1 (ToItemPrefsMapper)
 <UserID, Pair<ItemID, Rating>>
Reducer1 (PrefsToDiffsReducer)
 <Pair<Item1,Item2>, Diff> (for all three users)

<U1> Potter Bat Spider <U2> Potter Bat Spider

Potter Potter

Bat -2 Bat NULL

Spider -1 1 Spider NULL 2

Mapper and Reducer - 2
<U1> Potter Bat Spider <U2> Potter Bat Spider

Potter Potter

Bat -2 Bat NULL

Spider -1 1 Spider NULL 2

Mapper2 (org.apache.hadoop.mapreduce.Mapper)
Reducer2 (DiffsToAveragesReducer)
Average Diffs, Count, StedDev
<Aggregate> Potter Bat Spider
Potter
Bat -2, 1
Spider -1, 1 1.5, 2
Simply, <a,b> pair denotes a=averge diff, b=count
Notice: we should use three matrices in practice, here I used 2.

Predictions
U1 5 3 4
U2 ? 2 4
U3 4 2 ?

<Aggregate> Potter Bat Spider
Potter
Bat -2, 1
Spider -1, 1 1.5, 2
Simply, <a,b> pair denotes a=averge diff, b=count
Notice: we should use three matrices in practice, here I used 2.

Prediction(U3, Spiderman) = [(4-1)*1 + (2+1.5)*2] / (1+2)
= 3.33333333333333333333

Experiments

• Data

• Hadoop Setup

• Running Performances


Experiment Setup
Data: MovieLens-1M ratings
# of users: 6,040
# of movies: 3,900
# of ratings: 1,000,209

Density of the ratings:
each user has at least 20 ratings
obviously, some users have many more ratings

Rating format: UserID, ItemID, Rating (scale 1-5)

Data Split: 80% training, 20% testing

Experiment Setup
Hadoop Cluster Setup
 IBM SmartCloud
 1 master node, 7 slave nodes
 Each node is as SUSE Linux Enterprise Server v11 SP1
 Server Configuration:
64 bit (vCPU: 2, RAM: 4 GiB, Disk: 60 GiB)
 Hadoop v.0.20.205.0
 Mahout distribution-0.6

The environment setup follows the typical workflow as:
http://irecsys.blogspot.com/2012/11/configurate-map-reduce-
environment-on.html

Thanks Scott Young, neat writeup!!

Experimental Analyses
Stage-1: SlopeOneAverageDiffsJob by Map-Reduce
Goal: Build DiffStorage
Output: DiffStorage txt file, 1.45GB
Running Time:
 real 13m 34.228s
 user 0m 5.136s
 sys 0m 1.028s
Item1 Item2 Diff Count StdDev
221 223 -1.02 197 0.5
Stage-2: Java evaluator to measure MAE on testing set
Running Time:
 Load Testing Set (21K records), 299ms
 Load Training Set (79K records), 1,771ms
 Load DiffStorage, 176,352ms = 2.9m
 Prediction (21K records), 18,182ms = 0.3m
 MAE = 0.71330756

Experimental Experiences
1. Why not MovieLens 10M data?
Map-Reduce on 10M data may cost several hrs;
Running time depends on cluster and configuration;
Also, DiffStorage file will be too large.
2. Java Evaluator
Load full DiffStorage file is time-consuming.
Also, incur Java heap space and GCOverlimit errors;
Those errors can not be fixed by –Xmx or other solutions;
Two solutions:
1). Just use simple weighting, discard StdDev weighting.
2). Simple Mapper and Reducer, run it on clusters.

For MovieLens 1M, it is not that efficient compared with
the live SlopeOne recommendation; 10M data may be
better, will try MovieLens-10M data later; Slope One is
simple but memory-expensive.

More …

• Drive Mahout on Hadoop

• Interesting Communities


Mahout + Hadoop
How to put more Mahout algorithms to Hadoop?
1. Pre-set Command in Mahout
Let’s see bin/mahout – help, then it provides a list of
available programs such as svd, fkmeans, etc.

Some are basic functions, such as splitDataset
Some can be executed as Hadoop tasks

e.g. Run and evaluate Matrix Factorization on rating dataset

bin/mahout parallelALS --input inputSource --output outputSource
--tempDir tmpFolder --numFeatures 20 --numIterations 10

bin/mahout evaluateFactorization --input inputSource --output
outputSource --userFeatures als/out/U/ --itemFeatures als/out/M/
--tempDir tmpFolder

Mahout + Hadoop
2. More Algorithms on Hadoop
Mahout provides a way to run more Mahout algorithms. Simply,

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-
<version>.jar <Job Class> --recommenderClassName Class <OPTIONS>

Which kinds of Jobs it supports? Mahout implemented some versions.

Some popular ones:
1).org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob
--recommenderClassName ClassName
2).org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
3).org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob
4).org.apache.mahout.cf.taste.hadoop.slopeone.SlopeOneAverageDiffsJob

Interesting Communities
Beyond Hadoop and Mahout official sites

1. Data Mining
KDnuggets, http://www.kdnuggets.com
Popular community for Data Mining & Analytics. Lots of useful
information, such as news, materials, datasets, jobs, etc.

2. Big Data
SmartData Collective, http://smartdatacollective.com/
Smarter Computing, http://www.smartercomputingblog.com/
Big Data Meetup, http://big-data.meetup.com/

3. Recommender Systems
ACM Official Site, http://recsys.acm.org/
RecSys Wiki, http://recsyswiki.com/

Thank You!


Slope one recommender on hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Slope one recommender on hadoop

Similaire à Slope one recommender on hadoop (20)

Plus de YONG ZHENG

Plus de YONG ZHENG (20)

Dernier

Dernier (20)

Slope one recommender on hadoop