LA HUG Dec 2011 - Recommendation Talk

Dec 2011 – LA HUG – Santa Monica, CA
Mahout, CDH3, and Recommendation
Josh Patterson | Sr Solution Architect

Who is Josh Patterson?
• josh@cloudera.com
– Twitter: @jpatanooga
• Master’s Thesis: self-organizing mesh networks
– Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
• Conceived, built, and led Hadoop integration for
openPDC project at Tennessee Valley Authority (TVA)
– Led team which designed classification techniques for time series and
Map Reduce
• Open source work at
– http://openpdc.codeplex.com
– https://github.com/jpatanooga
• Today
– Sr. Solutions Architect at Cloudera

Outline

• Intro to Recommendation
• Recommendation with Mahout and CDH3

3

“I know I've made some very poor decisions recently, but I
can give you my complete assurance that my work will be
back to normal. I've still got the greatest enthusiasm and
confidence in the mission. And I want to help you. ”
--- HAL from “2001: A Space Odyssey”

Recommendation

4

Information Explosion

• Amount of data, articles, shows exploding
– Hard to know what to pay attention to
– Be nice if it was personalized to my own
tastes
• Issues at scale
– Heap size limits become issue with large
number of preferences
• > 1 Billion preferences
– “real time” recommenders have issues as well
with scale

5
Copyright 2010 Cloudera Inc. All rights reserved

User-based recommendations

• Look for users who share the same ratings
patterns with the active user
– looking at the notion of similarity between
users based on preferences/actions/ratings of
those users
• So we can recommend the same things to
similar users

Item-based recommendations

• Item based recommenders are derived from how
similar items are to items
– Users who bought X also bought Y
• Compute similarity matrix between items

Item vs User Based

• Algorithms are similar
– But not entirely symmetric
• Item based
– Scales up as the number of items increases
• If the number of items is relatively low compared to the
number of users, performance could be better
– Items tend to change less than users
• User based
– Running time goes up as the number of users
increases

8

Recommendation in Mahout

• Not a single recommender engine
– Assortment of components
• Components can be plugged together and
customized
– We target a specific domain with a custom
built recommender
– Need experimentation to get good results

9

Co-Occurrence Matrix
• Example:
– If we have 10 users, and all of them express a preference
for items A and B
• A and B are said to co-occur 10 times
• Can be thought of much like similarity
– The more we see two items occur together
– The greater the chance the two items are related
somehow
• Producing a Co-Occurrence matrix ends up being a
simple exercise of counting
– we compute number of times the pair occurs together
per user
– Works well distributed

10

Simple Recommender Input
UserID, ItemID, Rating

10, 1000, 5.0
10, 1001, 3.0
10, 1004, 2.5

13, 1001, 3.5
13, 1002, 4.5
13, 1003, 1.0
13, 1004, 3.5

15, 1000, 4.5
15, 1001, 3.5
15, 1002, 2.5

11

Simple Co-Occurrence Matrix
1000 1001 1002 1003 1004

1000 2 2 1 0 1

1001 2 3 2 1 2

1002 1 2 2 1 1

1003 0 1 1 1 1

1004 1 2 1 1 2

12

User’s Preferences as a Vector

• In other recommendation algos we look at
users as points in space
– Euclidean distances as similarity
• In a data model with n items, user
preferences are like a vector over n
dimensions
– With 1 dimension for each item
– Creates sparse vector
• Example
– User 10: { 5.0, 3.0, 0.0, 0.0, 2.5 }

13

Computing Recommendations

• Multiply the user vector (as column vector)
vs the co-occurrence matrix
– User column vector x each item row vector
• Result: vector whose dimension is equal to
the number of items
– Values in results vector R are recommended
as “best recommendations”

14

Calculating R: Example

1000 1001 1002 1003 1004 UserID R
1000 2 2 1 0 1
5.0 18.5
1001 2 3 2 1 2
3.0 24
1002 1 2 2 1 1 x 0.0 = 13.5
1003 0 1 1 1 1
0.0 5.5
1004 1 2 1 1 2
2.5 16

R value for item 1002:

1 ( 5.0 ) + 2 ( 3.0 ) + 2 ( 0.0 ) + 1 ( 0.0 ) + 1 ( 2.5 ) == 13.5

15

Recommendations

• If a user has already indicated a 10, 1000, 5.0
10, 1001, 3.0
preference for an item, we don’t 10, 1004, 2.5
want to recommend it
• We take the remaining items R

ranked by their R value 18.5

– Here it would be 1002 at 13.5 24
13.5
• Followed by 1003 at 5.5
5.5
16

16

“Dave Bowman: I don't know; I think so. You know of course
though he's right about the 9000 series having a perfect
operational record. They do.
Dr. Frank Poole: Unfortunately that sounds a little like
famous last words. ”
--- “2001:A Space Odyssey”

Recommendations with Mahout and CDH3u2

17

Step 1: Install CDH3u2

• Setup CDH3u2
– https://ccp.cloudera.com/display/CDHDOC/C
DH3+Quick+Start+Guide
– Setup in Pseudo-distributed mode for this
demo if you don’t have a cluster

18

Step 2: Install Mahout

• Setup Apache Mahout with CDH3
– https://ccp.cloudera.com/display/CDHDOC/M
ahout+Installation
– Make sure $JAVA_HOME is set or Mahout will
complain

19

Step 3: Get Grouplens Data
• Download
– http://www.grouplens.org/system/files/ml-1m.zip
• Format
– UserID::MovieID::Rating::Timestamp
• where
– UsersIDs are integers
– MovieIDs are integers
– Ratings are 1 through 5 “stars” (integers)
– Time stamp is seconds since the epoch
• Each user has at least 20 ratings

20

Step 4: Prep Data

• This file isn’t exactly how Mahout
prefers, but this is an easy fix
– Mahout is looking for a CSV file with lines of
the form:
• userID, itemID, value
• From bash run
– tr -s ':' ',' < ratings.dat | cut -f1-3 -d, >
ratings.csv

21

Step 5: Generate Recommendations

• Input to this job is going to be the
“ratings.csv” file we generated of the format:
– userID, itemID, value
• We also want to give it a list of userIDs to
generate recommendations for
• Output of the recommendation job will be
another CSV file with the layout of:
– userID [ itemID, score, ... ]
– Represents the userIDs with their recommended
itemIDs along with the preference scores

22

Step 5: Command Line

• Put ratings file in HDFS
– Hadoop fs –put ratings.csv [input-hdfs-path]
• Put user file in HDFS
– Let’s put “6040” on a single line in a file and put
that in HDFS
• hadoop fs -put [my_local_file]
[user_file_location_in_hdfs]
• Now we can run the recommender job
– mahout recommenditembased --input [input-hdfs-
path] --output [output-hdfs-path] --tempDir [tmp-
hdfs-path] --usersFile [user_file_location_in_hdfs]

23

Take a Look at the Results

• Cat output of job
– hadoop fs -cat [output-hdfs-path]/part-r-00000
• Which should look like:
– 6040 [1941:5.0,1904:5.0,2859:5.0,3811:5.0,3814:5.0,14:5.0,17:5.0,3795:5.0,3794:5.0,3793:5.0]

24

Questions? (Thank You!)

• Recommendation Tutorial based on:
– http://www.cloudera.com/blog/2011/11/recom
mendation-with-apache-mahout-in-cdh3/
• Cloudera’s Distribution including Apache
Hadoop (CDH):
– http://www.cloudera.com
• Apache Mahout
– http://mahout.apache.org

25

More?
• Look at www.cloudera.com/training to learn more about
Hadoop
• Read www.cloudera.com/blog
• Lots of great use cases.
• Check out the downloads page at
• www.cloudera.com/downloads
• Get your own copy of Cloudera Distribution for Apache Hadoop
(CDH)
• Grab Demo VMs, Connectors, other useful tools.

• Contact Josh with any questions at
• josh@cloudera.com

26

References

• S. Owen, R. Anil, T. Dunning, E. Friedman:
Mahout in Action
• Sarwar et al.: Item-Based Collaborative
Filtering Recommendation Algorithms
• Apache Mahout Wiki:
– http://mahout.apache.org/

27

Workflow
• Job 1
– Preprocess data if needed
• Job 2
– Create User Vectors
• Job 3
– Count Users
• Job 4
– Prune and Transpose
• Job 5
– RowSimilarityJob
• Weights
• pairwiseSimilarity
• asMatrix
• Job 6
– Pre Partial Multiply 1
• Job 7
– Pre Partial Multiply 2
• Job 8
– Partial Multiply
• Job 9

28

Temp Files Generated
• countUsers
• itemIDIndex
• itemUserMatrix
• pairwiseSimilarity
• partialMultiply
• partialMultiply1
• partialMultiply2
• similarityMatrix
• userVectors
• weights

29

LA HUG Dec 2011 - Recommendation Talk

Recommandé

Recommandé

Contenu connexe

Similaire à LA HUG Dec 2011 - Recommendation Talk

Similaire à LA HUG Dec 2011 - Recommendation Talk (20)

Plus de Josh Patterson

Plus de Josh Patterson (20)

Dernier

Dernier (20)

LA HUG Dec 2011 - Recommendation Talk

Notes de l'éditeur