Machine Learning With Mahout: Clustering, Recommendations, and More

Machine Learning
With Mahout
Rami Mukhtar
Big Data Group
National ICT Australia
February 2012

Mahout: Brief History
• Started in 2008 as a subproject of the Apache
Lucene;
– Text mining, clustering, some classification.
• Sean Owen started Taste in 2005:
– Recommender engine for business that never took off
– Mahout community asked to merge in Taste code
• Became a top level Apache Project in April 2010
• Lineage resulted in a fragmented framework.

NICTA Copyright 2011 From imagination to impact 2

Mahout: What is it?
• Collection of machine learning algorithm implementations:
– Many (not all) implemented on Hadoop map-reduce;
– Java library with handy command line interface to run common tasks.
• Currently serves 3 key areas:
– Recommendation engines
– Clustering
– Classification
• Focus of today’s talk is on functionality accessible from command
line interface:
– Most accessible for Hadoop beginners.


Recommenders
• Supports user based and item based
collaborative filtering:
– User based: similarity between users;
– Item based: similarity between items user

other items 1 2 3 user 3 likes item E
user 1 may
like
item
A B C D E F

4 5 6

Implementations
• Non-distributed (no Hadoop requirement)
– The ‘Taste’ code, supports item and user based;
– Good for up to 100 million user-item associations;
– Faster than distributed version.
• Distributed (Hadoop MapReduce)
– Item based using similarity measure (configurable)
between items.
– Latent factor based:
• Estimates ‘genres’ of items from user preferences
• Similar to entry that won the NetFlix prize.
– Both have command line interfaces.


Distributed Item Recommender
Item1 Item 2 Item 3 Item n

R R R
User 1
Similarity
User 2 R R R
calculation

User 3 R R R R

R R
1 2 n
R R 1 1 .2 … .8

2 1 … .5
R R
… .6
R R R R
User m n 1

User-item ratings matrix Item similarity
NICTA Copyright 2011 From imagination to impact matrix 6

Distributed Item Recommendation
csv file:
csv file:
user, item, rating
item, item, simularity
…

mahout itemsimilarity –i <input_file> -o <output_path> …!

Item1 Item 2 Item 3 Item n

User 1 R R R

User 2 R R R

User 3 R R R R s2,3 * R2 + s3,5 * R5
R2 R3? R5
R3? =
s2,3 + s3,5
R R

R R

User m R R R R
€

Distributed Item Recommendation
• Can perform item similarity and
recommendation generation in a single
call: csv file (tab seperated):
user, item, rating
…

mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>! csv file (tab separated):
--numRecommendations …! user,item,score,item,score,…

csv file (tab
Number of
separated):
recommendations to user
return per user
user
user

Clustering

Vectorization

Clustering


Clustering
• Don’t know the structure of data, want to sensibly group
things together.
• A number of distributed algorithms supported:
– Canopy Clustering (MAHOUT-3 – integrated)
– K-Means Clustering (MAHOUT-5 – integrated)
– Fuzzy K-Means (MAHOUT-74 – integrated)
– Expectation Maximization (EM) (MAHOUT-28)
– Mean Shift Clustering (MAHOUT-15 – integrated)
– Hierarchical Clustering (MAHOUT-19)
– Dirichlet Process Clustering (MAHOUT-30 – integrated)
– Latent Dirichlet Allocation (MAHOUT-123 – integrated)
– Spectral Clustering (MAHOUT-363 – integrated)
– Minhash Clustering (MAHOUT-344 - integrated)
• Some have command line interface support.

Vectorization
• Data specific.
– Majority of cases need to write a map-reduce job to generate
vectorized input
– Input formats are still not uniform across Mahout.
– Most clustering implementations expect:
• SequenceFile(WritableComparable, VectorWritable)!
• Note: key is ignored.
• Mahout has some support for clustering text documents:
– Can generate n-gram Term Frequency-Inverse Document
Frequency (TF-IDF) from a directory of text documents;
– Enables text documents to be clustered using command line
interface.


Text Document Vectorization

Term frequency
The | conduct | as | run | doctor | with | a | Patel!
47 | 3 | 7 | 5 | 8 | 12 | 54| 6 !

Document
Document frequency
The | conduct | as | run | doctor | with | a | Patel!
1000| 198 | 999| 567 | 48 | 998 |100| 3 !

N
Corpus TFIDFi = TFi * log
DFi
Unigram: (crude)!
Increase weight of less
Bi-gram: (crude, oil)!
common words/n-
Tri-gram: (crude, oil, prices)!
grams within corpus

€

Text Document Vectorization

Directory of Sequence
plain text file: <name,
documents text body>

mahout seqdirectory –i <input_path> -o <seq_output_path>!
Dictionary
Term Frequency file

Ngram
TF-IDF
generation
Vec. Gen.
Inverse
Document Freq.

mahout seq2sparse -i <seq_input_path> -o <output_path> !

K-means clustering
Run k-means clustering

mahout kmeans!
-i <input vectors directory>! org.apache.mahout.common.distance.
-c <input clusters directory>! CosineDistanceMeasure
EuclideanDistanceMeasure
-o <output working directory> ! ManhattanDistanceMeasure
-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure
-dm <DistanceMeasure> ! …
-x <maximum number of iterations> !
-xm <execution method: seq/mapreduce>!
…!
Cluster 1 Cluster 2
Inspect the result Top Terms: ! Top Terms: !
oil => 6.20! Coresponsibility => 13.97!
barrel => 5.15! cereals => 13.51!
mahout clusterdump ! crude => 5.06! penalise => 13.25!
prices => 4.50! farmers => 11.99!
-dt sequencefile ! opec => 3.23! levies => 11.60!
price => 2.77! ceilings => 11.52!
-d <dictionary_file>! dlrs => 2.76! ec => 11.07!
said => 2.70! ministers => 10.55!
-s <input_seq_file>! bpd => 2.45! output => 9.57!
petroleum => 1.99! 09.73 => 9.18!


Classification
• Train the machine to provide discrete answers to a
specific question.
Mahout supports the
100100: A following algorithms:
010011: A Model - ogistic Regression
L
010110: B Training - aïve Bayes
N
100101: A Algorithm - andom Forests
R
010101: B Others in development
Data with known
answers
101100: ? 101100: A
010010: ? 000110: B
001000: ? Trained
011100: A
100000: ? Model
101101: A
001001: ? 010111: B
Data without Data with
answers estimated
NICTA Copyright 2011 From imagination to impact answers 15

Classification Workflow
Label Sample
~90%

Training set 1
Model
Vectorize Sample Training
~10%

2
Model
Testing

3
Input set A, B, A,
Vectorize A, B, …

Trained Model Label
approximation

Feature Extraction
• Good feature extraction is critical to
trained model performance:
– Need domain understanding to ‘measure’ the
right things.
– Measure wrong things, even the best model
will perform badly.
– Caution needed to avoid ‘label leaks’.
• Will typically require hand written map-
reduce code:
– If text based, can use text mining tools in
HIVE or Mahout.

Naïve Bayes Classifier
Feature vector Label

n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
l i=1

Probability of feature i having value fi given l,
e.g. assume a Gaussian pdf:
(v −µl )2
1 −
2σ l2
P ( f = v | l) = e
2πσ l2
Note: Model training boils down to estimating the conditional variance of the
feature vector elements. This can be trivially parallelized and implemented in
map reduce.
€ 18
NICTA Copyright 2011 From imagination to impact

Naïve Bayes in Mahout
Command line specific to text classification (e.g. SPAM detection, document
classification, etc.)

Plain text file, format: Generated model, set
label t word word of files in sequence file
word … format (variances).

mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …

N_gram Discard n-grams
Discard n-grams
size, that occur less
that occur in less
default = 1 than this number
than this number
of times in a
of documents.
document.

Naïve Bayes in Mahout
• Need to write your own classifier to be
practical.

document
Classifier Trained Model

label

Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …


Classification vs. Recommendation
• Can use a classifier to recommend:
– Interested in item or not interested?
• Classifier is based on features of the
specific item and the customer
• Recommendation based on past behavior
of customers
• Classification: single decisions
• Recommendation: ranking


Machine Learning With Mahout: Clustering, Recommendations, and More

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à Machine Learning With Mahout: Clustering, Recommendations, and More

Similaire à Machine Learning With Mahout: Clustering, Recommendations, and More (20)

Dernier

Dernier (20)

Machine Learning With Mahout: Clustering, Recommendations, and More