2. Mahout: Brief History
• Started in 2008 as a subproject of the Apache
Lucene;
– Text mining, clustering, some classification.
• Sean Owen started Taste in 2005:
– Recommender engine for business that never took off
– Mahout community asked to merge in Taste code
• Became a top level Apache Project in April 2010
• Lineage resulted in a fragmented framework.
NICTA Copyright 2011 From imagination to impact 2
3. Mahout: What is it?
• Collection of machine learning algorithm implementations:
– Many (not all) implemented on Hadoop map-reduce;
– Java library with handy command line interface to run common tasks.
• Currently serves 3 key areas:
– Recommendation engines
– Clustering
– Classification
• Focus of today’s talk is on functionality accessible from command
line interface:
– Most accessible for Hadoop beginners.
NICTA Copyright 2011 From imagination to impact 3
4. Recommenders
• Supports user based and item based
collaborative filtering:
– User based: similarity between users;
– Item based: similarity between items user
other items 1 2 3 user 3 likes item E
user 1 may
like
item
A B C D E F
4 5 6
NICTA Copyright 2011 From imagination to impact 4
5. Implementations
• Non-distributed (no Hadoop requirement)
– The ‘Taste’ code, supports item and user based;
– Good for up to 100 million user-item associations;
– Faster than distributed version.
• Distributed (Hadoop MapReduce)
– Item based using similarity measure (configurable)
between items.
– Latent factor based:
• Estimates ‘genres’ of items from user preferences
• Similar to entry that won the NetFlix prize.
– Both have command line interfaces.
NICTA Copyright 2011 From imagination to impact 5
6. Distributed Item Recommender
Item1 Item 2 Item 3 Item n
R R R
User 1
Similarity
User 2 R R R
calculation
User 3 R R R R
R R
1 2 n
R R 1 1 .2 … .8
2 1 … .5
R R
… .6
R R R R
User m n 1
User-item ratings matrix Item similarity
NICTA Copyright 2011 From imagination to impact matrix 6
7. Distributed Item Recommendation
csv file:
csv file:
user, item, rating
item, item, simularity
…
mahout itemsimilarity –i <input_file> -o <output_path> …!
Item1 Item 2 Item 3 Item n
User 1 R R R
User 2 R R R
User 3 R R R R s2,3 * R2 + s3,5 * R5
R2 R3? R5
R3? =
s2,3 + s3,5
R R
R R
User m R R R R
€
NICTA Copyright 2011 From imagination to impact 7
8. Distributed Item Recommendation
• Can perform item similarity and
recommendation generation in a single
call: csv file (tab seperated):
user, item, rating
…
mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>! csv file (tab separated):
--numRecommendations …! user,item,score,item,score,…
csv file (tab
Number of
separated):
recommendations to user
return per user
user
user
NICTA Copyright 2011 From imagination to impact 8
9. Clustering
Vectorization
Clustering
NICTA Copyright 2011 From imagination to impact 9
10. Clustering
• Don’t know the structure of data, want to sensibly group
things together.
• A number of distributed algorithms supported:
– Canopy Clustering (MAHOUT-3 – integrated)
– K-Means Clustering (MAHOUT-5 – integrated)
– Fuzzy K-Means (MAHOUT-74 – integrated)
– Expectation Maximization (EM) (MAHOUT-28)
– Mean Shift Clustering (MAHOUT-15 – integrated)
– Hierarchical Clustering (MAHOUT-19)
– Dirichlet Process Clustering (MAHOUT-30 – integrated)
– Latent Dirichlet Allocation (MAHOUT-123 – integrated)
– Spectral Clustering (MAHOUT-363 – integrated)
– Minhash Clustering (MAHOUT-344 - integrated)
• Some have command line interface support.
NICTA Copyright 2011 From imagination to impact 10
11. Vectorization
• Data specific.
– Majority of cases need to write a map-reduce job to generate
vectorized input
– Input formats are still not uniform across Mahout.
– Most clustering implementations expect:
• SequenceFile(WritableComparable, VectorWritable)!
• Note: key is ignored.
• Mahout has some support for clustering text documents:
– Can generate n-gram Term Frequency-Inverse Document
Frequency (TF-IDF) from a directory of text documents;
– Enables text documents to be clustered using command line
interface.
NICTA Copyright 2011 From imagination to impact 11
12. Text Document Vectorization
Term frequency
The | conduct | as | run | doctor | with | a | Patel!
47 | 3 | 7 | 5 | 8 | 12 | 54| 6 !
Document
Document frequency
The | conduct | as | run | doctor | with | a | Patel!
1000| 198 | 999| 567 | 48 | 998 |100| 3 !
N
Corpus TFIDFi = TFi * log
DFi
Unigram: (crude)!
Increase weight of less
Bi-gram: (crude, oil)!
common words/n-
Tri-gram: (crude, oil, prices)!
grams within corpus
NICTA Copyright 2011 From imagination to impact 12
€
13. Text Document Vectorization
Directory of Sequence
plain text file: <name,
documents text body>
mahout seqdirectory –i <input_path> -o <seq_output_path>!
Dictionary
Term Frequency file
Ngram
TF-IDF
generation
Vec. Gen.
Inverse
Document Freq.
mahout seq2sparse -i <seq_input_path> -o <output_path> !
NICTA Copyright 2011 From imagination to impact 13
15. Classification
• Train the machine to provide discrete answers to a
specific question.
Mahout supports the
100100: A following algorithms:
010011: A Model - ogistic Regression
L
010110: B Training - aïve Bayes
N
100101: A Algorithm - andom Forests
R
010101: B Others in development
Data with known
answers
101100: ? 101100: A
010010: ? 000110: B
001000: ? Trained
011100: A
100000: ? Model
101101: A
001001: ? 010111: B
Data without Data with
answers estimated
NICTA Copyright 2011 From imagination to impact answers 15
16. Classification Workflow
Label Sample
~90%
Training set 1
Model
Vectorize Sample Training
~10%
2
Model
Testing
3
Input set A, B, A,
Vectorize A, B, …
Trained Model Label
NICTA Copyright 2011 From imagination to impact 16
approximation
17. Feature Extraction
• Good feature extraction is critical to
trained model performance:
– Need domain understanding to ‘measure’ the
right things.
– Measure wrong things, even the best model
will perform badly.
– Caution needed to avoid ‘label leaks’.
• Will typically require hand written map-
reduce code:
– If text based, can use text mining tools in
HIVE or Mahout.
NICTA Copyright 2011 From imagination to impact 17
18. Naïve Bayes Classifier
Feature vector Label
n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
l i=1
Probability of feature i having value fi given l,
e.g. assume a Gaussian pdf:
(v −µl )2
1 −
2σ l2
P ( f = v | l) = e
2πσ l2
Note: Model training boils down to estimating the conditional variance of the
feature vector elements. This can be trivially parallelized and implemented in
map reduce.
€ 18
NICTA Copyright 2011 From imagination to impact
19. Naïve Bayes in Mahout
Command line specific to text classification (e.g. SPAM detection, document
classification, etc.)
Plain text file, format: Generated model, set
label t word word of files in sequence file
word … format (variances).
mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …
N_gram Discard n-grams
Discard n-grams
size, that occur less
that occur in less
default = 1 than this number
than this number
of times in a
of documents.
document.
NICTA Copyright 2011 From imagination to impact 19
20. Naïve Bayes in Mahout
• Need to write your own classifier to be
practical.
document
Classifier Trained Model
label
Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …
NICTA Copyright 2011 From imagination to impact 20
21. Classification vs. Recommendation
• Can use a classifier to recommend:
– Interested in item or not interested?
• Classifier is based on features of the
specific item and the customer
• Recommendation based on past behavior
of customers
• Classification: single decisions
• Recommendation: ranking
NICTA Copyright 2011 From imagination to impact 21