Introduction to Mahout and Machine Learning

{ “Mahout” : “Scalable Machine Learning Library” }
{ “Presented By” : “Varad Meru”,
“Company” : “Orzota, Inc”,
“Twitter” : “@vrdmr” }
1

{ “Mahout” : “Introduction” }
2

{ “Introduction” : “History and Etymology” }
• A Scalable Machine Learning Library built on Hadoop, written in Java.
• Driven by Ng et al.’s paper “MapReduce for Machine Learning
on Multicore”
• Started as a Lucene sub-project. Became Apache TLP in April 2010.
• Latest version out – 0.6 (released on 6th Feb 2012).
• Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop.
• Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten.
• Taste Recommendation Framework was added
later by Sean Owen.
3
Figure 1.1 Apache Mahout and its related projects within the Apache Foundation.
Much of Mahout’s work has been to not only implement these algorithms conventionally,
and scalable way, but also to convert some of these algorithms to work at scale on to
Hadoop’s mascot is an elephant, which at last explains the project name!
Mahout incubates a number of techniques and algorithms, many still in developm
experimental phase. At this early stage in the project's life, three core themes are evident
filtering / recommender engines, clustering, and classification. This is by no means all tha
Mahout, but are the most prominent and mature themes at the time of writing. These the
scope of this book.
Chances are that if you are reading this, you are already aware of the interesting pot
three families of techniques. But just in case, read on.
2

{ “Mahout” : “Machine Learning” }
4

{ “Machine Learning” : “Introduction” }
“Machine Learning is Programming Computers to optimize a
Performance Criterion using Example Data or Past Experience”
• Branch of Artiﬁcial Intelligence
• Design and Development of Algorithms
• Computers Evolve Behavior based on Empirical Data .
• Supervised Learning
• Using Labeled training data, to create a Classiﬁer that can predict output for unseen inputs.
• Unsupervised Learning
• Using Unlabeled training data to create a function that can predict output.
• Semi-Supervised Learning
5

{ “Machine Learning” : “Applications” }
• Recommend Friends, Dates, Products to end-user.
• Classify content into pre-deﬁned groups.
• Find Similar content based on Object Properties.
• Identify key topics in large Collections of Text.
• Detect Anomalies within given data.
• Ranking Search Results with User Feedback Learning.
• Classifying DNA sequences.
• Sentiment Analysis/ Opinion Mining
• Computer Vision.
• Natural Language Processing,
• BioInformatics.
• Speech and HandWriting Recognition.
• Others ...
6

{“Machine Learning”: “Challenges”}
• BigData
• Yesterdays Processing on next
generation Data.
• Time for Processing
• Large and Cheap Storage
7
Size Classiﬁcation Tools
Lines
Sample Data
Analysis and
Visualization
Whiteboard,
bash,...
KBs - low MBs
Prototype Data
Analysis and
Visualization
Matlab, Octave, R,
Processing,
bash,...
MBs - low GBs
Online Data
Storage MySQL (DBs),...
MBs - low GBs
Online Data
Analysis
NumPy, SciPy,
Weka, BLAS/
LAPACK,...
MBs - low GBs
Online Data
Visualization
Flare, AmCharts,
Raphael, Protovis,...
GBs - TBs - PBs
Big Data
Storage
HDFS, HBase,
Cassandra,...
GBs - TBs - PBs
Big Data
Analysis
Hive, Mahout,
Hama, Giraph,...

{ “Machine Learning” : “Mahout for Big Data”}
• Goal: “Be as Fast and Eﬃcient as possible given the intrinsic design of the Algorithm”.
• Some Algorithms won’t scale to massive machine clusters
• Others ﬁt logically on MapReduce framework like Apache Hadoop
• Most Mahout implementations are MapReduce enabled
• Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”.
• The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as
Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library.
• The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine
Learning Open-Source Softwares)
8

{ “Mahout” : “Internals” }
9

10
{ “Internals” : “Architecture” }
Math%
Vectors/Matrices/SVD%
Recommenders%Clustering%Classiﬁca9on%
Freq.%
Pa>ern%
Mining%
Evolu9onary%
Algorithms%
U9li9es%
Lucene/Vectorizer%
Collec9ons%
(primi9ves)%
Apache%
Hadoop%
Applica9ons%
Examples%
Regression%
Dimension%
Reduc9on%

• Scalable
• Dual-Mode (Sequential and MapReduce Enabled)
• Support for easy Extension.
• Large Number of Data Source Enabled including the newer NoSQL variants.
• It is a Java library. It is a framework of tools intended to be used and adapted by developers.
• Advanced Implementations of Java’s Collections Framework for better Performance.
11
{ “Internals” : “Features” }

{ “Mahout” : “Algorithms” }
12

• Help Users ﬁnd items they might like based on historical behavior and preferences
• Top-level packages deﬁne the Mahout interfaces to these key abstractions:
• DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel,
MongoDBDataModel, CassandraDataModel
• UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood.
• Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering
Recommender.
13
{ “Algorithms” : “Recommender Systems”, “id” : “Introduction”}

14
{ “Algorithms” : “Recommender Systems”, “id” : “Example”}
0
1
1
1
1
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
0
1
1
1
0
0
0
1
1
1
0
1
1
0
1
Binary Values
Recommendation
Alice
Bob
John
Jane
Bill
Steve
Larry
Don
Jack

15
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}
1
1/3 –
0.33
5/8 –
0.625
5/8 –
0.625
1/3 –
0.33
1
3/8 –
0.375
3/8 –
0.375
5/8 –
0.625
3/8 –
0.375
1
5/7 –
0.714
5/8 –
0.625
3/8 –
0.375
5/7 –
0.714
1
Tanimoto Coefﬁcient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B

16
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}
1
0.507
0.772
0.772
0.507
1
0.707
0.707
0.772
0.707
1
0.833
0.772
0.707
0.833
1
Cosine Coefﬁcient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B

• Assigning Data to discreet Categories.
• Train a model on Labeled Data
• Run the Model on new, Unlabeled Data
• Classifier: An algorithm that implements classification, especially in a concrete implementation.
• Classification Algorithms
• Maximum entropy classifier
• Naïve Bayes classifier
• Decision trees, decision lists
• Support vector machines
• Kernel estimation and K-nearest-neighbor algorithms
• Perceptrons
• Neural networks (multi-level perceptrons)
17
{ “Algorithms” : “Classification” , “id” : “Introduction”}
Spam
Not spam
?

18
{ “Algorithms” : “Classiﬁcation” , “id” : “Naïve Bayes Example”}
Train: Not Spam
President Obama’s Nobel Prize Speech

19
Train: Spam
Spam Email Content

20
Run
“Order a trial Adobe chicken daily
EAB-List new summer savings, welcome!”

21
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”}
• Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.
• Training:
• Read the Features
• Calculate per-Document Statistics
• Normalize across Categories
• Calculate normalizing factor of each label
• Testing
• Classification (fifth job, explicitly invoked)
algorithm through which the system will learn, and the variables used as input are key steps in the
phase of building the classification system.
The basic steps in building a classification system are illustrated in figure 13.2.
Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train
algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc
with new input examples to estimate the target variable.
The figure shows two phases of the classification process, with the upper path representing training
classification model and the lower path providing new examples for which the model will assign catego
(the target variables) as a way to emulate decisions. For the training phase, input for the train

• Grouping unstructured data without any training data.
• Self learning from experience.
• Small intra-cluster distance - Trying for local and global Minima
• Large inter-cluster distance
• Mahout’s Canopy Clustering
map reduce algorithm is often
used to compute initial cluster
centroids.
22
{ “Algorithms” : “Clustering” , “id” : “Introduction”}

23
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

24

25

26

27

28

29

30

31

32
Cats
Dogs

33
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
+
C0 C1 C2 C3
M0 M1 M2 M3
IO0 IO1 IO2 IO3
R0 R1
FO0 FO1
chunks
mappers
Reducers
MapPhaseReducePhase
Shuffling Data

• Assume: Number of Cluster is far lesser than Number of Points.
• Therefore, |Clusters| << |Points|
• Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids.
34
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
M0 M1 M2 M3
<clusterID, observation>
R0 R1
Important arguments
--maxIter
--convergenceDelta
--method

35
{ “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”}
Map phase: assign cluster IDs
Reduce phase: reset centroids

36
{ “Algorithms” : “Other Algorithms” }
• Classiﬁcation
‣ Stochastic Gradient Descent
‣ Support Vector Machines
‣ Random Forests
• Clustering
‣ Latent Dirichlet Allocation
- Topic models
‣ Fuzzy K-Means
- Points are assigned multiple clusters
‣ Canopy clustering
- Fast approximations of clusters
‣ Spectral clustering
- Treat points as a graph
• Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions
• Dimensionality Reduction
• Regression

37
{ “Algorithms” : “Future” }
• Classiﬁcation
‣ Decision Trees such as J48 and ID3
• Clustering
‣ DBScan and CoWeb Clustering techniques
• Evolutionary Algorithms
‣ Classical Genetic Algorithms
• Association Rules
‣ Apriori. (It has an alternative frequent itemset algorithm implementation).

{ “Mahout” : “Summary” }
38

{ “Summary”: “Apache Mahout” }
39
• Scalable Library

40
• Three Primary Areas of
Focus

41
Focus
• Other Algorithms

42
Focus
• Other Algorithms
• All in your friendly
neighborhood MapReduce

{ “Mahout” : “Demo” }
43

{ “Mahout” : “Questions” }
44

{ “Mahout” : “References” }
45

• Books
• “Mahout in Action”, Owen et. al., Manning Pub.
• “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub.
• “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer
Pub.
• Videos
• CS-229, Machine Learning at Stanford University - Prof. Andrew Ng.
• Collaborative ﬁltering at scale - Sean Owen
• Distributed Item-based Collaborative Filtering - Sebastian Schelter
• EMail Classiﬁcation with Mahout - Grant Ingersoll @ Lucid Imagination
46
{ “References” : “Mahout Books, Tutorials, Links”, “id” : 1}

• WWW
• http://mahout.apache.org - Mahout@Apache
• http://hadoop.apache.org - Hadoop@Apache
• dev@mahout.apache.org - Developer mailing list
• user@mahout.apache.org - User mailing list
• http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout
47
{ “References” : “Mahout Books, Tutorials, Links”, “id” : 2}

{ “Mahout” : “The End” }
48
{“Thank You” : “Have a Nice and Green Day” }

Introduction to Mahout and Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Mahout and Machine Learning

Similar to Introduction to Mahout and Machine Learning (20)

More from Varad Meru

More from Varad Meru (16)

Recently uploaded

Recently uploaded (20)

Introduction to Mahout and Machine Learning