This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.
3. { “Introduction” : “History and Etymology” }
• A Scalable Machine Learning Library built on Hadoop, written in Java.
• Driven by Ng et al.’s paper “MapReduce for Machine Learning
on Multicore”
• Started as a Lucene sub-project. Became Apache TLP in April 2010.
• Latest version out – 0.6 (released on 6th Feb 2012).
• Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop.
• Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten.
• Taste Recommendation Framework was added
later by Sean Owen.
3
Figure 1.1 Apache Mahout and its related projects within the Apache Foundation.
Much of Mahout’s work has been to not only implement these algorithms conventionally,
and scalable way, but also to convert some of these algorithms to work at scale on to
Hadoop’s mascot is an elephant, which at last explains the project name!
Mahout incubates a number of techniques and algorithms, many still in developm
experimental phase. At this early stage in the project's life, three core themes are evident
filtering / recommender engines, clustering, and classification. This is by no means all tha
Mahout, but are the most prominent and mature themes at the time of writing. These the
scope of this book.
Chances are that if you are reading this, you are already aware of the interesting pot
three families of techniques. But just in case, read on.
2
5. { “Machine Learning” : “Introduction” }
“Machine Learning is Programming Computers to optimize a
Performance Criterion using Example Data or Past Experience”
• Branch of Artificial Intelligence
• Design and Development of Algorithms
• Computers Evolve Behavior based on Empirical Data .
• Supervised Learning
• Using Labeled training data, to create a Classifier that can predict output for unseen inputs.
• Unsupervised Learning
• Using Unlabeled training data to create a function that can predict output.
• Semi-Supervised Learning
5
6. { “Machine Learning” : “Applications” }
• Recommend Friends, Dates, Products to end-user.
• Classify content into pre-defined groups.
• Find Similar content based on Object Properties.
• Identify key topics in large Collections of Text.
• Detect Anomalies within given data.
• Ranking Search Results with User Feedback Learning.
• Classifying DNA sequences.
• Sentiment Analysis/ Opinion Mining
• Computer Vision.
• Natural Language Processing,
• BioInformatics.
• Speech and HandWriting Recognition.
• Others ...
6
7. {“Machine Learning”: “Challenges”}
• BigData
• Yesterdays Processing on next
generation Data.
• Time for Processing
• Large and Cheap Storage
7
Size Classification Tools
Lines
Sample Data
Analysis and
Visualization
Whiteboard,
bash,...
KBs - low MBs
Prototype Data
Analysis and
Visualization
Matlab, Octave, R,
Processing,
bash,...
MBs - low GBs
Online Data
Storage MySQL (DBs),...
MBs - low GBs
Online Data
Analysis
NumPy, SciPy,
Weka, BLAS/
LAPACK,...
MBs - low GBs
Online Data
Visualization
Flare, AmCharts,
Raphael, Protovis,...
GBs - TBs - PBs
Big Data
Storage
HDFS, HBase,
Cassandra,...
GBs - TBs - PBs
Big Data
Analysis
Hive, Mahout,
Hama, Giraph,...
8. { “Machine Learning” : “Mahout for Big Data”}
• Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”.
• Some Algorithms won’t scale to massive machine clusters
• Others fit logically on MapReduce framework like Apache Hadoop
• Most Mahout implementations are MapReduce enabled
• Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”.
• The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as
Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library.
• The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine
Learning Open-Source Softwares)
8
11. • Scalable
• Dual-Mode (Sequential and MapReduce Enabled)
• Support for easy Extension.
• Large Number of Data Source Enabled including the newer NoSQL variants.
• It is a Java library. It is a framework of tools intended to be used and adapted by developers.
• Advanced Implementations of Java’s Collections Framework for better Performance.
11
{ “Internals” : “Features” }
13. • Help Users find items they might like based on historical behavior and preferences
• Top-level packages define the Mahout interfaces to these key abstractions:
• DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel,
MongoDBDataModel, CassandraDataModel
• UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood.
• Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering
Recommender.
13
{ “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
14. 14
{ “Algorithms” : “Recommender Systems”, “id” : “Example”}
0
1
1
1
1
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
0
1
1
1
0
0
0
1
1
1
0
1
1
0
1
Binary Values
Recommendation
Alice
Bob
John
Jane
Bill
Steve
Larry
Don
Jack
15. 15
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}
1
1/3 –
0.33
5/8 –
0.625
5/8 –
0.625
1/3 –
0.33
1
3/8 –
0.375
3/8 –
0.375
5/8 –
0.625
3/8 –
0.375
1
5/7 –
0.714
5/8 –
0.625
3/8 –
0.375
5/7 –
0.714
1
Tanimoto Coefficient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B
16. 16
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}
1
0.507
0.772
0.772
0.507
1
0.707
0.707
0.772
0.707
1
0.833
0.772
0.707
0.833
1
Cosine Coefficient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B
17. • Assigning Data to discreet Categories.
• Train a model on Labeled Data
• Run the Model on new, Unlabeled Data
• Classifier: An algorithm that implements classification, especially in a concrete implementation.
• Classification Algorithms
• Maximum entropy classifier
• Naïve Bayes classifier
• Decision trees, decision lists
• Support vector machines
• Kernel estimation and K-nearest-neighbor algorithms
• Perceptrons
• Neural networks (multi-level perceptrons)
17
{ “Algorithms” : “Classification” , “id” : “Introduction”}
Spam
Not spam
?
18. 18
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Train: Not Spam
President Obama’s Nobel Prize Speech
20. 20
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Run
“Order a trial Adobe chicken daily
EAB-List new summer savings, welcome!”
21. 21
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”}
• Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.
• Training:
• Read the Features
• Calculate per-Document Statistics
• Normalize across Categories
• Calculate normalizing factor of each label
• Testing
• Classification (fifth job, explicitly invoked)
algorithm through which the system will learn, and the variables used as input are key steps in the
phase of building the classification system.
The basic steps in building a classification system are illustrated in figure 13.2.
Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train
algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc
with new input examples to estimate the target variable.
The figure shows two phases of the classification process, with the upper path representing training
classification model and the lower path providing new examples for which the model will assign catego
(the target variables) as a way to emulate decisions. For the training phase, input for the train
22. • Grouping unstructured data without any training data.
• Self learning from experience.
• Small intra-cluster distance - Trying for local and global Minima
• Large inter-cluster distance
• Mahout’s Canopy Clustering
map reduce algorithm is often
used to compute initial cluster
centroids.
22
{ “Algorithms” : “Clustering” , “id” : “Introduction”}
34. • Assume: Number of Cluster is far lesser than Number of Points.
• Therefore, |Clusters| << |Points|
• Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids.
34
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
M0 M1 M2 M3
<clusterID, observation>
R0 R1
Important arguments
--maxIter
--convergenceDelta
--method
36. 36
{ “Algorithms” : “Other Algorithms” }
• Classification
‣ Stochastic Gradient Descent
‣ Support Vector Machines
‣ Random Forests
• Clustering
‣ Latent Dirichlet Allocation
- Topic models
‣ Fuzzy K-Means
- Points are assigned multiple clusters
‣ Canopy clustering
- Fast approximations of clusters
‣ Spectral clustering
- Treat points as a graph
• Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions
• Dimensionality Reduction
• Regression
37. 37
{ “Algorithms” : “Future” }
• Classification
‣ Decision Trees such as J48 and ID3
• Clustering
‣ DBScan and CoWeb Clustering techniques
• Evolutionary Algorithms
‣ Classical Genetic Algorithms
• Association Rules
‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
41. 41
• Scalable Library
• Three Primary Areas of
Focus
• Other Algorithms
{ “Summary”: “Apache Mahout” }
42. 42
• Scalable Library
• Three Primary Areas of
Focus
• Other Algorithms
• All in your friendly
neighborhood MapReduce
{ “Summary”: “Apache Mahout” }