SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Mahout



Wednesday, March 16, 2011            1
Mahout
                            Scalable Data Mining for Everybody




Wednesday, March 16, 2011                                        1
What is Mahout
                   • Recommendations (people who x this also
                            x that)
                   • Clustering (segment data into groups of)
                   • Classification (learn decision making from
                            examples)
                   • Stuff (LDA, SVD, frequent item-set, math)

Wednesday, March 16, 2011                                        2
What is Mahout?
                   • Recommendations (people who x this also
                            x that)
                   • Clustering (segment data into groups of)
                   • Classification (learn decision
                            making from examples)
                   • Stuff (LDA, SVM, frequent item-set, math)

Wednesday, March 16, 2011                                        3
Classification in Detail
                   • Naive Bayes Family
                    • Hadoop based training
                   • Decision Forests
                    • Hadoop based training
                   • Logistic Regression (aka SGD)
                    • fast on-line (sequential) training
Wednesday, March 16, 2011                                  4
Classification in Detail
                   • Naive Bayes Family
                    • Hadoop based training
                   • Decision Forests
                    • Hadoop based training
                   • Logistic Regression (aka SGD)
                    • fast on-line (sequential) training
Wednesday, March 16, 2011                                  5
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                            big starts here

                                              Online training
                                              has low
                                              overhead for
                                              small and
                                              moderate size
                                              data-sets




Wednesday, March 16, 2011                                       6
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
And Another
                   From: Dr. Paul Acquah
                   Dear Sir,
                   Re: Proposal for over-invoice Contract Benevolence

                   Based on information gathered from the India
                   hospital directory, I am pleased to propose a
                   confidential business deal for our mutual benefit.
                   I have in my possession, instruments
                   (documentation) to transfer the sum of
                   33,100,000.00 eur thirty-three million one hundred
                   thousand euros, only) into a foreign company's bank
                   account for our favor.
                   ...




Wednesday, March 16, 2011                                                8
And Another
                    Date: Thu, May 20, 2010 at 10:51 AM
                    From: George <george@fumble-tech.com>

                    Hi Ted, was a pleasure talking to you last night
                    at the Hadoop User Group. I liked the idea of
                    going for lunch together. Are you available
                    tomorrow (Friday) at noon?




Wednesday, March 16, 2011                                              8
And Another
                    Date: Thu, May 20, 2010 at 10:51 AM
                    From: George <george@fumble-tech.com>

                    Hi Ted, was a pleasure talking to you last night
                    at the Hadoop User Group. I liked the idea of
                    going for lunch together. Are you available
                    tomorrow (Friday) at noon?




Wednesday, March 16, 2011                                              8
And Another
                    Date: Thu, May 20, 2010 at 10:51 AM
                    From: George <george@fumble-tech.com>

                    Hi Ted, was a pleasure talking to you last night
                    at the Hadoop User Group. I liked the idea of
                    going for lunch together. Are you available
                    tomorrow (Friday) at noon?




Wednesday, March 16, 2011                                              8
Mahout’s SGD

                   • Learns on-line per example
                    • O(1) memory
                    • O(1) time per training example
                   • Sequential implementation
                    • fast, but not parallel

Wednesday, March 16, 2011                              9
Special Features
                   • Hashed feature encoding
                   • Per-term annealing
                    • learn the boring stuff once
                   • Auto-magical learning knob turning
                    • learns correct learning rate, learns
                            correct learning rate for learning learning
                            rate, ...


Wednesday, March 16, 2011                                                 10
Feature Encoding




Wednesday, March 16, 2011                      11
Feature Encoding




Wednesday, March 16, 2011                      11
Hashed Encoding




Wednesday, March 16, 2011                     12
Feature Collisions




Wednesday, March 16, 2011                        13
Learning Rate Annealing
        Learning Rate




                            # training examples seen


Wednesday, March 16, 2011                              14
Learning Rate   Per-term Annealing




                                   # training examples seen



Wednesday, March 16, 2011                                     15
Learning Rate   Per-term Annealing

                                 Common
                                  Feature




                                     # training examples seen



Wednesday, March 16, 2011                                       15
Learning Rate   Per-term Annealing


                                                          Rare
                                                         Feature




                                   # training examples seen



Wednesday, March 16, 2011                                          15
General Structure

                • OnlineLogisticRegression
                 • Traditional logistic regression
                 • Stochastic Gradient Descent
                 • Per term annealing
                 • Too fast (for the disk + encoder)

Wednesday, March 16, 2011                              16
Next Level

                   • CrossFoldLearner
                    • contains multiple primitive learners
                    • online cross validation
                    • 5x more work

Wednesday, March 16, 2011                                    17
And again
                   • AdaptiveLogisticRegression
                    • 20 x CrossFoldLearner
                    • evolves good learning and regularization
                              rates
                            • 100 x more work than basic learner
                            • still faster than disk + encoding
Wednesday, March 16, 2011                                          18
A comparison
                   • Traditional view
                    • 400 x (read + OLR)
                   • Revised Mahout view
                    • 1 x (read + mu x 100 x OLR) x eta
                    • mu = efficiency from killing losers early
                    • eta = efficiency from stopping early
Wednesday, March 16, 2011                                        19
Deployment

                   • Training
                    • ModelSerializer.writeBinary(..., model)
                   • Deployment
                    • m = ModelSerializer.readBinary(...)
                    • r = m.classifyScalar(featureVector)

Wednesday, March 16, 2011                                       20
The Upshot

                   • One machine can go fast
                    • SITM trains in 2 billion examples in 3
                            hours
                   • Deployability pays off big
                    • simple sample server farm

Wednesday, March 16, 2011                                      21

Contenu connexe

Similaire à MAHOUT classifier tour

Similaire à MAHOUT classifier tour (6)

Mahout classifier tour
Mahout classifier tourMahout classifier tour
Mahout classifier tour
 
Opensource Authentication and Authorization
Opensource Authentication and AuthorizationOpensource Authentication and Authorization
Opensource Authentication and Authorization
 
Visual Communication That Works! (PDF)
Visual Communication That Works! (PDF)Visual Communication That Works! (PDF)
Visual Communication That Works! (PDF)
 
State of Social & Informal Learning
State of Social & Informal LearningState of Social & Informal Learning
State of Social & Informal Learning
 
Intro to Linked Data: Context
Intro to Linked Data: ContextIntro to Linked Data: Context
Intro to Linked Data: Context
 
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
 

Plus de Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 

Plus de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

MAHOUT classifier tour

  • 2. Mahout Scalable Data Mining for Everybody Wednesday, March 16, 2011 1
  • 3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math) Wednesday, March 16, 2011 2
  • 4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math) Wednesday, March 16, 2011 3
  • 5. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) training Wednesday, March 16, 2011 4
  • 6. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) training Wednesday, March 16, 2011 5
  • 7. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 8. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 9. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 10. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 11. So What? big starts here Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 19. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Wednesday, March 16, 2011 8
  • 20. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8
  • 21. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8
  • 22. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8
  • 23. Mahout’s SGD • Learns on-line per example • O(1) memory • O(1) time per training example • Sequential implementation • fast, but not parallel Wednesday, March 16, 2011 9
  • 24. Special Features • Hashed feature encoding • Per-term annealing • learn the boring stuff once • Auto-magical learning knob turning • learns correct learning rate, learns correct learning rate for learning learning rate, ... Wednesday, March 16, 2011 10
  • 29. Learning Rate Annealing Learning Rate # training examples seen Wednesday, March 16, 2011 14
  • 30. Learning Rate Per-term Annealing # training examples seen Wednesday, March 16, 2011 15
  • 31. Learning Rate Per-term Annealing Common Feature # training examples seen Wednesday, March 16, 2011 15
  • 32. Learning Rate Per-term Annealing Rare Feature # training examples seen Wednesday, March 16, 2011 15
  • 33. General Structure • OnlineLogisticRegression • Traditional logistic regression • Stochastic Gradient Descent • Per term annealing • Too fast (for the disk + encoder) Wednesday, March 16, 2011 16
  • 34. Next Level • CrossFoldLearner • contains multiple primitive learners • online cross validation • 5x more work Wednesday, March 16, 2011 17
  • 35. And again • AdaptiveLogisticRegression • 20 x CrossFoldLearner • evolves good learning and regularization rates • 100 x more work than basic learner • still faster than disk + encoding Wednesday, March 16, 2011 18
  • 36. A comparison • Traditional view • 400 x (read + OLR) • Revised Mahout view • 1 x (read + mu x 100 x OLR) x eta • mu = efficiency from killing losers early • eta = efficiency from stopping early Wednesday, March 16, 2011 19
  • 37. Deployment • Training • ModelSerializer.writeBinary(..., model) • Deployment • m = ModelSerializer.readBinary(...) • r = m.classifyScalar(featureVector) Wednesday, March 16, 2011 20
  • 38. The Upshot • One machine can go fast • SITM trains in 2 billion examples in 3 hours • Deployability pays off big • simple sample server farm Wednesday, March 16, 2011 21