SlideShare a Scribd company logo
1 of 36
1©MapR Technologies 2013- Confidential
Apache Mahout
How it's good, how it's awesome, and where it falls short
2©MapR Technologies 2013- Confidential
What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
 Components
– math library
– clustering
– classification
– decompositions
– recommendations
3©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
4©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
5©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
All the stuff that
isn’t there
6©MapR Technologies 2013- Confidential
Mahout Math
7©MapR Technologies 2013- Confidential
Mahout Math
 Goals are
– basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
 But not
– totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature
8©MapR Technologies 2013- Confidential
Matrices and Vectors
 At the core:
– DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
 Highly composable API
 Important ideas:
– view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v)
9©MapR Technologies 2013- Confidential
Assign
 Matrices
 Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
10©MapR Technologies 2013- Confidential
Views
 Matrices
 Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length);
11©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Random projection
 Low rank random matrix
12©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Random projection
 Low rank random matrix
m.viewDiagonal().zSum()
13©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Random projection
 Low rank random matrix
m.viewDiagonal().zSum()
m.times(new DenseMatrix(1000, 3).assign(new Normal()))
14©MapR Technologies 2013- Confidential
Recommenders
15©MapR Technologies 2013- Confidential
Examples of Recommendations
 Customers buying books (Linden et al)
 Web visitors rating music (Shardanand and Maes) or movies
(Riedl, et al), (Netflix)
 Internet radio listeners not skipping songs (Musicmatch)
 Internet video watchers watching >30 s (Veoh)
 Visibility in a map UI (new Google maps)
16©MapR Technologies 2013- Confidential
Recommendation Basics
 History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
17©MapR Technologies 2013- Confidential
Recommendation Basics
 History as matrix:
 (t1, t3) cooccur 2 times,
 (t1, t4) once,
 (t2, t4) once,
 (t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
18©MapR Technologies 2013- Confidential
A Quick Simplification
 Users who do h
 Also do r
Ah
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations
19©MapR Technologies 2013- Confidential
Clustering
20©MapR Technologies 2013- Confidential
An Example
21©MapR Technologies 2013- Confidential
An Example
22©MapR Technologies 2013- Confidential
Diagonalized Cluster Proximity
23©MapR Technologies 2013- Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
24©MapR Technologies 2013- Confidential
Lots of Clusters Are Fine
25©MapR Technologies 2013- Confidential
Decompositions
26©MapR Technologies 2013- Confidential
Low Rank Matrix
 Or should we see it differently?
 Are these scaled up versions of all the same column?
1 2 5
2 4 10
10 20 50
20 40 100
27©MapR Technologies 2013- Confidential
Low Rank Matrix
 Matrix multiplication is designed to make this easy
 We can see weighted column patterns, or weighted row patterns
 All the same mathematically
1
2
10
20
1 2 5x
Column pattern
(or weights)
Weights
(or row pattern)
28©MapR Technologies 2013- Confidential
Low Rank Matrix
 What about here?
 This is like before, but there is one exceptional value
1 2 5
2 4 10
10 100 50
20 40 100
29©MapR Technologies 2013- Confidential
Low Rank Matrix
 OK … add in a simple fixer upper
1
2
10
20
1 2 5x
0
0
10
0
0 8 0x
Which row
Exception
pattern
+[
[
]
]
30©MapR Technologies 2013- Confidential
Random Projection
31©MapR Technologies 2013- Confidential
SVD Projection
32©MapR Technologies 2013- Confidential
Classifiers
33©MapR Technologies 2013- Confidential
Mahout Classifiers
 Naïve Bayes
– high quality implementation
– uses idiosyncratic input format
– … but it is naïve
 SGD
– sequential, not parallel
– auto-tuning has foibles
– learning rate annealing has issues
– definitely not state of the art compared to Vowpal Wabbit
 Random forest
– scaling limits due to decomposition strategy
– yet another input format
– no deployment strategy
34©MapR Technologies 2013- Confidential
The stuff that isn’t there
35©MapR Technologies 2013- Confidential
What Mahout Isn’t
 Mahout isn’t R, isn’t SAS
 It doesn’t aim to do everything
 It aims to scale some few problems of practical interest
 The stuff that isn’t there is a feature, not a defect
36©MapR Technologies 2013- Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
– @apachemahout
– @user-subscribe@mahout.apache.org
 Slides and such
http://www.slideshare.net/tdunning
 Hash tags: #mapr #apachemahout

More Related Content

Similar to What's Right and Wrong with Apache Mahout

Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
 

Similar to What's Right and Wrong with Apache Mahout (20)

Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout Recommenders
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Data Science At Scale for IoT on the Pivotal Platform
Data Science At Scale for IoT on the Pivotal PlatformData Science At Scale for IoT on the Pivotal Platform
Data Science At Scale for IoT on the Pivotal Platform
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
Using Set Cover to Optimize a Large-Scale Low Latency Distributed GraphUsing Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
London data science
London data scienceLondon data science
London data science
 

More from MapR Technologies

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

What's Right and Wrong with Apache Mahout

  • 1. 1©MapR Technologies 2013- Confidential Apache Mahout How it's good, how it's awesome, and where it falls short
  • 2. 2©MapR Technologies 2013- Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
  • 3. 3©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff
  • 4. 4©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff
  • 5. 5©MapR Technologies 2013- Confidential What is Right and Wrong with Mahout?  Components – recommendations – math library – clustering – classification – decompositions – other stuff All the stuff that isn’t there
  • 6. 6©MapR Technologies 2013- Confidential Mahout Math
  • 7. 7©MapR Technologies 2013- Confidential Mahout Math  Goals are – basic linear algebra, – and statistical sampling, – and good clustering, – decent speed, – extensibility, – especially for sparse data  But not – totally badass speed – comprehensive set of algorithms – optimization, root finders, quadrature
  • 8. 8©MapR Technologies 2013- Confidential Matrices and Vectors  At the core: – DenseVector, RandomAccessSparseVector – DenseMatrix, SparseRowMatrix  Highly composable API  Important ideas: – view*, assign and aggregate – iteration m.viewDiagonal().assign(v)
  • 9. 9©MapR Technologies 2013- Confidential Assign  Matrices  Vectors Matrix assign(double value); Matrix assign(double[][] values); Matrix assign(Matrix other); Matrix assign(DoubleFunction f); Matrix assign(Matrix other, DoubleDoubleFunction f); Vector assign(double value); Vector assign(double[] values); Vector assign(Vector other); Vector assign(DoubleFunction f); Vector assign(Vector other, DoubleDoubleFunction f); Vector assign(DoubleDoubleFunction f, double y);
  • 10. 10©MapR Technologies 2013- Confidential Views  Matrices  Vectors Matrix viewPart(int[] offset, int[] size); Matrix viewPart(int row, int rlen, int col, int clen); Vector viewRow(int row); Vector viewColumn(int column); Vector viewDiagonal(); Vector viewPart(int offset, int length);
  • 11. 11©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix
  • 12. 12©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix m.viewDiagonal().zSum()
  • 13. 13©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Random projection  Low rank random matrix m.viewDiagonal().zSum() m.times(new DenseMatrix(1000, 3).assign(new Normal()))
  • 14. 14©MapR Technologies 2013- Confidential Recommenders
  • 15. 15©MapR Technologies 2013- Confidential Examples of Recommendations  Customers buying books (Linden et al)  Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)  Internet radio listeners not skipping songs (Musicmatch)  Internet video watchers watching >30 s (Veoh)  Visibility in a map UI (new Google maps)
  • 16. 16©MapR Technologies 2013- Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  • 17. 17©MapR Technologies 2013- Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  • 18. 18©MapR Technologies 2013- Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  • 19. 19©MapR Technologies 2013- Confidential Clustering
  • 20. 20©MapR Technologies 2013- Confidential An Example
  • 21. 21©MapR Technologies 2013- Confidential An Example
  • 22. 22©MapR Technologies 2013- Confidential Diagonalized Cluster Proximity
  • 23. 23©MapR Technologies 2013- Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 24. 24©MapR Technologies 2013- Confidential Lots of Clusters Are Fine
  • 25. 25©MapR Technologies 2013- Confidential Decompositions
  • 26. 26©MapR Technologies 2013- Confidential Low Rank Matrix  Or should we see it differently?  Are these scaled up versions of all the same column? 1 2 5 2 4 10 10 20 50 20 40 100
  • 27. 27©MapR Technologies 2013- Confidential Low Rank Matrix  Matrix multiplication is designed to make this easy  We can see weighted column patterns, or weighted row patterns  All the same mathematically 1 2 10 20 1 2 5x Column pattern (or weights) Weights (or row pattern)
  • 28. 28©MapR Technologies 2013- Confidential Low Rank Matrix  What about here?  This is like before, but there is one exceptional value 1 2 5 2 4 10 10 100 50 20 40 100
  • 29. 29©MapR Technologies 2013- Confidential Low Rank Matrix  OK … add in a simple fixer upper 1 2 10 20 1 2 5x 0 0 10 0 0 8 0x Which row Exception pattern +[ [ ] ]
  • 30. 30©MapR Technologies 2013- Confidential Random Projection
  • 31. 31©MapR Technologies 2013- Confidential SVD Projection
  • 32. 32©MapR Technologies 2013- Confidential Classifiers
  • 33. 33©MapR Technologies 2013- Confidential Mahout Classifiers  Naïve Bayes – high quality implementation – uses idiosyncratic input format – … but it is naïve  SGD – sequential, not parallel – auto-tuning has foibles – learning rate annealing has issues – definitely not state of the art compared to Vowpal Wabbit  Random forest – scaling limits due to decomposition strategy – yet another input format – no deployment strategy
  • 34. 34©MapR Technologies 2013- Confidential The stuff that isn’t there
  • 35. 35©MapR Technologies 2013- Confidential What Mahout Isn’t  Mahout isn’t R, isn’t SAS  It doesn’t aim to do everything  It aims to scale some few problems of practical interest  The stuff that isn’t there is a feature, not a defect
  • 36. 36©MapR Technologies 2013- Confidential  Contact: – tdunning@maprtech.com – @ted_dunning – @apachemahout – @user-subscribe@mahout.apache.org  Slides and such http://www.slideshare.net/tdunning  Hash tags: #mapr #apachemahout