SlideShare a Scribd company logo
1 of 18
Super-Fast Clustering
                Report from MapR workshop
©MapR Technologies - Confidential   1
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Twitter for this talk
       –   #mapr_uk


     Slides and such:
       –   http://info.mapr.com/ted-uk-05-2012



©MapR Technologies - Confidential          2
Company Background
      MapR provides the industry’s best Hadoop Distribution
       –    Combines the best of the Hadoop community
            contributions with significant internally
            financed infrastructure development
      Background of Team
       – Deep management bench with extensive analytic,
         storage, virtualization, and open source experience
       – Google, EMC, Cisco, VMWare, Network Appliance, IBM,
         Microsoft, Apache Foundation, Aster Data, Brio, ParAccel
      Proven
       –    MapR used across industries (Financial Services, Media,
            Telcom, Health Care, Internet Services, Government)
       –    Strategic OEM relationship with EMC and Cisco
       –    Over 1,000 installs

    ©MapR Technologies - Confidential           3
We Also Do …

     Open source development
       –   Zookeeper
       –   Hadoop
       –   Mahout
       –   Stuff
     Partner workshops
       –   Machine learning
       –   Information architecture
       –   Cluster design




©MapR Technologies - Confidential     4
We Also Do …

     Open source development
       –   Zookeeper
       –   Hadoop
       –   Mahout
       –   Stuff
     Partner workshops
       –   Machine learning
       –   Information architecture
       –   Cluster design




©MapR Technologies - Confidential     5
The Problem

     A certain bank
       –   had lots of customers
       –   had lots of prospective customers
       –   had a non-trivial number of fraudulent customers
       –   had a non-trivial number of fraudulent merchants


     They also
       –   collected data
       –   built models
       –   collected more data
       –   built more models


©MapR Technologies - Confidential           6
But …

     These models were arduous to build


     And hard to test


     So people suggested something simpler


     Like k-nearest neighbor




©MapR Technologies - Confidential   7
What’s that?

     Find the k nearest training examples
     Use the average value of the target variable from them


     This is easy … but hard
       –   easy because it is so conceptually simple and you don’t have knobs to turn
           or models to build
       –   hard because of the stunning amount of math
       –   also hard because we need top 50,000 results


     Initial prototype was massively too slow
       –   3K queries x 200K examples takes hours
       –   needed 20M x 25M in the same time

©MapR Technologies - Confidential            8
What We Did

     Mechanism for extending Mahout Vectors
       –   DelegatingVector, WeightedVector, Centroid


     Searcher interface
       –   ProjectionSearch, KmeansSearch, LshSearch, Brute


     Super-fast clustering
       –   Kmeans, StreamingKmeans




©MapR Technologies - Confidential           9
Projection Search




©MapR Technologies - Confidential   10
K-means Search




©MapR Technologies - Confidential   11
But These Require k-means!

     Need a new k-means algorithm to get speed


     Streaming k-means is
       –   One pass (through the original data)
       –   Very fast (20 us per data point with threads)
       –   Very parallelizable




©MapR Technologies - Confidential             12
How It Works

     For each point
       –   Find approximately nearest centroid (distance = d)
       –   If d > threshold, new centroid
       –   Else possibly new cluster
       –   Else add to nearest centroid
     If centroids > K ~ C log N
       –   Recursively cluster centroids with higher threshold


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly

©MapR Technologies - Confidential             13
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (μs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       14
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!




©MapR Technologies - Confidential       15
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)




©MapR Technologies - Confidential                16
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Slides and such:
       –   http://info.mapr.com/ted-uk-05-2012




©MapR Technologies - Confidential          17
Thank You




©MapR Technologies - Confidential   18

More Related Content

Viewers also liked

Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 
"Telling Stories about People and Their Influence" Ferenc Huszár @ds_ldn
"Telling Stories about People and Their Influence"  Ferenc Huszár @ds_ldn"Telling Stories about People and Their Influence"  Ferenc Huszár @ds_ldn
"Telling Stories about People and Their Influence" Ferenc Huszár @ds_ldnData Science London
 
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJELA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJEsuperveromena
 
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds..."Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...Data Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.Energy for One World
 
Utilización y selección
Utilización y selecciónUtilización y selección
Utilización y selecciónweticsblog
 
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...Jasper Moelker
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
PetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPosterPetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPosterSierra Peterson
 
Lines and angles ( Class 6-7 )
Lines and angles ( Class 6-7 )Lines and angles ( Class 6-7 )
Lines and angles ( Class 6-7 )romilkharia
 

Viewers also liked (20)

M2 actividad2 10
M2 actividad2 10M2 actividad2 10
M2 actividad2 10
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
ortodoncia
ortodonciaortodoncia
ortodoncia
 
Fico success story
Fico success storyFico success story
Fico success story
 
"Telling Stories about People and Their Influence" Ferenc Huszár @ds_ldn
"Telling Stories about People and Their Influence"  Ferenc Huszár @ds_ldn"Telling Stories about People and Their Influence"  Ferenc Huszár @ds_ldn
"Telling Stories about People and Their Influence" Ferenc Huszár @ds_ldn
 
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJELA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
LA TECNOLOGIA COMO APOYO EN EL PROCESO ENSEÑANZA-APRENDIZAJE
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds..."Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds...
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
EFOW/ LERCPA: Leaders of Energy without Borders. On our way to 100% renewables.
 
Utilización y selección
Utilización y selecciónUtilización y selección
Utilización y selección
 
Gradle_ToursJUG
Gradle_ToursJUGGradle_ToursJUG
Gradle_ToursJUG
 
Mais cultura
Mais culturaMais cultura
Mais cultura
 
Sarwat Jahan_cv
Sarwat Jahan_cvSarwat Jahan_cv
Sarwat Jahan_cv
 
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
(Inter)national Facades: International Facade Master: WHY? by Arie Bergsma (2...
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
PetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPosterPetersonSierra_Interface_SustainabilityPoster
PetersonSierra_Interface_SustainabilityPoster
 
Lines and angles ( Class 6-7 )
Lines and angles ( Class 6-7 )Lines and angles ( Class 6-7 )
Lines and angles ( Class 6-7 )
 
aparatologia ortodontica
aparatologia ortodontica aparatologia ortodontica
aparatologia ortodontica
 
JENKINS_BreizhJUG_20111003
JENKINS_BreizhJUG_20111003JENKINS_BreizhJUG_20111003
JENKINS_BreizhJUG_20111003
 

Similar to Super-Fast Clustering Report in MapR

London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportMapR Technologies
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 

Similar to Super-Fast Clustering Report in MapR (20)

London data science
London data scienceLondon data science
London data science
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

More from Data Science London

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourData Science London
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02Data Science London
 
A Data Scientist in the Music Industry
A Data Scientist in the Music IndustryA Data Scientist in the Music Industry
A Data Scientist in the Music IndustryData Science London
 

More from Data Science London (20)

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer Behaviour
 
Bootstrapping Data Science
Bootstrapping Data ScienceBootstrapping Data Science
Bootstrapping Data Science
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02
 
A Data Scientist in the Music Industry
A Data Scientist in the Music IndustryA Data Scientist in the Music Industry
A Data Scientist in the Music Industry
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Super-Fast Clustering Report in MapR

  • 1. Super-Fast Clustering Report from MapR workshop ©MapR Technologies - Confidential 1
  • 2. Contact: – tdunning@maprtech.com – @ted_dunning  Twitter for this talk – #mapr_uk  Slides and such: – http://info.mapr.com/ted-uk-05-2012 ©MapR Technologies - Confidential 2
  • 3. Company Background  MapR provides the industry’s best Hadoop Distribution – Combines the best of the Hadoop community contributions with significant internally financed infrastructure development  Background of Team – Deep management bench with extensive analytic, storage, virtualization, and open source experience – Google, EMC, Cisco, VMWare, Network Appliance, IBM, Microsoft, Apache Foundation, Aster Data, Brio, ParAccel  Proven – MapR used across industries (Financial Services, Media, Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco – Over 1,000 installs ©MapR Technologies - Confidential 3
  • 4. We Also Do …  Open source development – Zookeeper – Hadoop – Mahout – Stuff  Partner workshops – Machine learning – Information architecture – Cluster design ©MapR Technologies - Confidential 4
  • 5. We Also Do …  Open source development – Zookeeper – Hadoop – Mahout – Stuff  Partner workshops – Machine learning – Information architecture – Cluster design ©MapR Technologies - Confidential 5
  • 6. The Problem  A certain bank – had lots of customers – had lots of prospective customers – had a non-trivial number of fraudulent customers – had a non-trivial number of fraudulent merchants  They also – collected data – built models – collected more data – built more models ©MapR Technologies - Confidential 6
  • 7. But …  These models were arduous to build  And hard to test  So people suggested something simpler  Like k-nearest neighbor ©MapR Technologies - Confidential 7
  • 8. What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time ©MapR Technologies - Confidential 8
  • 9. What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 9
  • 12. But These Require k-means!  Need a new k-means algorithm to get speed  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable ©MapR Technologies - Confidential 12
  • 13. How It Works  For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid  If centroids > K ~ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 13
  • 14. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 14
  • 15. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! ©MapR Technologies - Confidential 15
  • 16. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) ©MapR Technologies - Confidential 16
  • 17. Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-uk-05-2012 ©MapR Technologies - Confidential 17
  • 18. Thank You ©MapR Technologies - Confidential 18

Editor's Notes

  1. MapR combines the best of the open source technology with our own deep innovations to provide the most advanced distribution for Apache Hadoop.MapR’s team has a deep bench of enterprise software experience with proven success across storage, networking, virtualization, analytics, and open source technologies.Our CEO has driven multiple companies to successful outcomes in the analytic, storage, and virtualization spaces.Our CTO and co-founder M.C. Srivas was most recently at Google in BigTable. He understands the challenges of MapReduce at huge scale. Srivas was also the chief software architect at Spinnaker Networks which came out of stealth with the fastest NAS storage on the market and was acquired quickly by NetAppThe team includes experience with enterprise storage at Cisco, VmWare, IBM and EMC. Our VP of Engineering led emerging technologies and a 600 person for EMC’s NAS engineering team. We also have experience in Business Intelligence and Analytic companies and open source committers in Hadoop, Zookeeper and Mahout including PMC members.MapR is proven technology with installs by leading Hadoop installations across industries and OEM by EMC and Cisco.