SlideShare a Scribd company logo
1 of 17
Social @tsunamiide tsunami.io Earthquake Enterprises
K-Means Clustering
Social @tsunamiide tsunami.io Earthquake Enterprises
 Two parts
 Simple Clustering Algorithm
 Using ML with Large Datasets
Social @tsunamiide tsunami.io Earthquake Enterprises
 Very elegant
 Scales to large datasets
 It is simple and easy to learn
 Works with unsupervised data
Social @tsunamiide tsunami.io Earthquake Enterprises
 Competitive Analysis
 Compare products from Company A with
Company B by clustering them into groups
 Semi-Structured Search Engine
 Show different results to different users
depending on how they are classified
▪ What Google thinks about you:
https://www.google.com/settings/ads/onweb/
Social @tsunamiide tsunami.io Earthquake Enterprises
 Multivariate data set
 (i.e. each row is a float[])
 Classification is
labeled
 Not linearly
separable
 Popular for testing
ML Algorithms
Social @tsunamiide tsunami.io Earthquake Enterprises
 Iris data in (n-1)! charts
Social @tsunamiide tsunami.io Earthquake Enterprises
 E.g. Classifying text documents
 Charting no longer makes sense
 Need to rely derived metrics
Social @tsunamiide tsunami.io Earthquake Enterprises
 Euclidian
 Manhattan Distance
 Angle between
 Correlation
Social @tsunamiide tsunami.io Earthquake Enterprises
 Many ML algorithms rely on the features
to be in the range of [-1,1] or [0,1]
 K-means will work with any range but for
many distance functions larger ranges will
crowed out smaller ones
 We can use this to emphasize some
factors over others
Social @tsunamiide tsunami.io Earthquake Enterprises
 select the number of clusters (K)
 select a seed for each cluster (centroid)
 Do {
 assign each item in the training set to the
closest centroid
 update each centroid to the mean of the
assigned items }
 while (any of the centroids have moved)
Social @tsunamiide tsunami.io Earthquake Enterprises
 Number of clusters are known (3)
 Pick seed by randomly selecting 3 rows
from dataset
 We intentionally pick 3 close together for
demonstration
Social @tsunamiide tsunami.io Earthquake Enterprises
 Number of clusters
 Distance functions
 Feature scaling
 Datasets
 E.g. included abalone and breast cancer
datasets
Social @tsunamiide tsunami.io Earthquake Enterprises
Social @tsunamiide tsunami.io Earthquake Enterprises
 Faster algorithms
with more data will
often beat slower
algorithms with less
data.
Social @tsunamiide tsunami.io Earthquake Enterprises
 Some algorithms do not scale well
 e.g. Layered NN
 can take many days (not suited to tutorials)
 ML algorithms need to be run repeatedly
 Tuning hyper-parameters
 K-fold cross validation
 Feature discovery
Social @tsunamiide tsunami.io Earthquake Enterprises
 Random Forest
 Built in, popular and effective
 Leave one out
 My preferred
Social @tsunamiide tsunami.io Earthquake Enterprises
 Use a fast algorithm for factor discovery
 Use a slow algorithm for final solution
 Many competitions are won on starting the
slow algorithm as soon as possible

More Related Content

What's hot

Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresYoav chernobroda
 
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelProbabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelNick Malleson
 
Partial Binomial Distribution method for Generation capacity outage using Spr...
Partial Binomial Distribution method for Generation capacity outage using Spr...Partial Binomial Distribution method for Generation capacity outage using Spr...
Partial Binomial Distribution method for Generation capacity outage using Spr...vivatechijri
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsPooyan Jamshidi
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsATMOSPHERE .
 
Meteoio Introduction given by Mathias Bavey in Bozen
Meteoio Introduction given by Mathias Bavey in BozenMeteoio Introduction given by Mathias Bavey in Bozen
Meteoio Introduction given by Mathias Bavey in BozenRiccardo Rigon
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...Dataconomy Media
 

What's hot (15)

Application of web ontology to harvest estimation of rice in thailand
Application of web ontology to harvest estimation of rice in thailandApplication of web ontology to harvest estimation of rice in thailand
Application of web ontology to harvest estimation of rice in thailand
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Topology in ArcGIS
Topology in ArcGISTopology in ArcGIS
Topology in ArcGIS
 
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelProbabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
 
Partial Binomial Distribution method for Generation capacity outage using Spr...
Partial Binomial Distribution method for Generation capacity outage using Spr...Partial Binomial Distribution method for Generation capacity outage using Spr...
Partial Binomial Distribution method for Generation capacity outage using Spr...
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark Applications
 
Meteoio Introduction given by Mathias Bavey in Bozen
Meteoio Introduction given by Mathias Bavey in BozenMeteoio Introduction given by Mathias Bavey in Bozen
Meteoio Introduction given by Mathias Bavey in Bozen
 
Terra Populus Overview Poster
Terra Populus Overview PosterTerra Populus Overview Poster
Terra Populus Overview Poster
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Internship
InternshipInternship
Internship
 
"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...
 
Chap02 01
Chap02 01Chap02 01
Chap02 01
 

Similar to K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

Data science technology overview
Data science technology overviewData science technology overview
Data science technology overviewSoojung Hong
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation EnginesAdam Rogers
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
XL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl MinerXL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl Minerxlminer content
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningDatamining Tools
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 
Data Mining with SQL Server 2005
Data Mining with SQL Server 2005Data Mining with SQL Server 2005
Data Mining with SQL Server 2005Dean Willson
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Datawaheed751
 

Similar to K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets (20)

Data science technology overview
Data science technology overviewData science technology overview
Data science technology overview
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation Engines
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
kdd2015
kdd2015kdd2015
kdd2015
 
Introduction To XL-Miner
Introduction To XL-MinerIntroduction To XL-Miner
Introduction To XL-Miner
 
XL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl MinerXL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl Miner
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
Poster
PosterPoster
Poster
 
Data Mining with SQL Server 2005
Data Mining with SQL Server 2005Data Mining with SQL Server 2005
Data Mining with SQL Server 2005
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 

More from Phillip Trelford

How to be a rock star developer
How to be a rock star developerHow to be a rock star developer
How to be a rock star developerPhillip Trelford
 
FSharp eye for the Haskell guy - London 2015
FSharp eye for the Haskell guy - London 2015FSharp eye for the Haskell guy - London 2015
FSharp eye for the Haskell guy - London 2015Phillip Trelford
 
Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015Phillip Trelford
 
F# for C# devs - Copenhagen .Net 2015
F# for C# devs - Copenhagen .Net 2015F# for C# devs - Copenhagen .Net 2015
F# for C# devs - Copenhagen .Net 2015Phillip Trelford
 
Generative Art - Functional Vilnius 2015
Generative Art - Functional Vilnius 2015Generative Art - Functional Vilnius 2015
Generative Art - Functional Vilnius 2015Phillip Trelford
 
24 hours later - FSharp Gotham 2015
24 hours later - FSharp Gotham  201524 hours later - FSharp Gotham  2015
24 hours later - FSharp Gotham 2015Phillip Trelford
 
Building cross platform games with Xamarin - Birmingham 2015
Building cross platform games with Xamarin - Birmingham 2015Building cross platform games with Xamarin - Birmingham 2015
Building cross platform games with Xamarin - Birmingham 2015Phillip Trelford
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Phillip Trelford
 
FSharp On The Desktop - Birmingham FP 2015
FSharp On The Desktop - Birmingham FP 2015FSharp On The Desktop - Birmingham FP 2015
FSharp On The Desktop - Birmingham FP 2015Phillip Trelford
 
Ready, steady, cross platform games - ProgNet 2015
Ready, steady, cross platform games - ProgNet 2015Ready, steady, cross platform games - ProgNet 2015
Ready, steady, cross platform games - ProgNet 2015Phillip Trelford
 
F# for C# devs - NDC Oslo 2015
F# for C# devs - NDC Oslo 2015F# for C# devs - NDC Oslo 2015
F# for C# devs - NDC Oslo 2015Phillip Trelford
 
F# for C# devs - Leeds Sharp 2015
F# for C# devs -  Leeds Sharp 2015F# for C# devs -  Leeds Sharp 2015
F# for C# devs - Leeds Sharp 2015Phillip Trelford
 
Build a compiler in 2hrs - NCrafts Paris 2015
Build a compiler in 2hrs -  NCrafts Paris 2015Build a compiler in 2hrs -  NCrafts Paris 2015
Build a compiler in 2hrs - NCrafts Paris 2015Phillip Trelford
 
24 Hours Later - NCrafts Paris 2015
24 Hours Later - NCrafts Paris 201524 Hours Later - NCrafts Paris 2015
24 Hours Later - NCrafts Paris 2015Phillip Trelford
 
Machine learning from disaster - GL.Net 2015
Machine learning from disaster  - GL.Net 2015Machine learning from disaster  - GL.Net 2015
Machine learning from disaster - GL.Net 2015Phillip Trelford
 
F# for Trading - QuantLabs 2014
F# for Trading -  QuantLabs 2014F# for Trading -  QuantLabs 2014
F# for Trading - QuantLabs 2014Phillip Trelford
 

More from Phillip Trelford (20)

How to be a rock star developer
How to be a rock star developerHow to be a rock star developer
How to be a rock star developer
 
Mobile F#un
Mobile F#unMobile F#un
Mobile F#un
 
F# eXchange Keynote 2016
F# eXchange Keynote 2016F# eXchange Keynote 2016
F# eXchange Keynote 2016
 
FSharp eye for the Haskell guy - London 2015
FSharp eye for the Haskell guy - London 2015FSharp eye for the Haskell guy - London 2015
FSharp eye for the Haskell guy - London 2015
 
Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015
 
F# for C# devs - Copenhagen .Net 2015
F# for C# devs - Copenhagen .Net 2015F# for C# devs - Copenhagen .Net 2015
F# for C# devs - Copenhagen .Net 2015
 
Generative Art - Functional Vilnius 2015
Generative Art - Functional Vilnius 2015Generative Art - Functional Vilnius 2015
Generative Art - Functional Vilnius 2015
 
24 hours later - FSharp Gotham 2015
24 hours later - FSharp Gotham  201524 hours later - FSharp Gotham  2015
24 hours later - FSharp Gotham 2015
 
Building cross platform games with Xamarin - Birmingham 2015
Building cross platform games with Xamarin - Birmingham 2015Building cross platform games with Xamarin - Birmingham 2015
Building cross platform games with Xamarin - Birmingham 2015
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015
 
FSharp On The Desktop - Birmingham FP 2015
FSharp On The Desktop - Birmingham FP 2015FSharp On The Desktop - Birmingham FP 2015
FSharp On The Desktop - Birmingham FP 2015
 
Ready, steady, cross platform games - ProgNet 2015
Ready, steady, cross platform games - ProgNet 2015Ready, steady, cross platform games - ProgNet 2015
Ready, steady, cross platform games - ProgNet 2015
 
F# for C# devs - NDC Oslo 2015
F# for C# devs - NDC Oslo 2015F# for C# devs - NDC Oslo 2015
F# for C# devs - NDC Oslo 2015
 
F# for C# devs - Leeds Sharp 2015
F# for C# devs -  Leeds Sharp 2015F# for C# devs -  Leeds Sharp 2015
F# for C# devs - Leeds Sharp 2015
 
Build a compiler in 2hrs - NCrafts Paris 2015
Build a compiler in 2hrs -  NCrafts Paris 2015Build a compiler in 2hrs -  NCrafts Paris 2015
Build a compiler in 2hrs - NCrafts Paris 2015
 
24 Hours Later - NCrafts Paris 2015
24 Hours Later - NCrafts Paris 201524 Hours Later - NCrafts Paris 2015
24 Hours Later - NCrafts Paris 2015
 
Real World F# - SDD 2015
Real World F# -  SDD 2015Real World F# -  SDD 2015
Real World F# - SDD 2015
 
F# for C# devs - SDD 2015
F# for C# devs - SDD 2015F# for C# devs - SDD 2015
F# for C# devs - SDD 2015
 
Machine learning from disaster - GL.Net 2015
Machine learning from disaster  - GL.Net 2015Machine learning from disaster  - GL.Net 2015
Machine learning from disaster - GL.Net 2015
 
F# for Trading - QuantLabs 2014
F# for Trading -  QuantLabs 2014F# for Trading -  QuantLabs 2014
F# for Trading - QuantLabs 2014
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

  • 1. Social @tsunamiide tsunami.io Earthquake Enterprises K-Means Clustering
  • 2. Social @tsunamiide tsunami.io Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets
  • 3. Social @tsunamiide tsunami.io Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data
  • 4. Social @tsunamiide tsunami.io Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you: https://www.google.com/settings/ads/onweb/
  • 5. Social @tsunamiide tsunami.io Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms
  • 6. Social @tsunamiide tsunami.io Earthquake Enterprises  Iris data in (n-1)! charts
  • 7. Social @tsunamiide tsunami.io Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics
  • 8. Social @tsunamiide tsunami.io Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation
  • 9. Social @tsunamiide tsunami.io Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others
  • 10. Social @tsunamiide tsunami.io Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)
  • 11. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration
  • 12. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets
  • 13. Social @tsunamiide tsunami.io Earthquake Enterprises
  • 14. Social @tsunamiide tsunami.io Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.
  • 15. Social @tsunamiide tsunami.io Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery
  • 16. Social @tsunamiide tsunami.io Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred
  • 17. Social @tsunamiide tsunami.io Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible