K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

•Download as PPTX, PDF•

2 likes•2,910 views

Phillip Trelford

Introduction to machine learning covering k-means clustering and support vector machines.

Technology Education

Social @tsunamiide tsunami.io Earthquake Enterprises
K-Means Clustering

Social @tsunamiide tsunami.io Earthquake Enterprises
 Two parts
 Simple Clustering Algorithm
 Using ML with Large Datasets

Social @tsunamiide tsunami.io Earthquake Enterprises
 Very elegant
 Scales to large datasets
 It is simple and easy to learn
 Works with unsupervised data

Social @tsunamiide tsunami.io Earthquake Enterprises
 Competitive Analysis
 Compare products from Company A with
Company B by clustering them into groups
 Semi-Structured Search Engine
 Show different results to different users
depending on how they are classified
▪ What Google thinks about you:
https://www.google.com/settings/ads/onweb/

Social @tsunamiide tsunami.io Earthquake Enterprises
 Multivariate data set
 (i.e. each row is a float[])
 Classification is
labeled
 Not linearly
separable
 Popular for testing
ML Algorithms

Social @tsunamiide tsunami.io Earthquake Enterprises
 Iris data in (n-1)! charts

Social @tsunamiide tsunami.io Earthquake Enterprises
 E.g. Classifying text documents
 Charting no longer makes sense
 Need to rely derived metrics

Social @tsunamiide tsunami.io Earthquake Enterprises
 Euclidian
 Manhattan Distance
 Angle between
 Correlation

Social @tsunamiide tsunami.io Earthquake Enterprises
 Many ML algorithms rely on the features
to be in the range of [-1,1] or [0,1]
 K-means will work with any range but for
many distance functions larger ranges will
crowed out smaller ones
 We can use this to emphasize some
factors over others

Social @tsunamiide tsunami.io Earthquake Enterprises
 select the number of clusters (K)
 select a seed for each cluster (centroid)
 Do {
 assign each item in the training set to the
closest centroid
 update each centroid to the mean of the
assigned items }
 while (any of the centroids have moved)

Social @tsunamiide tsunami.io Earthquake Enterprises
 Number of clusters are known (3)
 Pick seed by randomly selecting 3 rows
from dataset
 We intentionally pick 3 close together for
demonstration

Social @tsunamiide tsunami.io Earthquake Enterprises
 Number of clusters
 Distance functions
 Feature scaling
 Datasets
 E.g. included abalone and breast cancer
datasets

Social @tsunamiide tsunami.io Earthquake Enterprises

Social @tsunamiide tsunami.io Earthquake Enterprises
 Faster algorithms
with more data will
often beat slower
algorithms with less
data.

Social @tsunamiide tsunami.io Earthquake Enterprises
 Some algorithms do not scale well
 e.g. Layered NN
 can take many days (not suited to tutorials)
 ML algorithms need to be run repeatedly
 Tuning hyper-parameters
 K-fold cross validation
 Feature discovery

Social @tsunamiide tsunami.io Earthquake Enterprises
 Random Forest
 Built in, popular and effective
 Leave one out
 My preferred

Social @tsunamiide tsunami.io Earthquake Enterprises
 Use a fast algorithm for factor discovery
 Use a slow algorithm for final solution
 Many competitions are won on starting the
slow algorithm as soon as possible

What's hot

Application of web ontology to harvest estimation of rice in thailandAIMS (Agricultural Information Management Standards)

Probabilistic data structuresYoav chernobroda

Topology in ArcGISAmaljit Bharali

Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelNick Malleson

Partial Binomial Distribution method for Generation capacity outage using Spr...vivatechijri

Integrated Model Discovery and Self-Adaptation of RobotsPooyan Jamshidi

Moa: Real Time Analytics for Data StreamsAlbert Bifet

Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet

Gray-Box Models for Performance Assessment of Spark ApplicationsATMOSPHERE .

Meteoio Introduction given by Mathias Bavey in BozenRiccardo Rigon

Terra Populus Overview PosterMinnesota Population Center, Terra Populus Project

Joey gonzalez, graph lab, m lconf 2013MLconf

InternshipAli Akbari

"Machine Learning and Internet of Things, the future of medical prevention", ...Dataconomy Media

Chap02 01Rakesh Chintakunta

What's hot (15)

Application of web ontology to harvest estimation of rice in thailand

Probabilistic data structures

Topology in ArcGIS

Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model

Partial Binomial Distribution method for Generation capacity outage using Spr...

Integrated Model Discovery and Self-Adaptation of Robots

Moa: Real Time Analytics for Data Streams

Pitfalls in benchmarking data stream classification and how to avoid them

Gray-Box Models for Performance Assessment of Spark Applications

Meteoio Introduction given by Mathias Bavey in Bozen

Terra Populus Overview Poster

Joey gonzalez, graph lab, m lconf 2013

Internship

"Machine Learning and Internet of Things, the future of medical prevention", ...

Chap02 01

Similar to K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

Data science technology overviewSoojung Hong

Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork

Democratizing Data Science in the CloudUniversity of Washington

Jane Recommendation EnginesAdam Rogers

Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster

Data Warehousing AWS 12345AkhilSinghal21

kdd2015Deepak Agarwal

Introduction To XL-MinerDataminingTools Inc

XL-MINER:Introduction To Xl Minerxlminer content

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu

Presentation_BigData_NenaMarinn5712036

The hidden engineering behind machine learning products at HelixaAlluxio, Inc.

Big Data Session 1.pptxElsonPaul2

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes

Introduction to data miningDatamining Tools

Introduction to Data MiningDataminingTools Inc

BsidesLVPresso2016_JZeditsv6Rod Soto

PosterKevin Razavet

Data Mining with SQL Server 2005Dean Willson

Introduction to Cloud Computing and Big Datawaheed751

Similar to K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets (20)

Data science technology overview

Course 3 : Types of data and opportunities by Nikolaos Deligiannis

Democratizing Data Science in the Cloud

Jane Recommendation Engines

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Data Warehousing AWS 12345

kdd2015

Introduction To XL-Miner

XL-MINER:Introduction To Xl Miner

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...

Presentation_BigData_NenaMarin

The hidden engineering behind machine learning products at Helixa

Big Data Session 1.pptx

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...

Introduction to data mining

Introduction to Data Mining

BsidesLVPresso2016_JZeditsv6

Poster

Data Mining with SQL Server 2005

Introduction to Cloud Computing and Big Data

Recently uploaded

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Artificial intelligence in the post-deep learning eraDeakin University

Pigging Solutions in Pet Food ManufacturingPigging Solutions

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Key Features Of Token Development (1).pptxLBM Solutions

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Install Stable Diffusion in windows machinePadma Pradeep

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Recently uploaded (20)

Understanding the Laravel MVC Architecture

Artificial intelligence in the post-deep learning era

Pigging Solutions in Pet Food Manufacturing

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Breaking the Kubernetes Kill Chain: Host Path Mount

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Key Features Of Token Development (1).pptx

Injustice - Developers Among Us (SciFiDevCon 2024)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Install Stable Diffusion in windows machine

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Connect Wave/ connectwave Pitch Deck Presentation

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Unblocking The Main Thread Solving ANRs and Frozen Frames

APIForce Zurich 5 April Automation LPDG

Unlocking the Potential of the Cloud for IBM Power Systems

08448380779 Call Girls In Friends Colony Women Seeking Men

K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

1. Social @tsunamiide tsunami.io Earthquake Enterprises K-Means Clustering

2. Social @tsunamiide tsunami.io Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets

3. Social @tsunamiide tsunami.io Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data

4. Social @tsunamiide tsunami.io Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you: https://www.google.com/settings/ads/onweb/

5. Social @tsunamiide tsunami.io Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms

6. Social @tsunamiide tsunami.io Earthquake Enterprises  Iris data in (n-1)! charts

7. Social @tsunamiide tsunami.io Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics

8. Social @tsunamiide tsunami.io Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation

9. Social @tsunamiide tsunami.io Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others

10. Social @tsunamiide tsunami.io Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)

11. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration

12. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets

13. Social @tsunamiide tsunami.io Earthquake Enterprises

14. Social @tsunamiide tsunami.io Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.

15. Social @tsunamiide tsunami.io Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery

16. Social @tsunamiide tsunami.io Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred

17. Social @tsunamiide tsunami.io Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible

K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets

Similar to K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets (20)

More from Phillip Trelford

More from Phillip Trelford (20)

Recently uploaded

Recently uploaded (20)

K-Means Clustering: A Simple Unsupervised ML Algorithm for Large Datasets