SlideShare une entreprise Scribd logo
1  sur  30
Clustering:
A Scikit-Learn Tutorial
Damian Mingle
About Me
• Chief Data Scientist, WPC Healthcare
• Speaker
• Researcher
• Writer
Outline
• What is k-means clustering?
• How does it work?
• When is it appropriate to use it?
• K-means clustering in scikit-learn
• Basic
• Basic with adjustments
Clustering
• It is unsupervised learning (inferring a function to
describe not so obvious structures from
unlabeled data)
• Groups data objects
• Measures distance between data points
• Helps in examining the data
K-means Clustering
• Formally: a method of vector quantization
• Informally: a mapping of a large set of inputs to a
(countable smaller set)
• Separate data into
groups with equal
variance
• Makes use of the
Euclidean
distance metric
K-means Clustering
Repeats refinement
Three basic steps:
• Step 1: Choose k (how many groups)
• Repeat over:
• Step 2: Assignment (labeling data as part of a group)
• Step 3: Update
This process continues until its goal is reached
K-means Clustering
• Assignment
• Update
K-means Clustering
• Advantages
• Large data accepted
• Fast
• Will always find a solution
• Disadvantages
• Choosing the wrong number of groups
• You reach a local optima not a global
K-means Clustering
• When to use
• Normally distributed data
• Large number of samples
• Not too many clusters
• Distance can be measured in a linear fashion
Scikit-Learn
• Python
• Open-source machine learning library
• Very well documented
Scikit-Learn
• Model = EstimatorObject()
• Unsupervised:
• Model.fit(dataset.data)
• dataset.data = dataset
K-means in Scikit-Learn
• Very fast
• Data Scientist: picks number of clusters,
• Scikit kmeans: finds the initial centroids of groups
Dataset
Name: Household Power Consumption by Individuals
Number of attributes: 9
Number of instances: 2,075,259
Missing values: Yes
K-means in Scikit-Learn
K-means in Scikit-Learn
• Results
K-means Parameters
• n_clusters
• Number of clusters to form
• max_iter
• Maximum number of repeats for algo in a single run
• n_init
• Number of times k-means algo will run with different initialization points
• init
• Method you want to initialize with
• precompute_distances
• Selection of Yes, No, or let the machine decide
• Tol
• How tolerable should the algo be when it converges
• n_jobs
• How many CPUs do you want to engage when running the algo
• random_state
• What instance should be the starting point for the algo
n_clusters: choosing k
• View the variance
• cdist is the distance between sets of observations
• pdist is the pairwise distances between observations in
the same set
n_clusters: choosing k
Step 1: Determine your k range
Step 2: Fit the k-means model for each n_clusters = k
Step 3: Pull out the cluster centers for each model
n_clusters: choosing k
Step 4: Calculate Euclidean distance from each point to each cluster center
Step 5: Total within-cluster sum of squares
Step 6: Total sum of squares
Step 7: Difference between-cluster sum of squares
n_clusters: choosing k
• Graphing the variance
n_clusters: choosing k
n_clusters = 4 n_clusters = 7
n_clusters: choosing k
• n_clusters = 8 (default)
init
Methods and their meaning:
• k-means++
• Selects initial clusters in a way that speeds up
convergence
• random
• Choose k rows at random for initial centroids
• Ndarray that gives initial centers
• (n_clusters, n_features)
K-means (8)
n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
K-means (7)
n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
Comparing Results: Silhouette Score
• Silhouette coefficient
• Not black and white, lots of gray
• Average distance between data observations and other data
in cluster
• Average distance between data observations and all other
points in the NEXT nearest cluster
• Silhouette score in scikit-learn
• Average silhouette coefficient for all data observations
• The closer to 1, the better the fit
• Computation time increases with larger datasets
Result Comparison: Silhouette Score
What Do the Results Say?
• Data patterns may in fact exist
• Similar observations can be grouped
• We need additional discovery
A Few Hacks
• Clustering is a great way to explore your data and
develop intution
• Too many features create a problem for
understanding
• Use dimensionality reduction
• Use clustering with other methods
Let’s Connect
• Twitter: @DamianMingle
• LinkedIn: DamianRMingle
• Sign-up for Data Science Hacks

Contenu connexe

Tendances

Tendances (20)

Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
 
Kmeans plusplus
Kmeans plusplusKmeans plusplus
Kmeans plusplus
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepSupport Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by Step
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 

En vedette

Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 

En vedette (20)

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 

Similaire à Clustering: A Scikit Learn Tutorial

Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
Sri Prasanna
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
NANDHINIS900805
 

Similaire à Clustering: A Scikit Learn Tutorial (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Birch1
Birch1Birch1
Birch1
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Chapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdfChapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdf
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 

Plus de Damian R. Mingle, MBA

Plus de Damian R. Mingle, MBA (13)

Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and  Edge Computing with A...Classify Rice Disease Using Self-Optimizing Models and  Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
 
Greek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat SheetGreek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat Sheet
 
Scikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing ValuesScikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing Values
 
SciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your DataSciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your Data
 
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That WorkScikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That Work
 
What is sepsis?
What is sepsis?What is sepsis?
What is sepsis?
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
The evolving definition of sepsis
The evolving definition of sepsis The evolving definition of sepsis
The evolving definition of sepsis
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFOData and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFO
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
 
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
 
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care EpisodesA Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
 

Dernier

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 

Dernier (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Clustering: A Scikit Learn Tutorial

  • 2. About Me • Chief Data Scientist, WPC Healthcare • Speaker • Researcher • Writer
  • 3. Outline • What is k-means clustering? • How does it work? • When is it appropriate to use it? • K-means clustering in scikit-learn • Basic • Basic with adjustments
  • 4. Clustering • It is unsupervised learning (inferring a function to describe not so obvious structures from unlabeled data) • Groups data objects • Measures distance between data points • Helps in examining the data
  • 5. K-means Clustering • Formally: a method of vector quantization • Informally: a mapping of a large set of inputs to a (countable smaller set) • Separate data into groups with equal variance • Makes use of the Euclidean distance metric
  • 6. K-means Clustering Repeats refinement Three basic steps: • Step 1: Choose k (how many groups) • Repeat over: • Step 2: Assignment (labeling data as part of a group) • Step 3: Update This process continues until its goal is reached
  • 8. K-means Clustering • Advantages • Large data accepted • Fast • Will always find a solution • Disadvantages • Choosing the wrong number of groups • You reach a local optima not a global
  • 9. K-means Clustering • When to use • Normally distributed data • Large number of samples • Not too many clusters • Distance can be measured in a linear fashion
  • 10. Scikit-Learn • Python • Open-source machine learning library • Very well documented
  • 11. Scikit-Learn • Model = EstimatorObject() • Unsupervised: • Model.fit(dataset.data) • dataset.data = dataset
  • 12. K-means in Scikit-Learn • Very fast • Data Scientist: picks number of clusters, • Scikit kmeans: finds the initial centroids of groups
  • 13. Dataset Name: Household Power Consumption by Individuals Number of attributes: 9 Number of instances: 2,075,259 Missing values: Yes
  • 16. K-means Parameters • n_clusters • Number of clusters to form • max_iter • Maximum number of repeats for algo in a single run • n_init • Number of times k-means algo will run with different initialization points • init • Method you want to initialize with • precompute_distances • Selection of Yes, No, or let the machine decide • Tol • How tolerable should the algo be when it converges • n_jobs • How many CPUs do you want to engage when running the algo • random_state • What instance should be the starting point for the algo
  • 17. n_clusters: choosing k • View the variance • cdist is the distance between sets of observations • pdist is the pairwise distances between observations in the same set
  • 18. n_clusters: choosing k Step 1: Determine your k range Step 2: Fit the k-means model for each n_clusters = k Step 3: Pull out the cluster centers for each model
  • 19. n_clusters: choosing k Step 4: Calculate Euclidean distance from each point to each cluster center Step 5: Total within-cluster sum of squares Step 6: Total sum of squares Step 7: Difference between-cluster sum of squares
  • 20. n_clusters: choosing k • Graphing the variance
  • 21. n_clusters: choosing k n_clusters = 4 n_clusters = 7
  • 22. n_clusters: choosing k • n_clusters = 8 (default)
  • 23. init Methods and their meaning: • k-means++ • Selects initial clusters in a way that speeds up convergence • random • Choose k rows at random for initial centroids • Ndarray that gives initial centers • (n_clusters, n_features)
  • 24. K-means (8) n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
  • 25. K-means (7) n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
  • 26. Comparing Results: Silhouette Score • Silhouette coefficient • Not black and white, lots of gray • Average distance between data observations and other data in cluster • Average distance between data observations and all other points in the NEXT nearest cluster • Silhouette score in scikit-learn • Average silhouette coefficient for all data observations • The closer to 1, the better the fit • Computation time increases with larger datasets
  • 28. What Do the Results Say? • Data patterns may in fact exist • Similar observations can be grouped • We need additional discovery
  • 29. A Few Hacks • Clustering is a great way to explore your data and develop intution • Too many features create a problem for understanding • Use dimensionality reduction • Use clustering with other methods
  • 30. Let’s Connect • Twitter: @DamianMingle • LinkedIn: DamianRMingle • Sign-up for Data Science Hacks