SlideShare a Scribd company logo
1 of 22
Clustering on Database Systems 
Vahid Mirjalili 
Michigan State University
Clustering 
• Partitioning data into groups 
Items in the same group should have higher similarity to each 
other than items from different groups 
• A similarity/dissimilarity measure 
• Examples: 
 Clustering patients in a hospital 
 Genomic clustering 
 Hand-written character recognition 
A. Jain, “Data Clustering: 50 years beyond K-means”
Clustering vs. Classification 
Reinforcement 
learning 
Predictive Modeling Tasks 
Unsupervised 
Learning 
• Classification is supervised 
Supervised 
Learning 
– class labels are provided; 
– learn a classifier to predict class labels of novel/unseen 
data 
• Clustering is unsupervised or semi-supervised; 
– No class label is give 
– Understand the structure underlying your data
Clustering Approaches 
 Probability-based 
– Assuming statistical independence among features 
– Inefficient updating and storing clusters 
 Distance-based 
– Assuming direct access to all data points 
– Hierarchical clustering: O(N2), not giving the best clustering
Distance-Based Clustering Algorithms 
• kmeans and its variants (kmedoids, kernel 
kmeans, fuzzy c-means, …) 
• Density based methods (DBSCAN) 
• Hierarchical methods
Challenges 
• Unknown number of clusters (from 1 to N) 
Input data K=2 K=6 
You always get some 
output as clusters 
Are they really distinct 
clusters? 
A. Jain, “Data Clustering: 50 years beyond K-means”
Challenges 
• Clusters with different shapes, sizes and 
densities 
Shapes: globular shape, linear vs. non-linear 
shapes 
A. Jain, “Data Clustering: 50 years beyond K-means”
Standard K-Means Algorithm 
• Find initial Cluster centroids randomly 
• An iterative algorithm 
1. Assignment step: assign each data point the 
cluster whose mean is closest (smallest distance) 
2. Update step: update the mean (centroid) of each 
cluster 
Distance: squared Euclidean distance 
( , )  
dist x   x  
 
j j  1  
 
Centroid: mean of feature vectors  
 
 
i C 
 
i 
C 
X 
N 
2 
  
1 
d 
j
Standard K-Means Algorithm
Problem in Database-oriented 
Clustering 
• Low memory available compared to size of 
dataset  data doesn’t fit in main memory 
• High I/O 
• Necessary to avoid too many iterations
RKM: An Efficient Disk-based KMeans 
Method 
• Find the initial centroids by 
• Only 3 iterations: 
r d c all      / 
– Assign every L points to nearest centroids; 
– Update the cluster centroids 
• Minor efficiency tricks: 
N L  
– Keep track of LS, SS and Nc for each cluster during 
assignment  update step: 
c c   LS / N
Implementation of RKM: 
storing data matrices 
• D  input dataset 
• Pj 
 cluster j (for j in [1..k]) 
• Mj, Qj, Nj 
 Linear Sum, Squared Sum, cluster 
size 
• Cj, Rj, Wj 
 Centroids, Variances, Weights 
(accessed during update step) 
C  
M / 
N 
j j j 
R  Q / N  
M M / 
N 
 
 
 
j j l 
l k 
j 
t 
j j j j j 
W N N 
1.. 
2 
/
RKM avoids local minima: 
split large clusters 
• Only performed if size of a cluster is less than 
a user-defined threshold 
1. Remove the centroid of the small cluster 
2. Find the largest cluster (largest Wj) 
3. Randomly choose two centroids for the largest 
cluster (using Cj, and Rj) 
4. Reassign the items of small and large clusters
RKM vs. Standard K-means: 
Random Dataset
RKM vs. Standard K-means: 
Initial Cluster Centroids 
K = 3
Cluster assignment: 
Results after one pass over all the data 
Many iterations needed 2 more iterations
RKM: Database design 
• Relational schema for sparse data 
representation: D(pid, inx, value) 
• For other matrices: doing 1 I/O per matrix row 
to minimize I/O 
Matrix access 
E step (assignment step) 
M step (update step) 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Performance Comparison 
• RKM (disk-based) 
• Memory based: 
– Standard K-means 
– Scalable K-means 
  
  
C dist x C 
Quan.error( )  
( , ) 
j k i P 
i j 
j 
1..
Time Complexity of RKM 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Time Complexity 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Conclusion 
• RKM resolve some of the limitations of K-means 
• RKM limits disk access (I/O) 
• Final clustering is achieved with 3 iterations 
• On large datasets RKM outperforms standard K-means 
• Other limitations of K-means clustering still 
remain
Read more … 
General implementation in IPython notebook: 
http://goo.gl/YZScH9 
http://www.vahidmirjalili.com

More Related Content

What's hot

Operating system paging and segmentation
Operating system paging and segmentationOperating system paging and segmentation
Operating system paging and segmentationhamza haseeb
 
Ecg analysis in the cloud
Ecg analysis in the cloudEcg analysis in the cloud
Ecg analysis in the cloudgaurav jain
 
Mac protocols for ad hoc wireless networks
Mac protocols for ad hoc wireless networks Mac protocols for ad hoc wireless networks
Mac protocols for ad hoc wireless networks Divya Tiwari
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to VirtualizationRahul Hada
 
Little o and little omega
Little o and little omegaLittle o and little omega
Little o and little omegaRajesh K Shukla
 
Allocation of Frames & Thrashing
Allocation of Frames & ThrashingAllocation of Frames & Thrashing
Allocation of Frames & Thrashingarifmollick8578
 
Analytical learning
Analytical learningAnalytical learning
Analytical learningswapnac12
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Enginerajdeep
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page tableduvvuru madhuri
 
cloud virtualization technology
 cloud virtualization technology  cloud virtualization technology
cloud virtualization technology Ravindra Dastikop
 
ECG ANALYSIS IN CLOUD COMPUTING
ECG ANALYSIS IN CLOUD COMPUTINGECG ANALYSIS IN CLOUD COMPUTING
ECG ANALYSIS IN CLOUD COMPUTINGJago Corruption Se
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 

What's hot (20)

Heap Management
Heap ManagementHeap Management
Heap Management
 
Operating system paging and segmentation
Operating system paging and segmentationOperating system paging and segmentation
Operating system paging and segmentation
 
Ecg analysis in the cloud
Ecg analysis in the cloudEcg analysis in the cloud
Ecg analysis in the cloud
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mac protocols for ad hoc wireless networks
Mac protocols for ad hoc wireless networks Mac protocols for ad hoc wireless networks
Mac protocols for ad hoc wireless networks
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to Virtualization
 
Little o and little omega
Little o and little omegaLittle o and little omega
Little o and little omega
 
Google App Engine ppt
Google App Engine  pptGoogle App Engine  ppt
Google App Engine ppt
 
Recognition-of-tokens
Recognition-of-tokensRecognition-of-tokens
Recognition-of-tokens
 
Allocation of Frames & Thrashing
Allocation of Frames & ThrashingAllocation of Frames & Thrashing
Allocation of Frames & Thrashing
 
Analytical learning
Analytical learningAnalytical learning
Analytical learning
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page table
 
cloud virtualization technology
 cloud virtualization technology  cloud virtualization technology
cloud virtualization technology
 
ECG ANALYSIS IN CLOUD COMPUTING
ECG ANALYSIS IN CLOUD COMPUTINGECG ANALYSIS IN CLOUD COMPUTING
ECG ANALYSIS IN CLOUD COMPUTING
 
Regular Grammar
Regular GrammarRegular Grammar
Regular Grammar
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
Lec 7 query processing
Lec 7 query processingLec 7 query processing
Lec 7 query processing
 

Viewers also liked

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative ClusteringToshihiro Kamishima
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveChong-Kuan Chen
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniquesGiorgos Bamparopoulos
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Longhow Lam
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Renato Jovic
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarAndrew Morgan
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksAnna Förster
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine LearningAmazon Web Services
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognitionHưng Đặng
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 

Viewers also liked (15)

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative Clustering
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning Perspective
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor Networks
 
Introduction to pattern recognition
Introduction to pattern recognitionIntroduction to pattern recognition
Introduction to pattern recognition
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognition
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Clustering on database systems rkm

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 

Similar to Clustering on database systems rkm (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Clustering on database systems rkm

  • 1. Clustering on Database Systems Vahid Mirjalili Michigan State University
  • 2. Clustering • Partitioning data into groups Items in the same group should have higher similarity to each other than items from different groups • A similarity/dissimilarity measure • Examples:  Clustering patients in a hospital  Genomic clustering  Hand-written character recognition A. Jain, “Data Clustering: 50 years beyond K-means”
  • 3. Clustering vs. Classification Reinforcement learning Predictive Modeling Tasks Unsupervised Learning • Classification is supervised Supervised Learning – class labels are provided; – learn a classifier to predict class labels of novel/unseen data • Clustering is unsupervised or semi-supervised; – No class label is give – Understand the structure underlying your data
  • 4. Clustering Approaches  Probability-based – Assuming statistical independence among features – Inefficient updating and storing clusters  Distance-based – Assuming direct access to all data points – Hierarchical clustering: O(N2), not giving the best clustering
  • 5. Distance-Based Clustering Algorithms • kmeans and its variants (kmedoids, kernel kmeans, fuzzy c-means, …) • Density based methods (DBSCAN) • Hierarchical methods
  • 6. Challenges • Unknown number of clusters (from 1 to N) Input data K=2 K=6 You always get some output as clusters Are they really distinct clusters? A. Jain, “Data Clustering: 50 years beyond K-means”
  • 7. Challenges • Clusters with different shapes, sizes and densities Shapes: globular shape, linear vs. non-linear shapes A. Jain, “Data Clustering: 50 years beyond K-means”
  • 8. Standard K-Means Algorithm • Find initial Cluster centroids randomly • An iterative algorithm 1. Assignment step: assign each data point the cluster whose mean is closest (smallest distance) 2. Update step: update the mean (centroid) of each cluster Distance: squared Euclidean distance ( , )  dist x   x   j j  1   Centroid: mean of feature vectors    i C  i C X N 2   1 d j
  • 10. Problem in Database-oriented Clustering • Low memory available compared to size of dataset  data doesn’t fit in main memory • High I/O • Necessary to avoid too many iterations
  • 11. RKM: An Efficient Disk-based KMeans Method • Find the initial centroids by • Only 3 iterations: r d c all      / – Assign every L points to nearest centroids; – Update the cluster centroids • Minor efficiency tricks: N L  – Keep track of LS, SS and Nc for each cluster during assignment  update step: c c   LS / N
  • 12. Implementation of RKM: storing data matrices • D  input dataset • Pj  cluster j (for j in [1..k]) • Mj, Qj, Nj  Linear Sum, Squared Sum, cluster size • Cj, Rj, Wj  Centroids, Variances, Weights (accessed during update step) C  M / N j j j R  Q / N  M M / N    j j l l k j t j j j j j W N N 1.. 2 /
  • 13. RKM avoids local minima: split large clusters • Only performed if size of a cluster is less than a user-defined threshold 1. Remove the centroid of the small cluster 2. Find the largest cluster (largest Wj) 3. Randomly choose two centroids for the largest cluster (using Cj, and Rj) 4. Reassign the items of small and large clusters
  • 14. RKM vs. Standard K-means: Random Dataset
  • 15. RKM vs. Standard K-means: Initial Cluster Centroids K = 3
  • 16. Cluster assignment: Results after one pass over all the data Many iterations needed 2 more iterations
  • 17. RKM: Database design • Relational schema for sparse data representation: D(pid, inx, value) • For other matrices: doing 1 I/O per matrix row to minimize I/O Matrix access E step (assignment step) M step (update step) Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 18. Performance Comparison • RKM (disk-based) • Memory based: – Standard K-means – Scalable K-means     C dist x C Quan.error( )  ( , ) j k i P i j j 1..
  • 19. Time Complexity of RKM Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 20. Time Complexity Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 21. Conclusion • RKM resolve some of the limitations of K-means • RKM limits disk access (I/O) • Final clustering is achieved with 3 iterations • On large datasets RKM outperforms standard K-means • Other limitations of K-means clustering still remain
  • 22. Read more … General implementation in IPython notebook: http://goo.gl/YZScH9 http://www.vahidmirjalili.com