1. Presented By,
Manasi C. Kadam
Sharmishtha P. Alwekar
Ganesh H. Satpute
Deepak D. Ambegaonkar
Rajesh V. Dulhani
Under the guidance
Prof. G. A. Patil
Mr. Varad Meru
3. Introduction
Tedious task to maintain large Data
Types
1. Structured
2. Unstructured
4. Introduction to Data
analysis
Extracting information out of data
Two types
1. Exploratory or descriptive
2. Confirmative or inferential
5. Clustering
(Aka Unsupervised Learning)
Goal is to discover the natural grouping(s) between
objects
Given n objects find K groups on measure of
“similarity”
Organizing data into clusters such that there is
• high intra-cluster similarity
• low inter-cluster similarity
Ideal cluster - set of points that is compact and
isolated
Ex. K-means algorithm, k-medoids etc.
13. Parameter for K-means
Most critical choice is K
Typically algorithm is run for various values of K and
most appropriate output is selected
Different initialization can lead to different output
14. Canopy Clustering
Traditional clustering algorithm works well when
dataset has either property.
Large number of clusters
A high feature dimensionality
Large number of data points.
When dataset has all three property at once
computation becomes expensive.
This necessitates need of new technique, thus
canopy clustering
15. Canopy Clustering
(contd.)
Performs clustering in two stages
1. Rough and quick stage
2. Rigorous stage
16. Canopy Clustering
(contd.)
Rough and quick stage
Uses extremely inexpensive method
divides the data into overlapping subsets called
“canopies”
Rigorous stage
Uses rigorous and expensive metric
Clustering is applied only on canopy
25. Complexity
Complexity of K-means is O(nk), where n is number
of objects and k is number of centroids
Canopy based K-means changes to O(nkf2/c)
c is no of canopies
f is average no of canopies that each data point falls
into
As f is very small number and c is comparatively
big, the complexity is reduced
26. Conclusion
Implemented K-means Algorithm
Verified Result on Mathematica, R
Implemented Canopy Clustering
Verified Result on Excel
27. Future Enhancement
Learning Hadoop and MapReduce
Parallelizing K-Means based on MapReduce and
comparing the implementation
Running All the of K-means on standard dataset
28. References
Anil K. Jain, “Data Clustering: 50 Years Beyond K-
Means”
Andrew McCallum et al., “Efficient Clustering of
High Dimensional Data Sets with Application to
Reference Matching”