2. Introduction Definition: Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group Problem statement: Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function
3. Algorithm: BIRCH Data applied uses BIRCH algorithm for clustering BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources
4. Why BIRCH? Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms Minimize the time required for I/O by providing results on the basis of a single scan of data It also gives an option to handle outliers
5. Background Given a N d-dimensional data set: Xi for i varying from 1 to N: Centroid(X0): X0 = summation(Xi)/N Radius R: R = (summation((Xi-X0)^2/N))^(1/2) Diameter D: D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)
6. Background Plus we also define 5 more distance metrics: D0: Euclidian distance D1: Manhattan distance D2: Average inter cluster distance D3: Average intra cluster distance D4: Variance increase distance
7. Clustering feature Clustering feature is a triple summarizing the information that we maintain about the cluster CF = (N, LS, SS) N is the number of data points in the cluster LS is linear sum of data points SS is square sum of data points Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2 We can calculate D0,D1… D4 using clustering features
8. CF Trees A CF tree is a height balanced tree with two parameters: branching factor B and threshold T Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this child A leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes together The tree size is a function of T. The larger the T is, the smaller the tree