1. Department Of Computer Engineering SIESGST
SIES GRADUATE SCHOOL OF TECHNOLOGY
NERUL, NAVI MUMBAI
DEPARTMENT OF COMPUTER ENGG
SEM: - VI BRANCH: - CE
DATA WAREHOUSING & MINING
LIST OF PROGRAMS:
1. Build & edit Cube
2. Design Storage and Process the Cube
3. K- Nearest Neighbors (KNN) Algorithm
4. K-Means Algorithm
5. Naïve Bayesian Classifier
6. Decision Tree
7. Nearest Neighbors Clustering Algorithm
8. Agglomerative Clustering Algorithm
9. DBSCAN Clustering Algorithm
10. Apriori Algorithm
2. Department Of Computer Engineering SIESGST
PROGRAM NO. 1: Build & Edit Cube
Aim: To build and edit Cube
Theory:
Build a Cube
A cube is a multidimensional structure of data. Cubes are defined by a set of dimensions and
measures.
Modeling data multidimensionally facilitates online business analysis and query performance.
Analysis Manager allows you to turn data stored in relational databases into meaningful, easy-to-
navigate business information by creating a data cube.
The most common way of managing relational data for multidimensional use is with a star
schema. A star schema consists of a single fact table and multiple dimension tables linked to the
fact table.
Scenario:
You are a database administrator working for the FoodMart Corporation. FoodMart is a large
grocery store chain with sales in the United States, Mexico, and Canada. The marketing
department wants to analyze all of the sales by products and customers that were made during
the 1998 calendar year. Using data that is stored in the company's data warehouse, you will build
a multidimensional data structure (a cube) to enable fast response times when the marketing
analysts query the database.
We will build a cube that will be used for sales analysis.
How to open the Cube Wizard
In the Analysis Manager tree pane, under the Tutorial database, right-click the Cubes
folder, click to New Cube, and then click Wizard.
3. Department Of Computer Engineering SIESGST
How to add measures to the cube
Measures are the quantitative values in the database that you want to analyze. Commonly-used
measures are sales, cost, and budget data. Measures are analyzed against the different dimension
categories of a cube.
1. In the Welcome step of the Cube Wizard, click Next.
2. In the Select a fact table from a data source step, expand the Tutorial data source, and
then click sales_fact_1998.
3. You can view the data in the sales_fact_1998 table by clicking Browse data. After you
finish browsing data, close the Browse data window, and then click Next.
4. To define the measures for your cube, under Fact table numeric columns, double-click
store_sales. Repeat this procedure for the store_cost and unit_sales columns, and then
click Next.
How to build your Time dimension
1. In the Select the dimensions for your cube step of the wizard, click New Dimension.
This calls the Dimension Wizard.
2. In the Welcome step, click Next.
3. In the Choose how you want to create the dimension step, select Star Schema: A
single dimension table, and then click Next.
4. In the Select the dimension table step, click time_by_day. You can view the data
contained in the time_by_day table by clicking Browse Data. When you are finished
viewing the time_by_day table, click Next.
5. In the Select the dimension type step, select Time dimension, and then click Next.
4. Department Of Computer Engineering SIESGST
6. Next, you will define the levels for your dimension. In the Create the time dimension
levels step, click Select time levels, click Year, Quarter, Month, and then click Next.
7. In the Select advanced options step, click Next.
8. In the last step of the wizard, enter Time for the name of your new dimension.
7. Click Finish to return to the Cube Wizard.
8. In the Cube Wizard, you should now see the Time dimension in the Cube dimensions
list.
5. Department Of Computer Engineering SIESGST
How to build your Product dimension
1. Click New Dimension again. In the Welcome to the Dimension Wizard step, click
Next.
2. In the Choose how you want to create the dimension step, select Snowflake Schema:
Multiple, related dimension tables, and then click Next.
3. In the Select the dimension tables step, double-click product and product_class to add
them to Selected tables. Click Next.
4. The two tables you selected in the previous step and the existing join between them are
displayed in the Create and edit joins step of the Dimension Wizard. Click Next.
5. To define the levels for your dimension, under Available columns, double-click the
product_category, product_subcategory, and brand_name columns, in that order.
After you double-click each column, its name appears under Dimension levels. Click
Next after you have selected all three columns.
6. In the Specify the member key columns step, click Next.
7. In the Select advanced options step, click Next.
8. In the last step of the wizard, enter Product in the Dimension name box, and leave the
Share this dimension with other cubes box selected. Click Finish.
6. Department Of Computer Engineering SIESGST
9. You should see the Product dimension in the Cube dimensions list.
How to build your Customer dimension
1. Click New Dimension.
2. In the Welcome step, click Next.
3. In the Choose how you want to create the dimension step, select Star Schema: A
single dimension table, and then click Next.
4. In the Select the dimension table step, click Customer, and then click Next.
5. In the Select the dimension type step, click Next.
6. To define the levels for your dimension, under Available columns, double-click the
Country, State_Province, City, and lname columns, in that order. After you double-
click each column, its name appears under Dimension levels. After you have selected all
four columns, click Next.
7. In the Specify the member key columns step, click Next.
8. In the Select advanced options step, click Next.
9. In the last step of the wizard, enter Customer in the Dimension name box, and leave the
Share this dimension with other cubes box selected. Click Finish.
10. In the Cube Wizard, you should see the Customer dimension in the Cube dimensions
list.
How to build your Store dimension
1. Click New Dimension.
2. In the Welcome step, click Next.
3. In the Choose how you want to create the dimension step, select Star Schema: A
single dimension table, and then click Next.
4. In the Select the dimension table step, click Store, and then click Next.
5. In the Select the dimension type step, click Next.
6. To define the levels for your dimension, under Available columns, double-click the
store_country, store_state, store_city, and store_name columns, in that order. After
7. Department Of Computer Engineering SIESGST
you double-click each column, its name will appear under Dimension levels. After you
have selected all four columns, click Next.
7. In the Specify the member key columns step, click Next.
8. In the Select advanced options step, click Next.
9. In the last step of the wizard, enter Store in the Dimension name box, and leave the
Share this dimension with other cubes box selected. Click Finish.
10. In the Cube Wizard, you should see the Store dimension in the Cube dimensions list.
How to finish building your cube
1. In the Cube Wizard, click Next.
2. Click Yes when prompted by the Fact Table Row Count message.
3. In the last step of the Cube Wizard, name your cube Sales, and then click Finish.
4. The wizard closes and then launches Cube Editor, which contains the cube you just
created. By clicking on the blue or yellow title bars, arrange the tables so that they match
the following illustration.
8. Department Of Computer Engineering SIESGST
Edit a Cube
We can make changes to your existing cube by using Cube Editor.
You may want to browse a cube's data and examine or edit its structure. In addition, Cube
Editor allows you to perform other procedures (these are described in SQL Server Books
Online).
Scenario:
You realize that you need to add another level of information to the cube, so that you can analyze
customers based on their demographic information.
How to edit your cube in Cube Editor
You can use two methods to get to Cube Editor:
In the Analysis Manager tree pane, right-click an existing cube, and then click Edit.
-or-
Create a new cube using Cube Editor directly. This method is not recommended unless
you are an advanced user.
If you are continuing from the previous section, you should already be in Cube Editor.
In the schema pane of Cube Editor, you can see the fact table (with yellow title bar) and the
joined dimension tables (blue title bars). In the Cube Editor tree pane, you can preview the
9. Department Of Computer Engineering SIESGST
structure of your cube in a hierarchical tree. You can edit the properties of the cube by clicking
the Properties button at the bottom of the left pane.
How to add a dimension to an existing cube
At this point, you decide you need a new dimension to provide data on product promotions. You
can easily build this dimension in Cube Editor.
1. In Cube Editor, on the Insert menu, click Tables.
2. In the Select table dialog box, click the promotion table, click Add, and then
click Close.
3. To define the new dimension, double-click the promotion_name column in the
promotion table.
4. In the Map the Column dialog box, select Dimension, and then click OK.
5. Select the Promotion Name dimension in the tree view.
10. Department Of Computer Engineering SIESGST
6. On the Edit menu, click Rename.
7. Type Promotion, and then press ENTER.
8. Save your changes.
9. Close Cube Editor. When prompted to design the storage, click No. You will
design storage in a later section.
Conclusion: Thus, successfully Cube is build and edited.
11. Department Of Computer Engineering SIESGST
PROGRAM NO. 2: Design Storage and Process the Cube
Aim: To design storage and process the cube
Theory:
We can design storage options for the data and aggregations in your cube. Before you can use or
browse the data in your cubes, you must process them.
You can choose from three storage modes: multidimensional OLAP (MOLAP), relational
OLAP (ROLAP), and hybrid OLAP (HOLAP).
Microsoft® SQL Server™ 2000 Analysis Services allows you to set up aggregations.
Aggregations are precalculated summaries of data that greatly improve the efficiency and
response time of queries.
When you process a cube, the aggregations designed for the cube are calculated and the cube is
loaded with the calculated aggregations and data.
For more information, see SQL Server Books Online.
Scenario:
Now that you have designed the structure of the Sales cube, you need to choose the storage mode
it will use and designate the amount of precalculated values to store. After this is done, the cube
needs to be populated with data.
In this section you will select MOLAP for your storage mode, create the aggregation design for
the Sales cube, and then process the cube. Processing the Sales cube loads data from the ODBC
source and calculates the summary values as defined in the aggregation design.
How to design storage by using the Storage Design Wizard
1. In the Analysis Manager tree pane, expand the Cubes folder, right-click the Sales cube,
and then click Design Storage.
2. In the Welcome step, click Next.
3. Select MOLAP as your data storage type, and then click Next.
12. Department Of Computer Engineering SIESGST
4. Under Set Aggregation Options, click Performance gain reaches. In the box, enter 40
to indicate the percentage.
You are instructing Analysis Services to give a performance boost of up to 40 percent,
regardless of how much disk space this requires. Administrators can use this tuning
ability to balance the need for query performance against the disk space required to store
aggregation data.
5. Click Start.
6. You can watch the Performance vs. Size graph in the right side of the wizard while
Analysis Services designs the aggregations. Here you can see how increasing
performance gain requires additional disk space utilization. When the process of
designing aggregations is complete, click Next.
7. Under What do you want to do?, select Process now, and then click Finish.
Note: Processing the aggregations may take some time.
8. In the window that appears, you can watch your cube while it is being processed. When
processing is complete, a message appears confirming that the processing was completed
successfully.
9. Click Close to return to the Analysis Manager tree pane.
13. Department Of Computer Engineering SIESGST
Browse Cube Data
Using Cube Browser, you can look at data in different ways: You can filter the amount of
dimension data that is visible, you can drill down to see greater detail, and you can drill up to see
less detail.
Scenario:
Now that the Sales cube is processed, data is available for analysis.
In this section, you will use Cube Browser to slice and dice through the sales data.
How to view cube data using Cube Browser
1. In the Analysis Manager tree pane, right-click the Sales cube, and then click Browse
Data.
2. Cube Browser appears, displaying a grid made up of one dimension and the measures of
your cube. The additional four dimensions appear at the top of the browser.
How to replace a dimension in the grid
1. To replace one dimension in the grid with another, drag the dimension from the top box
and drop it directly on top of the column you want to exchange it with. Make sure the
pointer appears with a double-ended arrow during this process.
14. Department Of Computer Engineering SIESGST
2. Using this drag and drop technique, select the Product dimension button and drag it to
the grid, dropping it directly on top of Measures. The Product and Measures dimensions
will switch positions in Cube Browser.
How to filter your data by time
1. Click the arrow next to the Time dimension.
2. Expand All Time and 1998, and then click Quarter 1. The data in the grid is filtered to
reflect figures for only that one quarter.
15. Department Of Computer Engineering SIESGST
How to drill down
1. Switch the Product and Customer dimensions using the drag and drop technique. Click
Product and drag it on top of Country.
2. Double-click the cell in your grid that contains Baking Goods. The cube expands to
include the subcategory column.
Use the above techniques to move dimensions to and from the grid. This will help you
understand how Analysis Manager puts information about complex data relationships at
your fingertips.
3. When you are finished, click Close to close Cube Browser.
Conclusion: Thus, we have successfully design a storage and process the cube.
16. Department Of Computer Engineering SIESGST
PROGRAM NO. 3: K nearest Neighbors (KNN) Algorithm
Aim: To implement KNN algorithm in Java
Theory:
It is Non-parametric pattern classification. In pattern recognition, the k-nearest neighbor
algorithm (KNN) is a method for classifying objects based on closest training examples in the
feature space. KNN is a type of instance-based learning, or lazy learning where the function is
only approximated locally and all computation is deferred until classification. The k-nearest
neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is
classified by a majority vote of its neighbors, with the object being assigned to the class most
common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then
the object is simply assigned to the class of its nearest neighbor. In the classification phase, k is a
user-defined constant. Usually Euclidean distance is used as the distance metric.
Consider a two-class problem where each sample consists of two measurements (x, y).
For a given query point q, assign the class of the nearest neighbour. Compute the k nearest
neighbors and assign the class by majority vote.
17. Department Of Computer Engineering SIESGST
For K=1 For K=3
For classification, compute the confidence for each class as Ci /K,
(where Ci is the number of patterns among the K nearest patterns belonging to class i.)
The classification for the input pattern is the class with the highest confidence.
Advantage: No training is required, confidence level can be obtained
Disadvantage: classification accuracy is low is complex decision-region boundary exists, large
storage required.
Conclusion: Thus KNN is successfully implemented in Java & tested for training database.
18. Department Of Computer Engineering SIESGST
PROGRAM NO. 4: K-means Algorithm
Aim: To implement K means Algorithm in Java
Theory:
Clustering allows for unsupervised learning. That is, the machine / software will learn on its
own, using the data (learning set), and will classify the objects into a particular class
It is Partition Clustering Approach
Each cluster is associated with a centroid (center point). Each point is assigned to the cluster with
the closest centroids. Number of clusters, K, must be specific
Algorithm:
1. Select K points as the initial centroids
2. repeat
3. Form K clusters by assigning all points to the closest centroid
4. Recompute the centroid of each cluster
5. until the centroids don’t change
Initial centoids are often chosen randomly. The centroid is (typically) the mean of the points in
the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc
K means will converge (centroids move at each iteration)
K-means Example:
Problem: Cluster the following eight points (with (x, y) representing locations) into three
clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial
cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two
points a=(x1, y1) and b=(x2, y2) is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| .
Use k-means algorithm to find the three cluster centers after the second iteration.
Solution: First we list all points in the first column of the table above. The initial cluster
centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the
distance from the first point (2, 10) to each of the three means, by using the distance function:
19. Department Of Computer Engineering SIESGST
point mean1
x1, y1 x2, y2
(2, 10) (2, 10)
ρ (a, b) = |x2 – x1| + |y2 – y1|
ρ (point, mean1) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10| = 0 + 0 = 0
Iteration 1
(2, 10) (5, 8) (1, 2)
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
Cluster 1 Cluster 2 Cluster 3
(2, 10) (8, 4) (2, 5)
(5, 8) (1, 2)
(7, 5)
(6, 4)
(4, 9)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of
all points in each cluster.
For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center
remains the same.
20. Department Of Computer Engineering SIESGST
For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
That was Iteration1. Next, we go to Iteration2, Iteration3, and so on until the means do not
change anymore. In Iteration2, we repeat the process from Iteration1 this time using the new
means we computed.
After 2nd
iteration, results would be
1: {A1,A8}, 2:{A3,A4,A5,A6},3:{A2,A7}
With centers C1=(3,9.5), C2=(6.5,5.25) and C3=(1.5,3.5)
After 3rd
iteration, results would be
1: {A1,A4,A8}, 2:{A3,,A5,A6},3:{A2,A7}
With centers C1=(3.66,9), C2=(7,4.33) and C3=(1.5,3.5)
Conclusion: Thus, we have successfully implemented K-means in Java & tested for variety of
training databases.
21. Department Of Computer Engineering SIESGST
PROGRAM NO. 5: Naïve Bayesian Classifier
Aim: To implement Naïve Bayesian Classifier
Theory:
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is
particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive
Bayes can often outperform more sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the
illustration above. As indicated, the objects can be classified as either GREEN (light color) or
RED (dark color). Our task is to classify new cases as they arrive, i.e., decide to which class
label they belong, based on the currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as likely to have membership GREEN rather than
RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities
are based on previous experience, in this case the percentage of GREEN and RED objects, and
often used to predict outcomes before they actually happen.
Thus, we can write:
22. Department Of Computer Engineering SIESGST
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities
for class membership are:
Having formulated our prior probability, we are now ready to classify a new object (WHITE
circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that particular
color. To measure this likelihood, we draw a circle around X which encompasses a number (to
be chosen a priori) of points irrespective of their class labels. Then we calculate the number of
points in the circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than
Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones.
Thus:
23. Department Of Computer Engineering SIESGST
Although the prior probabilities indicate that X may belong to GREEN (given that there are
twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class
membership of X is RED (given that there are more RED objects in the vicinity of X than
GREEN). In the Bayesian analysis, the final classification is produced by combining both
sources of information, i.e., the prior and the likelihood, to form a posterior probability using the
so-called Bayes' rule.
Finally, we classify X as RED since its class membership achieves the largest posterior
probability.
Conclusion: Thus, we have successfully implemented Naïve Bayesian Classifier in Java &
tested for variety of training databases.
24. Department Of Computer Engineering SIESGST
PROGRAM NO. 6: Decision Tree
Aim: To implement Decision Tree using ID3 algorithm in Java
Theory:
Decision Tree
Decision trees are most useful, powerful and popular tool for classification and prediction
due to their simplicity, accuracy, ease of use and understanding, and speed of algorithm.
Decision tree approach divides the search space into rectangular regions.
Decision tree represents rule.
Rules can be easily expresses and understand by humans. Also they can directly used in
database access language SQL so that records falling into a particular category may be
retrieved.
A decision tree is a tree in which each branch node represents a choice between a number
of alternatives, and each leaf node represents a classification or decision.
For example:
25. Department Of Computer Engineering SIESGST
ID3
ID3 stands for Iterative Dichotomiser 3
Invented by J. Ross Quinlan in 1979.
Builds the tree from the top down, with no backtracking.
Information Gain is used to select the most useful attribute for classification.
ID3 is a precursor to the C4.5 Algorithm.
Main aim is to minimize expected number of comparisons.
The basic idea of ID3 algorithm is to construct the decision tree by employing a top down,
greedy search through the given sets to test each attribute at every tree node. In order to select
the attribute that is most useful for classifying a given sets, we use a metric --information gain.
The main ideas behind the ID3 algorithm are:
Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a
possible value of that attribute. A leaf node corresponds to the expected value of the
output attribute when the path from the root node to that leaf node describes the input
attributes.
In a “good” decision tree, each non-leaf node should correspond to the input attribute
which is the most informative(less entropy) about the output attribute amongst all the
input attributes not yet considered in the path from the root node to that node.
Entropy is used to determine how informative a particular input attribute is about the
output attribute for a subset of the training data.
ID3 Process
• Take all unused attributes and calculates their entropies.
• Chooses attribute that has the lowest entropy or when information gain is maximum
• Makes a node containing that attribute
26. Department Of Computer Engineering SIESGST
Entropy: Concept used to quantify information is called Entropy. Entropy measures the
randomness in data.
For example:
A complete homogeneous sample has entropy of 0: If all values are same, entropy is zero as
there is no randomness.
An equally divided sample as entropy of 1: If there is change in value, entropy is there as there is
randomness.
Formula of Entropy
27. Department Of Computer Engineering SIESGST
Conclusion: Thus Decision Tree using ID3 is successfully implemented in Java & tested for
training database.
28. Department Of Computer Engineering SIESGST
PROGRAM NO. 7: Nearest Neighbor Clustering Algorithm
Aim: To implement Nearest Neighbor Clustering Algorithm in Java
Theory:
Basic Idea:
A new instance
o forma a new cluster
o or is merged to an existing one
Depending on how close it is to the existing cluster
A threshold T is used to determine whether “to merge”, or “to create a new cluster”
The number of cluster k is not required as an input
Complexity depends on the number of items.
For each loop, each item must be compared to each item already in a cluster.(n is worst
case)
Time Complexity: O(n2
) & Space Complexity: O(n2
)
Example:
Given 5 items with the distance between them
Task: Cluster them using nearest neighbor algorithm: threshold t=1.5
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
29. Department Of Computer Engineering SIESGST
Item A is put into cluster K1={A};
For item B, dist (A, B) =1 which is less than threshold so include in cluster K1
K1= {A, B};
For item C, dist (A, C) =2 which is more than threshold
dist (B, C)=2 which is more than threshold
Not satisfied, so new cluster is created K1={C};
For item D, dist (A, D) =2 which is more than threshold
dist (B, D)=5 which is more than threshold
dist(C,D)=1 which is less than threshold so include in cluster K2
K1={A,B}, K2={C,D}
For item E, dist (A, E) =3 which is more than threshold
dist (B,E)=3 which is more than threshold
dist(C,E)=5 which is more than threshold
dist(D,E)=3 which is more than threshold
Not satisfied, so new cluster is created K3= {E};
Final Clustering Output:
K1= {A, B}, K2={C, D}, K3= {E}
Conclusion: Thus, we have successfully implemented Nearest Neighbor in Java & tested for
variety of training databases.
30. Department Of Computer Engineering SIESGST
PROGRAM NO. 8: Agglomerative Clustering Algorithm
Aim: To implement Agglomerative Clustering Algorithm
Theory:
Agglomerative hierarchical clustering
Data objects are grouped in a bottom-up fashion.
Initially each data object is in its own cluster.
Then merge these atomic clusters into larger and larger clusters, until all of the objects
are in a single cluster or until certain termination conditions are satisfied.
The user can specify termination condition, as the desired number of clusters.
Output is Dendrogram,which can be represent as a set of order triples <d, k, K> where d
is the threshold distance, k is the number of clusters, and K is the set of clusters.
Dendrogram:
It is a tree data structure, which illustrates hierarchical clustering techniques.
Each level shows clusters for that level.
o Leaf – individual clusters
o Root – one cluster
A cluster at level i is the union of its children clusters at level i+1.
31. Department Of Computer Engineering SIESGST
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic
process of hierarchical clustering) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N
clusters, each containing just one item. Let the distances (similarities) between the
clusters the same as the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-
linkage and average-linkage clustering.
In single-linkage clustering (also called the connectedness or minimum method), we consider
the distance between one cluster and another cluster to be equal to the shortest distance from any
member of one cluster to any member of the other cluster.
In complete-linkage clustering (also called the diameter or maximum method), we consider the
distance between one cluster and another cluster to be equal to the greatest distance from any
member of one cluster to any member of the other cluster.
In average-linkage clustering, we consider the distance between one cluster and another cluster
to be equal to the average distance from any member of one cluster to any member of the other
cluster.
This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively.
Complexity for Hierarchical Clustering:
Space complexity for hierarchical algorithm is O (n2
) because this the space required for
the adjacency matrix. Space required for the dendrogram is O(kn), which is much less
than O(n2
)
32. Department Of Computer Engineering SIESGST
Time complexity for hierarchical algorithms is O (kn2
) because there is one iteration for
each level in the dendrogram.
Conclusion: Thus, we have successfully implemented Agglomerative Clustering Algorithm in
Java & tested for variety of training databases.
33. Department Of Computer Engineering SIESGST
PROGRAM NO. 9: DBSCAN Clustering Algorithm
Aim: To implement Density Based Spatial Clustering of Application with Noise Algorithm
Theory:
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Used to create clusters of minimum size and density.
Density is defined as minimum no. of points within a certain distance of each other.
Two global parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point
Core Object: object with at least MinPts objects within a radius ‘Eps-neighborhood’
Border Object: object that on the border of a cluster
Basic Concepts: ε-neighborhood & core objects
ε = 1 cm
The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object
If the ε-neighborhood of an object contains at least a minimum number, MinPts, of objects then
the object is called a core object
Example: ε = 1 cm, MinPts=3
m and p are core objects because their ε-neighborhoods contain at least 3 points
34. Department Of Computer Engineering SIESGST
Directly density-Reachable Objects
An object p is directly density-reachable from object q if p is within the ε-neighborhood of q and
q is a core object
Example:
q is directly density-reachable from m
m is directly density-reachable from p
and vice versa
Density-Reachable Objects
An object p is density-reachable from object q with respect to ε and MinPts if there is a chain of
objects p1,…pn where p1=q and
pn=p such that pi+1 is directly reachable from pi with respect to ε and MinPts
35. Department Of Computer Engineering SIESGST
Example:
q is density-reachable from p because q is directly density reachable from m and m is directly
density-reachable from p
p is not density-reachable from q because q is not a core object
Density-Connectivity
An object p is density-connected to object q with respect to εand MinPts if there is an object O
such as both p and q are density reachable from O with respect to ε and MinPts
Example:
p, q and m are all density connected
DBSCAN Algorithm Steps
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have been processed.
36. Department Of Computer Engineering SIESGST
Example:
If Epsilon is 2 and minpoint is 2, what are the clusters that DBScan would discover with the
following examples:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Epsilon =ε=2
MinPts=2
A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)
A1 (2, 10) 0 >2 >2 >2 >2 >2 >2 >2
A2 (2, 5) >2 >2 >2 >2 >2 >2 >2 >2
A3 (8, 4) >2 >2 0 >2 2 2 >2 >2
A4 (5, 8) >2 >2 >2 0 >2 >2 >2 2
A5 (7, 5) >2 >2 2 >2 0 2 >2 >2
A6 (6, 4) >2 >2 2 >2 2 0 >2 >2
A7 (1, 2) >2 >2 >2 >2 >2 >2 0 >2
A8 (4, 9) >2 >2 >2 2 >2 >2 >2 0
N2(A1)={} N2(A2)={} N2(A3)={A5,A6} N2(A4)={A8}
N2(A5)={A3,A6} N2(A6)={A3,A5} N2(A7)={} N2(A8)={A4}
So A1, A2, and A7 are outliers, while we have two clusters
C1= {A4, A8} and C2={A3, A5, A6}
If Epsilon is square root(10) then the neighborhood of some points will increase:
A1 would join the cluster C1 and A2 would joint with A7 to form cluster C3= {A2, A7}.
Complexity: Space complexity: O (log n) Time complexity O (n log n)
Conclusion: Thus, we have successfully implemented DBSCAN Clustering Algorithm in Java &
tested for variety of training databases.
37. Department Of Computer Engineering SIESGST
PROGRAM NO. 10: Apriori Association Algorithm
Aim: To implement Apriori Association Algorithm in Java programming language.
Theory:
Basics: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules.
Key Concepts:
Frequent Itemsets: The sets of item, which has minimum support (denoted by Li for ith
-
Itemset).
Apriori Property: Any subset of frequent itemset must be frequent.
Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1with itself.
Find the frequent itemsets: the sets of items that have minimum support
o A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
o Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
38. Department Of Computer Engineering SIESGST
Apriori Algorithm: Pseudo code
The Apriori Algorithm: Example
Consider a database, D , consisting of 9 transactions.
Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
Let minimum confidence required is 70%.
We have to first find out the frequent itemset using Apriori algorithm.
Then, Association rules will be generated using min. support & min. confidence.
39. Department Of Computer Engineering SIESGST
Step 1: Generating 1-itemset Frequent Pattern
The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying
minimum support.
In the first iteration of the algorithm, each item is a member of the set of candidate.
Step 2: Generating 2-itemset Frequent Pattern
40. Department Of Computer Engineering SIESGST
To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset
in C2 is accumulated (as shown in the middle table).
The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
Note: We haven’t used Apriori Property yet.
41. Department Of Computer Engineering SIESGST
Step 3: Generating 3-itemset Frequent Pattern
The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2 Join L2.
C3= L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune
step helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four latter candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3}
in C3.
Lets take another example of {I2, I3, I5}which shows how the pruning is performed. The
2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of
Join operationfor Pruning.
42. Department Of Computer Engineering SIESGST
Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}}is not
frequent.
Thus, C4= φ, and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
These frequent itemsets will be used to generate strong association rules (where strong
association rules satisfy both minimum support & minimum confidence).
Step 5: Generating Association Rules from Frequent Itemsets
Procedure:
o For each frequent itemset “l”, generate all nonempty subsets of l.
o For every nonempty subset s of l, output the rule “s (l-s)” if
support_count(l) / support_count(s) >= min_conf
where min_conf is minimum confidence threshold.
Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},
{I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Lets take l = {I1, I2, I5}.
Its all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is, say 70%.
The resulting association rules are shown below, each listed with its confidence.
–R1: I1 ^ I2 I5
43. Department Of Computer Engineering SIESGST
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is rejected.
–R2: I1 ^ I5 I2
Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
R2 is selected.
–R3: I2 ^ I5 I1
Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
R3 is selected.
-R4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is rejected.
–R5: I2 I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is rejected.
–R6: I5 I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is selected.
In this way, we have found three strong association rules.
Conclusion: Thus, we have successfully implemented Apriori Association Algorithm in Java &
tested for variety of training databases.