SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
Department Of Computer Engineering SIESGST
SIES GRADUATE SCHOOL OF TECHNOLOGY
NERUL, NAVI MUMBAI
DEPARTMENT OF COMPUTER ENGG
SEM: - VI BRANCH: - CE
DATA WAREHOUSING & MINING
LIST OF PROGRAMS:
1. Build & edit Cube
2. Design Storage and Process the Cube
3. K- Nearest Neighbors (KNN) Algorithm
4. K-Means Algorithm
5. Naïve Bayesian Classifier
6. Decision Tree
7. Nearest Neighbors Clustering Algorithm
8. Agglomerative Clustering Algorithm
9. DBSCAN Clustering Algorithm
10. Apriori Algorithm
Department Of Computer Engineering SIESGST
PROGRAM NO. 1: Build & Edit Cube
Aim: To build and edit Cube
Theory:
Build a Cube
A cube is a multidimensional structure of data. Cubes are defined by a set of dimensions and
measures.
Modeling data multidimensionally facilitates online business analysis and query performance.
Analysis Manager allows you to turn data stored in relational databases into meaningful, easy-to-
navigate business information by creating a data cube.
The most common way of managing relational data for multidimensional use is with a star
schema. A star schema consists of a single fact table and multiple dimension tables linked to the
fact table.
Scenario:
You are a database administrator working for the FoodMart Corporation. FoodMart is a large
grocery store chain with sales in the United States, Mexico, and Canada. The marketing
department wants to analyze all of the sales by products and customers that were made during
the 1998 calendar year. Using data that is stored in the company's data warehouse, you will build
a multidimensional data structure (a cube) to enable fast response times when the marketing
analysts query the database.
We will build a cube that will be used for sales analysis.
How to open the Cube Wizard
In the Analysis Manager tree pane, under the Tutorial database, right-click the Cubes
folder, click to New Cube, and then click Wizard.
Department Of Computer Engineering SIESGST
How to add measures to the cube
Measures are the quantitative values in the database that you want to analyze. Commonly-used
measures are sales, cost, and budget data. Measures are analyzed against the different dimension
categories of a cube.
1. In the Welcome step of the Cube Wizard, click Next.
2. In the Select a fact table from a data source step, expand the Tutorial data source, and
then click sales_fact_1998.
3. You can view the data in the sales_fact_1998 table by clicking Browse data. After you
finish browsing data, close the Browse data window, and then click Next.
4. To define the measures for your cube, under Fact table numeric columns, double-click
store_sales. Repeat this procedure for the store_cost and unit_sales columns, and then
click Next.
How to build your Time dimension
1. In the Select the dimensions for your cube step of the wizard, click New Dimension.
This calls the Dimension Wizard.
2. In the Welcome step, click Next.
3. In the Choose how you want to create the dimension step, select Star Schema: A
single dimension table, and then click Next.
4. In the Select the dimension table step, click time_by_day. You can view the data
contained in the time_by_day table by clicking Browse Data. When you are finished
viewing the time_by_day table, click Next.
5. In the Select the dimension type step, select Time dimension, and then click Next.
Department Of Computer Engineering SIESGST
6. Next, you will define the levels for your dimension. In the Create the time dimension
levels step, click Select time levels, click Year, Quarter, Month, and then click Next.
7. In the Select advanced options step, click Next.
8. In the last step of the wizard, enter Time for the name of your new dimension.
7. Click Finish to return to the Cube Wizard.
8. In the Cube Wizard, you should now see the Time dimension in the Cube dimensions
list.
Department Of Computer Engineering SIESGST
How to build your Product dimension
1. Click New Dimension again. In the Welcome to the Dimension Wizard step, click
Next.
2. In the Choose how you want to create the dimension step, select Snowflake Schema:
Multiple, related dimension tables, and then click Next.
3. In the Select the dimension tables step, double-click product and product_class to add
them to Selected tables. Click Next.
4. The two tables you selected in the previous step and the existing join between them are
displayed in the Create and edit joins step of the Dimension Wizard. Click Next.
5. To define the levels for your dimension, under Available columns, double-click the
product_category, product_subcategory, and brand_name columns, in that order.
After you double-click each column, its name appears under Dimension levels. Click
Next after you have selected all three columns.
6. In the Specify the member key columns step, click Next.
7. In the Select advanced options step, click Next.
8. In the last step of the wizard, enter Product in the Dimension name box, and leave the
Share this dimension with other cubes box selected. Click Finish.
Department Of Computer Engineering SIESGST
9. You should see the Product dimension in the Cube dimensions list.
How to build your Customer dimension
1. Click New Dimension.
2. In the Welcome step, click Next.
3. In the Choose how you want to create the dimension step, select Star Schema: A
single dimension table, and then click Next.
4. In the Select the dimension table step, click Customer, and then click Next.
5. In the Select the dimension type step, click Next.
6. To define the levels for your dimension, under Available columns, double-click the
Country, State_Province, City, and lname columns, in that order. After you double-
click each column, its name appears under Dimension levels. After you have selected all
four columns, click Next.
7. In the Specify the member key columns step, click Next.
8. In the Select advanced options step, click Next.
9. In the last step of the wizard, enter Customer in the Dimension name box, and leave the
Share this dimension with other cubes box selected. Click Finish.
10. In the Cube Wizard, you should see the Customer dimension in the Cube dimensions
list.
How to build your Store dimension
1. Click New Dimension.
2. In the Welcome step, click Next.
3. In the Choose how you want to create the dimension step, select Star Schema: A
single dimension table, and then click Next.
4. In the Select the dimension table step, click Store, and then click Next.
5. In the Select the dimension type step, click Next.
6. To define the levels for your dimension, under Available columns, double-click the
store_country, store_state, store_city, and store_name columns, in that order. After
Department Of Computer Engineering SIESGST
you double-click each column, its name will appear under Dimension levels. After you
have selected all four columns, click Next.
7. In the Specify the member key columns step, click Next.
8. In the Select advanced options step, click Next.
9. In the last step of the wizard, enter Store in the Dimension name box, and leave the
Share this dimension with other cubes box selected. Click Finish.
10. In the Cube Wizard, you should see the Store dimension in the Cube dimensions list.
How to finish building your cube
1. In the Cube Wizard, click Next.
2. Click Yes when prompted by the Fact Table Row Count message.
3. In the last step of the Cube Wizard, name your cube Sales, and then click Finish.
4. The wizard closes and then launches Cube Editor, which contains the cube you just
created. By clicking on the blue or yellow title bars, arrange the tables so that they match
the following illustration.
Department Of Computer Engineering SIESGST
Edit a Cube
We can make changes to your existing cube by using Cube Editor.
You may want to browse a cube's data and examine or edit its structure. In addition, Cube
Editor allows you to perform other procedures (these are described in SQL Server Books
Online).
Scenario:
You realize that you need to add another level of information to the cube, so that you can analyze
customers based on their demographic information.
How to edit your cube in Cube Editor
You can use two methods to get to Cube Editor:
In the Analysis Manager tree pane, right-click an existing cube, and then click Edit.
-or-
Create a new cube using Cube Editor directly. This method is not recommended unless
you are an advanced user.
If you are continuing from the previous section, you should already be in Cube Editor.
In the schema pane of Cube Editor, you can see the fact table (with yellow title bar) and the
joined dimension tables (blue title bars). In the Cube Editor tree pane, you can preview the
Department Of Computer Engineering SIESGST
structure of your cube in a hierarchical tree. You can edit the properties of the cube by clicking
the Properties button at the bottom of the left pane.
How to add a dimension to an existing cube
At this point, you decide you need a new dimension to provide data on product promotions. You
can easily build this dimension in Cube Editor.
1. In Cube Editor, on the Insert menu, click Tables.
2. In the Select table dialog box, click the promotion table, click Add, and then
click Close.
3. To define the new dimension, double-click the promotion_name column in the
promotion table.
4. In the Map the Column dialog box, select Dimension, and then click OK.
5. Select the Promotion Name dimension in the tree view.
Department Of Computer Engineering SIESGST
6. On the Edit menu, click Rename.
7. Type Promotion, and then press ENTER.
8. Save your changes.
9. Close Cube Editor. When prompted to design the storage, click No. You will
design storage in a later section.
Conclusion: Thus, successfully Cube is build and edited.
Department Of Computer Engineering SIESGST
PROGRAM NO. 2: Design Storage and Process the Cube
Aim: To design storage and process the cube
Theory:
We can design storage options for the data and aggregations in your cube. Before you can use or
browse the data in your cubes, you must process them.
You can choose from three storage modes: multidimensional OLAP (MOLAP), relational
OLAP (ROLAP), and hybrid OLAP (HOLAP).
Microsoft® SQL Server™ 2000 Analysis Services allows you to set up aggregations.
Aggregations are precalculated summaries of data that greatly improve the efficiency and
response time of queries.
When you process a cube, the aggregations designed for the cube are calculated and the cube is
loaded with the calculated aggregations and data.
For more information, see SQL Server Books Online.
Scenario:
Now that you have designed the structure of the Sales cube, you need to choose the storage mode
it will use and designate the amount of precalculated values to store. After this is done, the cube
needs to be populated with data.
In this section you will select MOLAP for your storage mode, create the aggregation design for
the Sales cube, and then process the cube. Processing the Sales cube loads data from the ODBC
source and calculates the summary values as defined in the aggregation design.
How to design storage by using the Storage Design Wizard
1. In the Analysis Manager tree pane, expand the Cubes folder, right-click the Sales cube,
and then click Design Storage.
2. In the Welcome step, click Next.
3. Select MOLAP as your data storage type, and then click Next.
Department Of Computer Engineering SIESGST
4. Under Set Aggregation Options, click Performance gain reaches. In the box, enter 40
to indicate the percentage.
You are instructing Analysis Services to give a performance boost of up to 40 percent,
regardless of how much disk space this requires. Administrators can use this tuning
ability to balance the need for query performance against the disk space required to store
aggregation data.
5. Click Start.
6. You can watch the Performance vs. Size graph in the right side of the wizard while
Analysis Services designs the aggregations. Here you can see how increasing
performance gain requires additional disk space utilization. When the process of
designing aggregations is complete, click Next.
7. Under What do you want to do?, select Process now, and then click Finish.
Note: Processing the aggregations may take some time.
8. In the window that appears, you can watch your cube while it is being processed. When
processing is complete, a message appears confirming that the processing was completed
successfully.
9. Click Close to return to the Analysis Manager tree pane.
Department Of Computer Engineering SIESGST
Browse Cube Data
Using Cube Browser, you can look at data in different ways: You can filter the amount of
dimension data that is visible, you can drill down to see greater detail, and you can drill up to see
less detail.
Scenario:
Now that the Sales cube is processed, data is available for analysis.
In this section, you will use Cube Browser to slice and dice through the sales data.
How to view cube data using Cube Browser
1. In the Analysis Manager tree pane, right-click the Sales cube, and then click Browse
Data.
2. Cube Browser appears, displaying a grid made up of one dimension and the measures of
your cube. The additional four dimensions appear at the top of the browser.
How to replace a dimension in the grid
1. To replace one dimension in the grid with another, drag the dimension from the top box
and drop it directly on top of the column you want to exchange it with. Make sure the
pointer appears with a double-ended arrow during this process.
Department Of Computer Engineering SIESGST
2. Using this drag and drop technique, select the Product dimension button and drag it to
the grid, dropping it directly on top of Measures. The Product and Measures dimensions
will switch positions in Cube Browser.
How to filter your data by time
1. Click the arrow next to the Time dimension.
2. Expand All Time and 1998, and then click Quarter 1. The data in the grid is filtered to
reflect figures for only that one quarter.
Department Of Computer Engineering SIESGST
How to drill down
1. Switch the Product and Customer dimensions using the drag and drop technique. Click
Product and drag it on top of Country.
2. Double-click the cell in your grid that contains Baking Goods. The cube expands to
include the subcategory column.
Use the above techniques to move dimensions to and from the grid. This will help you
understand how Analysis Manager puts information about complex data relationships at
your fingertips.
3. When you are finished, click Close to close Cube Browser.
Conclusion: Thus, we have successfully design a storage and process the cube.
Department Of Computer Engineering SIESGST
PROGRAM NO. 3: K nearest Neighbors (KNN) Algorithm
Aim: To implement KNN algorithm in Java
Theory:
It is Non-parametric pattern classification. In pattern recognition, the k-nearest neighbor
algorithm (KNN) is a method for classifying objects based on closest training examples in the
feature space. KNN is a type of instance-based learning, or lazy learning where the function is
only approximated locally and all computation is deferred until classification. The k-nearest
neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is
classified by a majority vote of its neighbors, with the object being assigned to the class most
common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then
the object is simply assigned to the class of its nearest neighbor. In the classification phase, k is a
user-defined constant. Usually Euclidean distance is used as the distance metric.
Consider a two-class problem where each sample consists of two measurements (x, y).
For a given query point q, assign the class of the nearest neighbour. Compute the k nearest
neighbors and assign the class by majority vote.
Department Of Computer Engineering SIESGST
For K=1 For K=3
For classification, compute the confidence for each class as Ci /K,
(where Ci is the number of patterns among the K nearest patterns belonging to class i.)
The classification for the input pattern is the class with the highest confidence.
Advantage: No training is required, confidence level can be obtained
Disadvantage: classification accuracy is low is complex decision-region boundary exists, large
storage required.
Conclusion: Thus KNN is successfully implemented in Java & tested for training database.
Department Of Computer Engineering SIESGST
PROGRAM NO. 4: K-means Algorithm
Aim: To implement K means Algorithm in Java
Theory:
Clustering allows for unsupervised learning. That is, the machine / software will learn on its
own, using the data (learning set), and will classify the objects into a particular class
It is Partition Clustering Approach
Each cluster is associated with a centroid (center point). Each point is assigned to the cluster with
the closest centroids. Number of clusters, K, must be specific
Algorithm:
1. Select K points as the initial centroids
2. repeat
3. Form K clusters by assigning all points to the closest centroid
4. Recompute the centroid of each cluster
5. until the centroids don’t change
Initial centoids are often chosen randomly. The centroid is (typically) the mean of the points in
the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc
K means will converge (centroids move at each iteration)
K-means Example:
Problem: Cluster the following eight points (with (x, y) representing locations) into three
clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial
cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two
points a=(x1, y1) and b=(x2, y2) is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| .
Use k-means algorithm to find the three cluster centers after the second iteration.
Solution: First we list all points in the first column of the table above. The initial cluster
centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the
distance from the first point (2, 10) to each of the three means, by using the distance function:
Department Of Computer Engineering SIESGST
point mean1
x1, y1 x2, y2
(2, 10) (2, 10)
ρ (a, b) = |x2 – x1| + |y2 – y1|
ρ (point, mean1) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10| = 0 + 0 = 0
Iteration 1
(2, 10) (5, 8) (1, 2)
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
Cluster 1 Cluster 2 Cluster 3
(2, 10) (8, 4) (2, 5)
(5, 8) (1, 2)
(7, 5)
(6, 4)
(4, 9)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of
all points in each cluster.
For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center
remains the same.
Department Of Computer Engineering SIESGST
For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
That was Iteration1. Next, we go to Iteration2, Iteration3, and so on until the means do not
change anymore. In Iteration2, we repeat the process from Iteration1 this time using the new
means we computed.
After 2nd
iteration, results would be
1: {A1,A8}, 2:{A3,A4,A5,A6},3:{A2,A7}
With centers C1=(3,9.5), C2=(6.5,5.25) and C3=(1.5,3.5)
After 3rd
iteration, results would be
1: {A1,A4,A8}, 2:{A3,,A5,A6},3:{A2,A7}
With centers C1=(3.66,9), C2=(7,4.33) and C3=(1.5,3.5)
Conclusion: Thus, we have successfully implemented K-means in Java & tested for variety of
training databases.
Department Of Computer Engineering SIESGST
PROGRAM NO. 5: Naïve Bayesian Classifier
Aim: To implement Naïve Bayesian Classifier
Theory:
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is
particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive
Bayes can often outperform more sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the
illustration above. As indicated, the objects can be classified as either GREEN (light color) or
RED (dark color). Our task is to classify new cases as they arrive, i.e., decide to which class
label they belong, based on the currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as likely to have membership GREEN rather than
RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities
are based on previous experience, in this case the percentage of GREEN and RED objects, and
often used to predict outcomes before they actually happen.
Thus, we can write:
Department Of Computer Engineering SIESGST
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities
for class membership are:
Having formulated our prior probability, we are now ready to classify a new object (WHITE
circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that particular
color. To measure this likelihood, we draw a circle around X which encompasses a number (to
be chosen a priori) of points irrespective of their class labels. Then we calculate the number of
points in the circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than
Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones.
Thus:
Department Of Computer Engineering SIESGST
Although the prior probabilities indicate that X may belong to GREEN (given that there are
twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class
membership of X is RED (given that there are more RED objects in the vicinity of X than
GREEN). In the Bayesian analysis, the final classification is produced by combining both
sources of information, i.e., the prior and the likelihood, to form a posterior probability using the
so-called Bayes' rule.
Finally, we classify X as RED since its class membership achieves the largest posterior
probability.
Conclusion: Thus, we have successfully implemented Naïve Bayesian Classifier in Java &
tested for variety of training databases.
Department Of Computer Engineering SIESGST
PROGRAM NO. 6: Decision Tree
Aim: To implement Decision Tree using ID3 algorithm in Java
Theory:
Decision Tree
Decision trees are most useful, powerful and popular tool for classification and prediction
due to their simplicity, accuracy, ease of use and understanding, and speed of algorithm.
Decision tree approach divides the search space into rectangular regions.
Decision tree represents rule.
Rules can be easily expresses and understand by humans. Also they can directly used in
database access language SQL so that records falling into a particular category may be
retrieved.
A decision tree is a tree in which each branch node represents a choice between a number
of alternatives, and each leaf node represents a classification or decision.
For example:
Department Of Computer Engineering SIESGST
ID3
ID3 stands for Iterative Dichotomiser 3
Invented by J. Ross Quinlan in 1979.
Builds the tree from the top down, with no backtracking.
Information Gain is used to select the most useful attribute for classification.
ID3 is a precursor to the C4.5 Algorithm.
Main aim is to minimize expected number of comparisons.
The basic idea of ID3 algorithm is to construct the decision tree by employing a top down,
greedy search through the given sets to test each attribute at every tree node. In order to select
the attribute that is most useful for classifying a given sets, we use a metric --information gain.
The main ideas behind the ID3 algorithm are:
Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a
possible value of that attribute. A leaf node corresponds to the expected value of the
output attribute when the path from the root node to that leaf node describes the input
attributes.
In a “good” decision tree, each non-leaf node should correspond to the input attribute
which is the most informative(less entropy) about the output attribute amongst all the
input attributes not yet considered in the path from the root node to that node.
Entropy is used to determine how informative a particular input attribute is about the
output attribute for a subset of the training data.
ID3 Process
• Take all unused attributes and calculates their entropies.
• Chooses attribute that has the lowest entropy or when information gain is maximum
• Makes a node containing that attribute
Department Of Computer Engineering SIESGST
Entropy: Concept used to quantify information is called Entropy. Entropy measures the
randomness in data.
For example:
A complete homogeneous sample has entropy of 0: If all values are same, entropy is zero as
there is no randomness.
An equally divided sample as entropy of 1: If there is change in value, entropy is there as there is
randomness.
Formula of Entropy
Department Of Computer Engineering SIESGST
Conclusion: Thus Decision Tree using ID3 is successfully implemented in Java & tested for
training database.
Department Of Computer Engineering SIESGST
PROGRAM NO. 7: Nearest Neighbor Clustering Algorithm
Aim: To implement Nearest Neighbor Clustering Algorithm in Java
Theory:
Basic Idea:
A new instance
o forma a new cluster
o or is merged to an existing one
Depending on how close it is to the existing cluster
A threshold T is used to determine whether “to merge”, or “to create a new cluster”
The number of cluster k is not required as an input
Complexity depends on the number of items.
For each loop, each item must be compared to each item already in a cluster.(n is worst
case)
Time Complexity: O(n2
) & Space Complexity: O(n2
)
Example:
Given 5 items with the distance between them
Task: Cluster them using nearest neighbor algorithm: threshold t=1.5
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Department Of Computer Engineering SIESGST
Item A is put into cluster K1={A};
For item B, dist (A, B) =1 which is less than threshold so include in cluster K1
K1= {A, B};
For item C, dist (A, C) =2 which is more than threshold
dist (B, C)=2 which is more than threshold
Not satisfied, so new cluster is created K1={C};
For item D, dist (A, D) =2 which is more than threshold
dist (B, D)=5 which is more than threshold
dist(C,D)=1 which is less than threshold so include in cluster K2
K1={A,B}, K2={C,D}
For item E, dist (A, E) =3 which is more than threshold
dist (B,E)=3 which is more than threshold
dist(C,E)=5 which is more than threshold
dist(D,E)=3 which is more than threshold
Not satisfied, so new cluster is created K3= {E};
Final Clustering Output:
K1= {A, B}, K2={C, D}, K3= {E}
Conclusion: Thus, we have successfully implemented Nearest Neighbor in Java & tested for
variety of training databases.
Department Of Computer Engineering SIESGST
PROGRAM NO. 8: Agglomerative Clustering Algorithm
Aim: To implement Agglomerative Clustering Algorithm
Theory:
Agglomerative hierarchical clustering
 Data objects are grouped in a bottom-up fashion.
 Initially each data object is in its own cluster.
 Then merge these atomic clusters into larger and larger clusters, until all of the objects
are in a single cluster or until certain termination conditions are satisfied.
 The user can specify termination condition, as the desired number of clusters.
 Output is Dendrogram,which can be represent as a set of order triples <d, k, K> where d
is the threshold distance, k is the number of clusters, and K is the set of clusters.
Dendrogram:
It is a tree data structure, which illustrates hierarchical clustering techniques.
Each level shows clusters for that level.
o Leaf – individual clusters
o Root – one cluster
A cluster at level i is the union of its children clusters at level i+1.
Department Of Computer Engineering SIESGST
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic
process of hierarchical clustering) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N
clusters, each containing just one item. Let the distances (similarities) between the
clusters the same as the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-
linkage and average-linkage clustering.
In single-linkage clustering (also called the connectedness or minimum method), we consider
the distance between one cluster and another cluster to be equal to the shortest distance from any
member of one cluster to any member of the other cluster.
In complete-linkage clustering (also called the diameter or maximum method), we consider the
distance between one cluster and another cluster to be equal to the greatest distance from any
member of one cluster to any member of the other cluster.
In average-linkage clustering, we consider the distance between one cluster and another cluster
to be equal to the average distance from any member of one cluster to any member of the other
cluster.
This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively.
Complexity for Hierarchical Clustering:
Space complexity for hierarchical algorithm is O (n2
) because this the space required for
the adjacency matrix. Space required for the dendrogram is O(kn), which is much less
than O(n2
)
Department Of Computer Engineering SIESGST
Time complexity for hierarchical algorithms is O (kn2
) because there is one iteration for
each level in the dendrogram.
Conclusion: Thus, we have successfully implemented Agglomerative Clustering Algorithm in
Java & tested for variety of training databases.
Department Of Computer Engineering SIESGST
PROGRAM NO. 9: DBSCAN Clustering Algorithm
Aim: To implement Density Based Spatial Clustering of Application with Noise Algorithm
Theory:
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Used to create clusters of minimum size and density.
Density is defined as minimum no. of points within a certain distance of each other.
Two global parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point
Core Object: object with at least MinPts objects within a radius ‘Eps-neighborhood’
Border Object: object that on the border of a cluster
Basic Concepts: ε-neighborhood & core objects
ε = 1 cm
The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object
If the ε-neighborhood of an object contains at least a minimum number, MinPts, of objects then
the object is called a core object
Example: ε = 1 cm, MinPts=3
m and p are core objects because their ε-neighborhoods contain at least 3 points
Department Of Computer Engineering SIESGST
Directly density-Reachable Objects
An object p is directly density-reachable from object q if p is within the ε-neighborhood of q and
q is a core object
Example:
q is directly density-reachable from m
m is directly density-reachable from p
and vice versa
Density-Reachable Objects
An object p is density-reachable from object q with respect to ε and MinPts if there is a chain of
objects p1,…pn where p1=q and
pn=p such that pi+1 is directly reachable from pi with respect to ε and MinPts
Department Of Computer Engineering SIESGST
Example:
q is density-reachable from p because q is directly density reachable from m and m is directly
density-reachable from p
p is not density-reachable from q because q is not a core object
Density-Connectivity
An object p is density-connected to object q with respect to εand MinPts if there is an object O
such as both p and q are density reachable from O with respect to ε and MinPts
Example:
p, q and m are all density connected
DBSCAN Algorithm Steps
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have been processed.
Department Of Computer Engineering SIESGST
Example:
If Epsilon is 2 and minpoint is 2, what are the clusters that DBScan would discover with the
following examples:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Epsilon =ε=2
MinPts=2
A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)
A1 (2, 10) 0 >2 >2 >2 >2 >2 >2 >2
A2 (2, 5) >2 >2 >2 >2 >2 >2 >2 >2
A3 (8, 4) >2 >2 0 >2 2 2 >2 >2
A4 (5, 8) >2 >2 >2 0 >2 >2 >2 2
A5 (7, 5) >2 >2 2 >2 0 2 >2 >2
A6 (6, 4) >2 >2 2 >2 2 0 >2 >2
A7 (1, 2) >2 >2 >2 >2 >2 >2 0 >2
A8 (4, 9) >2 >2 >2 2 >2 >2 >2 0
N2(A1)={} N2(A2)={} N2(A3)={A5,A6} N2(A4)={A8}
N2(A5)={A3,A6} N2(A6)={A3,A5} N2(A7)={} N2(A8)={A4}
So A1, A2, and A7 are outliers, while we have two clusters
C1= {A4, A8} and C2={A3, A5, A6}
If Epsilon is square root(10) then the neighborhood of some points will increase:
A1 would join the cluster C1 and A2 would joint with A7 to form cluster C3= {A2, A7}.
Complexity: Space complexity: O (log n) Time complexity O (n log n)
Conclusion: Thus, we have successfully implemented DBSCAN Clustering Algorithm in Java &
tested for variety of training databases.
Department Of Computer Engineering SIESGST
PROGRAM NO. 10: Apriori Association Algorithm
Aim: To implement Apriori Association Algorithm in Java programming language.
Theory:
Basics: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules.
Key Concepts:
Frequent Itemsets: The sets of item, which has minimum support (denoted by Li for ith
-
Itemset).
Apriori Property: Any subset of frequent itemset must be frequent.
Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1with itself.
Find the frequent itemsets: the sets of items that have minimum support
o A subset of a frequent itemset must also be a frequent itemset
 i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
o Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
Department Of Computer Engineering SIESGST
Apriori Algorithm: Pseudo code
The Apriori Algorithm: Example
Consider a database, D , consisting of 9 transactions.
Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
Let minimum confidence required is 70%.
We have to first find out the frequent itemset using Apriori algorithm.
Then, Association rules will be generated using min. support & min. confidence.
Department Of Computer Engineering SIESGST
Step 1: Generating 1-itemset Frequent Pattern
The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying
minimum support.
In the first iteration of the algorithm, each item is a member of the set of candidate.
Step 2: Generating 2-itemset Frequent Pattern
Department Of Computer Engineering SIESGST
To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset
in C2 is accumulated (as shown in the middle table).
The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
Note: We haven’t used Apriori Property yet.
Department Of Computer Engineering SIESGST
Step 3: Generating 3-itemset Frequent Pattern
The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2 Join L2.
C3= L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune
step helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four latter candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3}
in C3.
Lets take another example of {I2, I3, I5}which shows how the pruning is performed. The
2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of
Join operationfor Pruning.
Department Of Computer Engineering SIESGST
Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}}is not
frequent.
Thus, C4= φ, and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
These frequent itemsets will be used to generate strong association rules (where strong
association rules satisfy both minimum support & minimum confidence).
Step 5: Generating Association Rules from Frequent Itemsets
Procedure:
o For each frequent itemset “l”, generate all nonempty subsets of l.
o For every nonempty subset s of l, output the rule “s (l-s)” if
support_count(l) / support_count(s) >= min_conf
where min_conf is minimum confidence threshold.
Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},
{I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Lets take l = {I1, I2, I5}.
Its all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is, say 70%.
The resulting association rules are shown below, each listed with its confidence.
–R1: I1 ^ I2 I5
Department Of Computer Engineering SIESGST
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is rejected.
–R2: I1 ^ I5 I2
Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
R2 is selected.
–R3: I2 ^ I5 I1
Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
R3 is selected.
-R4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is rejected.
–R5: I2 I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is rejected.
–R6: I5 I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is selected.
In this way, we have found three strong association rules.
Conclusion: Thus, we have successfully implemented Apriori Association Algorithm in Java &
tested for variety of training databases.

Contenu connexe

Tendances

Advanced tools and techniques in week4
Advanced tools and techniques in week4Advanced tools and techniques in week4
Advanced tools and techniques in week4
Brian Magan
 
Excel 2007 create a chart
Excel 2007    create a chartExcel 2007    create a chart
Excel 2007 create a chart
rezaulslide
 

Tendances (16)

Ig proe-wf-5.0
Ig proe-wf-5.0Ig proe-wf-5.0
Ig proe-wf-5.0
 
dr_3
dr_3dr_3
dr_3
 
Advanced tools and techniques in week4
Advanced tools and techniques in week4Advanced tools and techniques in week4
Advanced tools and techniques in week4
 
InDesign CS5 Tutorial
InDesign CS5 TutorialInDesign CS5 Tutorial
InDesign CS5 Tutorial
 
Photoshop Tutorial Clipping path service, Photoshop clipping path. photo clip...
Photoshop Tutorial Clipping path service, Photoshop clipping path. photo clip...Photoshop Tutorial Clipping path service, Photoshop clipping path. photo clip...
Photoshop Tutorial Clipping path service, Photoshop clipping path. photo clip...
 
Qlikview Quick Start
Qlikview Quick StartQlikview Quick Start
Qlikview Quick Start
 
6 - Panorama Necto 14 dimension selector - visualization & data discovery sol...
6 - Panorama Necto 14 dimension selector - visualization & data discovery sol...6 - Panorama Necto 14 dimension selector - visualization & data discovery sol...
6 - Panorama Necto 14 dimension selector - visualization & data discovery sol...
 
How to insert a chart using selected data
How to insert a chart using selected dataHow to insert a chart using selected data
How to insert a chart using selected data
 
Excel 2007 create a chart
Excel 2007    create a chartExcel 2007    create a chart
Excel 2007 create a chart
 
Excel 2007 create a chart
Excel 2007    create a chartExcel 2007    create a chart
Excel 2007 create a chart
 
Adobe Illustrator CS6 Primer
Adobe Illustrator CS6 PrimerAdobe Illustrator CS6 Primer
Adobe Illustrator CS6 Primer
 
Step by Step design cube using SSAS
Step by Step design cube using SSASStep by Step design cube using SSAS
Step by Step design cube using SSAS
 
Adobe Illustrator CC 2018
Adobe Illustrator CC 2018 Adobe Illustrator CC 2018
Adobe Illustrator CC 2018
 
dr_3
dr_3dr_3
dr_3
 
Corel Draw Tutorial: Florist Flyer
Corel Draw Tutorial: Florist FlyerCorel Draw Tutorial: Florist Flyer
Corel Draw Tutorial: Florist Flyer
 
Civil 3d workflow
Civil 3d workflowCivil 3d workflow
Civil 3d workflow
 

En vedette

Article 52 swayam ko samajhana part1
Article 52 swayam ko samajhana   part1Article 52 swayam ko samajhana   part1
Article 52 swayam ko samajhana part1
Rakesh Roshan
 
Chapter 13: UK Renewable Energy Policy since Privatization
Chapter 13: UK Renewable Energy Policy since PrivatizationChapter 13: UK Renewable Energy Policy since Privatization
Chapter 13: UK Renewable Energy Policy since Privatization
Electricidad Verde
 
Projek bmm3103 2012
Projek bmm3103 2012Projek bmm3103 2012
Projek bmm3103 2012
Amy Azuha
 
Offers Partners & Diligence Process
Offers Partners & Diligence ProcessOffers Partners & Diligence Process
Offers Partners & Diligence Process
Karthik Ethirajan
 

En vedette (20)

Article 52 swayam ko samajhana part1
Article 52 swayam ko samajhana   part1Article 52 swayam ko samajhana   part1
Article 52 swayam ko samajhana part1
 
Module 1 8086
Module 1 8086Module 1 8086
Module 1 8086
 
Digital 1
Digital 1Digital 1
Digital 1
 
Combinational and sequential logic
Combinational and sequential logicCombinational and sequential logic
Combinational and sequential logic
 
CO By Rakesh Roshan
CO By Rakesh RoshanCO By Rakesh Roshan
CO By Rakesh Roshan
 
Binary parallel adder
Binary parallel adderBinary parallel adder
Binary parallel adder
 
Carry look ahead adder
Carry look ahead adderCarry look ahead adder
Carry look ahead adder
 
VLSI Lab manual PDF
VLSI Lab manual PDFVLSI Lab manual PDF
VLSI Lab manual PDF
 
Adder Presentation
Adder PresentationAdder Presentation
Adder Presentation
 
Adder ppt
Adder pptAdder ppt
Adder ppt
 
The Things I Carry by Mars Dorian
The Things I Carry by Mars DorianThe Things I Carry by Mars Dorian
The Things I Carry by Mars Dorian
 
Slide shere
Slide shereSlide shere
Slide shere
 
2004 04 27_ocpd_casestudies
2004 04 27_ocpd_casestudies2004 04 27_ocpd_casestudies
2004 04 27_ocpd_casestudies
 
Science.ppt [autosaved]
Science.ppt [autosaved]Science.ppt [autosaved]
Science.ppt [autosaved]
 
Chapter 13: UK Renewable Energy Policy since Privatization
Chapter 13: UK Renewable Energy Policy since PrivatizationChapter 13: UK Renewable Energy Policy since Privatization
Chapter 13: UK Renewable Energy Policy since Privatization
 
The Science Behind Climate Change
The Science Behind Climate ChangeThe Science Behind Climate Change
The Science Behind Climate Change
 
Projek bmm3103 2012
Projek bmm3103 2012Projek bmm3103 2012
Projek bmm3103 2012
 
Pes Product Life Cycle Storyboard
Pes Product Life Cycle StoryboardPes Product Life Cycle Storyboard
Pes Product Life Cycle Storyboard
 
Offers Partners & Diligence Process
Offers Partners & Diligence ProcessOffers Partners & Diligence Process
Offers Partners & Diligence Process
 
Louise Cohen | PROJECTS
Louise Cohen | PROJECTSLouise Cohen | PROJECTS
Louise Cohen | PROJECTS
 

Similaire à Dwm l ab_manual_final

Geo prompt dashboard
Geo prompt dashboardGeo prompt dashboard
Geo prompt dashboard
Amit Sharma
 
A. Lab # BSBA BIS245A-7B. Lab 7 of 7 Database Navigation.docx
A. Lab #  BSBA BIS245A-7B. Lab 7 of 7  Database Navigation.docxA. Lab #  BSBA BIS245A-7B. Lab 7 of 7  Database Navigation.docx
A. Lab # BSBA BIS245A-7B. Lab 7 of 7 Database Navigation.docx
ransayo
 
Create a basic performance point dashboard epc
Create a basic performance point dashboard   epcCreate a basic performance point dashboard   epc
Create a basic performance point dashboard epc
EPC Group
 
Fluid Mechanics Project Assignment (Total 15)  Due Dates  .docx
Fluid Mechanics Project Assignment (Total 15)  Due Dates  .docxFluid Mechanics Project Assignment (Total 15)  Due Dates  .docx
Fluid Mechanics Project Assignment (Total 15)  Due Dates  .docx
bryanwest16882
 
Access advanced tutorial
Access advanced tutorialAccess advanced tutorial
Access advanced tutorial
catacata1976
 
A Skills Approach Excel 2016 Chapter 8 Exploring Advanced D.docx
A Skills Approach Excel 2016  Chapter 8 Exploring Advanced D.docxA Skills Approach Excel 2016  Chapter 8 Exploring Advanced D.docx
A Skills Approach Excel 2016 Chapter 8 Exploring Advanced D.docx
daniahendric
 
Developing a ssrs report using a ssas data source
Developing a ssrs report using a ssas data sourceDeveloping a ssrs report using a ssas data source
Developing a ssrs report using a ssas data source
relekarsushant
 

Similaire à Dwm l ab_manual_final (20)

Geo prompt dashboard
Geo prompt dashboardGeo prompt dashboard
Geo prompt dashboard
 
A. Lab # BSBA BIS245A-7B. Lab 7 of 7 Database Navigation.docx
A. Lab #  BSBA BIS245A-7B. Lab 7 of 7  Database Navigation.docxA. Lab #  BSBA BIS245A-7B. Lab 7 of 7  Database Navigation.docx
A. Lab # BSBA BIS245A-7B. Lab 7 of 7 Database Navigation.docx
 
Create a basic performance point dashboard epc
Create a basic performance point dashboard   epcCreate a basic performance point dashboard   epc
Create a basic performance point dashboard epc
 
Obiee training
Obiee trainingObiee training
Obiee training
 
Fluid Mechanics Project Assignment (Total 15)  Due Dates  .docx
Fluid Mechanics Project Assignment (Total 15)  Due Dates  .docxFluid Mechanics Project Assignment (Total 15)  Due Dates  .docx
Fluid Mechanics Project Assignment (Total 15)  Due Dates  .docx
 
Access advanced tutorial
Access advanced tutorialAccess advanced tutorial
Access advanced tutorial
 
A Skills Approach Excel 2016 Chapter 8 Exploring Advanced D.docx
A Skills Approach Excel 2016  Chapter 8 Exploring Advanced D.docxA Skills Approach Excel 2016  Chapter 8 Exploring Advanced D.docx
A Skills Approach Excel 2016 Chapter 8 Exploring Advanced D.docx
 
Oracle Forms
Oracle FormsOracle Forms
Oracle Forms
 
oracle-forms
oracle-formsoracle-forms
oracle-forms
 
Cube remodelling
Cube remodellingCube remodelling
Cube remodelling
 
Watson Analytic
Watson AnalyticWatson Analytic
Watson Analytic
 
Developing a ssrs report using a ssas data source
Developing a ssrs report using a ssas data sourceDeveloping a ssrs report using a ssas data source
Developing a ssrs report using a ssas data source
 
Ms access
Ms accessMs access
Ms access
 
How to design a report with fine report reporting tool
How to design a report with  fine report reporting toolHow to design a report with  fine report reporting tool
How to design a report with fine report reporting tool
 
Knowledgeware
KnowledgewareKnowledgeware
Knowledgeware
 
BBA 2 Semester DBMS All assignment
BBA 2 Semester DBMS All assignmentBBA 2 Semester DBMS All assignment
BBA 2 Semester DBMS All assignment
 
Xmastcamcribboard
XmastcamcribboardXmastcamcribboard
Xmastcamcribboard
 
Libre Office Calc Lesson 5: Working with Data
Libre Office Calc Lesson 5: Working with DataLibre Office Calc Lesson 5: Working with Data
Libre Office Calc Lesson 5: Working with Data
 
Workshop10 creep-jop
Workshop10 creep-jopWorkshop10 creep-jop
Workshop10 creep-jop
 
Bar chart Creation
Bar chart CreationBar chart Creation
Bar chart Creation
 

Dernier

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 

Dwm l ab_manual_final

  • 1. Department Of Computer Engineering SIESGST SIES GRADUATE SCHOOL OF TECHNOLOGY NERUL, NAVI MUMBAI DEPARTMENT OF COMPUTER ENGG SEM: - VI BRANCH: - CE DATA WAREHOUSING & MINING LIST OF PROGRAMS: 1. Build & edit Cube 2. Design Storage and Process the Cube 3. K- Nearest Neighbors (KNN) Algorithm 4. K-Means Algorithm 5. Naïve Bayesian Classifier 6. Decision Tree 7. Nearest Neighbors Clustering Algorithm 8. Agglomerative Clustering Algorithm 9. DBSCAN Clustering Algorithm 10. Apriori Algorithm
  • 2. Department Of Computer Engineering SIESGST PROGRAM NO. 1: Build & Edit Cube Aim: To build and edit Cube Theory: Build a Cube A cube is a multidimensional structure of data. Cubes are defined by a set of dimensions and measures. Modeling data multidimensionally facilitates online business analysis and query performance. Analysis Manager allows you to turn data stored in relational databases into meaningful, easy-to- navigate business information by creating a data cube. The most common way of managing relational data for multidimensional use is with a star schema. A star schema consists of a single fact table and multiple dimension tables linked to the fact table. Scenario: You are a database administrator working for the FoodMart Corporation. FoodMart is a large grocery store chain with sales in the United States, Mexico, and Canada. The marketing department wants to analyze all of the sales by products and customers that were made during the 1998 calendar year. Using data that is stored in the company's data warehouse, you will build a multidimensional data structure (a cube) to enable fast response times when the marketing analysts query the database. We will build a cube that will be used for sales analysis. How to open the Cube Wizard In the Analysis Manager tree pane, under the Tutorial database, right-click the Cubes folder, click to New Cube, and then click Wizard.
  • 3. Department Of Computer Engineering SIESGST How to add measures to the cube Measures are the quantitative values in the database that you want to analyze. Commonly-used measures are sales, cost, and budget data. Measures are analyzed against the different dimension categories of a cube. 1. In the Welcome step of the Cube Wizard, click Next. 2. In the Select a fact table from a data source step, expand the Tutorial data source, and then click sales_fact_1998. 3. You can view the data in the sales_fact_1998 table by clicking Browse data. After you finish browsing data, close the Browse data window, and then click Next. 4. To define the measures for your cube, under Fact table numeric columns, double-click store_sales. Repeat this procedure for the store_cost and unit_sales columns, and then click Next. How to build your Time dimension 1. In the Select the dimensions for your cube step of the wizard, click New Dimension. This calls the Dimension Wizard. 2. In the Welcome step, click Next. 3. In the Choose how you want to create the dimension step, select Star Schema: A single dimension table, and then click Next. 4. In the Select the dimension table step, click time_by_day. You can view the data contained in the time_by_day table by clicking Browse Data. When you are finished viewing the time_by_day table, click Next. 5. In the Select the dimension type step, select Time dimension, and then click Next.
  • 4. Department Of Computer Engineering SIESGST 6. Next, you will define the levels for your dimension. In the Create the time dimension levels step, click Select time levels, click Year, Quarter, Month, and then click Next. 7. In the Select advanced options step, click Next. 8. In the last step of the wizard, enter Time for the name of your new dimension. 7. Click Finish to return to the Cube Wizard. 8. In the Cube Wizard, you should now see the Time dimension in the Cube dimensions list.
  • 5. Department Of Computer Engineering SIESGST How to build your Product dimension 1. Click New Dimension again. In the Welcome to the Dimension Wizard step, click Next. 2. In the Choose how you want to create the dimension step, select Snowflake Schema: Multiple, related dimension tables, and then click Next. 3. In the Select the dimension tables step, double-click product and product_class to add them to Selected tables. Click Next. 4. The two tables you selected in the previous step and the existing join between them are displayed in the Create and edit joins step of the Dimension Wizard. Click Next. 5. To define the levels for your dimension, under Available columns, double-click the product_category, product_subcategory, and brand_name columns, in that order. After you double-click each column, its name appears under Dimension levels. Click Next after you have selected all three columns. 6. In the Specify the member key columns step, click Next. 7. In the Select advanced options step, click Next. 8. In the last step of the wizard, enter Product in the Dimension name box, and leave the Share this dimension with other cubes box selected. Click Finish.
  • 6. Department Of Computer Engineering SIESGST 9. You should see the Product dimension in the Cube dimensions list. How to build your Customer dimension 1. Click New Dimension. 2. In the Welcome step, click Next. 3. In the Choose how you want to create the dimension step, select Star Schema: A single dimension table, and then click Next. 4. In the Select the dimension table step, click Customer, and then click Next. 5. In the Select the dimension type step, click Next. 6. To define the levels for your dimension, under Available columns, double-click the Country, State_Province, City, and lname columns, in that order. After you double- click each column, its name appears under Dimension levels. After you have selected all four columns, click Next. 7. In the Specify the member key columns step, click Next. 8. In the Select advanced options step, click Next. 9. In the last step of the wizard, enter Customer in the Dimension name box, and leave the Share this dimension with other cubes box selected. Click Finish. 10. In the Cube Wizard, you should see the Customer dimension in the Cube dimensions list. How to build your Store dimension 1. Click New Dimension. 2. In the Welcome step, click Next. 3. In the Choose how you want to create the dimension step, select Star Schema: A single dimension table, and then click Next. 4. In the Select the dimension table step, click Store, and then click Next. 5. In the Select the dimension type step, click Next. 6. To define the levels for your dimension, under Available columns, double-click the store_country, store_state, store_city, and store_name columns, in that order. After
  • 7. Department Of Computer Engineering SIESGST you double-click each column, its name will appear under Dimension levels. After you have selected all four columns, click Next. 7. In the Specify the member key columns step, click Next. 8. In the Select advanced options step, click Next. 9. In the last step of the wizard, enter Store in the Dimension name box, and leave the Share this dimension with other cubes box selected. Click Finish. 10. In the Cube Wizard, you should see the Store dimension in the Cube dimensions list. How to finish building your cube 1. In the Cube Wizard, click Next. 2. Click Yes when prompted by the Fact Table Row Count message. 3. In the last step of the Cube Wizard, name your cube Sales, and then click Finish. 4. The wizard closes and then launches Cube Editor, which contains the cube you just created. By clicking on the blue or yellow title bars, arrange the tables so that they match the following illustration.
  • 8. Department Of Computer Engineering SIESGST Edit a Cube We can make changes to your existing cube by using Cube Editor. You may want to browse a cube's data and examine or edit its structure. In addition, Cube Editor allows you to perform other procedures (these are described in SQL Server Books Online). Scenario: You realize that you need to add another level of information to the cube, so that you can analyze customers based on their demographic information. How to edit your cube in Cube Editor You can use two methods to get to Cube Editor: In the Analysis Manager tree pane, right-click an existing cube, and then click Edit. -or- Create a new cube using Cube Editor directly. This method is not recommended unless you are an advanced user. If you are continuing from the previous section, you should already be in Cube Editor. In the schema pane of Cube Editor, you can see the fact table (with yellow title bar) and the joined dimension tables (blue title bars). In the Cube Editor tree pane, you can preview the
  • 9. Department Of Computer Engineering SIESGST structure of your cube in a hierarchical tree. You can edit the properties of the cube by clicking the Properties button at the bottom of the left pane. How to add a dimension to an existing cube At this point, you decide you need a new dimension to provide data on product promotions. You can easily build this dimension in Cube Editor. 1. In Cube Editor, on the Insert menu, click Tables. 2. In the Select table dialog box, click the promotion table, click Add, and then click Close. 3. To define the new dimension, double-click the promotion_name column in the promotion table. 4. In the Map the Column dialog box, select Dimension, and then click OK. 5. Select the Promotion Name dimension in the tree view.
  • 10. Department Of Computer Engineering SIESGST 6. On the Edit menu, click Rename. 7. Type Promotion, and then press ENTER. 8. Save your changes. 9. Close Cube Editor. When prompted to design the storage, click No. You will design storage in a later section. Conclusion: Thus, successfully Cube is build and edited.
  • 11. Department Of Computer Engineering SIESGST PROGRAM NO. 2: Design Storage and Process the Cube Aim: To design storage and process the cube Theory: We can design storage options for the data and aggregations in your cube. Before you can use or browse the data in your cubes, you must process them. You can choose from three storage modes: multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and hybrid OLAP (HOLAP). Microsoft® SQL Server™ 2000 Analysis Services allows you to set up aggregations. Aggregations are precalculated summaries of data that greatly improve the efficiency and response time of queries. When you process a cube, the aggregations designed for the cube are calculated and the cube is loaded with the calculated aggregations and data. For more information, see SQL Server Books Online. Scenario: Now that you have designed the structure of the Sales cube, you need to choose the storage mode it will use and designate the amount of precalculated values to store. After this is done, the cube needs to be populated with data. In this section you will select MOLAP for your storage mode, create the aggregation design for the Sales cube, and then process the cube. Processing the Sales cube loads data from the ODBC source and calculates the summary values as defined in the aggregation design. How to design storage by using the Storage Design Wizard 1. In the Analysis Manager tree pane, expand the Cubes folder, right-click the Sales cube, and then click Design Storage. 2. In the Welcome step, click Next. 3. Select MOLAP as your data storage type, and then click Next.
  • 12. Department Of Computer Engineering SIESGST 4. Under Set Aggregation Options, click Performance gain reaches. In the box, enter 40 to indicate the percentage. You are instructing Analysis Services to give a performance boost of up to 40 percent, regardless of how much disk space this requires. Administrators can use this tuning ability to balance the need for query performance against the disk space required to store aggregation data. 5. Click Start. 6. You can watch the Performance vs. Size graph in the right side of the wizard while Analysis Services designs the aggregations. Here you can see how increasing performance gain requires additional disk space utilization. When the process of designing aggregations is complete, click Next. 7. Under What do you want to do?, select Process now, and then click Finish. Note: Processing the aggregations may take some time. 8. In the window that appears, you can watch your cube while it is being processed. When processing is complete, a message appears confirming that the processing was completed successfully. 9. Click Close to return to the Analysis Manager tree pane.
  • 13. Department Of Computer Engineering SIESGST Browse Cube Data Using Cube Browser, you can look at data in different ways: You can filter the amount of dimension data that is visible, you can drill down to see greater detail, and you can drill up to see less detail. Scenario: Now that the Sales cube is processed, data is available for analysis. In this section, you will use Cube Browser to slice and dice through the sales data. How to view cube data using Cube Browser 1. In the Analysis Manager tree pane, right-click the Sales cube, and then click Browse Data. 2. Cube Browser appears, displaying a grid made up of one dimension and the measures of your cube. The additional four dimensions appear at the top of the browser. How to replace a dimension in the grid 1. To replace one dimension in the grid with another, drag the dimension from the top box and drop it directly on top of the column you want to exchange it with. Make sure the pointer appears with a double-ended arrow during this process.
  • 14. Department Of Computer Engineering SIESGST 2. Using this drag and drop technique, select the Product dimension button and drag it to the grid, dropping it directly on top of Measures. The Product and Measures dimensions will switch positions in Cube Browser. How to filter your data by time 1. Click the arrow next to the Time dimension. 2. Expand All Time and 1998, and then click Quarter 1. The data in the grid is filtered to reflect figures for only that one quarter.
  • 15. Department Of Computer Engineering SIESGST How to drill down 1. Switch the Product and Customer dimensions using the drag and drop technique. Click Product and drag it on top of Country. 2. Double-click the cell in your grid that contains Baking Goods. The cube expands to include the subcategory column. Use the above techniques to move dimensions to and from the grid. This will help you understand how Analysis Manager puts information about complex data relationships at your fingertips. 3. When you are finished, click Close to close Cube Browser. Conclusion: Thus, we have successfully design a storage and process the cube.
  • 16. Department Of Computer Engineering SIESGST PROGRAM NO. 3: K nearest Neighbors (KNN) Algorithm Aim: To implement KNN algorithm in Java Theory: It is Non-parametric pattern classification. In pattern recognition, the k-nearest neighbor algorithm (KNN) is a method for classifying objects based on closest training examples in the feature space. KNN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. In the classification phase, k is a user-defined constant. Usually Euclidean distance is used as the distance metric. Consider a two-class problem where each sample consists of two measurements (x, y). For a given query point q, assign the class of the nearest neighbour. Compute the k nearest neighbors and assign the class by majority vote.
  • 17. Department Of Computer Engineering SIESGST For K=1 For K=3 For classification, compute the confidence for each class as Ci /K, (where Ci is the number of patterns among the K nearest patterns belonging to class i.) The classification for the input pattern is the class with the highest confidence. Advantage: No training is required, confidence level can be obtained Disadvantage: classification accuracy is low is complex decision-region boundary exists, large storage required. Conclusion: Thus KNN is successfully implemented in Java & tested for training database.
  • 18. Department Of Computer Engineering SIESGST PROGRAM NO. 4: K-means Algorithm Aim: To implement K means Algorithm in Java Theory: Clustering allows for unsupervised learning. That is, the machine / software will learn on its own, using the data (learning set), and will classify the objects into a particular class It is Partition Clustering Approach Each cluster is associated with a centroid (center point). Each point is assigned to the cluster with the closest centroids. Number of clusters, K, must be specific Algorithm: 1. Select K points as the initial centroids 2. repeat 3. Form K clusters by assigning all points to the closest centroid 4. Recompute the centroid of each cluster 5. until the centroids don’t change Initial centoids are often chosen randomly. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc K means will converge (centroids move at each iteration) K-means Example: Problem: Cluster the following eight points (with (x, y) representing locations) into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2, y2) is defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| . Use k-means algorithm to find the three cluster centers after the second iteration. Solution: First we list all points in the first column of the table above. The initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point (2, 10) to each of the three means, by using the distance function:
  • 19. Department Of Computer Engineering SIESGST point mean1 x1, y1 x2, y2 (2, 10) (2, 10) ρ (a, b) = |x2 – x1| + |y2 – y1| ρ (point, mean1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0 + 0 = 0 Iteration 1 (2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) 12 7 9 2 A4 (5, 8) 5 0 10 2 A5 (7, 5) 10 5 9 2 A6 (6, 4) 10 5 7 2 A7 (1, 2) 9 10 0 3 A8 (4, 9) 3 2 10 2 Cluster 1 Cluster 2 Cluster 3 (2, 10) (8, 4) (2, 5) (5, 8) (1, 2) (7, 5) (6, 4) (4, 9) Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster. For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center remains the same.
  • 20. Department Of Computer Engineering SIESGST For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5) That was Iteration1. Next, we go to Iteration2, Iteration3, and so on until the means do not change anymore. In Iteration2, we repeat the process from Iteration1 this time using the new means we computed. After 2nd iteration, results would be 1: {A1,A8}, 2:{A3,A4,A5,A6},3:{A2,A7} With centers C1=(3,9.5), C2=(6.5,5.25) and C3=(1.5,3.5) After 3rd iteration, results would be 1: {A1,A4,A8}, 2:{A3,,A5,A6},3:{A2,A7} With centers C1=(3.66,9), C2=(7,4.33) and C3=(1.5,3.5) Conclusion: Thus, we have successfully implemented K-means in Java & tested for variety of training databases.
  • 21. Department Of Computer Engineering SIESGST PROGRAM NO. 5: Naïve Bayesian Classifier Aim: To implement Naïve Bayesian Classifier Theory: The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN (light color) or RED (dark color). Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects. Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. Thus, we can write:
  • 22. Department Of Computer Engineering SIESGST Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are: Having formulated our prior probability, we are now ready to classify a new object (WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood: From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
  • 23. Department Of Computer Engineering SIESGST Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule. Finally, we classify X as RED since its class membership achieves the largest posterior probability. Conclusion: Thus, we have successfully implemented Naïve Bayesian Classifier in Java & tested for variety of training databases.
  • 24. Department Of Computer Engineering SIESGST PROGRAM NO. 6: Decision Tree Aim: To implement Decision Tree using ID3 algorithm in Java Theory: Decision Tree Decision trees are most useful, powerful and popular tool for classification and prediction due to their simplicity, accuracy, ease of use and understanding, and speed of algorithm. Decision tree approach divides the search space into rectangular regions. Decision tree represents rule. Rules can be easily expresses and understand by humans. Also they can directly used in database access language SQL so that records falling into a particular category may be retrieved. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision. For example:
  • 25. Department Of Computer Engineering SIESGST ID3 ID3 stands for Iterative Dichotomiser 3 Invented by J. Ross Quinlan in 1979. Builds the tree from the top down, with no backtracking. Information Gain is used to select the most useful attribute for classification. ID3 is a precursor to the C4.5 Algorithm. Main aim is to minimize expected number of comparisons. The basic idea of ID3 algorithm is to construct the decision tree by employing a top down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we use a metric --information gain. The main ideas behind the ID3 algorithm are: Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a possible value of that attribute. A leaf node corresponds to the expected value of the output attribute when the path from the root node to that leaf node describes the input attributes. In a “good” decision tree, each non-leaf node should correspond to the input attribute which is the most informative(less entropy) about the output attribute amongst all the input attributes not yet considered in the path from the root node to that node. Entropy is used to determine how informative a particular input attribute is about the output attribute for a subset of the training data. ID3 Process • Take all unused attributes and calculates their entropies. • Chooses attribute that has the lowest entropy or when information gain is maximum • Makes a node containing that attribute
  • 26. Department Of Computer Engineering SIESGST Entropy: Concept used to quantify information is called Entropy. Entropy measures the randomness in data. For example: A complete homogeneous sample has entropy of 0: If all values are same, entropy is zero as there is no randomness. An equally divided sample as entropy of 1: If there is change in value, entropy is there as there is randomness. Formula of Entropy
  • 27. Department Of Computer Engineering SIESGST Conclusion: Thus Decision Tree using ID3 is successfully implemented in Java & tested for training database.
  • 28. Department Of Computer Engineering SIESGST PROGRAM NO. 7: Nearest Neighbor Clustering Algorithm Aim: To implement Nearest Neighbor Clustering Algorithm in Java Theory: Basic Idea: A new instance o forma a new cluster o or is merged to an existing one Depending on how close it is to the existing cluster A threshold T is used to determine whether “to merge”, or “to create a new cluster” The number of cluster k is not required as an input Complexity depends on the number of items. For each loop, each item must be compared to each item already in a cluster.(n is worst case) Time Complexity: O(n2 ) & Space Complexity: O(n2 ) Example: Given 5 items with the distance between them Task: Cluster them using nearest neighbor algorithm: threshold t=1.5 Item A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0
  • 29. Department Of Computer Engineering SIESGST Item A is put into cluster K1={A}; For item B, dist (A, B) =1 which is less than threshold so include in cluster K1 K1= {A, B}; For item C, dist (A, C) =2 which is more than threshold dist (B, C)=2 which is more than threshold Not satisfied, so new cluster is created K1={C}; For item D, dist (A, D) =2 which is more than threshold dist (B, D)=5 which is more than threshold dist(C,D)=1 which is less than threshold so include in cluster K2 K1={A,B}, K2={C,D} For item E, dist (A, E) =3 which is more than threshold dist (B,E)=3 which is more than threshold dist(C,E)=5 which is more than threshold dist(D,E)=3 which is more than threshold Not satisfied, so new cluster is created K3= {E}; Final Clustering Output: K1= {A, B}, K2={C, D}, K3= {E} Conclusion: Thus, we have successfully implemented Nearest Neighbor in Java & tested for variety of training databases.
  • 30. Department Of Computer Engineering SIESGST PROGRAM NO. 8: Agglomerative Clustering Algorithm Aim: To implement Agglomerative Clustering Algorithm Theory: Agglomerative hierarchical clustering  Data objects are grouped in a bottom-up fashion.  Initially each data object is in its own cluster.  Then merge these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.  The user can specify termination condition, as the desired number of clusters.  Output is Dendrogram,which can be represent as a set of order triples <d, k, K> where d is the threshold distance, k is the number of clusters, and K is the set of clusters. Dendrogram: It is a tree data structure, which illustrates hierarchical clustering techniques. Each level shows clusters for that level. o Leaf – individual clusters o Root – one cluster A cluster at level i is the union of its children clusters at level i+1.
  • 31. Department Of Computer Engineering SIESGST Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering) is this: 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*) Step 3 can be done in different ways, which is what distinguishes single-linkage from complete- linkage and average-linkage clustering. In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. Complexity for Hierarchical Clustering: Space complexity for hierarchical algorithm is O (n2 ) because this the space required for the adjacency matrix. Space required for the dendrogram is O(kn), which is much less than O(n2 )
  • 32. Department Of Computer Engineering SIESGST Time complexity for hierarchical algorithms is O (kn2 ) because there is one iteration for each level in the dendrogram. Conclusion: Thus, we have successfully implemented Agglomerative Clustering Algorithm in Java & tested for variety of training databases.
  • 33. Department Of Computer Engineering SIESGST PROGRAM NO. 9: DBSCAN Clustering Algorithm Aim: To implement Density Based Spatial Clustering of Application with Noise Algorithm Theory: Major features Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Used to create clusters of minimum size and density. Density is defined as minimum no. of points within a certain distance of each other. Two global parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point Core Object: object with at least MinPts objects within a radius ‘Eps-neighborhood’ Border Object: object that on the border of a cluster Basic Concepts: ε-neighborhood & core objects ε = 1 cm The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object If the ε-neighborhood of an object contains at least a minimum number, MinPts, of objects then the object is called a core object Example: ε = 1 cm, MinPts=3 m and p are core objects because their ε-neighborhoods contain at least 3 points
  • 34. Department Of Computer Engineering SIESGST Directly density-Reachable Objects An object p is directly density-reachable from object q if p is within the ε-neighborhood of q and q is a core object Example: q is directly density-reachable from m m is directly density-reachable from p and vice versa Density-Reachable Objects An object p is density-reachable from object q with respect to ε and MinPts if there is a chain of objects p1,…pn where p1=q and pn=p such that pi+1 is directly reachable from pi with respect to ε and MinPts
  • 35. Department Of Computer Engineering SIESGST Example: q is density-reachable from p because q is directly density reachable from m and m is directly density-reachable from p p is not density-reachable from q because q is not a core object Density-Connectivity An object p is density-connected to object q with respect to εand MinPts if there is an object O such as both p and q are density reachable from O with respect to ε and MinPts Example: p, q and m are all density connected DBSCAN Algorithm Steps Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.
  • 36. Department Of Computer Engineering SIESGST Example: If Epsilon is 2 and minpoint is 2, what are the clusters that DBScan would discover with the following examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Epsilon =ε=2 MinPts=2 A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9) A1 (2, 10) 0 >2 >2 >2 >2 >2 >2 >2 A2 (2, 5) >2 >2 >2 >2 >2 >2 >2 >2 A3 (8, 4) >2 >2 0 >2 2 2 >2 >2 A4 (5, 8) >2 >2 >2 0 >2 >2 >2 2 A5 (7, 5) >2 >2 2 >2 0 2 >2 >2 A6 (6, 4) >2 >2 2 >2 2 0 >2 >2 A7 (1, 2) >2 >2 >2 >2 >2 >2 0 >2 A8 (4, 9) >2 >2 >2 2 >2 >2 >2 0 N2(A1)={} N2(A2)={} N2(A3)={A5,A6} N2(A4)={A8} N2(A5)={A3,A6} N2(A6)={A3,A5} N2(A7)={} N2(A8)={A4} So A1, A2, and A7 are outliers, while we have two clusters C1= {A4, A8} and C2={A3, A5, A6} If Epsilon is square root(10) then the neighborhood of some points will increase: A1 would join the cluster C1 and A2 would joint with A7 to form cluster C3= {A2, A7}. Complexity: Space complexity: O (log n) Time complexity O (n log n) Conclusion: Thus, we have successfully implemented DBSCAN Clustering Algorithm in Java & tested for variety of training databases.
  • 37. Department Of Computer Engineering SIESGST PROGRAM NO. 10: Apriori Association Algorithm Aim: To implement Apriori Association Algorithm in Java programming language. Theory: Basics: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Key Concepts: Frequent Itemsets: The sets of item, which has minimum support (denoted by Li for ith - Itemset). Apriori Property: Any subset of frequent itemset must be frequent. Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1with itself. Find the frequent itemsets: the sets of items that have minimum support o A subset of a frequent itemset must also be a frequent itemset  i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset o Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.
  • 38. Department Of Computer Engineering SIESGST Apriori Algorithm: Pseudo code The Apriori Algorithm: Example Consider a database, D , consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence.
  • 39. Department Of Computer Engineering SIESGST Step 1: Generating 1-itemset Frequent Pattern The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support. In the first iteration of the algorithm, each item is a member of the set of candidate. Step 2: Generating 2-itemset Frequent Pattern
  • 40. Department Of Computer Engineering SIESGST To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2. Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table). The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2- itemsets in C2 having minimum support. Note: We haven’t used Apriori Property yet.
  • 41. Department Of Computer Engineering SIESGST Step 3: Generating 3-itemset Frequent Pattern The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori Property. In order to find C3, we compute L2 Join L2. C3= L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ? For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5}which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operationfor Pruning.
  • 42. Department Of Computer Engineering SIESGST Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support. Step 4: Generating 4-itemset Frequent Pattern The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}}is not frequent. Thus, C4= φ, and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support & minimum confidence). Step 5: Generating Association Rules from Frequent Itemsets Procedure: o For each frequent itemset “l”, generate all nonempty subsets of l. o For every nonempty subset s of l, output the rule “s (l-s)” if support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence threshold. Example: We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. Lets take l = {I1, I2, I5}. Its all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}. Let minimum confidence threshold is, say 70%. The resulting association rules are shown below, each listed with its confidence. –R1: I1 ^ I2 I5
  • 43. Department Of Computer Engineering SIESGST Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is rejected. –R2: I1 ^ I5 I2 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is selected. –R3: I2 ^ I5 I1 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is selected. -R4: I1 I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% R4 is rejected. –R5: I2 I1 ^ I5 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is rejected. –R6: I5 I1 ^ I2 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% R6 is selected. In this way, we have found three strong association rules. Conclusion: Thus, we have successfully implemented Apriori Association Algorithm in Java & tested for variety of training databases.