SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
2011 International Conference on Recent Trends in Information Systems


     Online Mining of data to Generate Association
            Rule Mining in Large Databases
Archana Singh                   Megha Chaudhary                    Dr (Prof.) Ajay Rana                    Gaurav Dubey
Ph.D Scholar,                  M.tech(CS&Engg)                     Ph.d(Comp Science&Engg)                  Ph.d Scholar
Amity University               Amity University                    Amity University                         Amity Univeristy
NOIDA (U.P)                    NOIDA (U.P)                         NOIDA (U.P)                             NOIDA(U.P)
91-9958255675                                                      +91-981811756                           +919958759459
archana.elina@gmail.com nicemegha@gmail.com                     ajay_rana@amity.edu                       gdubey1977@gmail.com



ABSTRACT - Data Mining is a Technology to explore data,                   Association rule mining, as suggested by R. Agrawal,
                                                                          basically describes relationships between items in data sets. It
analyze the data and finally discovering patterns from large
                                                                          helps in finding out the items, which would be selected
data repository. In this paper, the problem of online mining of
                                                                          provided certain set of items have already been selected. An
association rules in large databases is discussed. Online
                                                                          improved algorithm for fast rule generation has been
association rule mining can be applied which helps to remove
                                                                          discussed Agrawal et. al (1994). Two algorithms for
redundant rules and helps in compact representation of rules
                                                                          generating association rules have been discussed in ‘Fast
for user.
                                                                          Algorithms for Mining Association Rules’ by Rakesh
In this paper, a new and more optimized algorithm has been
proposed for online rule generation. The advantage of this                Agrawal and Srikant (1994).
algorithm is that the graph generated in our algorithm has                The online mining of data is performed by pre-processing the
less edge as compared to the lattice used in the existing                 data effectively in order to make it suitable for repeated
algorithm. The Proposed algorithm generates all the essential             online queries. An online association rule mining technique
rules also and no rule is missing. The use of non redundant               discussed by Charu C Agrawal at al(2001) suggests a graph
association rules help significantly in the reduction of                  theoretic approach, in which the pre -processed data is stored
irrelevant noise in the data mining process. This graph                   in such a way that online processing may be done by applying
theoretic approach, called adjacency lattice is crucial for               a graph theoretic search algorithm. In this paper concept of
online mining of data. The adjacency lattice could be stored              adjacency lattice of itemsets has been introduced.
either in main memory or secondary memory. The idea of                    This adjacency lattice is crucial in performing effective online
adjacency lattice is to pre store a number of large item sets in          data mining. The adjacency lattice could be stored either in a
special format which reduces disc I/O required in performing              main memory or on secondary memory. The idea of
the query.                                                                adjacency is to pre-store a number of item sets at a level of
                                                                          support. These items are stored in a special format (called
Index Keywords:                                                           adjacency lattice) which reduces the disk I/O required in
Adjacency lattice, Association Rule Mining, Data Mining                   order to perform the query.
                                                                          Online generation of the rules deals with the finding the
                     I INTRODUCTION                                       association rules online by changing the value of the
Data Mining is a process of analysis the data and                         minimum confidence value. Problems with the existing
summarizing it into useful information. In other words,                   algorithm is that the lattice has to be constructed again for all
technically, data mining is the process of finding pattern                large itemsets, to generate the rules, which is very time
among dozens of fields in large relational databases. Data                consuming for online generation of rule. The number of edges
mining software is one of a number of analytical tools for                would be more in the generated lattice as we have edges for a
                                                                          frequent itemset to all its supersets in the subsequent levels.
analyzing data. It allows users to analyze data from many
                                                                          This paper aims to develop a new algorithm for online rule
different dimensions or angles, categorize it, and summarize
                                                                          generation. A weighted directed graph has been constructed
the relationships identified.
                                                                          and depth first search has been used for rule generation. In the
A. Overview of the Work done                                              proposed algorithm, online rules can be generated by
                                                                          generating adjacency matrix for some confidence value and
                                                                          the generating rules for confidence measure higher than that
                                                                          used for generating the adjacency matrix.




          978-1-4577-0792-6/11/$26.00 ©2011 IEEE
                                                                    126
A new algorithm has been developed to overcome these                       threshold value. The itemsets obtained above are referred as
difficulties. In this algorithm the number of edges graph                  prestored itemsets, and can be stored in main memory or
generated is less than the adjacency lattice and it is also                secondary memory. This is beneficial in the sense that we
capable of finding all the essential rules.                                need not to refer dataset again and again from different value
This paper is divided further into sections as : Section 2                 of the min. support and confidence given by the user.
describes the work done by Charu C Agarwal(2001). Section                  The adjacency lattice L is a directed acyclic graph. An itemset
3 describes the new proposed algorithm. Section 4 discusses                X is said to be adjacent to an itemset Y if one of them can be
the illustration of Existing and proposed algorithm. In the last           obtained from the other by adding a single item. The
para, the comparison between two algorithms with their                     adjacency lattice L is constructed as follows:
complexity is found.                                                       Construct a graph with a vertex v(I) for each primary itemset
                                                                           I. Each vertex I has a label corresponding to the value of its
        II EXISTING ALGORITHM FOR ONLINE                                   support. This label is denoted by S(I). For any pair of vertices
                    RULE GENERATION                                        corresponding to itemsets X and Y, a directed edge exists
The aim of Association Rule Mining (Rakesh et. al, 1994) is                from v(X) to v(Y) if and only if X is a parent of Y .Note that it
to detect relationships or patterns between specific values of             is not possible to perform online mining of association rules at
categorical variables in large data sets. Rakesh suggests a                levels less than the primary threshold.
graph theoretic approach. The main idea of association rule
mining in the existing algorithm is to partition the attribute             STEP 2: Online Generation of Itemsets:
values into Transaction patterns. Basically, this technique                Once we have stored adjacency lattice in RAM. Now user can
enables analysts and researchers to uncover hidden patterns in             get some specific large itemsets as he desired. Suppose user
large data sets. Here the pre-processed data is stored in such a           want to find all large itemsets which contain a set of items I
way that online rule generation may be done with a                         and satisfy a level of minimum support s, then there is need to
complexity proportional to the size of the output. In the                  solve the following search in the adjacency lattice. For a
existing algorithm, the concept of an adjacency lattice of                 given itemset I, find all itemsets J such that v(J) is reachable
itemsets has been introduced. This adjacency lattice is crucial            from v(I) by a directed path in the lattice L, and satisfies S(J)
to performing effective online data mining. The adjacency                  ≥ s.
lattice could be stored either in main memory or on secondary              STEP 3: Rule Generation :
memory. The idea of the adjacency lattice is to prestore a                 Rules are generated by using these prestored itemsets for
number of large itemsets at a level of support possible given              some user defined minimum support and minimum
the available memory. These itemsets are stored in a special               confidence.
format (called the adjacency lattice) which reduces the disk
I/O required in order to perform the query. In fact, if enough                                III PROPOSED ALGORITHM
main memory is available for the entire adjacency lattice,                 The algorithm by Charu et al. (2001) is discussed in previous
then no I/O may need to be performed at all.                               section. Detailed discussion of the proposed algorithm has
                                                                           been done in the current section. Graph theoretic approach
A Adjacency lattice
                                                                           has been used in the proposed algorithm. The graph generated
An itemset X is said to be adjacent to an itemset Y if one of              is a directed graph with weights associated on the edges. Also
them can be obtained from the other by adding a single item.               the number of edges is less compared to that in the algorithm
Specifically, an itemset X is said to be a parent of the itemset
                                                                           suggested by Charu et. al.
Y, if Y can be obtained from X by adding a single item to the
set X. It is clear that an itemset may possibly have more than             A. Algorithm
one parent and more than one child. In fact, the number of                 The algorithm has two steps explained below. The first step is
parents of an itemset X is exactly equal to the cardinality of             explained in the section 3(A) in which we will explain that
the set X. This observation follows from the fact that for each            how are we going to construct the graph. The second step is
Element ir in an itemset X, X -ir is a parent of X. In the lattice         explained in section 3(B) in which rule generation is
if a directed path exists from the vertex corresponding to Z to            explained.
the vertex corresponding to X in the adjacency lattice, then               Construction of adjacency lattice
X Z. In such a case, X is said to be a descendant of Z and Z
is said to be an ancestor of X.
B. The Existing Algorithm
There are three steps in the Existing algorithm explained
by (Agarwal et al. 1994)
STEP 1: Generation of adjacency lattice:
The Adjacency lattice is created using the frequent itemsets
generated using any standard algorithm by defining some
minimum support. This support value is called primary




                                                                     127
The large itemsets obtained by applying some traditional                    //finding all subsets of item1 in s(i+1,j)
algorithm for finding frequent itemsets (like Apriori) are                      For each itemset in s(i+1) do
stored in one file and corresponding values of support is                                Item2 = s(i+1,k).itemsets;
stored in another file. Using these two files we can store the                      If (item1 is superset of item2)
item and their corresponding support in a structure say S.                            Index2 = find_index(item2,3);
Now create an array of structure s(i, j) having two fields                           Confidence = s(index2).support/s(index1).support
itemsets and support. This array of structure is used to store                       If(Confidence >= minconf)
the different length of large itemsets in different dimensions.                         adj_lat(index1,index2)=Confidence;
In the field itemsets of structure s(i, j) we will store 1-itemsets             Return adj_lat;
in s(1, j), 2-itemsets in s(2, j), 3-itemsets in s(3, j) and so on.         End;
We have written a function for this purpose named                           In the above gen_adj_lattice() function there is a sub-function
as Initialize ( ). The pseudo code for the Initialize ( )                   to search an element in the structure S which returns the
                                                                            index of that itemset in the structure. Using this index we can
Algorithm Initialize (S)                                                    get the support of the corresponding large itemset.
Begin for each large itemset Є S do                                         Let an itemset X is to be searched in the S(i) firstly find the
   Item1 = s(i).itemset;                                                    length of the itemset X. Now take start traversing the
   Item2 = s(i+1).itemset;                                                  structure S if the length of the current itemset is equal to the
   M1 = length(item1); M2 =                                                 length of the itemset to be searched then only compare the
   length(item2);                                                           two itemsets. If all the items of the both itemsets are matching
   s(j,k).itemsets = item1;                                                 then return the index. This pseudo code for find_index() is
   s(j,k).support =                                                         given in the following:
   s(i).support; Increment k;
   If(diff of lengths of consecutive items!=0)                              Algorithm find_index(item,S)
      put itemsets in the next row of s;                                    Begin
  return                                                                       N1 = length(item1);
s; End;                                                                            For each itemset in S do
Now to calculate the weight of the edge between itemset X                             Item2= S( r).itemsets;
and itemset Y, where (X-Y) = 1-itemset, calculate the value                           N2 = length(item2);
support(X)/support(Y) if this value is >= minimum                                     If(length of the itemsets are equal)
confidence then we can have an edge between the itemset X                                   If(Each item matched)
and the itemset Y and this edge will have weight =                                           index= r;
support(X)/support(Y). Now a function is required to generate                        return index;
the adjacency matrix using the structure S and s. This                       End;
function will take one large itemset from s(i, j) and compare               The graph generated will be directed graph in which largest
with all the items in s(i+1, j). If any subset of this itemset in           itemsets will be at the first level and 1-large itemsets will be
s(i, j) is present in s(i+1, j) then it is required to find that            at lowest level. And the direction of the edges will be from
whether there will be link between them and if there will be                (n-1)th level to nth level. And the weight will be equal to
link then what will be the weight of the link.                              the support of the itemset in the (n-1)th level divided by the
Let an itemset X from structure s(i, j) is taken and searched in            support of the itemset at the nth level.
the S(i). When index of itemset X, say index 1, in the                      B. Generation of Rules
structure S is obtained, we can easily get the support of this
itemset X. Now search all subset of this itemset in s(i+1, J).              Each node in the directed graph is chosen for rule generation.
There is need to find the support for each itemset Y, which is              Call that node starting node and do depth first search in the
present in the s(i+1, j) and also subset of the itemset X                   directed graph. And generate the rules from the visited node
present in s(i, j), The index of the itemset Y, index2, is                  and starting node if and only if it satisfies all the condition,
obtained by searching it in structure S(i). Now weight =                    which are required to generate essential rule.
S(index1).support/S(index2).support is calculated if it is                  Conditions:
greater than or equal to minimum confidence then in the                          1. Product of the confidence of the path between the
adjacency matrix,say a, a[index1, index2] is assigned value                          starting node and the visited mode must be greater
equal to the weight. The pseudo code for gen_adj_lattice()B                          than or equal to minimum confidence.
is given in the following                                                        2. To reduce simple redundancy: We generate set of all
                                                                                     children of the visited node and then this set of child
Algorithm gen_adj_lattice(S,s)                                                       nodes is compared with the nodes that have already
Begin                                                                                been used by the same starting node for rule
   For each row of s do                                                              generation. If any one of the child nodes is found
           Item 1 = s(I,j).itemsets;                                                 there from this visited node no rule can be generated.
           Index1 = find_index(item1,s);                                             Since, this rule will be redundant.
                                                                            The pseudo code for find_allChild() is given




                                                                      128
Algorithm find_allChild(adj_lat,i)                                   Algorithm Generate Rule (Starting node: X, Visited node:
        Begin                                                        Y, Min Conf: c, G)
            C1=C=NULL;                                               Begin
            C1=C=child(adj_lat,i);                                           RuleSet=NULL;
         while C1 = NULL do                                                  C1=weighted product of the path(X,Y);
         For each c Є C1 do                                          If(c1>=c)
            C1 = Child(adj_lat,c);                                   If(~compare(find_allChild(adj_mat,Y),node_gen_rule(X,G)))
            C = C Ụ C1;                                              If(~compare(node_gen_rule(find_allParents(adj_lat,X),G),Y)
         return C;                                                   )
         End;                                                        If(~compare(find_allChild(adj_lat,Y),node_gen_rule(find_all
We have a structure, say G, which stores nodes that have             Parents(X)),G))
already been used for generating rules. They are stored in                    RuleSet = RuleSet U(Y->(X-Y));
such a way that we can get the required nodes just by                Return ruleSet;
reaching the corresponding index. The pseudo code for the            End;
same is given in the following
Algorithm node_gen_rule(nodeset:                                               IV. ILLUSTRATION OF EXISTING AND
         S,G) Begin                                                                     PROPOSED ALGORITHMS
               generated Set = NULL;                                 Now we are going to illustrate both the algorithms by
               for each node S(i) Є S do
                                                                     taking example. The Market Basket Data sets taken
                generated Set – generated Set Ụ G(S(i));
                                                                     shown below in Table 4.1. This dataset has five
                           return generated Set;
                                                                     transactions and five itemsets. Let the minimum
      End;
                                                                     support be 0.4 and minimum confidence is 0.67.
To reduce strict redundancy;                                         Various large itemset obtained b having support value
    A) We have generated s of all Parents of the starting            greater than 0.4, along with the support value are
       node and then for all these parent nodes we have to           shown in the Tables 4.2 to 4.4.
       find out all the nodes which have been used for Rule                             Table 4.1 : 1-large itemsets
       Generation by these parent nodes. Then this set of                                 ITEMS         SUPPORT
       node is compared with the visited node. If this
       visited node is found then from this visited node no                               A=Bread       0.8
       rule can be generated. Because this rule will be
       strictly redundant. The pseudo code for                                            B=Milk        0.8
       find_allParents() is given in the following
                                                                                          C=Beer        0.6
    B) We generate set of all Childs of the visited node and
       the set of all Parents of the starting node and then                               D=Diaper      0.8
       for all these parent’s nodes we have to find out all
       the nodes, which have been used for Rule                                           F=Cock        0.4
       Generation by these parent nodes. Then this set of
                                                                                        Table 4.2 : 2-large itemsets
       node is compared with the set of all child. If any of
       child of this visited node is found there then from                                 AB           0.6
       this visited node no rule can be generated. Because
       this rule will be strictly redundant.                                               AC           0.4

Algorithm find_allParents(adj_lat,i)                                                       AD           0.6
    Begin
                                                                                           BC           0.4
     P1=P=NULL;
    P1=P=Parents(adj_lat,i)                                                                BD           0.6
        While P1 is not equal to NULL
          do For Each P Є P1 dp                                                            BF           0.4
                    P1 = Parents(adj_lat,P) P = P Ụ P1;
          return P;                                                                        CD           0.6
    End;
                                                                                           DF           0.4




                                                               129
Table 4.3 : 3- large itemsets

                       ABD          0.4

                       ACD          0.4

                       BCD          0.4

                       BDF          0.4

A. Rule Generation from the proposed algorithm
Weights of edges between frequent 1-itemset to frequent 2-
itemset and between frequent 2-itemset to frequent 3-itemsets                                   Figure 4.1 Lattice Structure
are shown in Table 4.4 . The weights of edges are calculated
in the following manner. Let X be k-itemset and Y be the k+1            The resultant graph is shown below:
itemset, then the weight of the edge form X to Y is equal to
the confidence of the rule X   (Y- X)

Table 4.4: Weights of the edges between 1-itemset to 2-itemsets

                Edges             Weights
                A – AB            0.75
                A – AC            0.5
                A – AD            0.75
                B – AB            0.75
                B – BC            0.5
                B – BD            0.75
                B – BF            0.5                                                Figure 4.2: Graph generated for the rule generation
                C – AC            0.67
                                                                        We can see that there are more edges in the lattice generated for
                C – BC            0.67
                C – CD            1.0                                   the same example. These edges are shown by dotted edge.
                D – DF            0.5
                D – AD            0.75
                D – BD            0.75
                D – CD            0.75
                AB-ABD            0.67
                AC-ACD            0.67
                AD-ABD            0.67
                AD-ACD            0.67
                BC-BCD            1.0
                BD-BCD            0.67
                BF-BDF            1.0
                CD-ACD            0.67                                  Figure 4.3 Generating the rules for the large item sets ABD
                CD-BCD            0.67
                BF-BDF            1.0                                   Applying depth first search starting from the node ABD, the
The lattice generated for the above example:                            node A will be the first visited node but the weighted product
                                                                        (0.67*0.75) of the path obtained from A to ABD is less than
                                                                        minimum confidence. So the node A will not participate in
                                                                        rule generation. Node B will be second visited node but this
                                                                        also will not participate in rule generation because of similar
                                                                        reason. Now the next visited node is AB and the weighted
                                                                        product of the path from AB to ABD is 0.67 which is equal to
                                                                        the minimum confidence. The children nodes of AB are not
                                                                        generating any rule, and also AB is not used by any of the
                                                                        parent nodes of ABD. Thus all the three conditions are
                                                                        satisfied for rule generation. So we will generate the rule




                                                                  130
from AB. , AB = > D. Now the next visited node will be D,             .
but weighted product of the path from D to ABD is less than           Theorem: The number of edges in the adjacency lattice is
minimum confidence hence no rule will be there and we have            equal to the sum of the number of parents of each
to go to next visited node AD as satisfies all the three              primary itemset.
conditions so there will be rule , AD => B .                          Let N(I, s) be the number of primary itemsets in R(I, s). Thus
The next visited node will BD and this node satisfies                 size of output in this case = N(I, s) . h(I, s) Complexity of
all the three conditions, thus we have rule , BD => A.                existing algorithm is proportional to N(I, s) . h(I, s).In the
                                                                      proposed algorithm there are some edges left which are not
Similarly, Generating the rules for large itemset ACD, BCD,
                                                                      visited by their parents Let those nodes are denoted by L(I,
BDF,BDF,AB,AD,BD,AC,BC,CD,BF,DF. We are getting the                   s).This size of output in this case = N(I, s) . h(I, s) – L(I, s)
following rules shown in the Table 4.5 below                          Complexity of proposed algorithm is proportional to N(I, s) .
                 Table 4.5: The rules generated                       h(I, s) – L(I, s)

   1           AB => D                                                CONCLUSION AND FUTURE WORK
   2           AD => B
   3           BD => A                                                In this paper, data mining and one of important technique of
   4           C => AD                                                data mining is discussed. The issues related with association
   5           AD => C                                                rule mining and then online mining of association rules are
   6           C => BD                                                introduced to resolve these issues. Online association mining
   7           BD => C                                                helps to remove redundant rules and helps in compact
   8           F => BD                                                representation of rules for user. A new algorithm has been
   9           BD => F                                                proposed for online rule generation. The advantage of this
   10          A => B
                                                                      algorithm is that, the graph generated in our algorithm has
   11          B => A
                                                                      less edge as compared to the lattice used in the existing
   12          A => D
   13          D => A
                                                                      algorithm. This algorithm generates all the essential rules also
   14          B => D                                                 and no rule is missing.
   15          D => B                                                 The future work will be implementing both existing and
   16          D => C                                                 proposed algorithms, and then test these algorithms on large
                                                                      datasets like the Zoo Dataset, Mushroom Dataset and
                                                                      Synthetic Dataset.
B. Rules Generated from the Existing algorithm
                                                                      REFERENCES
Generating the rules for large itemsets ABD
Chose all the ancestors of ABD which has support less than            [1] Agrawal, R., Imielinski, T., Swami, A. “Mining association
                                                                            rules between sets of items in large databases.’’ SIGMOD-1993,
or equal to the                                                             pp. 207-214.
         Value = {support (ABD)/c} = (0.4 / 0.67) = 0.6               [2] Charu C. Agrawal and Philip S. Yu, “A New Approach to
AB, AD and BD will be selected. So we will have following                  Online Generation of Association Rules’’ IEEE, vol. 13,
                                                                           No. 4, pp. 327-340, 2001.
lattice. We can easily see that AB, AD and BD are the
                                                                      [3] Dao-I Lin, Zvi M.Kedem, ``Pincer search: An Efficient
maximal ancestor of the directed graph shown in the figure.                algorithm to find maximal frequency item set IEEE trans
Hence we will have two rules:                                              On knowledge and data engineering, vol. No.3,
                  AB => D , AD => B, BD => A                                pp. 333-344, may/june,2002.
                                                                       [4] Ming Liu, W.Hsu, and Y. Ma, “Mining association rules
                                                                            with multiple minimum supports.’’ In Proceeding of
                                                                            fifth ACM SIGKDD International Conference on
                                                                            Knowledge Discovery and Data Mining, pages 337-341,
                                                                            N.Y., 1999. ACM Press.
                                                                      [5] R. Agrawal, T. Lmielinksi, and A. Swami
                                                                          “Mining association between sets of items in
                                                                            Large databases’’ Conf. Management of
                                                                            Data, Washington, DC, May 1993.
                                                                      [6] Ramakrishna Srikanth and Quoc Vu and
   Figure 4.4 : Directed Graph in Adjacency                                Rakesh Agrawal, ‘’Mining association rules
                                                                           with itemsets constraints.’’ In Proc. Of the 3rd
Total number of 16 rules generated in both algorithms. It was              International Conference on KDD and Data
                                                                           Mining (KDD 97), Newport Beach, California,
found that no essential rules are missing in proposed                      August 1997.
algorithm and also there is no redundancy in the rules                [7] Rakesh Agrawal and Ramakrishna Srikanth, “Fast
                                                                            Algorithm for Mining Association Rules’’ In Proc.
generated.                                                                  20 Int Conf. Very Large Data Base, VLDB, 1994.
                                                                      [8] Data Mining: Concepts and Techniques By “Jiawei
C. Comparison of Algorithms:                                                Han Micheline Kamber’’. Academic press 2001.
Complexity of graph search algorithm is proportional to the
size of output




                                                                131

Contenu connexe

Tendances

K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERYMR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERYIJDKP
 
A new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemA new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageMade Artha
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIJSRD
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010girivaishali
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Miningsnoreen
 
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...IRJET Journal
 
AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019Neha gupta
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
 

Tendances (20)

Az36311316
Az36311316Az36311316
Az36311316
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERYMR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
 
A new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemA new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender system
 
A cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storage
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 

Similaire à Online Mining of Association Rules

Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopBRNSSPublicationHubI
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
 
A study on rough set theory based
A study on rough set theory basedA study on rough set theory based
A study on rough set theory basedijaia
 
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAMINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAcscpconf
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data csandit
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
Review on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent ItemsReview on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent Itemsvivatechijri
 
A Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining MethodologiesA Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining MethodologiesIOSR Journals
 
Hybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmHybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmeSAT Publishing House
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsDrjabez
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...Nicolle Dammann
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERScsandit
 
Integrating compression technique for data mining
Integrating compression technique for data  miningIntegrating compression technique for data  mining
Integrating compression technique for data miningDr.Manmohan Singh
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 

Similaire à Online Mining of Association Rules (20)

Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
Ap26261267
Ap26261267Ap26261267
Ap26261267
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
D0352630
D0352630D0352630
D0352630
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
395 404
395 404395 404
395 404
 
A study on rough set theory based
A study on rough set theory basedA study on rough set theory based
A study on rough set theory based
 
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAMINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Review on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent ItemsReview on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent Items
 
A Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining MethodologiesA Survey on Fuzzy Association Rule Mining Methodologies
A Survey on Fuzzy Association Rule Mining Methodologies
 
Hybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmHybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithm
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
 
Integrating compression technique for data mining
Integrating compression technique for data  miningIntegrating compression technique for data  mining
Integrating compression technique for data mining
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 

Online Mining of Association Rules

  • 1. 2011 International Conference on Recent Trends in Information Systems Online Mining of data to Generate Association Rule Mining in Large Databases Archana Singh Megha Chaudhary Dr (Prof.) Ajay Rana Gaurav Dubey Ph.D Scholar, M.tech(CS&Engg) Ph.d(Comp Science&Engg) Ph.d Scholar Amity University Amity University Amity University Amity Univeristy NOIDA (U.P) NOIDA (U.P) NOIDA (U.P) NOIDA(U.P) 91-9958255675 +91-981811756 +919958759459 archana.elina@gmail.com nicemegha@gmail.com ajay_rana@amity.edu gdubey1977@gmail.com ABSTRACT - Data Mining is a Technology to explore data, Association rule mining, as suggested by R. Agrawal, basically describes relationships between items in data sets. It analyze the data and finally discovering patterns from large helps in finding out the items, which would be selected data repository. In this paper, the problem of online mining of provided certain set of items have already been selected. An association rules in large databases is discussed. Online improved algorithm for fast rule generation has been association rule mining can be applied which helps to remove discussed Agrawal et. al (1994). Two algorithms for redundant rules and helps in compact representation of rules generating association rules have been discussed in ‘Fast for user. Algorithms for Mining Association Rules’ by Rakesh In this paper, a new and more optimized algorithm has been proposed for online rule generation. The advantage of this Agrawal and Srikant (1994). algorithm is that the graph generated in our algorithm has The online mining of data is performed by pre-processing the less edge as compared to the lattice used in the existing data effectively in order to make it suitable for repeated algorithm. The Proposed algorithm generates all the essential online queries. An online association rule mining technique rules also and no rule is missing. The use of non redundant discussed by Charu C Agrawal at al(2001) suggests a graph association rules help significantly in the reduction of theoretic approach, in which the pre -processed data is stored irrelevant noise in the data mining process. This graph in such a way that online processing may be done by applying theoretic approach, called adjacency lattice is crucial for a graph theoretic search algorithm. In this paper concept of online mining of data. The adjacency lattice could be stored adjacency lattice of itemsets has been introduced. either in main memory or secondary memory. The idea of This adjacency lattice is crucial in performing effective online adjacency lattice is to pre store a number of large item sets in data mining. The adjacency lattice could be stored either in a special format which reduces disc I/O required in performing main memory or on secondary memory. The idea of the query. adjacency is to pre-store a number of item sets at a level of support. These items are stored in a special format (called Index Keywords: adjacency lattice) which reduces the disk I/O required in Adjacency lattice, Association Rule Mining, Data Mining order to perform the query. Online generation of the rules deals with the finding the I INTRODUCTION association rules online by changing the value of the Data Mining is a process of analysis the data and minimum confidence value. Problems with the existing summarizing it into useful information. In other words, algorithm is that the lattice has to be constructed again for all technically, data mining is the process of finding pattern large itemsets, to generate the rules, which is very time among dozens of fields in large relational databases. Data consuming for online generation of rule. The number of edges mining software is one of a number of analytical tools for would be more in the generated lattice as we have edges for a frequent itemset to all its supersets in the subsequent levels. analyzing data. It allows users to analyze data from many This paper aims to develop a new algorithm for online rule different dimensions or angles, categorize it, and summarize generation. A weighted directed graph has been constructed the relationships identified. and depth first search has been used for rule generation. In the A. Overview of the Work done proposed algorithm, online rules can be generated by generating adjacency matrix for some confidence value and the generating rules for confidence measure higher than that used for generating the adjacency matrix. 978-1-4577-0792-6/11/$26.00 ©2011 IEEE 126
  • 2. A new algorithm has been developed to overcome these threshold value. The itemsets obtained above are referred as difficulties. In this algorithm the number of edges graph prestored itemsets, and can be stored in main memory or generated is less than the adjacency lattice and it is also secondary memory. This is beneficial in the sense that we capable of finding all the essential rules. need not to refer dataset again and again from different value This paper is divided further into sections as : Section 2 of the min. support and confidence given by the user. describes the work done by Charu C Agarwal(2001). Section The adjacency lattice L is a directed acyclic graph. An itemset 3 describes the new proposed algorithm. Section 4 discusses X is said to be adjacent to an itemset Y if one of them can be the illustration of Existing and proposed algorithm. In the last obtained from the other by adding a single item. The para, the comparison between two algorithms with their adjacency lattice L is constructed as follows: complexity is found. Construct a graph with a vertex v(I) for each primary itemset I. Each vertex I has a label corresponding to the value of its II EXISTING ALGORITHM FOR ONLINE support. This label is denoted by S(I). For any pair of vertices RULE GENERATION corresponding to itemsets X and Y, a directed edge exists The aim of Association Rule Mining (Rakesh et. al, 1994) is from v(X) to v(Y) if and only if X is a parent of Y .Note that it to detect relationships or patterns between specific values of is not possible to perform online mining of association rules at categorical variables in large data sets. Rakesh suggests a levels less than the primary threshold. graph theoretic approach. The main idea of association rule mining in the existing algorithm is to partition the attribute STEP 2: Online Generation of Itemsets: values into Transaction patterns. Basically, this technique Once we have stored adjacency lattice in RAM. Now user can enables analysts and researchers to uncover hidden patterns in get some specific large itemsets as he desired. Suppose user large data sets. Here the pre-processed data is stored in such a want to find all large itemsets which contain a set of items I way that online rule generation may be done with a and satisfy a level of minimum support s, then there is need to complexity proportional to the size of the output. In the solve the following search in the adjacency lattice. For a existing algorithm, the concept of an adjacency lattice of given itemset I, find all itemsets J such that v(J) is reachable itemsets has been introduced. This adjacency lattice is crucial from v(I) by a directed path in the lattice L, and satisfies S(J) to performing effective online data mining. The adjacency ≥ s. lattice could be stored either in main memory or on secondary STEP 3: Rule Generation : memory. The idea of the adjacency lattice is to prestore a Rules are generated by using these prestored itemsets for number of large itemsets at a level of support possible given some user defined minimum support and minimum the available memory. These itemsets are stored in a special confidence. format (called the adjacency lattice) which reduces the disk I/O required in order to perform the query. In fact, if enough III PROPOSED ALGORITHM main memory is available for the entire adjacency lattice, The algorithm by Charu et al. (2001) is discussed in previous then no I/O may need to be performed at all. section. Detailed discussion of the proposed algorithm has been done in the current section. Graph theoretic approach A Adjacency lattice has been used in the proposed algorithm. The graph generated An itemset X is said to be adjacent to an itemset Y if one of is a directed graph with weights associated on the edges. Also them can be obtained from the other by adding a single item. the number of edges is less compared to that in the algorithm Specifically, an itemset X is said to be a parent of the itemset suggested by Charu et. al. Y, if Y can be obtained from X by adding a single item to the set X. It is clear that an itemset may possibly have more than A. Algorithm one parent and more than one child. In fact, the number of The algorithm has two steps explained below. The first step is parents of an itemset X is exactly equal to the cardinality of explained in the section 3(A) in which we will explain that the set X. This observation follows from the fact that for each how are we going to construct the graph. The second step is Element ir in an itemset X, X -ir is a parent of X. In the lattice explained in section 3(B) in which rule generation is if a directed path exists from the vertex corresponding to Z to explained. the vertex corresponding to X in the adjacency lattice, then Construction of adjacency lattice X Z. In such a case, X is said to be a descendant of Z and Z is said to be an ancestor of X. B. The Existing Algorithm There are three steps in the Existing algorithm explained by (Agarwal et al. 1994) STEP 1: Generation of adjacency lattice: The Adjacency lattice is created using the frequent itemsets generated using any standard algorithm by defining some minimum support. This support value is called primary 127
  • 3. The large itemsets obtained by applying some traditional //finding all subsets of item1 in s(i+1,j) algorithm for finding frequent itemsets (like Apriori) are For each itemset in s(i+1) do stored in one file and corresponding values of support is Item2 = s(i+1,k).itemsets; stored in another file. Using these two files we can store the If (item1 is superset of item2) item and their corresponding support in a structure say S. Index2 = find_index(item2,3); Now create an array of structure s(i, j) having two fields Confidence = s(index2).support/s(index1).support itemsets and support. This array of structure is used to store If(Confidence >= minconf) the different length of large itemsets in different dimensions. adj_lat(index1,index2)=Confidence; In the field itemsets of structure s(i, j) we will store 1-itemsets Return adj_lat; in s(1, j), 2-itemsets in s(2, j), 3-itemsets in s(3, j) and so on. End; We have written a function for this purpose named In the above gen_adj_lattice() function there is a sub-function as Initialize ( ). The pseudo code for the Initialize ( ) to search an element in the structure S which returns the index of that itemset in the structure. Using this index we can Algorithm Initialize (S) get the support of the corresponding large itemset. Begin for each large itemset Є S do Let an itemset X is to be searched in the S(i) firstly find the Item1 = s(i).itemset; length of the itemset X. Now take start traversing the Item2 = s(i+1).itemset; structure S if the length of the current itemset is equal to the M1 = length(item1); M2 = length of the itemset to be searched then only compare the length(item2); two itemsets. If all the items of the both itemsets are matching s(j,k).itemsets = item1; then return the index. This pseudo code for find_index() is s(j,k).support = given in the following: s(i).support; Increment k; If(diff of lengths of consecutive items!=0) Algorithm find_index(item,S) put itemsets in the next row of s; Begin return N1 = length(item1); s; End; For each itemset in S do Now to calculate the weight of the edge between itemset X Item2= S( r).itemsets; and itemset Y, where (X-Y) = 1-itemset, calculate the value N2 = length(item2); support(X)/support(Y) if this value is >= minimum If(length of the itemsets are equal) confidence then we can have an edge between the itemset X If(Each item matched) and the itemset Y and this edge will have weight = index= r; support(X)/support(Y). Now a function is required to generate return index; the adjacency matrix using the structure S and s. This End; function will take one large itemset from s(i, j) and compare The graph generated will be directed graph in which largest with all the items in s(i+1, j). If any subset of this itemset in itemsets will be at the first level and 1-large itemsets will be s(i, j) is present in s(i+1, j) then it is required to find that at lowest level. And the direction of the edges will be from whether there will be link between them and if there will be (n-1)th level to nth level. And the weight will be equal to link then what will be the weight of the link. the support of the itemset in the (n-1)th level divided by the Let an itemset X from structure s(i, j) is taken and searched in support of the itemset at the nth level. the S(i). When index of itemset X, say index 1, in the B. Generation of Rules structure S is obtained, we can easily get the support of this itemset X. Now search all subset of this itemset in s(i+1, J). Each node in the directed graph is chosen for rule generation. There is need to find the support for each itemset Y, which is Call that node starting node and do depth first search in the present in the s(i+1, j) and also subset of the itemset X directed graph. And generate the rules from the visited node present in s(i, j), The index of the itemset Y, index2, is and starting node if and only if it satisfies all the condition, obtained by searching it in structure S(i). Now weight = which are required to generate essential rule. S(index1).support/S(index2).support is calculated if it is Conditions: greater than or equal to minimum confidence then in the 1. Product of the confidence of the path between the adjacency matrix,say a, a[index1, index2] is assigned value starting node and the visited mode must be greater equal to the weight. The pseudo code for gen_adj_lattice()B than or equal to minimum confidence. is given in the following 2. To reduce simple redundancy: We generate set of all children of the visited node and then this set of child Algorithm gen_adj_lattice(S,s) nodes is compared with the nodes that have already Begin been used by the same starting node for rule For each row of s do generation. If any one of the child nodes is found Item 1 = s(I,j).itemsets; there from this visited node no rule can be generated. Index1 = find_index(item1,s); Since, this rule will be redundant. The pseudo code for find_allChild() is given 128
  • 4. Algorithm find_allChild(adj_lat,i) Algorithm Generate Rule (Starting node: X, Visited node: Begin Y, Min Conf: c, G) C1=C=NULL; Begin C1=C=child(adj_lat,i); RuleSet=NULL; while C1 = NULL do C1=weighted product of the path(X,Y); For each c Є C1 do If(c1>=c) C1 = Child(adj_lat,c); If(~compare(find_allChild(adj_mat,Y),node_gen_rule(X,G))) C = C Ụ C1; If(~compare(node_gen_rule(find_allParents(adj_lat,X),G),Y) return C; ) End; If(~compare(find_allChild(adj_lat,Y),node_gen_rule(find_all We have a structure, say G, which stores nodes that have Parents(X)),G)) already been used for generating rules. They are stored in RuleSet = RuleSet U(Y->(X-Y)); such a way that we can get the required nodes just by Return ruleSet; reaching the corresponding index. The pseudo code for the End; same is given in the following Algorithm node_gen_rule(nodeset: IV. ILLUSTRATION OF EXISTING AND S,G) Begin PROPOSED ALGORITHMS generated Set = NULL; Now we are going to illustrate both the algorithms by for each node S(i) Є S do taking example. The Market Basket Data sets taken generated Set – generated Set Ụ G(S(i)); shown below in Table 4.1. This dataset has five return generated Set; transactions and five itemsets. Let the minimum End; support be 0.4 and minimum confidence is 0.67. To reduce strict redundancy; Various large itemset obtained b having support value A) We have generated s of all Parents of the starting greater than 0.4, along with the support value are node and then for all these parent nodes we have to shown in the Tables 4.2 to 4.4. find out all the nodes which have been used for Rule Table 4.1 : 1-large itemsets Generation by these parent nodes. Then this set of ITEMS SUPPORT node is compared with the visited node. If this visited node is found then from this visited node no A=Bread 0.8 rule can be generated. Because this rule will be strictly redundant. The pseudo code for B=Milk 0.8 find_allParents() is given in the following C=Beer 0.6 B) We generate set of all Childs of the visited node and the set of all Parents of the starting node and then D=Diaper 0.8 for all these parent’s nodes we have to find out all the nodes, which have been used for Rule F=Cock 0.4 Generation by these parent nodes. Then this set of Table 4.2 : 2-large itemsets node is compared with the set of all child. If any of child of this visited node is found there then from AB 0.6 this visited node no rule can be generated. Because this rule will be strictly redundant. AC 0.4 Algorithm find_allParents(adj_lat,i) AD 0.6 Begin BC 0.4 P1=P=NULL; P1=P=Parents(adj_lat,i) BD 0.6 While P1 is not equal to NULL do For Each P Є P1 dp BF 0.4 P1 = Parents(adj_lat,P) P = P Ụ P1; return P; CD 0.6 End; DF 0.4 129
  • 5. Table 4.3 : 3- large itemsets ABD 0.4 ACD 0.4 BCD 0.4 BDF 0.4 A. Rule Generation from the proposed algorithm Weights of edges between frequent 1-itemset to frequent 2- itemset and between frequent 2-itemset to frequent 3-itemsets Figure 4.1 Lattice Structure are shown in Table 4.4 . The weights of edges are calculated in the following manner. Let X be k-itemset and Y be the k+1 The resultant graph is shown below: itemset, then the weight of the edge form X to Y is equal to the confidence of the rule X (Y- X) Table 4.4: Weights of the edges between 1-itemset to 2-itemsets Edges Weights A – AB 0.75 A – AC 0.5 A – AD 0.75 B – AB 0.75 B – BC 0.5 B – BD 0.75 B – BF 0.5 Figure 4.2: Graph generated for the rule generation C – AC 0.67 We can see that there are more edges in the lattice generated for C – BC 0.67 C – CD 1.0 the same example. These edges are shown by dotted edge. D – DF 0.5 D – AD 0.75 D – BD 0.75 D – CD 0.75 AB-ABD 0.67 AC-ACD 0.67 AD-ABD 0.67 AD-ACD 0.67 BC-BCD 1.0 BD-BCD 0.67 BF-BDF 1.0 CD-ACD 0.67 Figure 4.3 Generating the rules for the large item sets ABD CD-BCD 0.67 BF-BDF 1.0 Applying depth first search starting from the node ABD, the The lattice generated for the above example: node A will be the first visited node but the weighted product (0.67*0.75) of the path obtained from A to ABD is less than minimum confidence. So the node A will not participate in rule generation. Node B will be second visited node but this also will not participate in rule generation because of similar reason. Now the next visited node is AB and the weighted product of the path from AB to ABD is 0.67 which is equal to the minimum confidence. The children nodes of AB are not generating any rule, and also AB is not used by any of the parent nodes of ABD. Thus all the three conditions are satisfied for rule generation. So we will generate the rule 130
  • 6. from AB. , AB = > D. Now the next visited node will be D, . but weighted product of the path from D to ABD is less than Theorem: The number of edges in the adjacency lattice is minimum confidence hence no rule will be there and we have equal to the sum of the number of parents of each to go to next visited node AD as satisfies all the three primary itemset. conditions so there will be rule , AD => B . Let N(I, s) be the number of primary itemsets in R(I, s). Thus The next visited node will BD and this node satisfies size of output in this case = N(I, s) . h(I, s) Complexity of all the three conditions, thus we have rule , BD => A. existing algorithm is proportional to N(I, s) . h(I, s).In the proposed algorithm there are some edges left which are not Similarly, Generating the rules for large itemset ACD, BCD, visited by their parents Let those nodes are denoted by L(I, BDF,BDF,AB,AD,BD,AC,BC,CD,BF,DF. We are getting the s).This size of output in this case = N(I, s) . h(I, s) – L(I, s) following rules shown in the Table 4.5 below Complexity of proposed algorithm is proportional to N(I, s) . Table 4.5: The rules generated h(I, s) – L(I, s) 1 AB => D CONCLUSION AND FUTURE WORK 2 AD => B 3 BD => A In this paper, data mining and one of important technique of 4 C => AD data mining is discussed. The issues related with association 5 AD => C rule mining and then online mining of association rules are 6 C => BD introduced to resolve these issues. Online association mining 7 BD => C helps to remove redundant rules and helps in compact 8 F => BD representation of rules for user. A new algorithm has been 9 BD => F proposed for online rule generation. The advantage of this 10 A => B algorithm is that, the graph generated in our algorithm has 11 B => A less edge as compared to the lattice used in the existing 12 A => D 13 D => A algorithm. This algorithm generates all the essential rules also 14 B => D and no rule is missing. 15 D => B The future work will be implementing both existing and 16 D => C proposed algorithms, and then test these algorithms on large datasets like the Zoo Dataset, Mushroom Dataset and Synthetic Dataset. B. Rules Generated from the Existing algorithm REFERENCES Generating the rules for large itemsets ABD Chose all the ancestors of ABD which has support less than [1] Agrawal, R., Imielinski, T., Swami, A. “Mining association rules between sets of items in large databases.’’ SIGMOD-1993, or equal to the pp. 207-214. Value = {support (ABD)/c} = (0.4 / 0.67) = 0.6 [2] Charu C. Agrawal and Philip S. Yu, “A New Approach to AB, AD and BD will be selected. So we will have following Online Generation of Association Rules’’ IEEE, vol. 13, No. 4, pp. 327-340, 2001. lattice. We can easily see that AB, AD and BD are the [3] Dao-I Lin, Zvi M.Kedem, ``Pincer search: An Efficient maximal ancestor of the directed graph shown in the figure. algorithm to find maximal frequency item set IEEE trans Hence we will have two rules: On knowledge and data engineering, vol. No.3, AB => D , AD => B, BD => A pp. 333-344, may/june,2002. [4] Ming Liu, W.Hsu, and Y. Ma, “Mining association rules with multiple minimum supports.’’ In Proceeding of fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 337-341, N.Y., 1999. ACM Press. [5] R. Agrawal, T. Lmielinksi, and A. Swami “Mining association between sets of items in Large databases’’ Conf. Management of Data, Washington, DC, May 1993. [6] Ramakrishna Srikanth and Quoc Vu and Figure 4.4 : Directed Graph in Adjacency Rakesh Agrawal, ‘’Mining association rules with itemsets constraints.’’ In Proc. Of the 3rd Total number of 16 rules generated in both algorithms. It was International Conference on KDD and Data Mining (KDD 97), Newport Beach, California, found that no essential rules are missing in proposed August 1997. algorithm and also there is no redundancy in the rules [7] Rakesh Agrawal and Ramakrishna Srikanth, “Fast Algorithm for Mining Association Rules’’ In Proc. generated. 20 Int Conf. Very Large Data Base, VLDB, 1994. [8] Data Mining: Concepts and Techniques By “Jiawei C. Comparison of Algorithms: Han Micheline Kamber’’. Academic press 2001. Complexity of graph search algorithm is proportional to the size of output 131