SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Finding Symmetric Association Rules to Support Medical Qualitative
                                Research

                        Razan Paul, Abu Sayed Md. Latiful Hoque
                    Department of Computer Science and Engineering,
       Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
                  razanpaul@yahoo.com, asmlatifulhoque@cse.buet.ac.bd


                     Abstract                                  the algorithms produce rules eliminating all
                                                               infrequent item sets. On the other hand, if we set
   In medical qualitative research, medical                    minimum support too low, the algorithms produce
researchers analyze historical patient data to verify          far too many rules that are meaningless. In order to
known relationships and to discover unknown                    deal with this problem many algorithms have been
relationships among medical attributes. All the                proposed to mine rare associations [10-13].
existing algorithms to solve this problem use                  However, these algorithms do not find the
measures which are asymmetric measure, so only                 relationship between rare and high frequent medical
one direction of the rule( P -> Q or Q->P) is taken            items. In [7], the authors propose few algorithms that
into account. However, medical researchers are                 allow a user to specify Boolean expressions over the
interested to find both asymmetric and symmetric               presence or absence of items in association rule or to
relationship among medical attributes. We have                 specify a certain hierarchy [6] of items in association
developed pruning strategies and devised an efficient          rule. These approaches are not enough to mine
algorithm for the symmetric relationship problem.              desired rules for medical researchers.
We propose measuring interestingness of known                       All the existing algorithms [14-18] to discover
symmetric relationships and unknown symmetric                  interesting association rules in medical data only find
relationships via the correlation measure of                   asymmetric pattern, whereas medical researchers are
antecedent items and consequent items. We have                 interested to find both asymmetric and symmetric
demonstrated its effectiveness by testing it on real           relationship among medical attributes. For these
dataset.                                                       reasons, we have proposed an association-mining
                                                               algorithm, which will find rules among the attributes
1. Introduction                                                of the researcher interest, so that it can help in
                                                               decision making of the researchers. The problem in
    In medical qualitative research, medical                   discovering relationships is to avoid redundant
researchers are interested in finding association rules        relationships and control the quality of them. This
to see relationship among specified items and to see           algorithm allows the researchers to define the
how a group of items is related with a different group         following constraints:        group information of
of items. For instance, a medical researcher can               attributes, minimum confidence and support for each
discover relationship between the age and the                  group, which item will appear in antecedent and
HbA1c% of a patient. Medical researchers are                   which item will appear in consequent and which
interested to find relationship among various                  attributes will appear in both. One attribute can
diseases, lab tests, symptoms, etc. Due to high                belong to several groups.
dimensionality of medical data, conventional
association mining algorithms [1-8] discover a very            2. Mapping complex medical data to
high number of rules with many attributes, which are           mineable items
tedious, redundant to medical researchers and not
among their desired set of attributes. Medical                    For knowledge discovery, the medical data have
researchers may need to find the relationship                  to be transformed into a suitable transaction format
between rare and high frequent medical items, but              to discover knowledge. We have addressed the
conventional mining processes for association rules            problem of mapping complex medical data to items
explore interesting relationships between data items           using domain dictionary and rule base as shown in
that occur frequently together [1-7].                          figure 1. The medical data are types of categorical,
    Rare item problem is presented in [9]. According           continuous numerical data, boolean, interval,
to this problem if minimum support is set too high,            percentage, fraction and ratio. Medical domain




978-1-4244-7571-1/10/$26.00 ©2010 IEEE                    81
experts have the knowledge of how to map ranges of                   cardinality of attributes except continuous numeric
numerical data for each attribute to a series of items.              data are not high in medical domain, these attribute
For example, there are certain conventions to                        values are mapped to integer values using medical
consider a person is young, adult, or elder with                     domain dictionaries. Therefore, the mapping process
respect to age. A set of rules is created for each                   is divided in two phases. Phase 1: a rule base is
continuous numerical attribute using the knowledge                   constructed based on the knowledge of medical
of medical domain experts. A rule engine is used to                  domain experts and dictionaries are constructed for
map continuous numerical data to items using these                   attributes where domain expert knowledge is not
developed rules.                                                     applicable, Phase 2: attribute values are mapped to
   We have used domain dictionary approach to                        integer values using the corresponding rule base and
transform the data, for which medical domain expert                  the dictionaries.
knowledge is not applicable, to numerical form. As
                                                                        Original    Mapped            Original    Mapped
                                   Generate dictionary for              value       value             value       value
                                   each categorical attribute           Headache     1                Yes          1
                                                                        Fever       2                 No          2
               PatientActual Data
                        Age Smoke        Diagnosis                       Dictionary of                       Dictionary of
                ID                                                       Diagnosis attribute                 Smoke attribute
               1020D 33        Yes       Headache
               1021D 63        No        Fever                                                 Map to integer items using
                                                                                               rule base and dictionaries
                     Actual data

                                          If age <= 12 then 1
                  Medical                 If 13<=age<=60 then 2
                  domain                  If 60 <=age then 3                         Patient      Age    Smoke     Diagnosis
                  knowledge               If smoke = y then 1                         ID
                                          If smoke = n then 2                        1020D        2      1         1
                                          If Sex = M then 1
                                                                                     1021D        3      2         2
                                          If Sex = F then 2

                                                 Rule Base                            Data suitable for Knowledge Discovery
                               Figure 1. Data transformation of medical data
3. The proposed algorithm                                         Uninteresting relationships among medical
                                                                  attributes are avoided in the candidate
    The main theme of this algorithm is based on the              generation phase which reduces number of
following two statements. Interesting relationships               rules, finds out only interesting relationships
among various medical attributes are concealed in                 and makes the algorithm fast.
subsets of the attributes, but do not come out on all        Confidence is not the perfect method to rank
attributes taken together. All interesting relationships symmetric medical relationships because it does not
among various medical attributes have not same           account for the consequent frequency with the
support and confidence. The algorithm constructs a       antecedent. For the ranking of medical relationship, a
candidate itemsets based on groups constraint and        direct measure of association rule between variables
use the corresponding support of each group in           is a perfect scheme. For a medical relationship s t
candidate selection process to discover all possible     , s is a group of medical items where each item is
desired itemsets of that group. The goals of this        constrained to be appear in antecedent or both and t
algorithm are the following: finding desired rules of    is a group of medical attributes where each item is
medical researcher and running fast. The features of     appear to be in consequent or both. Moreover,
this proposed algorithm are as follows:                  s t = Ø. For this relationship, the support is
        It allows grouping of attributes to find         defined as support = P s, t and the confidence is
        relationship among medical attributes. This      defined as = P s, t /P t where P is the probability.
        provides control on the search process.          The correlation coefficient (also known as the -
        Minimum confidence and support can vary          coefficient) measures the degree of relationship
        from one group to another group.                 between two random variables by measuring the
        One item can belong to several groups            degree of linear interdependency. It is defined by the
        Attributes are constrained to appear on either   covariance between the two variables divided by
        antecedent or consequent or both side of the     their standard deviations:
        rule.                                                                        Cov(s, t)
                                                                               st =
        It does not generate subsets on full desired                                    s t
        itemset, but generates subsets for items that        Here Cov(s, t) represents the covariance of the
        can appear in both consequent and                two variables and X and Y are stand for standard
        antecedent.




                                                                82
deviation .The covariance measures how two                      belong to zero or more groups. 1-itemset is selected
variables change together:                                      if it has support greater or equal to one of its
             Cov s, t = P s, t      P s P t                     corresponding group support. As medical attribute
   As we know, standard deviation is the square root            value contains patient information that is
of its variance and variance is a special case of               multidimensional, the algorithm performs the count
covariance when the two variables are identical.                operation by comparing the value of attributes
                                                                instead of determining presence or absence of values
               s = Var s = Cov s, s
                                                                of attributes to calculate support.
       = P s, s      P s P(s) = P s         P(s)2
   Similarly, t = P t         P(t) 2
                                                                3.1. Candidate Generation and Selection
                       P s, t    P s P t
          st =
                   P s     P(s) 2 P t     P(t)2                    The intuition behind candidate generation of all
   Here P s, t is the support of itemset consists of            level-wise algorithms like Apriori is based on the
both s and t. Let the support of the itemset be Sst .           following simple fact: Every subset of a frequent
Here p s and p t is the support of antecedent s and             itemset is frequent so that they can reduce the
antecedent t respectively. Let the support of                   number of itemsets that have to be checked.
antecedent s and consequent t be Ss andSt . The value           However, the idea behind candidate generation of
of Sst , Ss and St are computed during the desired              proposed algorithm is every item in the itemset has
itemset generation of our proposed algorithm. Using             to be in the same group. This idea makes the new
these values, we can calculate the correlation of               candidates that consist of items in the same group
every medical relationship rule between a group of              and keeps itemsets consist of both rare items and
medical items to another group of medical items. The            high frequent items. If all the items in a new
correlation value will indicate medical researchers             candidate set are in the same group, then it is
how strong a medical relationship is in perspective of          selected as a valid candidate, otherwise the new
historical data.                                                candidate is not added to valid candidate itemsets.
                              Sst Ss St                         Here for each group there are different support and
                 st =                                           confidence. Each candidate itemset belongs to a
                          Ss Ss 2 St St 2                       particular group. After finding group id of a
                                                                candidate itemset, the algorithm uses corresponding
    So putting the value of            ,    and       in
                                                                support for candidate selection where as Apriori uses
association rule generation phase, we have found the
                                                                a single support threshold for all the candidate
single metric, correlation coefficient, to represent
                                                                itemsets. By this way, itemsets are explored which
how much antecedent and consequent are medically
                                                                are desired to medical researchers.
related with each other. For each medical
relationship or rule, this metric has been used to
indicate the degree of strong relationship between a
                                                                3.2. Generating association rules
group of items to another group of items to support
                                                                Let AC(item) be the function which returns one out
medical qualitative research. The ranges of values
                                                                of three values: 1 if item is constrained to be in the
for       is between -1 and +1. If two variables are
                                                                antecedent of a rule, 2 if it is constrained to be in
independent then        equals 0. When        equals +1
                                                                the consequent and 0 if it can be in either. Using this
the variables are considered perfectly positively
                                                                function, itemset is partitioned into antecedent set,
correlated. A positive correlation is the evidence of a
                                                                consequent set and both set. Moreover, it does not
general tendency that when a group of attribute
                                                                use subset generation to itemsets to form rules like
values s for a patient happens, another group of
                                                                conventional association mining algorithm; it only
attribute values y for the same patient happens. More
                                                                uses subset generation to both set. Each subset of
positive value means the relationship is more strong.
                                                                both set is added in antecedent part in one rule and is
When         equals -1 the variables are considered
                                                                added in consequent part in another rule. Each
perfectly negatively correlated.
                                                                itemset belongs to a particular group. In addition to,
    Figure 2 shows the association-mining algorithm
                                                                there is a different confidence for each group
to support medical research. Like Apriori, our
                                                                whereas Apriori uses a single confidence for all the
algorithm is also based on level wise search. The
                                                                itemsets. After finding group id of an itemset, the
major difference in our proposed algorithm is
                                                                algorithm uses corresponding confidence to form
candidate generation process with Apriori. Each item
                                                                rules. By this way, rules are explored which are
consists of attribute name and its value. Having
                                                                desired of medical researchers.
retrieved information of a 1-itemset, we make a new
1-itemset if this 1-itemset is not created already,
otherwise update its support. The 1-itemset can




                                                           83
Algorithm: Find itemsets which has high support                 procedure SelectDesiredItemSetFromCandidates
  and are in the same group.                                      (CK, GroupSupports )
  Input: Data and metadata files.                                                         k
  Output : Itemsets which are desired to Medical                     1.1 j=FindGroupNoWhichHasMinimum
  Researchers.                                                             SupportIfMultipleGroupsExist (c)
  1. K=1;                                                            1.2 If c.support >= GroupSupports[j]
  2. Read the metadata about which attributes can only               1.3 Add it to I
       appear in the antecedent of a rule, can only appear        2. return I
       in the consequent and can appear in either                 Algorithm : Find assosiation rules for decision
  3. Read Groups Information along with each group                supportability of medical reasearcher.
       support and confidence from configuration file and         Input: I : Itemsets , GroupConfidences
       make dictionary , here key is the attribute number         Output: R: Set of rules
       and value is a list of group numbers on whcih the          1. R = Ø
       corresponding attribute belongs to.                        2. For each X I
  4. Ik = Select 1-itemsets that have support greater or            2.1 j =FindGroupNoWhichHasMinimum
       equal to one of its corresponding group support.                 ConfideceIfMultipleGroupsExist(X)
  5. While(Ik                                                       2.2 Both Set B = (b1, b2          n){ where bi
       5.1 K++;                                                            X and AC(bi) = 0}
       5.2 CK = Candidate_generation(Ik-1)
       5.3 CalculateCandidatesSupport(Ck)                               where asi                  i)= 1}
       5.4 Ik = SelectDesiredItemSetFromCandidates(CK,             2.4 Consequent set CS = (cs1, cs2              n){
       GroupSupports) ;                                                 where csi X and AC(csi) = 2}
       5.5 I = I U Ik                                              2.5 For each subset Y of B
  6. return I                                                          2.5.1 Y1 = B-Y;
  procedure Candidate_generation(Ik-1: frequent (k-1)                  2.5.2 AS1 =AS U Y
  itemsets)                                                            2.5.3 CS1 = CS U Y1
  1. for each Itemset i1     k-1                                       2.5.4 if (support (AS1 CS1)/Support
     1.1for each Itemset i2      k-1                                          (AS1)) >= GroupConfidences[j];
          1.1.1 newcandidate, NC = Union(i1,i2);                               2.5.4.1 AS1     CS1 is a valid rule.
          1.1.2 if size of NC is k                                             2.5.4.2 R = R U (AS1      CS1)
            1.1.2.1 isInSameGroup =TestWhetherAll-                      2.5.5 AS2 =AS U Y1
                       TheItemsInSameGroup(NC)                          2.5.6 CS2 = CS U Y
            1.1.2.2 if (isInSameGroup == true)                           2.5.7 if (support (AS2 CS2)/Support
                  1.1.2.2.1 add NC to Ck othewise                               (AS2)) >= GroupConfidences[j];
                                     remove it.                                  2.5.7.1 AS2    CS2 is a valid rule.
  2. return Ck;                                                                  2.5.7.2 R = R U (AS2     CS2)
                     Figure 2: Association mining algorithm to support medical research
                                                             determines number of items in a itemset. Number of
3.2.1. Lemma 1. Number of rules is equal to
  k     L(D 2i )                                             rules from D =2 ( 2 ) . So total number of rules =
  i=1 2            where k is the number of desired                  ( 2 )
itemsets and L is function, which determines number             =1 2         where k is the number of desired
                                                             itemsets. Let m is the average number of distinct
of items in an itemset. D2 is the both set. Number of
                               k                             value, each multidimensional attribute holds. P is the
discarded rules = mp           i=1 2
                                     L(D 2i )
                                              .              number of attributes. Number of possible different
    Proof: Let I = {i1, i2       n} be the set of items. Let rules =       . Number of discarded rules =
G= {g1,g2,g3           q} be the set of groups. Let R=          =1 2 ( 2 ).
{r1,r2,r3        s} be the set of restrictions. GS is the
function, which finds groups with the smallest               4. Results and discussion
confidence. If not all items are in the same group, the
GS returns NULL. 1-itemset is selected if S( 1-                   The experiments were done using PC with core
itemset) >= S(GS(1-itemset)) where S is the function,        2 duo processor with a clock rate of 1.8 GHz and
which returns support for an itemset. Let C= {c1, c2,        3GB of main memory. The operating system was
c3       x} be the set of candidate itemsets. A new          Microsoft Vista and implementation language was
candidate NC is added to C                                   c#. We used 1 dataset to verify our method. The data
ci is selected for rule generation if S(C) >= S              set of interest is patient dataset collected and
(GS(C)). A desired itemset, D, is partitioned into           preprocessed from Bangladeshi hospitals, which has
three parts. D = {D0, D1, D2}. D0 is mapped to               50273 instances and 514 attributes (included 150
anticipated items, D1 is mapped to consequent items,         discrete and 364 numerical attributes). It contains all
D2 is mapped to both. Each subset of D 2, d, is added        categories of healthcare data: ratio, interval, decimal,
to both antecedent and consequent. When d is added           integer, percentage etc. All these data are converted
to antecedent then D2-d is added to consequent. On           into mineable items (integer representation) using
the other hand, when d is added to consequent then           domain dictionary and rule base. We have taken an
D2-d is added to antecedent. L is a function, which




                                                             84
average value from 10 trials for each of the test                constrains on attributes constant. Time is not varied
result.                                                          significantly because the number of groups has no
     Table 1. Test result for patient dataset                    lead to reduce disk access. This is because number of
Number of groups                    4       8                    groups has no lead to the number of candidate
Support for each group              .55,    .47,.84, .66,        generations phases and to the number of support
                                    .64,    .55,.85, .94,        calculation phases. The number of groups has only
                                    .76,.45 .86,.35              lead to the number of valid candidate generations
Correlation for each group          .71,    .63, .85,.82,        and it can save some CPU time.
                                    .41,    .76,.91, .73,                          4 Groups            8Groups       12 Groups
                                    .51,.61 .82, .71
Number of Items to be               4,4,4,4 5,4,5,6,                2000




                                                                  Time(Seconds)
constrained in antecedent for               4,5,5,7
each group
Number of Items to be               1,2,2,1 1,2,2,1
constrained in consequent for               1,2,2,1                               0
each group
                                                                                        8    4   12
Number of Items to be               0,0,0,0 1,1,1,1
                                                                                      Group Size
constrained in both for each                1,1,1,0
group                                                            Figure 4: Time comparison of the proposed
Total number of desired itemsets 125        311                  algorithms for the patient dataset based on
Total number of desired rules       21      28                   Group Size
                                                                    Figure 4 shows how time is varied with different
Time(Seconds)                       173.09 556.11
                                                                 group size for medical research algorithm. Here we
    Table 1 shows test result for patient dataset, after
                                                                 measured the performance of Medical Research
running the program of the proposed algorithm with
                                                                 algorithm in terms of group size keeping number of
different parameters. Second column of the table
                                                                 groups constant, support and confidence of each
presents the test result, where we used 4 groups,
                                                                 group constant, antecedent and consequent
minimum support of 45%-76% and correlation of
                                                                 constrains on attributes constant. Time is varied
.41-.71 to mine symmetric association rules for
                                                                 significantly because group size has lead to reduce
medical researcher. The maximum number of items
                                                                 disk access. This is because group size has lead to
in a rule was 6. 125 desired itemsets were generated
                                                                 the number of candidate generations phases and to
in total. 21 rules were discovered in total. It took
                                                                 the number of support calculation phases.
about 3461 seconds to find these rules. Third column                                    Group Size 4             Group size 10
of the table presents the test result, where we used 8                             1 Group Size 18
groups, minimum support of 35%-94% and
                                                                  Accuracy




correlation of .63-.91 to mine symmetric association
rules for medical researcher. The maximum number                                  0.5
of items in a rule was 8. 311 desired itemsets were
generated in total. 28 rules were discovered in total.                             0
It took about 11122 seconds to find these rules.
                                                                                              0.5          0.7        0.85
                     Group Size 4      Group Size 10
                     Group Size 18                                                                     Correlation
         2000
 Time(Seconds)




                                                                 Figure 5: Accuracy of test result for the
                                                                 patient dataset based on correlation
         1000                                                        Figure 5 illustrates accuracy results for our
                                                                 proposed algorithm. The value of correlation for
                                                                 each presented result is also indicated. For accuracy
                 0                                               measurement,       we     intentionally    discovered
                                                                 relationships among attributes for which trends are
                          4 Number of Groups12
                                   8
                                                                 known. Here we calculated accuracy as the ratio
Figure 3: Time comparison of the proposed                        between the number of correct discovered
algorithms for the patient dataset based on                      relationships and total number of discovered
number of groups                                                 relationships. A discovered relationship is correct if
Figure 3 shows how time is varied with different                 it is one of the known trends of medical domain. It
number of groups for the medical research algorithm.             shows that an average accuracy of 55% is achieved
We measured the performance of Medical Research                  with correlation 0.5. The proposed algorithm with
algorithm in terms of number of groups keeping                   correlation 0.7 achieves an average accuracy of
group size constant, support and confidence of each              85.66%. The proposed algorithm with correlation 0.7
group constant, antecedent and consequent                        achieves an average accuracy of 94.66%. As




                                                            85
accuracy refers to the rate of correct values in the                   Large Databases," in Proceedings of the 1993 ACM
data, the figure represents the success of our                         SIGMOD international conference on Management of
proposed data mining algorithm.                                        data, Washington, D.C., 1993, pp. 207-216.
                                                                [5]    H. Mannila, H. Toivonen, and A. I. Verkamo,
                                                                       "Efficient Algorithms for Discovering Association
5. Conclusion                                                          Rules," in AAAI Workshop on Knowledge Discovery
                                                                       in Databases, 1994, pp. 181-192.
    Medical Researchers are interested to find                  [6]    R. Srikant and R. Agrawal, "Mining Generalized
relationship among various diseases, lab tests,                        Association Rules," in In Proc. of the 21st Int'l
symptoms, etc. Due to high dimensionality of                           Conference on Very Large Databases, Zurich,
medical data, conventional association mining                          Switzerland, 1995.
algorithms discover a very high number of rules with            [7]    R. Srikant, Q. Vu, and R. Agrawal, "Mining
many attributes, which are tedious, redundant to                       association rules with item constraints," in In Proc.
medical researchers and not among their desired set                    3rd Int. Conf. Knowledge Discovery and Data
of attributes. In this paper, we have proposed an                      Mining, 1997, pp. 67--73.
association rule mining algorithm for finding                   [8]    A. Savasere, E. Omiecinski, and S. B. Navathe, "An
symmetric association rules to support medical                         Efficient Algorithm for Mining Association Rules in
                                                                       Large Databases," in Proceedings of the 21th
qualitative research. The main theme of this
                                                                       International Conference on Very Large Data Bases,
algorithm is based on the following two statements:                    1995, pp. 432 - 444.
interesting relationships among various medical                 [9]    H. Mannila, "Database methods for data mining," in
attributes are concealed in subsets of the attributes,                 The Fourth International Conference on Knowledge
but do not come out on all attributes taken together                   Discovery and Data Mining, 1998.
and all interesting relationships among various                 [10]   B. Liu, W. Hsu, and Y. Ma, "Mining Association
medical attributes have not same support and                           Rules with Multiple Minimum Supports.," in
correlation. The algorithm constructs a candidate                      SIGKDD Explorations, 1999, pp. 337--341.
item sets based on groups constraint and use the                [11]   H. Yun, D. Ha, B. Hwang, and K. H. Ryu, "Mining
corresponding support of each group in candidate                       association rules on significant rare data using relative
selection process to discover all possible desired item                support.," Journal of Systems and Software archive,
sets of that group. We propose measuring                               vol. 67, no. 3, pp. 181 - 191, 2003.
interestingness of known symmetric relationships                [12]   M. Hahsler, "A Model-Based Frequency Constraint
and unknown symmetric relationships via the                            for Mining Associations from Transaction Data.,"
                                                                       Data Mining and Knowledge Discovery, vol. 13, no.
correlation measure of antecedent items and
                                                                       2, pp. 137 - 166, 2006.
consequent items. The proposed algorithm has been
                                                                [13]   L. Zhou and S. Yau, "Association rule and
applied to a real world medical data set. We have                      quantitative association rule mining among infrequent
shown significant accuracy in the output of the                        items," in International Conference on Knowledge
proposed algorithm. Although we have used level-                       Discovery and Data Mining, San Jose, California,
wise search for finding symmetric association rules,                   2007, pp. 156-167.
each step of our algorithm is different from any                [14]   C. Ordonez, C. Santana, and L. d. Braal, "Discovering
level-wise search algorithm. Rules generation from                     Interesting Association Rules in Medical Data," in
desired item sets is also different from conventional                  Proccedings of ACM SIGMOD Workshop on
association mining algorithms.                                         Research Issues on Data Mining and Knowledge
                                                                       Discovery, 2000, pp. 78-85.
                                                                [15]   L. J. Sheela and V. Shanthi, "DIMAR - Discovering
6. References                                                          interesting medical association rules form MRI
                                                                       scans," in 6th International Conference on Electrical
 [1] R. Agrawal and R. Srikant, "Fast Algorithms for                   Engineering/Electronics,                      Computer,
     Mining Association Rules in Large Databases," in                  Telecommunications and Information Technology,
     Proceedings of the 20th International Conference on               2009, pp. 654 - 658.
     Very Large Data Bases, San Francisco, CA, USA,             [16]   C. Ordonez, N. Ezquerra, and C. A. Santana,
     1994, pp. 487 - 499.                                              "Constraining and summarizing association rules in
[2] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur,                    medical data," Knowledge and Information Systems,
     "Dynamic Itemset Counting and Implication Rules for               vol. 9, no. 3, pp. 259 - 283, September 2005.
     Market Basket Data," in Proceedings of the 1997            [17]   H. Pan, J. Li, and Z. Wei, "Mining Interesting
     ACM SIGMOD international conference on                            Association Rules in Medical Images," Lecture Notes
     Management of data, Tucson, Arizona, United States,               In Computer Science, vol. 3584, pp. 598-609, 2005.
     1997, pp. 255-264.
                                                                [18]   S. Doddi, A. Marathe, S. S. Ravi, and D. C Torney,
[3] J. S. Park, M. S. Chen, and P. S. Yu, "An Effctive                 "Discovery of association rules in medical data,"
     Hash based Algorithm for mining association rules,"               Medical Informatics and the Internet in Medicine, vol.
     in Prof. ACM SIGMOD Conf Management of Data,                      26, no. 1, pp. 25-33, January 2001.
     New York, NY, USA, 1995, pp. 175 - 186.
[4] R. Agrawal, T.                     . Swami, "Mining
     Association Rules between Sets of Items in Very




                                                           86

Contenu connexe

En vedette (6)

Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...
Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...
Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...
 
Apa itu SIG?
Apa itu SIG?Apa itu SIG?
Apa itu SIG?
 
Clustering Medical Data to Predict the Likelihood of Diseases
Clustering Medical Data to Predict the Likelihood of DiseasesClustering Medical Data to Predict the Likelihood of Diseases
Clustering Medical Data to Predict the Likelihood of Diseases
 
Search Efficient Representation of Healthcare Data based on the HL7 RIM
Search Efficient Representation of Healthcare Data based on the HL7 RIMSearch Efficient Representation of Healthcare Data based on the HL7 RIM
Search Efficient Representation of Healthcare Data based on the HL7 RIM
 
Amherst tigers vs.pptx final
Amherst tigers vs.pptx finalAmherst tigers vs.pptx final
Amherst tigers vs.pptx final
 
Mining Irregular Association Rules based on Action & Non-action Type Data
Mining Irregular Association Rules based on Action & Non-action Type DataMining Irregular Association Rules based on Action & Non-action Type Data
Mining Irregular Association Rules based on Action & Non-action Type Data
 

Similaire à Finding Symmetric Association Rules to Support Medical Qualitative Research

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive SurveyPrognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
ijtsrd
 
Android Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction SystemAndroid Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction System
ijtsrd
 
1Deliverable 3 - Evaluate Research and DataAttempt 2
1Deliverable 3 - Evaluate Research and DataAttempt 21Deliverable 3 - Evaluate Research and DataAttempt 2
1Deliverable 3 - Evaluate Research and DataAttempt 2
EttaBenton28
 
Berman pcori challenge document
Berman pcori challenge documentBerman pcori challenge document
Berman pcori challenge document
Lew Berman
 
11.using rule based classifiers for the predictive analysis of breast cancer ...
11.using rule based classifiers for the predictive analysis of breast cancer ...11.using rule based classifiers for the predictive analysis of breast cancer ...
11.using rule based classifiers for the predictive analysis of breast cancer ...
Alexander Decker
 
Paper id 212014112
Paper id 212014112Paper id 212014112
Paper id 212014112
IJRAT
 

Similaire à Finding Symmetric Association Rules to Support Medical Qualitative Research (20)

H0333039042
H0333039042H0333039042
H0333039042
 
prediction using data mining.pdf
prediction using data mining.pdfprediction using data mining.pdf
prediction using data mining.pdf
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive SurveyPrognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
 
Android Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction SystemAndroid Based Questionnaires Application for Heart Disease Prediction System
Android Based Questionnaires Application for Heart Disease Prediction System
 
1Deliverable 3 - Evaluate Research and DataAttempt 2
1Deliverable 3 - Evaluate Research and DataAttempt 21Deliverable 3 - Evaluate Research and DataAttempt 2
1Deliverable 3 - Evaluate Research and DataAttempt 2
 
Case Metadata can be.docx
Case Metadata can be.docxCase Metadata can be.docx
Case Metadata can be.docx
 
Berman pcori challenge document
Berman pcori challenge documentBerman pcori challenge document
Berman pcori challenge document
 
Using rule based classifiers for the predictive analysis of breast cancer rec...
Using rule based classifiers for the predictive analysis of breast cancer rec...Using rule based classifiers for the predictive analysis of breast cancer rec...
Using rule based classifiers for the predictive analysis of breast cancer rec...
 
11.using rule based classifiers for the predictive analysis of breast cancer ...
11.using rule based classifiers for the predictive analysis of breast cancer ...11.using rule based classifiers for the predictive analysis of breast cancer ...
11.using rule based classifiers for the predictive analysis of breast cancer ...
 
Paper id 212014112
Paper id 212014112Paper id 212014112
Paper id 212014112
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
AI and Big Data in Psychiatry: An Introduction and Overview
AI and Big Data in Psychiatry: An Introduction and OverviewAI and Big Data in Psychiatry: An Introduction and Overview
AI and Big Data in Psychiatry: An Introduction and Overview
 
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
 
Pet Care Application
Pet Care ApplicationPet Care Application
Pet Care Application
 
Csit110713
Csit110713Csit110713
Csit110713
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
DISEASE PREDICTION SYSTEM USING SYMPTOMS
DISEASE PREDICTION SYSTEM USING SYMPTOMSDISEASE PREDICTION SYSTEM USING SYMPTOMS
DISEASE PREDICTION SYSTEM USING SYMPTOMS
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Finding Symmetric Association Rules to Support Medical Qualitative Research

  • 1. Finding Symmetric Association Rules to Support Medical Qualitative Research Razan Paul, Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh razanpaul@yahoo.com, asmlatifulhoque@cse.buet.ac.bd Abstract the algorithms produce rules eliminating all infrequent item sets. On the other hand, if we set In medical qualitative research, medical minimum support too low, the algorithms produce researchers analyze historical patient data to verify far too many rules that are meaningless. In order to known relationships and to discover unknown deal with this problem many algorithms have been relationships among medical attributes. All the proposed to mine rare associations [10-13]. existing algorithms to solve this problem use However, these algorithms do not find the measures which are asymmetric measure, so only relationship between rare and high frequent medical one direction of the rule( P -> Q or Q->P) is taken items. In [7], the authors propose few algorithms that into account. However, medical researchers are allow a user to specify Boolean expressions over the interested to find both asymmetric and symmetric presence or absence of items in association rule or to relationship among medical attributes. We have specify a certain hierarchy [6] of items in association developed pruning strategies and devised an efficient rule. These approaches are not enough to mine algorithm for the symmetric relationship problem. desired rules for medical researchers. We propose measuring interestingness of known All the existing algorithms [14-18] to discover symmetric relationships and unknown symmetric interesting association rules in medical data only find relationships via the correlation measure of asymmetric pattern, whereas medical researchers are antecedent items and consequent items. We have interested to find both asymmetric and symmetric demonstrated its effectiveness by testing it on real relationship among medical attributes. For these dataset. reasons, we have proposed an association-mining algorithm, which will find rules among the attributes 1. Introduction of the researcher interest, so that it can help in decision making of the researchers. The problem in In medical qualitative research, medical discovering relationships is to avoid redundant researchers are interested in finding association rules relationships and control the quality of them. This to see relationship among specified items and to see algorithm allows the researchers to define the how a group of items is related with a different group following constraints: group information of of items. For instance, a medical researcher can attributes, minimum confidence and support for each discover relationship between the age and the group, which item will appear in antecedent and HbA1c% of a patient. Medical researchers are which item will appear in consequent and which interested to find relationship among various attributes will appear in both. One attribute can diseases, lab tests, symptoms, etc. Due to high belong to several groups. dimensionality of medical data, conventional association mining algorithms [1-8] discover a very 2. Mapping complex medical data to high number of rules with many attributes, which are mineable items tedious, redundant to medical researchers and not among their desired set of attributes. Medical For knowledge discovery, the medical data have researchers may need to find the relationship to be transformed into a suitable transaction format between rare and high frequent medical items, but to discover knowledge. We have addressed the conventional mining processes for association rules problem of mapping complex medical data to items explore interesting relationships between data items using domain dictionary and rule base as shown in that occur frequently together [1-7]. figure 1. The medical data are types of categorical, Rare item problem is presented in [9]. According continuous numerical data, boolean, interval, to this problem if minimum support is set too high, percentage, fraction and ratio. Medical domain 978-1-4244-7571-1/10/$26.00 ©2010 IEEE 81
  • 2. experts have the knowledge of how to map ranges of cardinality of attributes except continuous numeric numerical data for each attribute to a series of items. data are not high in medical domain, these attribute For example, there are certain conventions to values are mapped to integer values using medical consider a person is young, adult, or elder with domain dictionaries. Therefore, the mapping process respect to age. A set of rules is created for each is divided in two phases. Phase 1: a rule base is continuous numerical attribute using the knowledge constructed based on the knowledge of medical of medical domain experts. A rule engine is used to domain experts and dictionaries are constructed for map continuous numerical data to items using these attributes where domain expert knowledge is not developed rules. applicable, Phase 2: attribute values are mapped to We have used domain dictionary approach to integer values using the corresponding rule base and transform the data, for which medical domain expert the dictionaries. knowledge is not applicable, to numerical form. As Original Mapped Original Mapped Generate dictionary for value value value value each categorical attribute Headache 1 Yes 1 Fever 2 No 2 PatientActual Data Age Smoke Diagnosis Dictionary of Dictionary of ID Diagnosis attribute Smoke attribute 1020D 33 Yes Headache 1021D 63 No Fever Map to integer items using rule base and dictionaries Actual data If age <= 12 then 1 Medical If 13<=age<=60 then 2 domain If 60 <=age then 3 Patient Age Smoke Diagnosis knowledge If smoke = y then 1 ID If smoke = n then 2 1020D 2 1 1 If Sex = M then 1 1021D 3 2 2 If Sex = F then 2 Rule Base Data suitable for Knowledge Discovery Figure 1. Data transformation of medical data 3. The proposed algorithm Uninteresting relationships among medical attributes are avoided in the candidate The main theme of this algorithm is based on the generation phase which reduces number of following two statements. Interesting relationships rules, finds out only interesting relationships among various medical attributes are concealed in and makes the algorithm fast. subsets of the attributes, but do not come out on all Confidence is not the perfect method to rank attributes taken together. All interesting relationships symmetric medical relationships because it does not among various medical attributes have not same account for the consequent frequency with the support and confidence. The algorithm constructs a antecedent. For the ranking of medical relationship, a candidate itemsets based on groups constraint and direct measure of association rule between variables use the corresponding support of each group in is a perfect scheme. For a medical relationship s t candidate selection process to discover all possible , s is a group of medical items where each item is desired itemsets of that group. The goals of this constrained to be appear in antecedent or both and t algorithm are the following: finding desired rules of is a group of medical attributes where each item is medical researcher and running fast. The features of appear to be in consequent or both. Moreover, this proposed algorithm are as follows: s t = Ø. For this relationship, the support is It allows grouping of attributes to find defined as support = P s, t and the confidence is relationship among medical attributes. This defined as = P s, t /P t where P is the probability. provides control on the search process. The correlation coefficient (also known as the - Minimum confidence and support can vary coefficient) measures the degree of relationship from one group to another group. between two random variables by measuring the One item can belong to several groups degree of linear interdependency. It is defined by the Attributes are constrained to appear on either covariance between the two variables divided by antecedent or consequent or both side of the their standard deviations: rule. Cov(s, t) st = It does not generate subsets on full desired s t itemset, but generates subsets for items that Here Cov(s, t) represents the covariance of the can appear in both consequent and two variables and X and Y are stand for standard antecedent. 82
  • 3. deviation .The covariance measures how two belong to zero or more groups. 1-itemset is selected variables change together: if it has support greater or equal to one of its Cov s, t = P s, t P s P t corresponding group support. As medical attribute As we know, standard deviation is the square root value contains patient information that is of its variance and variance is a special case of multidimensional, the algorithm performs the count covariance when the two variables are identical. operation by comparing the value of attributes instead of determining presence or absence of values s = Var s = Cov s, s of attributes to calculate support. = P s, s P s P(s) = P s P(s)2 Similarly, t = P t P(t) 2 3.1. Candidate Generation and Selection P s, t P s P t st = P s P(s) 2 P t P(t)2 The intuition behind candidate generation of all Here P s, t is the support of itemset consists of level-wise algorithms like Apriori is based on the both s and t. Let the support of the itemset be Sst . following simple fact: Every subset of a frequent Here p s and p t is the support of antecedent s and itemset is frequent so that they can reduce the antecedent t respectively. Let the support of number of itemsets that have to be checked. antecedent s and consequent t be Ss andSt . The value However, the idea behind candidate generation of of Sst , Ss and St are computed during the desired proposed algorithm is every item in the itemset has itemset generation of our proposed algorithm. Using to be in the same group. This idea makes the new these values, we can calculate the correlation of candidates that consist of items in the same group every medical relationship rule between a group of and keeps itemsets consist of both rare items and medical items to another group of medical items. The high frequent items. If all the items in a new correlation value will indicate medical researchers candidate set are in the same group, then it is how strong a medical relationship is in perspective of selected as a valid candidate, otherwise the new historical data. candidate is not added to valid candidate itemsets. Sst Ss St Here for each group there are different support and st = confidence. Each candidate itemset belongs to a Ss Ss 2 St St 2 particular group. After finding group id of a candidate itemset, the algorithm uses corresponding So putting the value of , and in support for candidate selection where as Apriori uses association rule generation phase, we have found the a single support threshold for all the candidate single metric, correlation coefficient, to represent itemsets. By this way, itemsets are explored which how much antecedent and consequent are medically are desired to medical researchers. related with each other. For each medical relationship or rule, this metric has been used to indicate the degree of strong relationship between a 3.2. Generating association rules group of items to another group of items to support Let AC(item) be the function which returns one out medical qualitative research. The ranges of values of three values: 1 if item is constrained to be in the for is between -1 and +1. If two variables are antecedent of a rule, 2 if it is constrained to be in independent then equals 0. When equals +1 the consequent and 0 if it can be in either. Using this the variables are considered perfectly positively function, itemset is partitioned into antecedent set, correlated. A positive correlation is the evidence of a consequent set and both set. Moreover, it does not general tendency that when a group of attribute use subset generation to itemsets to form rules like values s for a patient happens, another group of conventional association mining algorithm; it only attribute values y for the same patient happens. More uses subset generation to both set. Each subset of positive value means the relationship is more strong. both set is added in antecedent part in one rule and is When equals -1 the variables are considered added in consequent part in another rule. Each perfectly negatively correlated. itemset belongs to a particular group. In addition to, Figure 2 shows the association-mining algorithm there is a different confidence for each group to support medical research. Like Apriori, our whereas Apriori uses a single confidence for all the algorithm is also based on level wise search. The itemsets. After finding group id of an itemset, the major difference in our proposed algorithm is algorithm uses corresponding confidence to form candidate generation process with Apriori. Each item rules. By this way, rules are explored which are consists of attribute name and its value. Having desired of medical researchers. retrieved information of a 1-itemset, we make a new 1-itemset if this 1-itemset is not created already, otherwise update its support. The 1-itemset can 83
  • 4. Algorithm: Find itemsets which has high support procedure SelectDesiredItemSetFromCandidates and are in the same group. (CK, GroupSupports ) Input: Data and metadata files. k Output : Itemsets which are desired to Medical 1.1 j=FindGroupNoWhichHasMinimum Researchers. SupportIfMultipleGroupsExist (c) 1. K=1; 1.2 If c.support >= GroupSupports[j] 2. Read the metadata about which attributes can only 1.3 Add it to I appear in the antecedent of a rule, can only appear 2. return I in the consequent and can appear in either Algorithm : Find assosiation rules for decision 3. Read Groups Information along with each group supportability of medical reasearcher. support and confidence from configuration file and Input: I : Itemsets , GroupConfidences make dictionary , here key is the attribute number Output: R: Set of rules and value is a list of group numbers on whcih the 1. R = Ø corresponding attribute belongs to. 2. For each X I 4. Ik = Select 1-itemsets that have support greater or 2.1 j =FindGroupNoWhichHasMinimum equal to one of its corresponding group support. ConfideceIfMultipleGroupsExist(X) 5. While(Ik 2.2 Both Set B = (b1, b2 n){ where bi 5.1 K++; X and AC(bi) = 0} 5.2 CK = Candidate_generation(Ik-1) 5.3 CalculateCandidatesSupport(Ck) where asi i)= 1} 5.4 Ik = SelectDesiredItemSetFromCandidates(CK, 2.4 Consequent set CS = (cs1, cs2 n){ GroupSupports) ; where csi X and AC(csi) = 2} 5.5 I = I U Ik 2.5 For each subset Y of B 6. return I 2.5.1 Y1 = B-Y; procedure Candidate_generation(Ik-1: frequent (k-1) 2.5.2 AS1 =AS U Y itemsets) 2.5.3 CS1 = CS U Y1 1. for each Itemset i1 k-1 2.5.4 if (support (AS1 CS1)/Support 1.1for each Itemset i2 k-1 (AS1)) >= GroupConfidences[j]; 1.1.1 newcandidate, NC = Union(i1,i2); 2.5.4.1 AS1 CS1 is a valid rule. 1.1.2 if size of NC is k 2.5.4.2 R = R U (AS1 CS1) 1.1.2.1 isInSameGroup =TestWhetherAll- 2.5.5 AS2 =AS U Y1 TheItemsInSameGroup(NC) 2.5.6 CS2 = CS U Y 1.1.2.2 if (isInSameGroup == true) 2.5.7 if (support (AS2 CS2)/Support 1.1.2.2.1 add NC to Ck othewise (AS2)) >= GroupConfidences[j]; remove it. 2.5.7.1 AS2 CS2 is a valid rule. 2. return Ck; 2.5.7.2 R = R U (AS2 CS2) Figure 2: Association mining algorithm to support medical research determines number of items in a itemset. Number of 3.2.1. Lemma 1. Number of rules is equal to k L(D 2i ) rules from D =2 ( 2 ) . So total number of rules = i=1 2 where k is the number of desired ( 2 ) itemsets and L is function, which determines number =1 2 where k is the number of desired itemsets. Let m is the average number of distinct of items in an itemset. D2 is the both set. Number of k value, each multidimensional attribute holds. P is the discarded rules = mp i=1 2 L(D 2i ) . number of attributes. Number of possible different Proof: Let I = {i1, i2 n} be the set of items. Let rules = . Number of discarded rules = G= {g1,g2,g3 q} be the set of groups. Let R= =1 2 ( 2 ). {r1,r2,r3 s} be the set of restrictions. GS is the function, which finds groups with the smallest 4. Results and discussion confidence. If not all items are in the same group, the GS returns NULL. 1-itemset is selected if S( 1- The experiments were done using PC with core itemset) >= S(GS(1-itemset)) where S is the function, 2 duo processor with a clock rate of 1.8 GHz and which returns support for an itemset. Let C= {c1, c2, 3GB of main memory. The operating system was c3 x} be the set of candidate itemsets. A new Microsoft Vista and implementation language was candidate NC is added to C c#. We used 1 dataset to verify our method. The data ci is selected for rule generation if S(C) >= S set of interest is patient dataset collected and (GS(C)). A desired itemset, D, is partitioned into preprocessed from Bangladeshi hospitals, which has three parts. D = {D0, D1, D2}. D0 is mapped to 50273 instances and 514 attributes (included 150 anticipated items, D1 is mapped to consequent items, discrete and 364 numerical attributes). It contains all D2 is mapped to both. Each subset of D 2, d, is added categories of healthcare data: ratio, interval, decimal, to both antecedent and consequent. When d is added integer, percentage etc. All these data are converted to antecedent then D2-d is added to consequent. On into mineable items (integer representation) using the other hand, when d is added to consequent then domain dictionary and rule base. We have taken an D2-d is added to antecedent. L is a function, which 84
  • 5. average value from 10 trials for each of the test constrains on attributes constant. Time is not varied result. significantly because the number of groups has no Table 1. Test result for patient dataset lead to reduce disk access. This is because number of Number of groups 4 8 groups has no lead to the number of candidate Support for each group .55, .47,.84, .66, generations phases and to the number of support .64, .55,.85, .94, calculation phases. The number of groups has only .76,.45 .86,.35 lead to the number of valid candidate generations Correlation for each group .71, .63, .85,.82, and it can save some CPU time. .41, .76,.91, .73, 4 Groups 8Groups 12 Groups .51,.61 .82, .71 Number of Items to be 4,4,4,4 5,4,5,6, 2000 Time(Seconds) constrained in antecedent for 4,5,5,7 each group Number of Items to be 1,2,2,1 1,2,2,1 constrained in consequent for 1,2,2,1 0 each group 8 4 12 Number of Items to be 0,0,0,0 1,1,1,1 Group Size constrained in both for each 1,1,1,0 group Figure 4: Time comparison of the proposed Total number of desired itemsets 125 311 algorithms for the patient dataset based on Total number of desired rules 21 28 Group Size Figure 4 shows how time is varied with different Time(Seconds) 173.09 556.11 group size for medical research algorithm. Here we Table 1 shows test result for patient dataset, after measured the performance of Medical Research running the program of the proposed algorithm with algorithm in terms of group size keeping number of different parameters. Second column of the table groups constant, support and confidence of each presents the test result, where we used 4 groups, group constant, antecedent and consequent minimum support of 45%-76% and correlation of constrains on attributes constant. Time is varied .41-.71 to mine symmetric association rules for significantly because group size has lead to reduce medical researcher. The maximum number of items disk access. This is because group size has lead to in a rule was 6. 125 desired itemsets were generated the number of candidate generations phases and to in total. 21 rules were discovered in total. It took the number of support calculation phases. about 3461 seconds to find these rules. Third column Group Size 4 Group size 10 of the table presents the test result, where we used 8 1 Group Size 18 groups, minimum support of 35%-94% and Accuracy correlation of .63-.91 to mine symmetric association rules for medical researcher. The maximum number 0.5 of items in a rule was 8. 311 desired itemsets were generated in total. 28 rules were discovered in total. 0 It took about 11122 seconds to find these rules. 0.5 0.7 0.85 Group Size 4 Group Size 10 Group Size 18 Correlation 2000 Time(Seconds) Figure 5: Accuracy of test result for the patient dataset based on correlation 1000 Figure 5 illustrates accuracy results for our proposed algorithm. The value of correlation for each presented result is also indicated. For accuracy 0 measurement, we intentionally discovered relationships among attributes for which trends are 4 Number of Groups12 8 known. Here we calculated accuracy as the ratio Figure 3: Time comparison of the proposed between the number of correct discovered algorithms for the patient dataset based on relationships and total number of discovered number of groups relationships. A discovered relationship is correct if Figure 3 shows how time is varied with different it is one of the known trends of medical domain. It number of groups for the medical research algorithm. shows that an average accuracy of 55% is achieved We measured the performance of Medical Research with correlation 0.5. The proposed algorithm with algorithm in terms of number of groups keeping correlation 0.7 achieves an average accuracy of group size constant, support and confidence of each 85.66%. The proposed algorithm with correlation 0.7 group constant, antecedent and consequent achieves an average accuracy of 94.66%. As 85
  • 6. accuracy refers to the rate of correct values in the Large Databases," in Proceedings of the 1993 ACM data, the figure represents the success of our SIGMOD international conference on Management of proposed data mining algorithm. data, Washington, D.C., 1993, pp. 207-216. [5] H. Mannila, H. Toivonen, and A. I. Verkamo, "Efficient Algorithms for Discovering Association 5. Conclusion Rules," in AAAI Workshop on Knowledge Discovery in Databases, 1994, pp. 181-192. Medical Researchers are interested to find [6] R. Srikant and R. Agrawal, "Mining Generalized relationship among various diseases, lab tests, Association Rules," in In Proc. of the 21st Int'l symptoms, etc. Due to high dimensionality of Conference on Very Large Databases, Zurich, medical data, conventional association mining Switzerland, 1995. algorithms discover a very high number of rules with [7] R. Srikant, Q. Vu, and R. Agrawal, "Mining many attributes, which are tedious, redundant to association rules with item constraints," in In Proc. medical researchers and not among their desired set 3rd Int. Conf. Knowledge Discovery and Data of attributes. In this paper, we have proposed an Mining, 1997, pp. 67--73. association rule mining algorithm for finding [8] A. Savasere, E. Omiecinski, and S. B. Navathe, "An symmetric association rules to support medical Efficient Algorithm for Mining Association Rules in Large Databases," in Proceedings of the 21th qualitative research. The main theme of this International Conference on Very Large Data Bases, algorithm is based on the following two statements: 1995, pp. 432 - 444. interesting relationships among various medical [9] H. Mannila, "Database methods for data mining," in attributes are concealed in subsets of the attributes, The Fourth International Conference on Knowledge but do not come out on all attributes taken together Discovery and Data Mining, 1998. and all interesting relationships among various [10] B. Liu, W. Hsu, and Y. Ma, "Mining Association medical attributes have not same support and Rules with Multiple Minimum Supports.," in correlation. The algorithm constructs a candidate SIGKDD Explorations, 1999, pp. 337--341. item sets based on groups constraint and use the [11] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, "Mining corresponding support of each group in candidate association rules on significant rare data using relative selection process to discover all possible desired item support.," Journal of Systems and Software archive, sets of that group. We propose measuring vol. 67, no. 3, pp. 181 - 191, 2003. interestingness of known symmetric relationships [12] M. Hahsler, "A Model-Based Frequency Constraint and unknown symmetric relationships via the for Mining Associations from Transaction Data.," Data Mining and Knowledge Discovery, vol. 13, no. correlation measure of antecedent items and 2, pp. 137 - 166, 2006. consequent items. The proposed algorithm has been [13] L. Zhou and S. Yau, "Association rule and applied to a real world medical data set. We have quantitative association rule mining among infrequent shown significant accuracy in the output of the items," in International Conference on Knowledge proposed algorithm. Although we have used level- Discovery and Data Mining, San Jose, California, wise search for finding symmetric association rules, 2007, pp. 156-167. each step of our algorithm is different from any [14] C. Ordonez, C. Santana, and L. d. Braal, "Discovering level-wise search algorithm. Rules generation from Interesting Association Rules in Medical Data," in desired item sets is also different from conventional Proccedings of ACM SIGMOD Workshop on association mining algorithms. Research Issues on Data Mining and Knowledge Discovery, 2000, pp. 78-85. [15] L. J. Sheela and V. Shanthi, "DIMAR - Discovering 6. References interesting medical association rules form MRI scans," in 6th International Conference on Electrical [1] R. Agrawal and R. Srikant, "Fast Algorithms for Engineering/Electronics, Computer, Mining Association Rules in Large Databases," in Telecommunications and Information Technology, Proceedings of the 20th International Conference on 2009, pp. 654 - 658. Very Large Data Bases, San Francisco, CA, USA, [16] C. Ordonez, N. Ezquerra, and C. A. Santana, 1994, pp. 487 - 499. "Constraining and summarizing association rules in [2] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, medical data," Knowledge and Information Systems, "Dynamic Itemset Counting and Implication Rules for vol. 9, no. 3, pp. 259 - 283, September 2005. Market Basket Data," in Proceedings of the 1997 [17] H. Pan, J. Li, and Z. Wei, "Mining Interesting ACM SIGMOD international conference on Association Rules in Medical Images," Lecture Notes Management of data, Tucson, Arizona, United States, In Computer Science, vol. 3584, pp. 598-609, 2005. 1997, pp. 255-264. [18] S. Doddi, A. Marathe, S. S. Ravi, and D. C Torney, [3] J. S. Park, M. S. Chen, and P. S. Yu, "An Effctive "Discovery of association rules in medical data," Hash based Algorithm for mining association rules," Medical Informatics and the Internet in Medicine, vol. in Prof. ACM SIGMOD Conf Management of Data, 26, no. 1, pp. 25-33, January 2001. New York, NY, USA, 1995, pp. 175 - 186. [4] R. Agrawal, T. . Swami, "Mining Association Rules between Sets of Items in Very 86