2. experts have the knowledge of how to map ranges of cardinality of attributes except continuous numeric
numerical data for each attribute to a series of items. data are not high in medical domain, these attribute
For example, there are certain conventions to values are mapped to integer values using medical
consider a person is young, adult, or elder with domain dictionaries. Therefore, the mapping process
respect to age. A set of rules is created for each is divided in two phases. Phase 1: a rule base is
continuous numerical attribute using the knowledge constructed based on the knowledge of medical
of medical domain experts. A rule engine is used to domain experts and dictionaries are constructed for
map continuous numerical data to items using these attributes where domain expert knowledge is not
developed rules. applicable, Phase 2: attribute values are mapped to
We have used domain dictionary approach to integer values using the corresponding rule base and
transform the data, for which medical domain expert the dictionaries.
knowledge is not applicable, to numerical form. As
Original Mapped Original Mapped
Generate dictionary for value value value value
each categorical attribute Headache 1 Yes 1
Fever 2 No 2
PatientActual Data
Age Smoke Diagnosis Dictionary of Dictionary of
ID Diagnosis attribute Smoke attribute
1020D 33 Yes Headache
1021D 63 No Fever Map to integer items using
rule base and dictionaries
Actual data
If age <= 12 then 1
Medical If 13<=age<=60 then 2
domain If 60 <=age then 3 Patient Age Smoke Diagnosis
knowledge If smoke = y then 1 ID
If smoke = n then 2 1020D 2 1 1
If Sex = M then 1
1021D 3 2 2
If Sex = F then 2
Rule Base Data suitable for Knowledge Discovery
Figure 1. Data transformation of medical data
3. The proposed algorithm Uninteresting relationships among medical
attributes are avoided in the candidate
The main theme of this algorithm is based on the generation phase which reduces number of
following two statements. Interesting relationships rules, finds out only interesting relationships
among various medical attributes are concealed in and makes the algorithm fast.
subsets of the attributes, but do not come out on all Confidence is not the perfect method to rank
attributes taken together. All interesting relationships symmetric medical relationships because it does not
among various medical attributes have not same account for the consequent frequency with the
support and confidence. The algorithm constructs a antecedent. For the ranking of medical relationship, a
candidate itemsets based on groups constraint and direct measure of association rule between variables
use the corresponding support of each group in is a perfect scheme. For a medical relationship s t
candidate selection process to discover all possible , s is a group of medical items where each item is
desired itemsets of that group. The goals of this constrained to be appear in antecedent or both and t
algorithm are the following: finding desired rules of is a group of medical attributes where each item is
medical researcher and running fast. The features of appear to be in consequent or both. Moreover,
this proposed algorithm are as follows: s t = Ø. For this relationship, the support is
It allows grouping of attributes to find defined as support = P s, t and the confidence is
relationship among medical attributes. This defined as = P s, t /P t where P is the probability.
provides control on the search process. The correlation coefficient (also known as the -
Minimum confidence and support can vary coefficient) measures the degree of relationship
from one group to another group. between two random variables by measuring the
One item can belong to several groups degree of linear interdependency. It is defined by the
Attributes are constrained to appear on either covariance between the two variables divided by
antecedent or consequent or both side of the their standard deviations:
rule. Cov(s, t)
st =
It does not generate subsets on full desired s t
itemset, but generates subsets for items that Here Cov(s, t) represents the covariance of the
can appear in both consequent and two variables and X and Y are stand for standard
antecedent.
82
3. deviation .The covariance measures how two belong to zero or more groups. 1-itemset is selected
variables change together: if it has support greater or equal to one of its
Cov s, t = P s, t P s P t corresponding group support. As medical attribute
As we know, standard deviation is the square root value contains patient information that is
of its variance and variance is a special case of multidimensional, the algorithm performs the count
covariance when the two variables are identical. operation by comparing the value of attributes
instead of determining presence or absence of values
s = Var s = Cov s, s
of attributes to calculate support.
= P s, s P s P(s) = P s P(s)2
Similarly, t = P t P(t) 2
3.1. Candidate Generation and Selection
P s, t P s P t
st =
P s P(s) 2 P t P(t)2 The intuition behind candidate generation of all
Here P s, t is the support of itemset consists of level-wise algorithms like Apriori is based on the
both s and t. Let the support of the itemset be Sst . following simple fact: Every subset of a frequent
Here p s and p t is the support of antecedent s and itemset is frequent so that they can reduce the
antecedent t respectively. Let the support of number of itemsets that have to be checked.
antecedent s and consequent t be Ss andSt . The value However, the idea behind candidate generation of
of Sst , Ss and St are computed during the desired proposed algorithm is every item in the itemset has
itemset generation of our proposed algorithm. Using to be in the same group. This idea makes the new
these values, we can calculate the correlation of candidates that consist of items in the same group
every medical relationship rule between a group of and keeps itemsets consist of both rare items and
medical items to another group of medical items. The high frequent items. If all the items in a new
correlation value will indicate medical researchers candidate set are in the same group, then it is
how strong a medical relationship is in perspective of selected as a valid candidate, otherwise the new
historical data. candidate is not added to valid candidate itemsets.
Sst Ss St Here for each group there are different support and
st = confidence. Each candidate itemset belongs to a
Ss Ss 2 St St 2 particular group. After finding group id of a
candidate itemset, the algorithm uses corresponding
So putting the value of , and in
support for candidate selection where as Apriori uses
association rule generation phase, we have found the
a single support threshold for all the candidate
single metric, correlation coefficient, to represent
itemsets. By this way, itemsets are explored which
how much antecedent and consequent are medically
are desired to medical researchers.
related with each other. For each medical
relationship or rule, this metric has been used to
indicate the degree of strong relationship between a
3.2. Generating association rules
group of items to another group of items to support
Let AC(item) be the function which returns one out
medical qualitative research. The ranges of values
of three values: 1 if item is constrained to be in the
for is between -1 and +1. If two variables are
antecedent of a rule, 2 if it is constrained to be in
independent then equals 0. When equals +1
the consequent and 0 if it can be in either. Using this
the variables are considered perfectly positively
function, itemset is partitioned into antecedent set,
correlated. A positive correlation is the evidence of a
consequent set and both set. Moreover, it does not
general tendency that when a group of attribute
use subset generation to itemsets to form rules like
values s for a patient happens, another group of
conventional association mining algorithm; it only
attribute values y for the same patient happens. More
uses subset generation to both set. Each subset of
positive value means the relationship is more strong.
both set is added in antecedent part in one rule and is
When equals -1 the variables are considered
added in consequent part in another rule. Each
perfectly negatively correlated.
itemset belongs to a particular group. In addition to,
Figure 2 shows the association-mining algorithm
there is a different confidence for each group
to support medical research. Like Apriori, our
whereas Apriori uses a single confidence for all the
algorithm is also based on level wise search. The
itemsets. After finding group id of an itemset, the
major difference in our proposed algorithm is
algorithm uses corresponding confidence to form
candidate generation process with Apriori. Each item
rules. By this way, rules are explored which are
consists of attribute name and its value. Having
desired of medical researchers.
retrieved information of a 1-itemset, we make a new
1-itemset if this 1-itemset is not created already,
otherwise update its support. The 1-itemset can
83
4. Algorithm: Find itemsets which has high support procedure SelectDesiredItemSetFromCandidates
and are in the same group. (CK, GroupSupports )
Input: Data and metadata files. k
Output : Itemsets which are desired to Medical 1.1 j=FindGroupNoWhichHasMinimum
Researchers. SupportIfMultipleGroupsExist (c)
1. K=1; 1.2 If c.support >= GroupSupports[j]
2. Read the metadata about which attributes can only 1.3 Add it to I
appear in the antecedent of a rule, can only appear 2. return I
in the consequent and can appear in either Algorithm : Find assosiation rules for decision
3. Read Groups Information along with each group supportability of medical reasearcher.
support and confidence from configuration file and Input: I : Itemsets , GroupConfidences
make dictionary , here key is the attribute number Output: R: Set of rules
and value is a list of group numbers on whcih the 1. R = Ø
corresponding attribute belongs to. 2. For each X I
4. Ik = Select 1-itemsets that have support greater or 2.1 j =FindGroupNoWhichHasMinimum
equal to one of its corresponding group support. ConfideceIfMultipleGroupsExist(X)
5. While(Ik 2.2 Both Set B = (b1, b2 n){ where bi
5.1 K++; X and AC(bi) = 0}
5.2 CK = Candidate_generation(Ik-1)
5.3 CalculateCandidatesSupport(Ck) where asi i)= 1}
5.4 Ik = SelectDesiredItemSetFromCandidates(CK, 2.4 Consequent set CS = (cs1, cs2 n){
GroupSupports) ; where csi X and AC(csi) = 2}
5.5 I = I U Ik 2.5 For each subset Y of B
6. return I 2.5.1 Y1 = B-Y;
procedure Candidate_generation(Ik-1: frequent (k-1) 2.5.2 AS1 =AS U Y
itemsets) 2.5.3 CS1 = CS U Y1
1. for each Itemset i1 k-1 2.5.4 if (support (AS1 CS1)/Support
1.1for each Itemset i2 k-1 (AS1)) >= GroupConfidences[j];
1.1.1 newcandidate, NC = Union(i1,i2); 2.5.4.1 AS1 CS1 is a valid rule.
1.1.2 if size of NC is k 2.5.4.2 R = R U (AS1 CS1)
1.1.2.1 isInSameGroup =TestWhetherAll- 2.5.5 AS2 =AS U Y1
TheItemsInSameGroup(NC) 2.5.6 CS2 = CS U Y
1.1.2.2 if (isInSameGroup == true) 2.5.7 if (support (AS2 CS2)/Support
1.1.2.2.1 add NC to Ck othewise (AS2)) >= GroupConfidences[j];
remove it. 2.5.7.1 AS2 CS2 is a valid rule.
2. return Ck; 2.5.7.2 R = R U (AS2 CS2)
Figure 2: Association mining algorithm to support medical research
determines number of items in a itemset. Number of
3.2.1. Lemma 1. Number of rules is equal to
k L(D 2i ) rules from D =2 ( 2 ) . So total number of rules =
i=1 2 where k is the number of desired ( 2 )
itemsets and L is function, which determines number =1 2 where k is the number of desired
itemsets. Let m is the average number of distinct
of items in an itemset. D2 is the both set. Number of
k value, each multidimensional attribute holds. P is the
discarded rules = mp i=1 2
L(D 2i )
. number of attributes. Number of possible different
Proof: Let I = {i1, i2 n} be the set of items. Let rules = . Number of discarded rules =
G= {g1,g2,g3 q} be the set of groups. Let R= =1 2 ( 2 ).
{r1,r2,r3 s} be the set of restrictions. GS is the
function, which finds groups with the smallest 4. Results and discussion
confidence. If not all items are in the same group, the
GS returns NULL. 1-itemset is selected if S( 1- The experiments were done using PC with core
itemset) >= S(GS(1-itemset)) where S is the function, 2 duo processor with a clock rate of 1.8 GHz and
which returns support for an itemset. Let C= {c1, c2, 3GB of main memory. The operating system was
c3 x} be the set of candidate itemsets. A new Microsoft Vista and implementation language was
candidate NC is added to C c#. We used 1 dataset to verify our method. The data
ci is selected for rule generation if S(C) >= S set of interest is patient dataset collected and
(GS(C)). A desired itemset, D, is partitioned into preprocessed from Bangladeshi hospitals, which has
three parts. D = {D0, D1, D2}. D0 is mapped to 50273 instances and 514 attributes (included 150
anticipated items, D1 is mapped to consequent items, discrete and 364 numerical attributes). It contains all
D2 is mapped to both. Each subset of D 2, d, is added categories of healthcare data: ratio, interval, decimal,
to both antecedent and consequent. When d is added integer, percentage etc. All these data are converted
to antecedent then D2-d is added to consequent. On into mineable items (integer representation) using
the other hand, when d is added to consequent then domain dictionary and rule base. We have taken an
D2-d is added to antecedent. L is a function, which
84
5. average value from 10 trials for each of the test constrains on attributes constant. Time is not varied
result. significantly because the number of groups has no
Table 1. Test result for patient dataset lead to reduce disk access. This is because number of
Number of groups 4 8 groups has no lead to the number of candidate
Support for each group .55, .47,.84, .66, generations phases and to the number of support
.64, .55,.85, .94, calculation phases. The number of groups has only
.76,.45 .86,.35 lead to the number of valid candidate generations
Correlation for each group .71, .63, .85,.82, and it can save some CPU time.
.41, .76,.91, .73, 4 Groups 8Groups 12 Groups
.51,.61 .82, .71
Number of Items to be 4,4,4,4 5,4,5,6, 2000
Time(Seconds)
constrained in antecedent for 4,5,5,7
each group
Number of Items to be 1,2,2,1 1,2,2,1
constrained in consequent for 1,2,2,1 0
each group
8 4 12
Number of Items to be 0,0,0,0 1,1,1,1
Group Size
constrained in both for each 1,1,1,0
group Figure 4: Time comparison of the proposed
Total number of desired itemsets 125 311 algorithms for the patient dataset based on
Total number of desired rules 21 28 Group Size
Figure 4 shows how time is varied with different
Time(Seconds) 173.09 556.11
group size for medical research algorithm. Here we
Table 1 shows test result for patient dataset, after
measured the performance of Medical Research
running the program of the proposed algorithm with
algorithm in terms of group size keeping number of
different parameters. Second column of the table
groups constant, support and confidence of each
presents the test result, where we used 4 groups,
group constant, antecedent and consequent
minimum support of 45%-76% and correlation of
constrains on attributes constant. Time is varied
.41-.71 to mine symmetric association rules for
significantly because group size has lead to reduce
medical researcher. The maximum number of items
disk access. This is because group size has lead to
in a rule was 6. 125 desired itemsets were generated
the number of candidate generations phases and to
in total. 21 rules were discovered in total. It took
the number of support calculation phases.
about 3461 seconds to find these rules. Third column Group Size 4 Group size 10
of the table presents the test result, where we used 8 1 Group Size 18
groups, minimum support of 35%-94% and
Accuracy
correlation of .63-.91 to mine symmetric association
rules for medical researcher. The maximum number 0.5
of items in a rule was 8. 311 desired itemsets were
generated in total. 28 rules were discovered in total. 0
It took about 11122 seconds to find these rules.
0.5 0.7 0.85
Group Size 4 Group Size 10
Group Size 18 Correlation
2000
Time(Seconds)
Figure 5: Accuracy of test result for the
patient dataset based on correlation
1000 Figure 5 illustrates accuracy results for our
proposed algorithm. The value of correlation for
each presented result is also indicated. For accuracy
0 measurement, we intentionally discovered
relationships among attributes for which trends are
4 Number of Groups12
8
known. Here we calculated accuracy as the ratio
Figure 3: Time comparison of the proposed between the number of correct discovered
algorithms for the patient dataset based on relationships and total number of discovered
number of groups relationships. A discovered relationship is correct if
Figure 3 shows how time is varied with different it is one of the known trends of medical domain. It
number of groups for the medical research algorithm. shows that an average accuracy of 55% is achieved
We measured the performance of Medical Research with correlation 0.5. The proposed algorithm with
algorithm in terms of number of groups keeping correlation 0.7 achieves an average accuracy of
group size constant, support and confidence of each 85.66%. The proposed algorithm with correlation 0.7
group constant, antecedent and consequent achieves an average accuracy of 94.66%. As
85
6. accuracy refers to the rate of correct values in the Large Databases," in Proceedings of the 1993 ACM
data, the figure represents the success of our SIGMOD international conference on Management of
proposed data mining algorithm. data, Washington, D.C., 1993, pp. 207-216.
[5] H. Mannila, H. Toivonen, and A. I. Verkamo,
"Efficient Algorithms for Discovering Association
5. Conclusion Rules," in AAAI Workshop on Knowledge Discovery
in Databases, 1994, pp. 181-192.
Medical Researchers are interested to find [6] R. Srikant and R. Agrawal, "Mining Generalized
relationship among various diseases, lab tests, Association Rules," in In Proc. of the 21st Int'l
symptoms, etc. Due to high dimensionality of Conference on Very Large Databases, Zurich,
medical data, conventional association mining Switzerland, 1995.
algorithms discover a very high number of rules with [7] R. Srikant, Q. Vu, and R. Agrawal, "Mining
many attributes, which are tedious, redundant to association rules with item constraints," in In Proc.
medical researchers and not among their desired set 3rd Int. Conf. Knowledge Discovery and Data
of attributes. In this paper, we have proposed an Mining, 1997, pp. 67--73.
association rule mining algorithm for finding [8] A. Savasere, E. Omiecinski, and S. B. Navathe, "An
symmetric association rules to support medical Efficient Algorithm for Mining Association Rules in
Large Databases," in Proceedings of the 21th
qualitative research. The main theme of this
International Conference on Very Large Data Bases,
algorithm is based on the following two statements: 1995, pp. 432 - 444.
interesting relationships among various medical [9] H. Mannila, "Database methods for data mining," in
attributes are concealed in subsets of the attributes, The Fourth International Conference on Knowledge
but do not come out on all attributes taken together Discovery and Data Mining, 1998.
and all interesting relationships among various [10] B. Liu, W. Hsu, and Y. Ma, "Mining Association
medical attributes have not same support and Rules with Multiple Minimum Supports.," in
correlation. The algorithm constructs a candidate SIGKDD Explorations, 1999, pp. 337--341.
item sets based on groups constraint and use the [11] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, "Mining
corresponding support of each group in candidate association rules on significant rare data using relative
selection process to discover all possible desired item support.," Journal of Systems and Software archive,
sets of that group. We propose measuring vol. 67, no. 3, pp. 181 - 191, 2003.
interestingness of known symmetric relationships [12] M. Hahsler, "A Model-Based Frequency Constraint
and unknown symmetric relationships via the for Mining Associations from Transaction Data.,"
Data Mining and Knowledge Discovery, vol. 13, no.
correlation measure of antecedent items and
2, pp. 137 - 166, 2006.
consequent items. The proposed algorithm has been
[13] L. Zhou and S. Yau, "Association rule and
applied to a real world medical data set. We have quantitative association rule mining among infrequent
shown significant accuracy in the output of the items," in International Conference on Knowledge
proposed algorithm. Although we have used level- Discovery and Data Mining, San Jose, California,
wise search for finding symmetric association rules, 2007, pp. 156-167.
each step of our algorithm is different from any [14] C. Ordonez, C. Santana, and L. d. Braal, "Discovering
level-wise search algorithm. Rules generation from Interesting Association Rules in Medical Data," in
desired item sets is also different from conventional Proccedings of ACM SIGMOD Workshop on
association mining algorithms. Research Issues on Data Mining and Knowledge
Discovery, 2000, pp. 78-85.
[15] L. J. Sheela and V. Shanthi, "DIMAR - Discovering
6. References interesting medical association rules form MRI
scans," in 6th International Conference on Electrical
[1] R. Agrawal and R. Srikant, "Fast Algorithms for Engineering/Electronics, Computer,
Mining Association Rules in Large Databases," in Telecommunications and Information Technology,
Proceedings of the 20th International Conference on 2009, pp. 654 - 658.
Very Large Data Bases, San Francisco, CA, USA, [16] C. Ordonez, N. Ezquerra, and C. A. Santana,
1994, pp. 487 - 499. "Constraining and summarizing association rules in
[2] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, medical data," Knowledge and Information Systems,
"Dynamic Itemset Counting and Implication Rules for vol. 9, no. 3, pp. 259 - 283, September 2005.
Market Basket Data," in Proceedings of the 1997 [17] H. Pan, J. Li, and Z. Wei, "Mining Interesting
ACM SIGMOD international conference on Association Rules in Medical Images," Lecture Notes
Management of data, Tucson, Arizona, United States, In Computer Science, vol. 3584, pp. 598-609, 2005.
1997, pp. 255-264.
[18] S. Doddi, A. Marathe, S. S. Ravi, and D. C Torney,
[3] J. S. Park, M. S. Chen, and P. S. Yu, "An Effctive "Discovery of association rules in medical data,"
Hash based Algorithm for mining association rules," Medical Informatics and the Internet in Medicine, vol.
in Prof. ACM SIGMOD Conf Management of Data, 26, no. 1, pp. 25-33, January 2001.
New York, NY, USA, 1995, pp. 175 - 186.
[4] R. Agrawal, T. . Swami, "Mining
Association Rules between Sets of Items in Very
86