SlideShare a Scribd company logo
1 of 9
Download to read offline
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
DOI : 10.5121/ijdkp.2013.3409 127
PATTERN GENERATION FOR COMPLEX DATA
USING HYBRID MINING
Manish Kumar1
, Sumit Kumar2
, Sweety1
1
IIIT-Allahabad, India, 2
IVY Comptech Pvt. Ltd.-Hyderabad
[manish.singh76,sumit78kumar90,sweety.bhagat89]@gmail.com
ABSTRACT
Combined mining is a hybrid mining approach for mining informative patterns from single or multiple
data-sources, multiple-features extraction and applying multiple-methods as per the requirements. Data
mining applications often involve complex data like multiple heterogeneous data sources, different user
preference and create decision-making actions. The complete useful information may not be obtained by
using single data mining method in the form of informative patterns as that would consume more time and
space. This paper implements hybrid or combined mining approach that applies Lossy-counting algorithm
on each data-source to get the frequent data item-sets and then generates the combined association rules.
Applying multi-feature approach, we generate incremental pair patterns and incremental cluster patterns.
In multi-method combined mining approach, FP-growth and Bayesian Belief Network are combined to
generate classifier to get more informative knowledge. This paper uses two different data-sets to get more
useful knowledge and compare the results.
KEYWORDS
Association Rule Mining, Lossy-Counting Algorithm, Incremental Pair-Patterns, Incremental Cluster-
Patterns, Bayesian Belief Network.
1. INTRODUCTION
In distributed data mining algorithms [Kargupta and Park (2002)], data sampling is generally not
accepted because it may miss useful data that may be filtered out during sampling. Distributed
data sets need to join into one large data set but process may be more time and space consuming.
More often such approach of handling multiple data sources can only be developed for specific
cases and cannot be applied for all domain problems. Combined mining is a two-to-multistep data
mining approach: In first step, it involves mining the atomic patterns from each individual data
source and then next step combines atomic patterns into combined-patterns, which is more
suitable for a particular problem. In multi-source combined mining approach, it generates
informative patterns from individual data source and then generates the combined patterns. In
multi-feature combined mining approach, we consider features from multiple data sets while
generating the informative patterns, where it is necessary in order to make the patterns more
actionable. In case of cluster patterns, we made the cluster of patterns with same prefix but the
remaining data items in the pattern reflect results to be different. The advantage of our approach,
it does not apply any pruning method or any clustering method separately to get the more
informative patterns. In Lossy-counting algorithm’s implementation, data sources have already
been pruned at the boundary that directs most similar data items in the same bucket itself.
2. RELATED WORKS
Kargupta and Park (2002) provide an overview of distributed data mining algorithms, systems
and applications. The paper pointed out a mismatch between the architecture of most off-the-shelf
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
128
data mining systems and the needs of mining systems for distributed applications. It also claims
that such a mismatch may cause a fundamental bottleneck in many distributed applications.
Kargupta et al (1999) presented a framework of collective data mining to conduct distributed data
mining from heterogeneous sites. Authors observed that in a heterogeneous environment, naïve
approaches to distributed data analysis may lead to incorrect data-model. Chattratichat et al
(1999) designed Kensington software architecture for distributed enterprise data mining, which
addresses the problem of data mining on logical and physical distribution of data and
heterogeneous computational resources. Karypis and Wang (2005) present a new classifier,
HARMONY, which is an example of direct mining for informative patterns as it directly mines
the resultant set of rules required for classification. G. Dong and J. Li (1999) introduce a new type
of patterns i.e. emerging patterns (EPs), for discovering knowledge from databases. Authors
define EPs as data item-sets whose support increases more significantly from one to other data-
set. They have used EPs to build very powerful classifiers. W. Fan et al (2008) builds a model
based search tree, which partitions the data onto different nodes and at each node, it directly find
out a discriminative pattern, which further divide its examples into more purer subsets. A novel
technique was proposed by B. Liu et al.(1999), which first prunes the discovered association-rules
to remove the insignificant association-rules from the entire set of association-rules, and then
finds a subset of the un-pruned association-rules by which a summary of the discovered
association-rules can be formed. The paper refers it as subset of association-rules as the direction
setting (DS) rules because they can be used to set the directions, which are followed by the rest of
the association-rules. By the help of the summary, the user can have more focus on the important
aspects of the particular domain and also can view the relevant details. They suggest that their
approach is effective as their experimental result shows that the set of DS rules is quite very
small. Lent et al. (1997) proposed a method for clustering two-dimensional associations in large
data-bases. This paper presents a geometric-based algorithm called BitOp, for clustering,
embedded within ARCS (Association Rule Clustering System). It also measures the quality of the
segmentation generated by ARCS. J. Han et al (2006) proposed a new approach called
CrossMine, which mainly includes a set of novel and powerful methods for multi-relational
classification including 1) tuple ID propagation, 2) new definitions for predicates and decision-
tree nodes and 3) a selective sampling method. This paper also proposed two accurate and
scalable methods for multi-relational classification i.e. CrossMine-Rule and CrossMine-Tree. C.
Zhang et al (2008) proposed a novel approach of combined patterns to extract important,
actionable and impact oriented information from a large amount of association rules. Authors also
proposed definitions of combined patterns and also designed novel matrices to measure their
interestingness and analyzed the redundancy in combined patterns. Combined mining as a general
approach is proposed by C. Zhang et al (2011) to mine the informative patterns. This paper
summarizes general framework, paradigms and basic processes for various types of combined
mining. Authors also generate novel types of combined patterns from their proposed frameworks.
H. Yu et al. (2003) proposed a new method called as Clustering-Based SVM (CB-SVM), in
which, the whole data set can be scanned only once to have an SVM with samples that carry the
statistical information of the data by applying a hierarchical micro-clustering algorithm. Authors
also show that CB-SVM is also highly scalable for very large data sets and also generating very
high classification accuracy. Longbing Cao (2012) proposed combined mining as an approach
from the perspective of object and pattern relation analysis. This paper also discusses the
fundamental aspects of combined pattern mining like feature interaction, pattern dynamics,
pattern interaction, pattern impact, pattern structure etc. K. Kavitha et al. (2012) proposed a
generic framework that uses utility in decision making to drive the data mining process. Their
study proposes a novel approach to discover actionable combined patterns with composite items.
3. PROBLEM DEFINITION
Complex data may contain incredible information, which may not be mined directly by using a
single method and it is also tough to deal with such information using different perspective such
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
129
as client’s perspective, business analyst’s perspective and decision-makers perspective etc. Any
service provider wants to predict the client’s behavior to design the services according to client’s
perspective and also to reduce the traffic load. In our approach, we try to get patterns to retrieve
useful information from complex data. This information may be used in different places, for
example in e-commerce, stock market, market campaigns, measuring the success of marketing
efforts and client-company behavior etc.
4. PROPOSED SOLUTION
The effectiveness and quality of the patterns which have to be discovered highly depends on the
effectiveness and type of the data used during the pattern discovery process. We considered two
data-sets from UCI-Machine Learning Repository named as “Adult” and “Pen-Digits”. “Adult”
data-set have 14 attributes, some of them are discrete attributes, while others are continuous
attributes while “Pen-Digits” data-set have 17 attributes and all attributes are of type continuous
attributes. Our proposed solution consists of following steps:
4.1 Preprocessing & Data mining approach
Data-sets contains unknown value (present as ‘?’ in our data-set) for any of the attribute, then that
tuple should be removed first, as such tuples are the source for noise and errors. Removing all
such tuples from data-sets first, non-overlapping partitions of “Adult” dataset are generated so
that each of such partition can behave as a sub-dataset. The main idea behind generating sub-
datasets is that each of the sub-dataset can be used as a source of data for multi-source combined
mining approach. We have generated 210 sub-datasets for “Adult” dataset as try to have
maximum number of distinct attributes from all the sub-dataset on the basis of information gain
value and gathered maximum 8 distinct attributes (from ‫݁ݎݑ݃݅ܨ‬ 1), when we generated 210
partitions of “Adult” dataset. This paper considered another data set “Pen-Based Recognition of
Handwritten Digits” as un-partitioned for analysis. Figure 1 shows the partitions of “Adult” data-
set.
Figure 1. Number of partitions vs. Number of distinct attributes for Adult Data-Set
As stated earlier that input “Adult” dataset contains discrete as well as continuous attributes.
Information gain for discrete attributes in a partition ܲ have been computed as given below:
Expected information needed for the classification of a tuple in a partition ܲ, if the class label
attribute has n distinct values, each of which defines a distinct class [2], as follows:
IሺPሻ = − ෍ p୧logଶp୧
௡
௜ୀଵ
(1)
Where, ‫݌‬௜ is the probability that a particular tuple in partition ܲ belongs to class ‫ܥ‬௜ and we can
compute ‫݌‬௜ as ‫݌‬௜ = |‫ܥ‬௜,௣| / |ܲ| and where ‫ܥ‬௜,௣ defines the set of tuples of class ‫ܥ‬௜ in ܲ while
|‫ܥ‬௜,௣| and |ܲ| denotes the number of tuples in ‫ܥ‬௜,௣ and ܲ respectively.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
130
Then we have to classify the tuples in ܲ on some attribute ‫ܣ‬ having m distinct values, as observed
from the training dataset. The amount of information would be required for an exact classification
[2] is measured by:
I୅ሺܲሻ = ෍ሺ|P୨|/|P|ሻ × IሺP୨ሻ
௠
௝ୀଵ
(2)
Then the information Gain ‫ܩ‬ሺ‫ܣ‬ሻ for attribute ‫ܣ‬ is measured as:
‫ܩ‬ሺ‫ܣ‬ሻ = ‫ܫ‬ሺܲሻ– ‫ܫ‬஺ሺܲሻ (3)
For computing the information gain for continuous attributes ‫ܤ‬ in a partition ܲ, compute the
information gain for every possible split-point for ‫ܤ‬ and then choosing the best split-point. We
considered split-point as a threshold on ‫.ܤ‬ First, we sorted the values of ‫ܤ‬ in increasing order and
then typically the mid-point between each pair of adjacent values considered as a possible split-
point. If there are ‫ݕ‬ values of ‫,ܤ‬ then ሺ‫ݕ‬ − 1ሻ possible splits has to be computed. The reason for
sorting the values of ‫ܤ‬ is that if the values are already sorted then for determining the best split
for ‫ܤ‬ requires only one pass through the values. Then, for each possible split-point for ‫,ܤ‬ by
using ݁‫݊݋݅ݐܽݑݍ‬ ሺ2ሻ, we measured ‫ܫ‬஺ሺܲሻ, where the number of value of ݉ = 2. The point with the
minimum expected information requirement for ‫ܤ‬ will be selected as the best split-point for ‫.ܤ‬
After finding out the information gain for each attribute, an attribute with maximum information
gain from each partition ܲ௜ are selected and then by concatenating the attribute values from all
partitions, a data-stream is formed, which serves as an input for lossy-counting algorithm. As
state earlier “Pen-Digits” data-set is having continuous attributes only, the information gain for all
continuous attributes are generated as stated above.
A. Lossy-counting Algorithm
Lossy-counting is a deterministic algorithm [2], which computes frequency counts over a stream
of data-items. It approximates the frequency of items or item-sets within a user-specified error
bound ߝ. If ܰ is the current length of the data-stream then this algorithm takes 1/ߝ݈‫݃݋‬ሺߝܰሻ space
in worst-case for computing the frequency counts of a single data-item. The steps are as follows:
Input: Support ‫,ݏ‬ error bound ߝ and input data-stream
Output: Set of data-items with frequency counts at least equals to ሺ‫ݏ‬ − ߝሻܰ
Step 1: The input data-stream logically divided into the buckets of width ‫ݓ‬ = ݈ܿ݁݅ሺ1/ ߝሻ and
each bucket is labeled with bucket id, initially starting from 1 for the first bucket, and the current
bucket id is denoted by ‫ܤ‬௖௨௥௥௘௡௧, which is equal to ݈ܿ݁݅ሺܰ/ ߝሻ.
Step 2: Then maintain a data-structure ‫,ܵܦ‬ which is a set of values of the form ሺ‫,ܧ‬ ‫ܨ‬ா, ߜሻ, where,
‫ܧ‬ is an element from the input data-stream and ‫ܨ‬ா is the true frequency of the element ‫ܧ‬ and ߜ
denotes the maximum number of times ‫ܧ‬ could have occurred in first ‫ܤ‬௖௨௥௥௘௡௧ − 1 buckets.
Initially ‫ܵܦ‬ will be empty.
Step 3: For an element from the data-stream, if ‫ܧ‬ already exists in ‫ܵܦ‬ then increase its ‫ܨ‬ா by 1
else we have to create a new entry in ‫ܵܦ‬ such as ሺ‫,ܧ‬ 1, ‫ܤ‬௖௨௥௥௘௡௧ − 1ሻ.
Step 4: If it is the bucket boundary then we have to prune ‫ܵܦ‬ as follows:
if ‫ܨ‬ா + ߜ ≤ ‫ܤ‬௖௨௥௥௘௡௧, then the entry ሺ‫,ܧ‬ ‫ܨ‬ா, ߜሻ has to be deleted from ‫.ܵܦ‬
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
131
Step 5: When a user wants a final list of frequent data-items with support ‫,ݏ‬ then output all those
entries in ‫ܵܦ‬ with ‫ܨ‬ா ≥ ሺ‫ݏ‬ − ߝሻܰ.
Then, we generate the combined association rules [C. Zhang et al, 2011] of the frequent data-
items computed by Lossy-counting algorithm.
4.2 Multi-feature mining approach
Generating combined association rules, we consider the heterogeneous features of different data
types as well as of different data categories. If combined association rule is of the form
“‫ܨܫ‬ ܺ ܶ‫ܰܧܪ‬ ܻ”, where ܺ is the antecedent and ܻ is the consequent part of the rule, then we have
some traditional definitions for support, confidence and lift of the rule as given below in the Table
1.
Table 1. Support, Confidence and Lift for the Rule ܺ → ܻ
ܷܱܴܵܲܲܶ ܲ‫ܾ݋ݎ‬ሺܺ ܷ ܻሻ
‫ܧܥܰܧܦܫܨܱܰܥ‬ ܲ‫ܾ݋ݎ‬ሺܺ ܷ ܻሻ / ܲ‫ܾ݋ݎ‬ሺܺሻ
‫ܶܨܫܮ‬ ܲ‫ܾ݋ݎ‬ሺܺ ܷ ܻሻ / ሺܲ‫ܾ݋ݎ‬ሺܺሻ × ܲ‫ܾ݋ݎ‬ሺܻሻሻ
On the basis of these traditional definitions of support, confidence and lift, we can compute the
Contribution and Interestingness [1] ‫ܫ‬ோ௎௅ாof the rule ܺ௣ܷܺோ → ܻ as follows:
‫݊݋݅ݐݑܾ݅ݎݐ݊݋ܥ‬ ሺܺ௉ܷܺோ → ܻሻ = ‫ܶܨܫܮ‬ ሺܺ௉ ܷ ܺோ → ܻሻ / ‫ܶܨܫܮ‬ ሺܺ௉ → ܻሻ
= ‫ܧܥܰܧܦܫܨܱܰܥ‬ ሺܺ௉ ܷ ܺோ → ܻሻ/‫ܧܥܰܧܦܫܨܱܰܥ‬ ሺܺ௉ → ܻሻ (4)
‫ܫ‬ோ௎௅ா ሺܺ௉ܷ ܺோ → ܻሻ = ‫݊݋݅ݐݑܾ݅ݎݐ݊݋ܥ‬ ሺܺ௉ ܷ ܺோ → ܻሻ / ‫ܶܨܫܮ‬ ሺܺோ → ܻሻ (5)
Where, ‫ܫ‬ோ௎௅ா indicates whether the Contribution of ܺ௉ (or ܺோ) to the occurrence of ܻ increases,
while considering ܺோ (or ܺ௉) as a precondition to the rule. To get more information, we also
generate pair patterns, cluster pattern, incremental pair-pattern and incremental cluster-patterns
[C. Zhang et al, 2011], and their respective contribution and interestingness matrices. In case of
pair pattern, two atomic rules are taken to form a pair-pattern if and only if the two atomic rules
have at least one common data-item in their antecedent parts and after removing those common
data-item/data-items from the atomic rules the antecedent parts of none of the atomic rules should
be null. In case of incremental pair-pattern, removing the common data-item/data-items from the
pair-patterns and considered the common data-items as a pre-condition. In case of cluster pattern
formation, we have included as much as rules in a single cluster on the basis of the common data-
item/data-items in their antecedent parts and also have taken care of the fact that after removing
the common data-item/data-items from the antecedent parts of the respective rules, the antecedent
part of the rules should not be empty or null. In case of incremental cluster-patterns, we removed
the common data-item/data-items from the respective rules in a cluster and considered the
common data-item/data-items as a precondition for that particular cluster of rules.
4.3 Multi-method mining approach
In this approach, we first generate the association rules by using FP-growth algorithm [Han et al.
(2006)] and then create the Bayesian belief network [J. Cheng et al. (1997)]. The testing data-set
are classified by Bayesian belief network during testing phase.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
132
5. RESULT AND ANLAYSIS
Figure 2. Support vs. Run-Time Figure 3. Support vs. Correct Classification %
Figure 4. Confidence vs. Run-Time Figure 5. Confidence vs. Correct Classification %
Figure 6. Support vs. Frequent Data Figure 7. Support vs. No. of Pair Patterns
Items, No. of Association Rules & No. of Incremental Pair Patterns for
& No. of Clusters for “Adult” data-set “Adult” data-set
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
Figure 8. Support vs. Frequent Data
Items, No. of Association Rules & No. of
Clusters for “Pen-Digits” Data
“Adult” data-set with 32,561 entries
get selected for mining the useful information.
entries and after preprocessing step,
any noisy data or missing value for any attribute
information by including other perspective like client’s perspective etc. In
shown the variation of run-time (in seconds) and correct classification percentage by
belief network vs. support by keeping
and in ݂݅݃. 4 & 5, we have shown the variation o
percentage by Bayesian belief network vs. confidence by keeping
FP-growth implementation for “Adult” data
minimum support vs. the number
number of association rules and number
multi-feature mining approach
0.01% for Lossy Counting Algorithm
variation of the minimum support vs. the number of pair patterns along with the variation of
number of incremental pair patterns
and ݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = 0.01%. In
vs. the number of frequent data items obtained by Lossy Counting Algorithm, number of
association rules and number of total clusters obtained by the implementation of the multi
mining approach by keeping the
Counting Algorithm for “Pen-D
minimum support vs. the number of pair patterns along with the variation of number of
incremental pair patterns for “Pen
݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = 0.01%. From
frequent data items more will the number of association rules and more will be the number of
pair-patterns. The obtained numbers of clusters also have the dependency on the number of
association rules obtained and the similarity between the association rules in
common data-items in the antecedent part of the association rules.
6. CONCLUSION AND FUTURE
The support has impact on run-time as well as correct classification percen
network. Our approach identified combined patterns, which are more informative, actionable and
impact-oriented as compared to any single patterns identified by traditional methods
be such frameworks, which are flexible an
data, for which data sampling and table joining may not be
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
Figure 8. Support vs. Frequent Data Figure 9. Support vs. No. of Pair Patterns
Items, No. of Association Rules & No. of & No. of Incremental Pair Patterns
Digits” Data-set “Pen-Digits” Data-set
entries is considered and after preprocessing step, 30
for mining the useful information. Then considering “Pen-Digits” data-set with
entries and after preprocessing step, all entries get selected as “Pen-Digits” data-set doesn’t have
or missing value for any attribute. We performed mining task and gen
information by including other perspective like client’s perspective etc. In ݂݅݃. 2 &
time (in seconds) and correct classification percentage by
belief network vs. support by keeping ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 90% fixed for FP-growth implementation
, we have shown the variation of run-time (in seconds) and correct classification
belief network vs. confidence by keeping ‫ݐݎ݋݌݌ݑݏ‬ = 90
for “Adult” data-set. In ݂݅݃. 6, we have shown the variation of the
number of frequent data items obtained by Lossy-counting
rules and number of total clusters obtained by the implementation of the
by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 95% and ݁‫ݎ݋ݎݎ‬
for Lossy Counting Algorithm for “Adult” data-set. In ݂݅݃. 7, we have shown the
variation of the minimum support vs. the number of pair patterns along with the variation of
number of incremental pair patterns for “Adult” data-set by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬
In ݂݅݃. 8, we have shown the variation of the minimum support
vs. the number of frequent data items obtained by Lossy Counting Algorithm, number of
and number of total clusters obtained by the implementation of the multi
mining approach by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 95% and ݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = 0.01
Digits” data-set. In ݂݅݃. 7, we have shown the variation of the
minimum support vs. the number of pair patterns along with the variation of number of
Pen-Digits” data-set by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬
From ݂݅݃. 6, 7,8 & 9, we can conclude that more the number of
frequent data items more will the number of association rules and more will be the number of
patterns. The obtained numbers of clusters also have the dependency on the number of
ained and the similarity between the association rules in-
items in the antecedent part of the association rules.
UTURE WORK
time as well as correct classification percentage by Bayesian belief
identified combined patterns, which are more informative, actionable and
oriented as compared to any single patterns identified by traditional methods
such frameworks, which are flexible and customizable for handling a large amount of complex
data, for which data sampling and table joining may not be required. We further
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
133
Figure 9. Support vs. No. of Pair Patterns
for
30,162 entries
set with 7,494
set doesn’t have
ing task and generated the
& 3, we have
time (in seconds) and correct classification percentage by Bayesian
growth implementation
time (in seconds) and correct classification
90 % fixed for
we have shown the variation of the
ounting algorithm,
clusters obtained by the implementation of the
݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ =
, we have shown the
variation of the minimum support vs. the number of pair patterns along with the variation of
ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 95%
, we have shown the variation of the minimum support
vs. the number of frequent data items obtained by Lossy Counting Algorithm, number of
and number of total clusters obtained by the implementation of the multi-feature
01% for Lossy
shown the variation of the
minimum support vs. the number of pair patterns along with the variation of number of
= 95% and
, we can conclude that more the number of
frequent data items more will the number of association rules and more will be the number of
patterns. The obtained numbers of clusters also have the dependency on the number of
-terms of the
tage by Bayesian belief
identified combined patterns, which are more informative, actionable and
oriented as compared to any single patterns identified by traditional methods. There may
d customizable for handling a large amount of complex
. We further may develop
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
134
some effective paradigms, for handling large and multiple sources of data available in industry,
insurance, stock market, e-commerce and banking etc. in real-time.
REFERENCES
[1] C. Zhang, D. Luo, H. Zhang, L. Cao and Y. Zhao (2011), ‘Combined Mining: Discovering
Informative Knowledge in Complex Data’, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND
CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 3, pp. 699-712.
[2] Han and Kamber (2006), ‘Data Mining Concepts and Techniques’, 2nd
ed., United State of America.
[3] C. Zhang, F. Figueiredo, H. Zhang, L. Cao and Y. Zhao (2007), ‘Mining for combined association
rules on multiple datasets’, in Proc. DDDM, pp. 18–23.
[4] C. Zhang, H. Bohlscheid, H. Zhang, L. Cao and Y. Zhao (2008), ‘Combined pattern mining: From
learned rules to actionable knowledge’, in Proc. AI, pp. 393–403.
[5] B. Park and H. Kargupta (2002), ‘Distributed Data Mining: Algorithms, Systems, and Applications’.
Data Mining Handbook, N. Ye, Ed 2002.
[6] Jaturon Chattratichat, John Darlington, et al (1999). ‘An Architecture for Distributed Enterprise Data
Mining’. Proceedings of the 7th International Conference on High-Performance Computing and
Networking.
[7] B. Park, D. Hershbereger, E. Johnson and H. Kargupta (1999), ‘Collective data mining: A new
perspective toward distributed data mining’. Accepted in the Advances in Distributed Data Mining,
Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999).
[8] G. Dong and J. Li (1999), ‘Efficient mining of emerging patterns: Discovering trends and
differences,’ in Proc. KDD, pp. 43–52.
[9] H. Cheng, J. Han, J. Gao, K. Zhang, O. Verscheure, P. Yu, W. Fan and X. Yan (2008), ‘Direct mining
of discriminative and essential graphical and item-set features via model-based search tree,’ in Proc.
KDD, pp. 230–238.
[10] B. Liu, W. Hsu and Y. Ma (1999), ‘Pruning and summarizing the discovered associations,’ in Proc.
KDD, pp. 125–134.
[11] A. N. Swami, B. Lent and J. Widom (1997), ‘Clustering association rules,’ in Proc. ICDE, pp. 220–
231.
[12] J. Han, J. Yang, P. S. Yu and X. Yin (2006), ‘Efficient classification across multiple database
relations: A CrossMine approach,’ IEEE Trans. Knowl. Data Eng., vol. 18, no. 6, pp. 770–783.
[13] C. Zhang, H. Bohlscheid, H. Zhang, L. Cao and Y. Zhao (2008), ‘Combined pattern mining: From
learned rules to actionable knowledge,’ in Proc. AI, pp. 393–403.
[14] H. Yu, J. Han and J. Yang (2003), ‘Classifying large data sets using SVM with hierarchical clusters,’
in Proc. KDD, pp. 306–315.
[15] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.
[16] David A. Bell, Jie Cheng and Weiru Liu (1997), ‘An Algorithm for Bayesian Belief Network
Construction from Data,’ In Proceedings of AI & STAT'97, pp. 83-90.
[17] Dr. E. kumar and A. Solanki (2010), ‘A Combined Mining Approach and Application in Tax
Administration,’ International Journal of Engineering and Technology Vol.2(2), 2010, 38-44.
[18] Longbing Cao (2012), ‘Combined Mining: Analyzing Object and Pattern Relations for Discovering
Actionable Complex Patterns,’ sponsored by Australian Research Council Discovery Grants
(DP1096218 and DP130102691) and an ARC Linkage Grant(LP100200774).
[19] Kavitha, K. and Ramaraj, E. (2012), ‘Mining Actionable Patterns using Combined Association
Rules,’ International Journal of Current Research, Vol. 4, Issue, 03, pp.117-120, March, 2012.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
Biographical Notes:
Sumit Kumar received his B.Tech
Technology, Allahabad, India. He is Software Engineer at IVY Comptech Pvt. Ltd., Hyderabad, India. His
research interest areas are distributed systems, data mining, databases and computer
networks.
Manish Kumar received his PhD from Indian Institute of
Allahabad, India in Data Management in Wireless Sensor Networks. He is an
Assistant Professor at Indian Institute of Information Technology, Allahabad, India.
His research interest areas are databases, data management in sensor networks, and
data mining.
Sweety received her B.Tech (Information Technology) from Indian Institute of
Information Technology, Allahabad, India. She is a software developer at K
India Pvt. Ltd., Hyderabad, India. Her research interest areas are
databases and operating systems.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
Sumit Kumar received his B.Tech (Information Technology) from Indian Institute of Information
He is Software Engineer at IVY Comptech Pvt. Ltd., Hyderabad, India. His
research interest areas are distributed systems, data mining, databases and computer
Manish Kumar received his PhD from Indian Institute of Information Technology,
ad, India in Data Management in Wireless Sensor Networks. He is an
Assistant Professor at Indian Institute of Information Technology, Allahabad, India.
His research interest areas are databases, data management in sensor networks, and
B.Tech (Information Technology) from Indian Institute of
Information Technology, Allahabad, India. She is a software developer at Kony
India Pvt. Ltd., Hyderabad, India. Her research interest areas are data mining,
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
135
from Indian Institute of Information
He is Software Engineer at IVY Comptech Pvt. Ltd., Hyderabad, India. His

More Related Content

What's hot

Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd Iaetsd
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurIAEME Publication
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION cscpconf
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysisinventy
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXINGIJDKP
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
 
Bs31267274
Bs31267274Bs31267274
Bs31267274IJMER
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
A genetic based research framework 3
A genetic based research framework 3A genetic based research framework 3
A genetic based research framework 3prj_publication
 
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
 

What's hot (20)

Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
I1802055259
I1802055259I1802055259
I1802055259
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 
K355662
K355662K355662
K355662
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
A genetic based research framework 3
A genetic based research framework 3A genetic based research framework 3
A genetic based research framework 3
 
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
 
G045033841
G045033841G045033841
G045033841
 

Similar to PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesEditor IJMTER
 
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MININGcsandit
 
Classifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkClassifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkAI Publications
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESSIJDKP
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...IAEME Publication
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data FusionIRJET Journal
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Sciencetheijes
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEM
A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEMA NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEM
A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEMIJNSA Journal
 
A new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemA new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusabilityAlexander Decker
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...IJSRD
 

Similar to PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING (20)

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining Techniques
 
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
 
H017625354
H017625354H017625354
H017625354
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
Classifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkClassifier Model using Artificial Neural Network
Classifier Model using Artificial Neural Network
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Science
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEM
A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEMA NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEM
A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEM
 
A new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemA new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender system
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING

  • 1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 DOI : 10.5121/ijdkp.2013.3409 127 PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING Manish Kumar1 , Sumit Kumar2 , Sweety1 1 IIIT-Allahabad, India, 2 IVY Comptech Pvt. Ltd.-Hyderabad [manish.singh76,sumit78kumar90,sweety.bhagat89]@gmail.com ABSTRACT Combined mining is a hybrid mining approach for mining informative patterns from single or multiple data-sources, multiple-features extraction and applying multiple-methods as per the requirements. Data mining applications often involve complex data like multiple heterogeneous data sources, different user preference and create decision-making actions. The complete useful information may not be obtained by using single data mining method in the form of informative patterns as that would consume more time and space. This paper implements hybrid or combined mining approach that applies Lossy-counting algorithm on each data-source to get the frequent data item-sets and then generates the combined association rules. Applying multi-feature approach, we generate incremental pair patterns and incremental cluster patterns. In multi-method combined mining approach, FP-growth and Bayesian Belief Network are combined to generate classifier to get more informative knowledge. This paper uses two different data-sets to get more useful knowledge and compare the results. KEYWORDS Association Rule Mining, Lossy-Counting Algorithm, Incremental Pair-Patterns, Incremental Cluster- Patterns, Bayesian Belief Network. 1. INTRODUCTION In distributed data mining algorithms [Kargupta and Park (2002)], data sampling is generally not accepted because it may miss useful data that may be filtered out during sampling. Distributed data sets need to join into one large data set but process may be more time and space consuming. More often such approach of handling multiple data sources can only be developed for specific cases and cannot be applied for all domain problems. Combined mining is a two-to-multistep data mining approach: In first step, it involves mining the atomic patterns from each individual data source and then next step combines atomic patterns into combined-patterns, which is more suitable for a particular problem. In multi-source combined mining approach, it generates informative patterns from individual data source and then generates the combined patterns. In multi-feature combined mining approach, we consider features from multiple data sets while generating the informative patterns, where it is necessary in order to make the patterns more actionable. In case of cluster patterns, we made the cluster of patterns with same prefix but the remaining data items in the pattern reflect results to be different. The advantage of our approach, it does not apply any pruning method or any clustering method separately to get the more informative patterns. In Lossy-counting algorithm’s implementation, data sources have already been pruned at the boundary that directs most similar data items in the same bucket itself. 2. RELATED WORKS Kargupta and Park (2002) provide an overview of distributed data mining algorithms, systems and applications. The paper pointed out a mismatch between the architecture of most off-the-shelf
  • 2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 128 data mining systems and the needs of mining systems for distributed applications. It also claims that such a mismatch may cause a fundamental bottleneck in many distributed applications. Kargupta et al (1999) presented a framework of collective data mining to conduct distributed data mining from heterogeneous sites. Authors observed that in a heterogeneous environment, naïve approaches to distributed data analysis may lead to incorrect data-model. Chattratichat et al (1999) designed Kensington software architecture for distributed enterprise data mining, which addresses the problem of data mining on logical and physical distribution of data and heterogeneous computational resources. Karypis and Wang (2005) present a new classifier, HARMONY, which is an example of direct mining for informative patterns as it directly mines the resultant set of rules required for classification. G. Dong and J. Li (1999) introduce a new type of patterns i.e. emerging patterns (EPs), for discovering knowledge from databases. Authors define EPs as data item-sets whose support increases more significantly from one to other data- set. They have used EPs to build very powerful classifiers. W. Fan et al (2008) builds a model based search tree, which partitions the data onto different nodes and at each node, it directly find out a discriminative pattern, which further divide its examples into more purer subsets. A novel technique was proposed by B. Liu et al.(1999), which first prunes the discovered association-rules to remove the insignificant association-rules from the entire set of association-rules, and then finds a subset of the un-pruned association-rules by which a summary of the discovered association-rules can be formed. The paper refers it as subset of association-rules as the direction setting (DS) rules because they can be used to set the directions, which are followed by the rest of the association-rules. By the help of the summary, the user can have more focus on the important aspects of the particular domain and also can view the relevant details. They suggest that their approach is effective as their experimental result shows that the set of DS rules is quite very small. Lent et al. (1997) proposed a method for clustering two-dimensional associations in large data-bases. This paper presents a geometric-based algorithm called BitOp, for clustering, embedded within ARCS (Association Rule Clustering System). It also measures the quality of the segmentation generated by ARCS. J. Han et al (2006) proposed a new approach called CrossMine, which mainly includes a set of novel and powerful methods for multi-relational classification including 1) tuple ID propagation, 2) new definitions for predicates and decision- tree nodes and 3) a selective sampling method. This paper also proposed two accurate and scalable methods for multi-relational classification i.e. CrossMine-Rule and CrossMine-Tree. C. Zhang et al (2008) proposed a novel approach of combined patterns to extract important, actionable and impact oriented information from a large amount of association rules. Authors also proposed definitions of combined patterns and also designed novel matrices to measure their interestingness and analyzed the redundancy in combined patterns. Combined mining as a general approach is proposed by C. Zhang et al (2011) to mine the informative patterns. This paper summarizes general framework, paradigms and basic processes for various types of combined mining. Authors also generate novel types of combined patterns from their proposed frameworks. H. Yu et al. (2003) proposed a new method called as Clustering-Based SVM (CB-SVM), in which, the whole data set can be scanned only once to have an SVM with samples that carry the statistical information of the data by applying a hierarchical micro-clustering algorithm. Authors also show that CB-SVM is also highly scalable for very large data sets and also generating very high classification accuracy. Longbing Cao (2012) proposed combined mining as an approach from the perspective of object and pattern relation analysis. This paper also discusses the fundamental aspects of combined pattern mining like feature interaction, pattern dynamics, pattern interaction, pattern impact, pattern structure etc. K. Kavitha et al. (2012) proposed a generic framework that uses utility in decision making to drive the data mining process. Their study proposes a novel approach to discover actionable combined patterns with composite items. 3. PROBLEM DEFINITION Complex data may contain incredible information, which may not be mined directly by using a single method and it is also tough to deal with such information using different perspective such
  • 3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 129 as client’s perspective, business analyst’s perspective and decision-makers perspective etc. Any service provider wants to predict the client’s behavior to design the services according to client’s perspective and also to reduce the traffic load. In our approach, we try to get patterns to retrieve useful information from complex data. This information may be used in different places, for example in e-commerce, stock market, market campaigns, measuring the success of marketing efforts and client-company behavior etc. 4. PROPOSED SOLUTION The effectiveness and quality of the patterns which have to be discovered highly depends on the effectiveness and type of the data used during the pattern discovery process. We considered two data-sets from UCI-Machine Learning Repository named as “Adult” and “Pen-Digits”. “Adult” data-set have 14 attributes, some of them are discrete attributes, while others are continuous attributes while “Pen-Digits” data-set have 17 attributes and all attributes are of type continuous attributes. Our proposed solution consists of following steps: 4.1 Preprocessing & Data mining approach Data-sets contains unknown value (present as ‘?’ in our data-set) for any of the attribute, then that tuple should be removed first, as such tuples are the source for noise and errors. Removing all such tuples from data-sets first, non-overlapping partitions of “Adult” dataset are generated so that each of such partition can behave as a sub-dataset. The main idea behind generating sub- datasets is that each of the sub-dataset can be used as a source of data for multi-source combined mining approach. We have generated 210 sub-datasets for “Adult” dataset as try to have maximum number of distinct attributes from all the sub-dataset on the basis of information gain value and gathered maximum 8 distinct attributes (from ‫݁ݎݑ݃݅ܨ‬ 1), when we generated 210 partitions of “Adult” dataset. This paper considered another data set “Pen-Based Recognition of Handwritten Digits” as un-partitioned for analysis. Figure 1 shows the partitions of “Adult” data- set. Figure 1. Number of partitions vs. Number of distinct attributes for Adult Data-Set As stated earlier that input “Adult” dataset contains discrete as well as continuous attributes. Information gain for discrete attributes in a partition ܲ have been computed as given below: Expected information needed for the classification of a tuple in a partition ܲ, if the class label attribute has n distinct values, each of which defines a distinct class [2], as follows: IሺPሻ = − ෍ p୧logଶp୧ ௡ ௜ୀଵ (1) Where, ‫݌‬௜ is the probability that a particular tuple in partition ܲ belongs to class ‫ܥ‬௜ and we can compute ‫݌‬௜ as ‫݌‬௜ = |‫ܥ‬௜,௣| / |ܲ| and where ‫ܥ‬௜,௣ defines the set of tuples of class ‫ܥ‬௜ in ܲ while |‫ܥ‬௜,௣| and |ܲ| denotes the number of tuples in ‫ܥ‬௜,௣ and ܲ respectively.
  • 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 130 Then we have to classify the tuples in ܲ on some attribute ‫ܣ‬ having m distinct values, as observed from the training dataset. The amount of information would be required for an exact classification [2] is measured by: I୅ሺܲሻ = ෍ሺ|P୨|/|P|ሻ × IሺP୨ሻ ௠ ௝ୀଵ (2) Then the information Gain ‫ܩ‬ሺ‫ܣ‬ሻ for attribute ‫ܣ‬ is measured as: ‫ܩ‬ሺ‫ܣ‬ሻ = ‫ܫ‬ሺܲሻ– ‫ܫ‬஺ሺܲሻ (3) For computing the information gain for continuous attributes ‫ܤ‬ in a partition ܲ, compute the information gain for every possible split-point for ‫ܤ‬ and then choosing the best split-point. We considered split-point as a threshold on ‫.ܤ‬ First, we sorted the values of ‫ܤ‬ in increasing order and then typically the mid-point between each pair of adjacent values considered as a possible split- point. If there are ‫ݕ‬ values of ‫,ܤ‬ then ሺ‫ݕ‬ − 1ሻ possible splits has to be computed. The reason for sorting the values of ‫ܤ‬ is that if the values are already sorted then for determining the best split for ‫ܤ‬ requires only one pass through the values. Then, for each possible split-point for ‫,ܤ‬ by using ݁‫݊݋݅ݐܽݑݍ‬ ሺ2ሻ, we measured ‫ܫ‬஺ሺܲሻ, where the number of value of ݉ = 2. The point with the minimum expected information requirement for ‫ܤ‬ will be selected as the best split-point for ‫.ܤ‬ After finding out the information gain for each attribute, an attribute with maximum information gain from each partition ܲ௜ are selected and then by concatenating the attribute values from all partitions, a data-stream is formed, which serves as an input for lossy-counting algorithm. As state earlier “Pen-Digits” data-set is having continuous attributes only, the information gain for all continuous attributes are generated as stated above. A. Lossy-counting Algorithm Lossy-counting is a deterministic algorithm [2], which computes frequency counts over a stream of data-items. It approximates the frequency of items or item-sets within a user-specified error bound ߝ. If ܰ is the current length of the data-stream then this algorithm takes 1/ߝ݈‫݃݋‬ሺߝܰሻ space in worst-case for computing the frequency counts of a single data-item. The steps are as follows: Input: Support ‫,ݏ‬ error bound ߝ and input data-stream Output: Set of data-items with frequency counts at least equals to ሺ‫ݏ‬ − ߝሻܰ Step 1: The input data-stream logically divided into the buckets of width ‫ݓ‬ = ݈ܿ݁݅ሺ1/ ߝሻ and each bucket is labeled with bucket id, initially starting from 1 for the first bucket, and the current bucket id is denoted by ‫ܤ‬௖௨௥௥௘௡௧, which is equal to ݈ܿ݁݅ሺܰ/ ߝሻ. Step 2: Then maintain a data-structure ‫,ܵܦ‬ which is a set of values of the form ሺ‫,ܧ‬ ‫ܨ‬ா, ߜሻ, where, ‫ܧ‬ is an element from the input data-stream and ‫ܨ‬ா is the true frequency of the element ‫ܧ‬ and ߜ denotes the maximum number of times ‫ܧ‬ could have occurred in first ‫ܤ‬௖௨௥௥௘௡௧ − 1 buckets. Initially ‫ܵܦ‬ will be empty. Step 3: For an element from the data-stream, if ‫ܧ‬ already exists in ‫ܵܦ‬ then increase its ‫ܨ‬ா by 1 else we have to create a new entry in ‫ܵܦ‬ such as ሺ‫,ܧ‬ 1, ‫ܤ‬௖௨௥௥௘௡௧ − 1ሻ. Step 4: If it is the bucket boundary then we have to prune ‫ܵܦ‬ as follows: if ‫ܨ‬ா + ߜ ≤ ‫ܤ‬௖௨௥௥௘௡௧, then the entry ሺ‫,ܧ‬ ‫ܨ‬ா, ߜሻ has to be deleted from ‫.ܵܦ‬
  • 5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 131 Step 5: When a user wants a final list of frequent data-items with support ‫,ݏ‬ then output all those entries in ‫ܵܦ‬ with ‫ܨ‬ா ≥ ሺ‫ݏ‬ − ߝሻܰ. Then, we generate the combined association rules [C. Zhang et al, 2011] of the frequent data- items computed by Lossy-counting algorithm. 4.2 Multi-feature mining approach Generating combined association rules, we consider the heterogeneous features of different data types as well as of different data categories. If combined association rule is of the form “‫ܨܫ‬ ܺ ܶ‫ܰܧܪ‬ ܻ”, where ܺ is the antecedent and ܻ is the consequent part of the rule, then we have some traditional definitions for support, confidence and lift of the rule as given below in the Table 1. Table 1. Support, Confidence and Lift for the Rule ܺ → ܻ ܷܱܴܵܲܲܶ ܲ‫ܾ݋ݎ‬ሺܺ ܷ ܻሻ ‫ܧܥܰܧܦܫܨܱܰܥ‬ ܲ‫ܾ݋ݎ‬ሺܺ ܷ ܻሻ / ܲ‫ܾ݋ݎ‬ሺܺሻ ‫ܶܨܫܮ‬ ܲ‫ܾ݋ݎ‬ሺܺ ܷ ܻሻ / ሺܲ‫ܾ݋ݎ‬ሺܺሻ × ܲ‫ܾ݋ݎ‬ሺܻሻሻ On the basis of these traditional definitions of support, confidence and lift, we can compute the Contribution and Interestingness [1] ‫ܫ‬ோ௎௅ாof the rule ܺ௣ܷܺோ → ܻ as follows: ‫݊݋݅ݐݑܾ݅ݎݐ݊݋ܥ‬ ሺܺ௉ܷܺோ → ܻሻ = ‫ܶܨܫܮ‬ ሺܺ௉ ܷ ܺோ → ܻሻ / ‫ܶܨܫܮ‬ ሺܺ௉ → ܻሻ = ‫ܧܥܰܧܦܫܨܱܰܥ‬ ሺܺ௉ ܷ ܺோ → ܻሻ/‫ܧܥܰܧܦܫܨܱܰܥ‬ ሺܺ௉ → ܻሻ (4) ‫ܫ‬ோ௎௅ா ሺܺ௉ܷ ܺோ → ܻሻ = ‫݊݋݅ݐݑܾ݅ݎݐ݊݋ܥ‬ ሺܺ௉ ܷ ܺோ → ܻሻ / ‫ܶܨܫܮ‬ ሺܺோ → ܻሻ (5) Where, ‫ܫ‬ோ௎௅ா indicates whether the Contribution of ܺ௉ (or ܺோ) to the occurrence of ܻ increases, while considering ܺோ (or ܺ௉) as a precondition to the rule. To get more information, we also generate pair patterns, cluster pattern, incremental pair-pattern and incremental cluster-patterns [C. Zhang et al, 2011], and their respective contribution and interestingness matrices. In case of pair pattern, two atomic rules are taken to form a pair-pattern if and only if the two atomic rules have at least one common data-item in their antecedent parts and after removing those common data-item/data-items from the atomic rules the antecedent parts of none of the atomic rules should be null. In case of incremental pair-pattern, removing the common data-item/data-items from the pair-patterns and considered the common data-items as a pre-condition. In case of cluster pattern formation, we have included as much as rules in a single cluster on the basis of the common data- item/data-items in their antecedent parts and also have taken care of the fact that after removing the common data-item/data-items from the antecedent parts of the respective rules, the antecedent part of the rules should not be empty or null. In case of incremental cluster-patterns, we removed the common data-item/data-items from the respective rules in a cluster and considered the common data-item/data-items as a precondition for that particular cluster of rules. 4.3 Multi-method mining approach In this approach, we first generate the association rules by using FP-growth algorithm [Han et al. (2006)] and then create the Bayesian belief network [J. Cheng et al. (1997)]. The testing data-set are classified by Bayesian belief network during testing phase.
  • 6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 132 5. RESULT AND ANLAYSIS Figure 2. Support vs. Run-Time Figure 3. Support vs. Correct Classification % Figure 4. Confidence vs. Run-Time Figure 5. Confidence vs. Correct Classification % Figure 6. Support vs. Frequent Data Figure 7. Support vs. No. of Pair Patterns Items, No. of Association Rules & No. of Incremental Pair Patterns for & No. of Clusters for “Adult” data-set “Adult” data-set
  • 7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 Figure 8. Support vs. Frequent Data Items, No. of Association Rules & No. of Clusters for “Pen-Digits” Data “Adult” data-set with 32,561 entries get selected for mining the useful information. entries and after preprocessing step, any noisy data or missing value for any attribute information by including other perspective like client’s perspective etc. In shown the variation of run-time (in seconds) and correct classification percentage by belief network vs. support by keeping and in ݂݅݃. 4 & 5, we have shown the variation o percentage by Bayesian belief network vs. confidence by keeping FP-growth implementation for “Adult” data minimum support vs. the number number of association rules and number multi-feature mining approach 0.01% for Lossy Counting Algorithm variation of the minimum support vs. the number of pair patterns along with the variation of number of incremental pair patterns and ݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = 0.01%. In vs. the number of frequent data items obtained by Lossy Counting Algorithm, number of association rules and number of total clusters obtained by the implementation of the multi mining approach by keeping the Counting Algorithm for “Pen-D minimum support vs. the number of pair patterns along with the variation of number of incremental pair patterns for “Pen ݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = 0.01%. From frequent data items more will the number of association rules and more will be the number of pair-patterns. The obtained numbers of clusters also have the dependency on the number of association rules obtained and the similarity between the association rules in common data-items in the antecedent part of the association rules. 6. CONCLUSION AND FUTURE The support has impact on run-time as well as correct classification percen network. Our approach identified combined patterns, which are more informative, actionable and impact-oriented as compared to any single patterns identified by traditional methods be such frameworks, which are flexible an data, for which data sampling and table joining may not be International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 Figure 8. Support vs. Frequent Data Figure 9. Support vs. No. of Pair Patterns Items, No. of Association Rules & No. of & No. of Incremental Pair Patterns Digits” Data-set “Pen-Digits” Data-set entries is considered and after preprocessing step, 30 for mining the useful information. Then considering “Pen-Digits” data-set with entries and after preprocessing step, all entries get selected as “Pen-Digits” data-set doesn’t have or missing value for any attribute. We performed mining task and gen information by including other perspective like client’s perspective etc. In ݂݅݃. 2 & time (in seconds) and correct classification percentage by belief network vs. support by keeping ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 90% fixed for FP-growth implementation , we have shown the variation of run-time (in seconds) and correct classification belief network vs. confidence by keeping ‫ݐݎ݋݌݌ݑݏ‬ = 90 for “Adult” data-set. In ݂݅݃. 6, we have shown the variation of the number of frequent data items obtained by Lossy-counting rules and number of total clusters obtained by the implementation of the by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 95% and ݁‫ݎ݋ݎݎ‬ for Lossy Counting Algorithm for “Adult” data-set. In ݂݅݃. 7, we have shown the variation of the minimum support vs. the number of pair patterns along with the variation of number of incremental pair patterns for “Adult” data-set by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬ In ݂݅݃. 8, we have shown the variation of the minimum support vs. the number of frequent data items obtained by Lossy Counting Algorithm, number of and number of total clusters obtained by the implementation of the multi mining approach by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 95% and ݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = 0.01 Digits” data-set. In ݂݅݃. 7, we have shown the variation of the minimum support vs. the number of pair patterns along with the variation of number of Pen-Digits” data-set by keeping the ܿ‫݂݁ܿ݊݁݀݅݊݋‬ From ݂݅݃. 6, 7,8 & 9, we can conclude that more the number of frequent data items more will the number of association rules and more will be the number of patterns. The obtained numbers of clusters also have the dependency on the number of ained and the similarity between the association rules in- items in the antecedent part of the association rules. UTURE WORK time as well as correct classification percentage by Bayesian belief identified combined patterns, which are more informative, actionable and oriented as compared to any single patterns identified by traditional methods such frameworks, which are flexible and customizable for handling a large amount of complex data, for which data sampling and table joining may not be required. We further International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 133 Figure 9. Support vs. No. of Pair Patterns for 30,162 entries set with 7,494 set doesn’t have ing task and generated the & 3, we have time (in seconds) and correct classification percentage by Bayesian growth implementation time (in seconds) and correct classification 90 % fixed for we have shown the variation of the ounting algorithm, clusters obtained by the implementation of the ݁‫ݎ݋ݎݎ‬ ܾ‫ݏ݀݊ݑ݋‬ = , we have shown the variation of the minimum support vs. the number of pair patterns along with the variation of ܿ‫݂݁ܿ݊݁݀݅݊݋‬ = 95% , we have shown the variation of the minimum support vs. the number of frequent data items obtained by Lossy Counting Algorithm, number of and number of total clusters obtained by the implementation of the multi-feature 01% for Lossy shown the variation of the minimum support vs. the number of pair patterns along with the variation of number of = 95% and , we can conclude that more the number of frequent data items more will the number of association rules and more will be the number of patterns. The obtained numbers of clusters also have the dependency on the number of -terms of the tage by Bayesian belief identified combined patterns, which are more informative, actionable and oriented as compared to any single patterns identified by traditional methods. There may d customizable for handling a large amount of complex . We further may develop
  • 8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 134 some effective paradigms, for handling large and multiple sources of data available in industry, insurance, stock market, e-commerce and banking etc. in real-time. REFERENCES [1] C. Zhang, D. Luo, H. Zhang, L. Cao and Y. Zhao (2011), ‘Combined Mining: Discovering Informative Knowledge in Complex Data’, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 3, pp. 699-712. [2] Han and Kamber (2006), ‘Data Mining Concepts and Techniques’, 2nd ed., United State of America. [3] C. Zhang, F. Figueiredo, H. Zhang, L. Cao and Y. Zhao (2007), ‘Mining for combined association rules on multiple datasets’, in Proc. DDDM, pp. 18–23. [4] C. Zhang, H. Bohlscheid, H. Zhang, L. Cao and Y. Zhao (2008), ‘Combined pattern mining: From learned rules to actionable knowledge’, in Proc. AI, pp. 393–403. [5] B. Park and H. Kargupta (2002), ‘Distributed Data Mining: Algorithms, Systems, and Applications’. Data Mining Handbook, N. Ye, Ed 2002. [6] Jaturon Chattratichat, John Darlington, et al (1999). ‘An Architecture for Distributed Enterprise Data Mining’. Proceedings of the 7th International Conference on High-Performance Computing and Networking. [7] B. Park, D. Hershbereger, E. Johnson and H. Kargupta (1999), ‘Collective data mining: A new perspective toward distributed data mining’. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999). [8] G. Dong and J. Li (1999), ‘Efficient mining of emerging patterns: Discovering trends and differences,’ in Proc. KDD, pp. 43–52. [9] H. Cheng, J. Han, J. Gao, K. Zhang, O. Verscheure, P. Yu, W. Fan and X. Yan (2008), ‘Direct mining of discriminative and essential graphical and item-set features via model-based search tree,’ in Proc. KDD, pp. 230–238. [10] B. Liu, W. Hsu and Y. Ma (1999), ‘Pruning and summarizing the discovered associations,’ in Proc. KDD, pp. 125–134. [11] A. N. Swami, B. Lent and J. Widom (1997), ‘Clustering association rules,’ in Proc. ICDE, pp. 220– 231. [12] J. Han, J. Yang, P. S. Yu and X. Yin (2006), ‘Efficient classification across multiple database relations: A CrossMine approach,’ IEEE Trans. Knowl. Data Eng., vol. 18, no. 6, pp. 770–783. [13] C. Zhang, H. Bohlscheid, H. Zhang, L. Cao and Y. Zhao (2008), ‘Combined pattern mining: From learned rules to actionable knowledge,’ in Proc. AI, pp. 393–403. [14] H. Yu, J. Han and J. Yang (2003), ‘Classifying large data sets using SVM with hierarchical clusters,’ in Proc. KDD, pp. 306–315. [15] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [16] David A. Bell, Jie Cheng and Weiru Liu (1997), ‘An Algorithm for Bayesian Belief Network Construction from Data,’ In Proceedings of AI & STAT'97, pp. 83-90. [17] Dr. E. kumar and A. Solanki (2010), ‘A Combined Mining Approach and Application in Tax Administration,’ International Journal of Engineering and Technology Vol.2(2), 2010, 38-44. [18] Longbing Cao (2012), ‘Combined Mining: Analyzing Object and Pattern Relations for Discovering Actionable Complex Patterns,’ sponsored by Australian Research Council Discovery Grants (DP1096218 and DP130102691) and an ARC Linkage Grant(LP100200774). [19] Kavitha, K. and Ramaraj, E. (2012), ‘Mining Actionable Patterns using Combined Association Rules,’ International Journal of Current Research, Vol. 4, Issue, 03, pp.117-120, March, 2012.
  • 9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 Biographical Notes: Sumit Kumar received his B.Tech Technology, Allahabad, India. He is Software Engineer at IVY Comptech Pvt. Ltd., Hyderabad, India. His research interest areas are distributed systems, data mining, databases and computer networks. Manish Kumar received his PhD from Indian Institute of Allahabad, India in Data Management in Wireless Sensor Networks. He is an Assistant Professor at Indian Institute of Information Technology, Allahabad, India. His research interest areas are databases, data management in sensor networks, and data mining. Sweety received her B.Tech (Information Technology) from Indian Institute of Information Technology, Allahabad, India. She is a software developer at K India Pvt. Ltd., Hyderabad, India. Her research interest areas are databases and operating systems. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 Sumit Kumar received his B.Tech (Information Technology) from Indian Institute of Information He is Software Engineer at IVY Comptech Pvt. Ltd., Hyderabad, India. His research interest areas are distributed systems, data mining, databases and computer Manish Kumar received his PhD from Indian Institute of Information Technology, ad, India in Data Management in Wireless Sensor Networks. He is an Assistant Professor at Indian Institute of Information Technology, Allahabad, India. His research interest areas are databases, data management in sensor networks, and B.Tech (Information Technology) from Indian Institute of Information Technology, Allahabad, India. She is a software developer at Kony India Pvt. Ltd., Hyderabad, India. Her research interest areas are data mining, International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013 135 from Indian Institute of Information He is Software Engineer at IVY Comptech Pvt. Ltd., Hyderabad, India. His