2. Table of Contents
Ø Introduction
Ø Motivation
Ø Data Mining
Ø Classification
Ø Association
Ø Heart Disease Database
Ø Literature Survey
Ø Problem Formulation
Ø Objectives
Ø Present Work
Ø Result and Discussion
Ø Conclusion
Ø Future Scope
Ø References
3. Motivation
Ø Accumulation of huge data-sets in the field of
Engineering and Biomedical Science.
Ø Ability to extract hidden and useful knowledge from
large databases.
Ø Need to development intelligent and cost effective
decision support system.
Ø How to teach the people to ignore the irrelevant
data.
Ø The greatest problem of today is to get optimal
outcome of irrelevant data.
4. Data Mining
Ø Data mining computational process of finding
patterns in large data sets including methods at the
intersection of machine learning, artificial
intelligence, statistics and database systems.
Ø The main focus of data mining process is to obtain
information from the data and converted it into an
knowledgeable and reasonable structure for further
use.
6. Classification
Classification is the problem of identifying to which of
a set of categories a new observation belongs, on the
basis of a training set of data containing observations
(or instances) whose category membership is known.
7. Association
Association learning method for discovering interesting
relations between variables in large databases. It is
intended to identify strong rules discovered in
databases using different measures of interestingness.
For example, the rule :
{onions, potatoes} => {burger}.
8. Example : Heart diseases Dataset
ID age Gender Chest pain
Blood
pressure
diagnosis
1
63
male
typ_angina
High
No
2
67
male
asympt
very_high
Yes
3
67
male
asympt
high
Yes
4
37
male
non_anginal
high
No
5
41
female
atyp_angina
high
No
6
56
male
atyp_angina
high
No
7
62
female
asympt
high
Yes
8
57
female
asympt
high
No
9
63
male
asympt
high
Yes
10
53
male
asympt
high
Yes
11
57
male
asympt
high
No
12
56
female
atyp_angina
high
No
13
56
male
non_anginal
high
Yes
14
44
male
atyp_angina
high
No
10. Result new prediction ?
age gender Chest pain Blood
pressure
diagnosis
52
male
non_anginal
very_high
11. Classifiers
Ø ZeroR : There is no predictability, it is useful for determining a baseline
performance as a benchmark for other classification methods.
Ø OneR : Classification rules based on the value of a single predictor, that generates
one rule for each predictor in the data.
Ø NaiveBayes: Bayes rule is implemented or assigned to make easier to evaluate
prior from a probability model. it handles condition of some missing entries in data.
Ø J48: It creates a binary tree, With this technique, a tree is constructed to model the
classification process.
Ø IBk (k nearest neighbour): The nearest neighbor algorithm categorise a given
instance depend on a set of already categorise the training set by measuring the
distance to the closed instances
12. Association Methods
Ø Aprior Algorithm: Find rules that will predict the
occurrence of an item based on the occurrences of
other items in the transaction.
Ø FP-Growth Algorithm: Allows frequent discovery
without candidate itemset generation. Extracts
frequent itemsets form the FP-tree. Follow Divide
and conquer approach.
13. Heart Disease Database
Sr. No.
Attributes
Description
Values
1
age
Age in years
Continuous
2
gender
Male or female
1 = Male,
0 = female
3
cp
Chest pain type
1 = typical type,
2 = typical type angina,
3 = non-angina pain,
4 = asymptomatic
4
thestbps
Resting blood pres-
sure
Continuous value in mm hg
5
chol
Serum cholesterol
Continuous value in mm/dl
6
thalach
Maximum heart rate
achieved
Continuous value
7
fbs
Fasting blood sugar
1 =>120 mg/dl,
0 =<120 mg/dl
14. Continue…
8
Restecg
Resting electro-
graphic results
0 = normal,
1 = having ST-T wave abnormal,
2 = left ventricular hypertrophy
9
exang
Exercise induced
angina
0 = no 1 = yes
10
oldpeak
ST depression
induced by exercise
relative to rest
Continuous value
11
slope
Slope of the peak
exercise ST segment
1 = unsloping,
2 = flat,
3 = downsloping
12
ca
Number of major
vessels colored by
floursopy
0 - 3 value
13
thal
Defect type
3 = normal,
6 = fixed,
7 = reversible defect
14
Diagnosis
Heart disease Predi-
cation
Value 1: no heart disease
Value 0: has heart disease
15. Literature Survey
Ø Liao et al. [3] author report about data mining techniques and application,
development through a survey of literature, form 2000 to 2011. Paper surveys
three areas of data mining research: knowledge types, analysis types, and
architecture types. A discussion deals with future progress in social science and
Engineering methodologies implement data mining techniques and the development
of applications in problem- oriented
Ø Liu et al. [4] presented an associative classification, to integrate classification rules
and association rule mining. The integration is done by focusing on mining a special
subset of association rules whose consequent parts are restricted to the classification
class labels, called Class Association Rules (CARs). This algorithm first generates all
the association rules and then selects a small set of rules to form the classifiers.
When predicting the class label for a coming sample, the best rule is chosen.
16. Continue…
Ø The first association rule mining algorithm was the Apriori algorithm [5] developed
by Agrawal, and swami. The Apriori algorithm generates the candidate item sets in
one pass through only the item sets with large support in the previous pass, without
considering the transactions in the database.
Ø Palaniappan and Awang [6] developed a prototype Intelligent Heart Disease
Prediction System (IHDPS) using data mining techniques, namely, Decision Trees,
Nave Bayes and Neural Network. Results show that each technique has its unique
strength in realizing the objectives of the defined mining goals. IHDPS can answer
complex what if queries which traditional decision support systems cannot. Using
medical profiles such as age, gender, blood pressure and blood sugar it can predict
the likelihood of patients getting a heart disease. IHDPS is Web-based, user-
friendly, scalable, reliable and expandable. It is implemented on the .NET platform.
17. Continue…
Ø Srinivas et al. [7] presented Application of Data Mining Technique in Healthcare and
Prediction of Heart Attacks. The potential use of classification based data mining techniques
such as Rule based, Decision tree, Nave Bayes and Artificial Neural Network to the massive
Volume of healthcare data. Tanagra data mining tool was used for exploratory data analysis,
machine learning and statistical learning algorithms. The training data set consists of 3000
instances with14 different attributes.
Ø Shouman et al. [8] proposed k-means clustering with the decision tree method to predict the
heart disease. In their work they suggested several centroid selection methods for k- means
clustering to increase efficiency. The 13 input attributes were collected from Cleveland Clinic
Foundation Heart disease data set. For the random attribute and random row methods, ten
runs were executed and the average and best for each method were calculated. In Addition,
integrating k-means clustering and decision tree could achieve higher accuracy than the
paging algorithm in the diagnosis of heart disease patients. The accuracy achieved was
83.9% by the enabler method with two clusters.
The algorithm used
Accuracy
Time taken
Naive Bayes
52.33%
609ms
Decision list
52%
719ms
K-NN
45.67%
1000ms
18. Summary and Gaps Identified
Ø Implementation of different methods like NaiveBayes, Decision tree and
Neural, K-nearest, Artificial Neural Network etc, is done on heart disease
dataset.
Ø The performance of the classifiers is evaluated and their results are
analysed.
Ø Maximum accuracy achieved according to the survey is 83.9% using K-
means clustering with decision tree.
Ø The classification methods does not provide better accuracy and
experimental results.
Ø Integration of associative classification is not yet implemented on heart
diseases data set.
19. Problem Formulation
Ø Accuracy of heart data diseases is only calculate on basis of classification
methods.
Ø Accuracy of corrected classified instances is less to predict heart diseases.
Ø Association and classification suffers from inefficiency due to the fact that it
often generates a very large number of insignificant rules.
Ø Most of the associative classification algorithms adopt the exhaustive search
method to discover the rules and require multiple passes over the
database.
Ø They find frequent items in one phase and generate the rules in a separate
phase consuming more resources such as storage and processing time.
20. Objectives
Ø To propose a technique that can generate
Classification Association Rules (CARs) efficiently for
heart diseases prediction.
Ø Perform evaluation of proposed approach.
Ø Comparative analysis of proposed method with
other state-of-the-art techniques
21. Present Work
The Present Work has been implemented using data mining tool Weka .
Implementation steps are listed below :
1. Review of the classification and association rule generation methods.
2. Understanding the existing algorithm of classification.
3. Study the existing methods of Classification and association to predict heart
diseases.
4. Understanding the heart disease data set attributes used in predication.
5. Study ARFF file format standard of representing datasets.
6. Preparing data set for implementation of association algorithm
22.
23. Continue…
7. Implement association algorithm like Aprior and FP growth on prepared
data set.
8. Select the best 10 rules for each associate algorithm.
9. Make classes and extract training data sets bases on different rules.
10. Implement classification algorithms on extracted training data set.
11. Compared the performance and accuracy of corrected classified instances
of classification methods.
12. Construct a system based on high performance and better accuracy of
classification meth- ods.
26. Sample Data form of Heart Disease Prediction
Online Available : http://gndec.ac.in/~jagdeepmalhi/ihdps/
27. Sample Data of Heart Disease Prediction for Risk Level: No
28. Sample Data of Heart Disease Prediction for Risk Level: Low
29. Sample Data of Heart Disease Prediction for Risk Level: High
30. Results and Discussion
The Evaluation of results is done on bases of two
categories.
Ø Compare the different parameters like time taken,
Correctly/Incorrectly classified instances, Kappa statistic
value, mean absolute error and root mean squared
error rate of different classifier with Aprior and FP-
Growth association algorithm.
Ø Compare the accuracy evaluated by different authors
on the heart disease dataset.
31. Continue…
Comparison of different classifiers using Aprior association
algorithm on heart diseases dataset.
Classifiers
Time
Taken (In
seconds)
Correctly
Classified
I n s t a n c e s
(%)
Incorrectly
Classified
I n s t a n c e s
(%)
Kappa
statistic
Mean
absolute
error
Root mean
squared
error
ZeroR
0.001
67.2
32.79
0
0.441
0.470
OneR
0.01
97.31
2.6
0.94
0.027
0.164
J48
0.04
97.85
2.15
0.951
0.031
0.143
IBk
0.003
99.19
0.81
0.982
0.010
0.090
NaiveBayes
0.01
97.58
2.42
0.946
0.023
0.137
32. Continue…
Comparison of different classifiers using FP- Growth
association algorithm on heart diseases dataset.
Classifiers
Time
Taken (In
seconds)
Correctly
Classified
Instances
(%)
Incorrectly
Classified
Instances
(%)
Kappa
statistic
Mean
absolute
error
Root mean
squared
error
ZeroR
0.001
85.67
14.33
0
0.247
0.350
OneR
0.005
92.55
7.45
0.649
0.075
0.273
J48
0.01
96.56
3.44
0.859
0.056
0.185
IBk
0.001
94.84
5.16
0.779
0.053
0.227
NaiveBayes
0.003
97.55
7.45
0.711
0.088
0.265
33. Continue…
Comparison of Aprior and FP-Growth association
algorithms heart diseases dataset
Association
Algorithms
ZeroR
accuracy
OneR
accuracy
J48
accuracy
IBk
accuracy
NaiveBayes
accuracy
Aprior
67.2
97.31
97.85
99.19
97.58
FP-Growth
85.67
92.55
96.56
94.84
97.55
34. Continue…
Comparison of results evaluated by different authors
on the heart disease dataset.
Author /Year Technique Accuracy (%)
Cheung 2001 [11] NaiveBayes 81.48
Polat and Sahan et al. 2007 [12] K-Nearest Neighbor 87.00
Shouman and Turner et al. 2012 [13] Decision tree 84.10
Das and Turkoglu et al. 2009 [14] K-Nearest Neighbor 97.40
Tu and Shin et al. 2009 [15] J4.8 Decision Tree 78.90
Proposed Method 2014 IBk with Aprior Algorithm 99.19
35. Conclusion
Ø The development of a hybrid technique for implementation
of associative classification is done on heart diseases
dataset to predict more accurate results.
Ø Dataset is implement on weka environment and compared
the performance of different classifier after apply
association algorithm.
Ø Results show that IBk (k Nearest Neighbor) with Aprior
associative algorithms shows better results than others.
Ø Compare the results of different classifiers with proposed
implementation methods.
Ø Finally develop Intelligent Heart Diseases Prediction System
(IHDPS) for end user to check the risk of heart diseases.
36. Future Scope
Ø In future work plan to reduce numbers of attributes
and to determine the attribute which contribute
towards the diagnosis of heart disease.
Ø Additional Data Mining techniques can be
incorporated to provide better results.
Ø There is a need to build a system where every
human can check the risk of heart diseases using
minimum recourses and parameters.
Ø Parameters like processing time, resources and
memory used can be further enhanced.
37. References
1) U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “Data mining to knowledge discovery in
databases,” American Association for Artificial Intelligence, vol. 17, no. 3, pp. 37–54, 1996.
2) D. Aha. (1988, July) Heart disease databases. [Online]. Available: http://repository.seasr.
org/Datasets/UCI/arff/heart-c.arff.
3) S. H. Liao, P. H. Chu, and P. Y. Hsiao, “Data mining techniques and applications - a decade
review from 2000 to 2011,” Elsevier Expert Systems with Applications, vol. 39, no. 1, pp. 11
303–11 311, 2012.
4) B. Liu, W. Hsu, and Y. Ma, “Integrating classification and association rule mining,” In
Knowledge Discovery and Data Mining, New York, vol. 2, pp. 80–86, 1998.
5) R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in VLDB, Santi-
ago, Chile, September 1994, pp. 487–499.
6) S.Palaniappan and R.Awang, “Intelligent heart disease prediction system using data mining
techniques,” in IEEE/ACS International Conference, Doha, 2008, pp. 108–115.
7) K. Srinivas, B. K. Rani, and D. A. Govrdhan, “Application of data mining techniques in
healthcare and prediction of heart attacks,” International Journal on Computer Science and
Engineering, vol. 2, no. 2, pp. 250–255, 2011.
38. Continue …
8) M. Shouman, T. Turner, and R. Stocker, “Integrating decision tree and k-means clustering with different
initial centroid selection methods in the diagnosis of heart disease patients,” in Proceedings of the
International Conference on Data Mining, 2012.
10) J. Singh, H. Singh, and A. Kamra, “Recent trends in data mining: A review,” in Proceeding of 3rd
International Conference on Biomedical Engineering and Assistive Technologies, Chandigarh, India, 2014,
pp. 138–144.
11) N.Cheung, “Machine learning techniques for medical analysis,” B.Sc. Thesis, School of Information
Technology and Electrical Engineering, University of Queenland, 2001.
12) K. Polat, S. Sahan, and S. Gunes, “Automatic detection of heart disease using an artifi- cial immune
recognition system (airs) with fuzzy resource allocation mechanism and k-nn (nearest neighbor) based
weighting preprocessing,” Expert Systems with Applications, pp. 625–663, 2007.
13) M. Shouman, T. Turner, and R. Stocker, “Applying k-nearest neighbor in diagnosing heart disease
patients,” International Journal of Information and Education Technology, vol. 2, no. 3, pp. 220–223, June
2012.
14) R. Das, I. Turkoglu, and A. Sengur, “Effective diagnosis of heart disease through neural networks
ensembles,” Expert Systems with Applications, Elsevier, pp. 7675–7680, 2009.
15) M. C. Tu, D. Shin, and D. Shin, “Effective diagnosis of heart disease through bagging approach,” in
Proceeding of 2nd International Conference on Biomedical Engineering and Informatics. Seoul, South Korea:
IEEE, October 2009, pp. 1–4.