Publicité
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Publicité
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Effect of Data Size on Feature Set Using Classification in Health Domain
Publicité
Effect of Data Size on Feature Set Using Classification in Health Domain
Prochain SlideShare
Assessment of Decision Tree Algorithms on Student’s RecitalAssessment of Decision Tree Algorithms on Student’s Recital
Chargement dans ... 3
1 sur 10
Publicité

Contenu connexe

Similaire à Effect of Data Size on Feature Set Using Classification in Health Domain(20)

Publicité

Effect of Data Size on Feature Set Using Classification in Health Domain

  1. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 1 | P a g e Copyright@IDL-2017 Effect of Data Size on Feature Set Using Classification in Health Domain Uttham H1* , Gowramma2 1 PG-Student, 2 Associate Professor, Dept. Computer Science & Engineering, D.B.I.T, Banglore, Karnataka, India. 1* utthamhmanju@gmail.com,2 gowramma@gmail.com. ABSTRACT: In health domain, the major critical issue is prediction of disease in early stage. Prediction of disease is mainly based on the experience of physician so many machine learning approach contribute their work in the prediction of disease. In existing approaches, either prediction or feature selection has been concentrated. The aim of this paper is to present the effect of data size and set of features in the prediction of disease in health domain using Naïve Bayes. This shows how each attribute or combination of attribute behaves on different size of dataset. Keywords: Machine Learning, Classification, Naïve Bayes, feature selection. 1. INTRODUCTION In health, domain diagnosis of disease is very challenging task. Earlier prediction can made based on some lab test. Using this lab test report the physician will decide whether the patient has disease or not but prediction of disease by physician mainly depend on the experience. If the physician has more experience, then he may predict well. if the physician has less experience then he may predict wrongly.to overcome from this problem machine learning has many approaches like KNN, SVM, ANN to
  2. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 2 | P a g e Copyright@IDL-2017 predict correctly. Machine learning is a branch of science that allows machine to make decision.to make decision machine has to learn on itself or by experience. There are three types of learning supervised learning, unsupervised learning, reinforcement learning. The aim of this study is to find the effect on performance of different feature set using WEKA on different size of Pima Indian Diabetes Dataset. A critical challenge in medical science is to attain the diagnosis correctly. For correct diagnosis, generally many tests done to predict correctly. All of these test procedures said to be necessary in order to reach the ultimate diagnosis. However, on the other hand, too many tests could complicate the main diagnosis process and lead to the difficulty in obtaining the results, particularly in the case where many tests performed. This kind of difficulty could be resolved with the aid of machine learning which used directly to obtain the result with the aid of several classification techniques. Machine learning covers such a broad range of processes that it is difficult to define it precisely. A dictionary definition includes phrases such as to gain knowledge, understanding of, or skill by studying the instruction or experience and modification of a behavioural tendency by experienced zoologists and psychologists study learning in animals and humans [1]. The extraction of important information from a large pile of data and its correlations is often the advantage of using machine learning. Humans are constantly discovering new knowledge about tasks. There is a constant stream of new events in the world and continuing redesign of Artificial Intelligent systems to conform to new knowledge is impractical but machine-learning methods might be able to track much of it [1]. There is a substantial amount of research has been done with machine learning algorithms such as Bayes network, Multilayer Perceptron, Decision tree and pruning like J48graft, C4.5, Single Conjunctive Rule Learner like FLR, JRip and Fuzzy Inference System and Adaptive Neuro-Fuzzy Inference System. 2. RELATED WORK A good number of researches have been reported in literature on diagnosis of different deceases. Sapna and Tamilarasi [2]
  3. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 3 | P a g e Copyright@IDL-2017 proposed a technique based on neuropathy diabetics. Nerve disorder is caused by diabetic mellitus. Long term diabetic patients could have diabetic neuropathies very easily. There is fifty (50%) percent probability to have such diseases which affect many nerves system of the body. For example, body wall, limbs (which called as somatic nerves) could be affected. On the other hand, internal organ like heart, stomach, etc., are known as automatic nerves. In this paper, the risk factors and symptoms of diabetic neuropathy are used to make the fuzzy relation equation. Fuzzy relation equation is linked with the perception of composition of binary relations that means they used Multilayer Perceptron NN using Fuzzy Inference System. Leonarda and Antonio [6] proposed automatic detection of diabetic symptoms in retinal images by using a multilevel perceptron neural network. The network trained using algorithms for evaluating the optimal global threshold, which can minimize pixel classification errors. System performances evaluated by means of an adequate index to provide percentage measure in the detection of eye suspect regions based on neuro-fuzzy subsystem. Radha and Rajagopalan [4] introduced an application of fuzzy logic to diagnosis of diabetes. It describes the fuzzy sets and linguistic variables that contribute to the diagnosis of disease particularly diabetes. As we all know fuzzy logic is a computational paradigm, that provides a tool based on mathematics which deals with un- certainty. At the same time this paper also presents a computer-based Fuzzy Logic with maximum and mini- mum relationship, membership values consisting of the components, specifying fuzzy set frame work. Forty patients’ data have been collected to make this relationship more strong. Faezeh,Hossien, Ebrahim [7] proposed a fuzzy clus- tering technique (FACT) which determined the number of appropriate clusters based on the pattern essence. Dif- ferent experiments for algorithm evaluation were per- formed which showed a better performance compared to the typical widely used K-means clustering algorithm. Data
  4. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 4 | P a g e Copyright@IDL-2017 was taken from the UCI Machine Learning Repository [3]. 3. DATA SET DESCRIPTION The characteristics of the data set used in this research are summarized in following Table 1. The detailed descriptions of the data set are available at UCI repository which contains 768 instances [3] Dataset->Pima Indian diabetes No of example->768 Input attribute->8 Output classes->two Total number of attribute->nine Missing attributes status->No Noisy attribute status->No Table 1. Characteristics of data sets Sl number Attributes 0 Number of times pregnant 1 Plasma glucose concentration a 2 hours in an oral glucose tolerance test 2 Diastolic blood pressure (mm Hg) 3 Triceps skin fold thickness (mm) 4 2-hour serum insulin (mu U/ml) 5 Body mass index (weight in kg/(height in m)^2) 6 Diabetes pedigree function 7 Age (years) 8 Class variable (0 or 1) 4. METHODOLOGY In this paper, we will use machine learning techniques like the Naïve Bayes classification techniques for classification of diabetes data
  5. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 5 | P a g e Copyright@IDL-2017 4.1. Naïve BayesThe Naïve Bayes [5] classifier provides a simple approach, with clear semantics, representing and learning probabilistic knowledge. It is termed naïve because is relies on two important simplifying assumes that the predictive attributes are conditionally independent given the class, and it assumes that no hidden or latent attributes influence the prediction process. Naive Bayes: The Naive Bayes classifier is a simple supervised learning probabilistic classifier based on Bayes’ theorem. P(c|x) =P(x|c)P(c)/ P(x)--------->(1) P(c|x) = P(x1|c)P(x2|c)...P(x6|c)P(c)--------- > (2) Where P(c|x) is the posterior probability of the class (high-risk or low-risk) given the predictors, calculated as (2), P(c) is the prior probability of the class, P(x|c) is the likelihood which is the probability of the predictor given the class, and P(x) is the prior probability of predictor. 5. PERFORMANCE METRICS We measure the performance of the classifiers with respect to different performance metrics like precision value, recall value, F-measure value. Precision value (p): provides correctness Calculate the precision with respect to a particular class. This is defined as Correctly classified positives p= ------------------------------ Total predicted as positive Recall value(r): provides completeness Calculate the recall with respect to a particular class. This is defined as Correctly classified positives r= ---------------------------------------------- Total positives F-Measure (f): it is the harmonic mean of precision value and recall value Calculate the F-Measure with respect to a particular class. This is defined as 2 * r * p F=-- ---------------------- r + p 6 EXPERIMENTAL WORK
  6. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 6 | P a g e Copyright@IDL-2017 This experiment have done with the help of open source tools in window environment using eclipse software. In this experiment, we used the java code and libraries, which are available in WEKA. To conduct the experiment following procedure has to follow. We divide our data set into training sets and testing sets to apply supervised learning. We use Naive Bayes classifier to explore our data set, primarily because previous work has shown that these algorithms present a good trade-off between simplicity and accuracy. Patients are classified into one of two classes: (i) ’diabetic’ i or (ii) ’non - diabetic’. We use 10-fold cross validation in training and then we apply the model onto our testing set. Consider 100% of data means full instances then For each possible subset for features (for example if there are 8 attribute then 2^8 possible subset)Apply10-fold cross validation for building the model then note down the precision value, recall value, f score value. Repeat the experiment for 90% of data,80% of data,70% of data,60% of data,50% of data.by conducting this experiment we know how each feature or combination of feature act on different size of data. 7. RESULT ANALYSIS AND DISCUSSION In this paper, we examine the effect of data size on feature set using naïve Bayes classifier. For each attribute set for example if there are 8 attribute then 2^8-1=256-1=255 subset possible. For each subset graph generated. Which shows performance of each The following figure shows the effect of features(0,1,2,4) on different data size
  7. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 7 | P a g e Copyright@IDL-2017 The below graph shows the effect of attribute subset(2,4,5,6) on different data size
  8. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 8 | P a g e Copyright@IDL-2017 8. CONCLUSION AND FUTURE WORK The objective of this study is to evaluate effect of data size on feature set and investigate the performance using Naïve Bayes algorithm based on WEKA. The experiment shows the effect of each attribute or combination of attribute affecting the performance on different data size i.e. for each possible subset of attribute affecting the performance for prediction on different size of data. As a future work we can conduct same experiment on different data set for example :heart attack dataset and diabetes dataset from the experiment we can combine common attribute affect for prediction also we can work using different classification algorithm. 9. REFERENCES [1] N.J.Nilsson, “Introduction to Machine Learning,” 2010 http://ai.stanford.edu/~nilsson/mlboo k.html [2] M. S. Sapna and D. A. Tamilarasi, “Fuzzy Relational Equation in Preventing Neuropathy Diabetic,” Internati- onal Journal of Recent Trends in Engineering, Vol. 2, No. 4, 2009, p. 126. [3] UCI Machine Learning Repository. http://www.ics.uci.edu/mlearn/MLR epository.html [4] R. Radha and S. P. Rajagopalan, “Fuzzy Logic Approach for Diagnosis of Diabetes,” Information Technology Journal, Vol. 6, No. 1,
  9. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 9 | P a g e Copyright@IDL-2017 pp. 96-102. doi:10.3923/itj.2007.96.102 [5] G. H. John and P. Langley, “Estimating Continuous Distributions in Bayesian Classifiers,” Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, San Francisco, 1995, pp. 338-345. [6] L. Carnimeo and A. Giaquinto, “An Intelligent System for Improving Detection of Diabetic Symptoms in Retinal Images,” IEEE International Conference on Information Technology in Biomedicine, Ioannina, 26-28 October 2006. [7] F. Ensan, M. H. Yaghmaee and E. Bagheri, “Fact: A New Fuzzy Adaptive Clustering Technique,” The 11th IEEE Symposium on Computers and Communications, Sardinia, 26- 29 June 2006, pp. 442-447. doi:10.1109/ISCC.2006.73
  10. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 10 | P a g e Copyright@IDL- 2017
Publicité