SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
INTERNATIONAL Issue Engineering and Technology © IAEME ISSN 0976 – 6367(Print), ISSN
 International Journal of Computer
 0976 – 6375(Online) Volume 3,
                                                               (IJCET),
                               JOURNAL OF COMPUTER ENGINEERING &
                                   3, October-December (2012),
                                TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 3, Issue 3, October - December (2012), pp. 155-167
                                                                               IJCET
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2012): 3.9580 (Calculated by GISI)                  ©IAEME
www.jifactor.com




     IMPROVING THE PERFORMANCE OF K-NEAREST NEIGHBOR
   ALGORITHM FOR THE CLASSIFICATION OF DIABETES DATASET
                   WITH MISSING VALUES
                              Y. Angeline Christobel1, P. Sivaprakasam2,
              1
                Research Scholar, Department of Computer Science, Karpagam University,
                                          Coimbatore, India
            2
              Professor, Department of Computer Science, Sri Vasavi College, Erode, India

 ABSTRACT
 In today’s world, people get affected by many diseases which cannot be completely cured.
 Diabetes is one such disease and is now a big growing health problem. It leads to the risk of heart
 attack, kidney failure and renal disease. The techniques of data mining have been widely applied
 to extract knowledge from medical databases. In this paper, we evaluated the performance of k-
 nearest neighbor(kNN) algorithm for classification of Diabetes data. We considered the data
 imputation, scaling and normalization techniques to improve the accuracy of the classifier while
 using diabetes data, which may contain lot of missing values. We selected to explore KNN
 because, it is very simple and faster than most of the complex classification algorithms. To
 measure the performance, we used Accuracy and Error rate as the metrics. We found that data
 imputation method will not lead to higher accuracy; instead it will give correct accuracy for
 missing values and imputation along with a suitable data preprocessing method increases the
 accuracy.


 Keywords: Data Mining, Classification, kNN, Data Imputation Methods, Data
 Normalization and Scaling

 1. INTRODUCTION
  Hospitals and medical institutions are finding difficult to extract useful information for decision
 support due to the increase of medical data. So the methods for efficient computer based analysis
 are essential. It has been proven that the benefits of introducing machine learning into medical
 analysis are to increase diagnostic accuracy, to reduce costs and to reduce human resources.
 Earlier detection of disease will aid in increased exposure to required patient care and improved
 cure rates [17]. Diabetes is one of the most deadly diseases observed in many of the nations.
 According to the World Health Organization (WHO) [18] there are approximately millions of
                                                  155
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

people in this world are suffering from diabetes. Data classification is an important problem in
engineering and scientific disciplines such as biology, psychology, medicines, marketing,
computer vision, and artificial intelligence. The goal of the data classification is to classify
objects into a number of categories or classes.
        A lot of research work has been done on Pima Indian diabetes dataset. In [8],
Jayalakshmi and Santhakumaran used the ANN method for diagnosing diabetes, using the Pima
Indian diabetes dataset without missing data and obtained 68.56% classification accuracy.
        Manaswini Pradhan et al.[12] suggested an Artificial Neural Network (ANN) based
classification model for classifying diabetic patients. Best accuracy being 72% with average
accuracy of 72.2%.
        Umar Sathic Ali, Dr.C.Jothi Ventakeswaran[13] proposed an improved evidence
theoretic KNN algorithm which combines Dempster Shafer theory of evidence and K nearest
neighbouring rule with distance metric based neighborhood with the average accuracy of 73.57%.
         Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya [14] proposed a
particle swarm based hybrid system for remedying the class imbalance problem in medical and
biological data mining. To guide the sampling process, multiple classification algorithms are
used. The classification algorithms used in the hybrid system composition includes Decision
Tree (J48), k-Nearest Neighbor (kNN), Naive Bayes (NB), Random Forest (RF) and Logistic
Regression (LOG). Four medical datasets such as blood, survival, breast cancer and Pima Indian
Diabetes from UCI Machine Learning Repository and a genome wide association study (GWAS)
dataset from the genotyping of Single Nucleotide Polymorphisms (SNPs) of Age-related
Macular Degeneration (AMD) are used. The different methods of the study includes PSO
(Particle Swam based Hybrid System), RO(Rndom OverSampling), RU (Random
UnderSampling) and Clustering. The metrics used are AU(Area under ROC curve), F-measure
and Geometric mean.Particle Swam based Hybrid System (PSO) gives an average accuracy of
71.6%, RU 67.3%, RO 65.8% and cluster 64.9% with KNN Classifier for Pima Indian Diabetes.
Also with other classifiers and dataset, the study shows that PSO gives better accuracy.
         In[15], Biswadip Ghosh applies FCP (Fuzzy Composite Programming) to build a
diabetes classifier using the PIMA diabetes dataset. He evaluated the performance of the Fuzzy
classifier using Receiver Operating Characteristic (ROC) Curves and compared the performance
of the Fuzzy classifier against a Logistic Regression classifier. The results show that FCP
classifier is better when the calculated AUC values are compared. The classifier based on
logistic regression has an AUC of 64.8%, while the classifier based on FCP has an AUC of
72.2%. He proved that the performance of the fuzzy classier was found to be better than a
logistic regression classifier. Quinlan applied C4.5 and it was 71.1% accurate (Quinlan 1993).
         In [3], Wahba's group at the University of Wisconsin applied penalized log likelihood
smoothing spline analysis of variance (PSA). They eliminated patients with glucose and BMIs of
zero leaving n=752. They used 500 for the training set, and the remaining 252 as the evaluation
set which showed an accuracy of 72% for the PSA model and 74% for a GLIM model (Wahba,
Gu et al. 1992).
        Michie et al. used 22 algorithms with 12-fold cross validation reported the following
accuracy rates on the test set and: Discrim 77.5%, Quaddisc 73.8%, Logdisc 77.7%,
SMART 76.8%, ALLOC80 69.9%, k-NN 67.6%, CASTLE 74.2%, CART 74.5%, IndCART
72.9%, NewID 71.1%, AC2 72.4%, Baytree 72.9%, NaiveBay 73.8%, CN2 71.1%, C4.5 73%,
Itrule 75.5%, Cal5 75%, Kohonen 72.7%, DIPOL92 77.6%, Backprop 75.2%, RBF 75.7%, and
LVQ 72.8% (Michie, Spiegelhalter et al. 1994, p. 158)

                                                 156
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

        J. Oates (1994) used Multi-Stream Dependency Detection (MSDD) algorithm on two-
thirds of the dataset for training. Accuracy on the one-third of evaluation was 71.33%.
        Although the cited articles use different techniques, the average accuracy is 70.64% for
Pima Indian Diabetes dataset.
    The representation of lot of missing values by zeros itself forms a characteristic and hence
affects the accuracy of the classifier. The main objective of this paper is to explore the ways to
improve the classification accuracy of a classification algorithm using suitable techniques such
as data imputation, scaling and normalization for the classification of diabetes data which may
contain lot of missing values.

2. DATA PREPROCESSING
Data in the real world is incomplete, noisy, inconsistent and redundant. The knowledge
discovery during the training phase is difficult if there is much irrelevant and redundant
information. Data preprocessing is important to produce quality mining results. The main tasks
in data preprocessing are data cleaning, data integration, data transformation, data reduction and
data discretization. The final dataset used will be the product of data preprocessing.

2.1. Missing Values and Imputation of Missing Values
Many existing, industrial and research datasets contain missing values. They are introduced due
to reasons, such as manual data entry procedures, equipment errors and incorrect measurements.
The most important task in data cleaning is processing missing values. Many methods have been
proposed in the literature to treat missing data [21,22]. Missing data should be carefully handled;
otherwise bias might be introduced into the knowledge induced and affect the quality of the
supervised learning process or the performance of classification algorithms [3, 4]. Missing
values imputation is a challenging issue in machine learning and data mining [2]. Due to the
difficulty with missing values, most of the learning algorithms are not well adapted to some
application domains (for example Web Applications) because the existed algorithms are
designed with the assumption that there are no missing values in datasets. This shows the
necessity for dealing with missing values. The satisfactory solution to missing data is good
database design but good analysis can help to reduce the problems [1]. Many techniques can be
used to deal missing data. Depending on the problem domain and the goal for the data mining
process, the right technique should be selected.

In literature, the different approaches to handle missing values in dataset are given as:
1. Ignore the tuple
2. Fill in the missing values manually
3. Use a global constant to fill in the missing values
4. Use attribute mean to fill in the missing values
5.Use the attribute mean for all samples belonging to the same class as the given tuple

2.2. Mean substitution
The popular imputation method to fill in the missing data values is to use a variable’s mean or
median. This method is suitable only if the data are MCAR(Missing Completely At Random).
This method creates a spiked distribution at the mean in frequency distributions. It lowers the
correlations between the imputed variables and the other variables and underestimates variance.



                                                 157
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

The Simple Data Scaling Algorithm
The following algorithm explains the data scaling method.
Let
        D = { A1, A2, A3, ….. An }
Where
       D is the set of unnormalized data
       Ai – is the ith attribute column of values of
       m- is the member of rows (records)
       n - is the number of attributes.
Function Normalize(D)
Begin
       For i=1 to n {
                Maxi ←max(Ai)
                Mini←min(Ai)
                For r =1 to m {
                         Air ← Air-Mini
                         Air ← Air/ Maxi
                         Where
                         Air is the element of Ai at row r
                }
       }
       Finally we will have the scaled data set.
End
The Mean Substitution Algorithm
The following algorithm explains the very commonly used form of mean substitution method [9]
Let
        D = { A1, A2, A3, ….. An }
Where
       D is the set of data with missing values
       Ai – is the ith attribute column of values of D with missing values in some or all columns
       n - is the number of attributes.
Function MeanSubstitution(D)
Begin
       For i=1 to n {
                ai ← Ai ∩ mi
                where
                     ai is the column of attributes without missing values
                      mi is the set of missing values in Ai (missing values denoted by a symbol)
                Let µi be the mean of ai
                Replace all the missing elements of Ai with µi
       }
       Finally we will have the imputed data set.
End


                                                 158
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

2.3. k-Nearest Neighbor(knn) Classification Algorithm
KNN classification classifies instances based on their similarity. It is one of the most popular
algorithms for pattern recognition. It is a type of Lazy learning where the function is only
approximated locally and all computation is deferred until classification. An object is classified
by a majority of its neighbors. K is always a positive integer. The neighbors are selected from a
set of objects for which the correct classification is known.

The kNN algorithm is as follows:
   1. Determine k i.e., the number of nearest neighbors
   2. Using the distance measure, calculate the distance between the query instance and all the
      training samples.
   3. The distance of all the training samples are sorted and nearest neighbor based on the k
      minimum distance is determined.
   4. Since the kNN is supervised learning, get all the categories of the training data for the
      sorted value which fall under k.
   5. The prediction value is measured by using the majority of nearest neighbors.

The Pseudo Code of kNN Algorithm
Function kNN(train_patterns , train_targets, test_patterns )
       Uc - a set of unique labels of train_targets;
       N - size of test_patterns
       for i = 1:N,
                dist:=EuclideanDistance(train_patterns, test_patterns(i))
                idxs := sort(dist)
                topkClasses := train_targets(idxs(1:Knn))
                c := DominatingClass (topkClasses)
                test_targets(i) := c
end


2.4. Validating the Performance of the Classification Algorithm
Classifier performance depends on the characteristics of the data to be classified. Performance of
the selected algorithm is measured for Accuracy and Error rate. Various empirical tests can be
performed to compare the classifier like holdout, random sub-sampling, k-fold cross validation
and bootstrap method. In this study, we have selected k-fold cross validation for evaluating the
classifiers. In k-fold cross validation, the initial data are randomly partitioned into k mutually
exclusive subset or folds d1,d2,…,dk, each approximately equal in size. The training and testing
is performed k times. In the first iteration, subsets d2, …, dk collectively serve as the training set
in order to obtain a first model, which is tested on d1; the second iteration is trained in subsets d1,
d3,…, dk and tested on d2; and so no[19]. The accuracy of the classifier refers to the ability of a
given classifier to correctly predict the class label of new or previously unseen data [19].

The Accuracy and Error rate can be defined as follows:
Accuracy = (TP+TN) / (TP + FP + TN + FN)
Error rate = (FP+FN) / (TP + FP + TN + FN)
Where

                                                  159
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

TP is the number of True Positives
TN is the number of True Negatives
FP is the number of False Positives
FN is the number of False Negatives

3. EXPERIMENTAL RESULTS
3.1. Pima Indians Diabetes Dataset
The Pima (or Akimel O'odham) are a group of American Indians living in southern Arizona. The
name, "Akimel O'odham", means "river people". The short name, "Pima" is believed to have
come from the phrase pi 'añi mac or pi mac, meaning "I don't know," used repeatedly in their
initial meeting with Europeans [25]. According to World Health Organization (WHO) [18 ], a
population of women who were at least 21 years old of Pima Indian Heritage was tested for
diabetes. The data were collected by the US National Institute of Diabetes and Digestive and
Kidney Diseases.
Number of Instances: 768
Number of Attributes: 8 (Attributes) plus 1 (class label)
All the attributes are numeric-valued

                      Sl
                            Attribute     Explanation
                      No
                      1     Pregnant      Number of times pregnant
                                          Plasma glucose concentration
                      2     Glucose
                                          (glucose tolerance test)
                                          Diastolic blood pressure (mm
                      3     Pressure
                                          Hg)
                      4     Triceps       Triceps skin fold thickness (mm)
                      5     Insulin       2-Hour serum insulin (mu U/ml)
                                          Body mass index (weight in
                      6     Mass
                                          kg/(height in m)^2)
                      7     Pedigree      Diabetes pedigree function
                      8     Age           Age (years)
                      9     Diabetes      Class variable (test for diabetes)

Class Distribution: Class value 1 is interpreted as "tested positive for diabetes"
Class Value : 0 - Number of instances - 500
Class Value : 1 - Number of instances – 268

While the UCI repository index claims that there are no missing values, closer inspection of the
data shows several physical impossibilities, e.g., blood pressure or body mass index of 0.They
should be treated as missing values to achieve better classification accuracy.

The following 3D plot clearly shows the complex distribution of the positive and negative
records in the original Pima Indians Diabetes Database which will complicate the classification
task.

                                                 160
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME




                 Fig 1. The 3D plot of original Pima Indians Diabetes Database

3.2. Results of the 10 Fold Validation
The following table shows the results in terms of accuracy, error and run time in the case of 10
fold validation

                               Table 1. 10 Fold Validation Results
                      Performanc
                                                             After
                      e       with Actual        After
                                                             Imputation
                      different    Data          Imputation
                                                             and Scaling
                      Metrics
                      Accuracy 71.71             71.32       71.84
                      Error         28.29        28.68         28.16
                      Time          0.2          0.28          0.23

As shown in the following graphs, the performance was considerably improved while using
imputed and scaled data. The performance after imputation is almost equal or little bit lower
than using normal data. The reason is, the original data contains lot of missing values and the
missing values represented by 0 are treated as a significant feature by the classifier. Even though
the performance of original data is higher than that of imputed data, we should not consider
because it will not be accurate. If we train the system with original data which has lot of missing
values and use it for classifying real world data which may not contain any missing value, then
the system will not classify the ideal data correctly.




                                                 161
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME


                                          Performance in terms of Accuracy


                                         71.9




                           Accuracy(%)
                                         71.8
                                         71.7
                                         71.6
                                         71.5
                                         71.4
                                         71.3
                                         71.2
                                         71.1
                                           71
                                                 Actual Data     After       After
                                                               Imputation Imputation
                                                                          and Scaling



                       Fig 2. Comparison of Accuracy –10 Fold Valuation

The accuracy of classification has been significantly improved after imputation and scaling.
                                            Performance in terms of Error


                                         28.8
                                         28.7
                                         28.6
                           Error(%)




                                         28.5
                                         28.4
                                         28.3
                                         28.2
                                         28.1
                                           28
                                         27.9
                                                 Actual Data     After       After
                                                               Imputation Imputation
                                                                          and Scaling



                         Fig 3. Comparison of Error –10 Fold Valuation

The classification error has been significantly reduced after imputation and scaling.

3.3. Results of the Average of 2 to 10 Fold Validation
The following table shows the average of results in terms of accuracy, error and run time in the
case of 2 to 10 fold validation

                      Table 2. Average of 2 to 10 Fold Validation Results
                      Performanc
                                                             After
                      e       with Actual       After
                                                             Imputation
                      different    Data         Imputation
                                                             and Scaling
                      Metrics
                      Accuracy 71.94            71.26        73.38
                      Error                     28.06          28.74        26.62
                      Time                      0.19           0.19         0.19

                                                               162
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

                                                Performance in terms of Accuracy


                                          74
                                         73.5




                           Accuracy(%)
                                          73
                                         72.5
                                          72
                                         71.5
                                          71
                                         70.5
                                          70
                                                  Actual Data     After         After
                                                                Imputation   Imputation
                                                                             and Scaling


                  Fig 4. Comparison of Accuracy Avg. of 2-10 Fold Valuations

The accuracy of classification has been significantly improved after imputation and scaling. The
average of 2 to 10 fold validations clearly shows the difference in accuracy.

                                                 Performance in terms of Error


                                          29
                                         28.5
                                          28
                           Error(%)




                                         27.5
                                          27
                                         26.5
                                          26
                                         25.5
                                                  Actual Data     After         After
                                                                Imputation   Imputation
                                                                             and Scaling


                    Fig 5. Comparison of Error Avg. of 2-10 Fold Valuations

The classification error has been significantly reduced after imputation and scaling. The average
of 2 to 10 fold validations clearly shows the difference in classification Error.

The graphs above show that imputation alone does not give higher accuracy but imputation with
a data preprocessing technique can improve the accuracy of a dataset which has missing values.

We also compared our improved KNN with Weka’s Implementation of KNN. The results show
that improved kNN’s accuracy with imputation and scaling method is 3.2% higher than
the standard Weka's Implementation of kNN.


4. CONCLUSION AND SCOPE FOR FURTHER ENHANCEMENTS
The result shows that the characteristics and distribution of data and the quality of input data had
a significant impact on the performance classifier. Performance of the kNN classifier is measured
in terms of Accuracy, Error Rate and run time using 10-fold validation method and average of 2-
10 fold validation. The application of imputation method and data normalization had obvious
                                                                163
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

impact on classification performance and significantly improved the performance of kNN. The
performance after imputation is almost equal or little bit lower than using original data. The
reason is, the original data contains lot of missing values and the missing values represented by 0
are treated as a significant feature by the classifier. So the data imputation method not always
lead to so called “ good results” they only give “correct” results. It means imputation will not
always lead to higher accuracy for imputed missing values. But Imputation along with a suitable
data preprocessing method increases the accuracy. Our improved kNN’s accuracy with
imputation and scaling method is 3.2% higher than the standard Weka's Implementation of
kNN. To improve the overall accuracy, we have to improve the performance of the classifier or
use a good feature selection method. Future works may address hybrid classification model using
kNN with other techniques. Even though the achieved accuracy of kNN based method is little bit
lower than some of the previous methods, still there are scope for improving the kNN based
method. For example, we may use another distance metric function instead of the standard
Euclidean distance function in the distance calculation part of kNN for improving accuracy.
Even we may use any evolutionary computing technique such as GA or PSO for feature selection
or Feature weight selection for attaining maximum possible accuracy of classification with kNN.
So, future work may address the ways to incorporate these ideas along with the standard kNN.

ANNEXURE –RESULTS OF K-FOLD VALIDATION
The following table shows the k-fold validation results where k is changed from 2 to 10. We
used the average of this results as well as the results corresponding to k=10 for preparing graphs
in the previous section IV.
                   Table 3. Results With Actual Pima Indian Diabetes Dataset

                           k        Accuracy        Error      Time
                           2        72.53           27.47      0.13
                           3        74.35           25.65      0.20
                           4        71.61           28.39      0.16
                           5        70.98           29.02      0.17
                           6        72.40           27.60      0.19
                           7        70.90           29.10      0.25
                           8        72.14           27.86      0.20
                           9        70.85           29.15      0.19
                           10       71.71           28.29      0.20
                           avg      71.94           28.06      0.19




                                                 164
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

                                 Table 4. Results With Imputed Data

                                k        Accuracy      Error      Time
                                2        71.22         28.78      0.16
                                3        70.44         29.56      0.19
                                4        70.83         29.17      0.16
                                5        71.24         28.76      0.14
                                6        71.88         28.13      0.19
                                7        71.56         28.44      0.20
                                8        71.61         28.39      0.17
                                9        71.24         28.76      0.19
                                10       71.32         28.68      0.28
                                Avg      71.26         28.74      0.19

                     Table 5. Results With Imputed and Scaled/Normalized Data

                              k         Accuracy       Error       Time
                              2         74.74          25.26       0.22
                              3         74.09          25.91       0.13
                              4         71.88          28.13       0.16
                              5         73.59          26.41       0.17
                              6         72.92          27.08       0.17
                              7         74.31          25.69       0.16
                              8         74.09          25.91       0.22
                              9         72.94          27.06       0.22
                              10        71.84          28.16       0.23
                              Avg       73.38          26.62       0.19


5. REFERENCES
[1] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning
      databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California,
      Department of Information and Computer Science.
[2]   Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press,
      Cambridge.
[3] Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), Soft Classification a.k.a.
      Risk Estimation via Penalized Log Likelihood and Smoothing Spline Analysis of Variance, in D. H.
      Wolpert (1995), The Mathematics of Generalization, 331-359, Addison-Wesley, Reading, MA.
[4] S. Sahan, K. Polat, H. Kodaz, and S. Gunes, "The medical applications of attribute weighted artificial
      immune system (awais): Diagnosis of heart and diabetes diseases," in ICARIS, 2005, p. 456 468.
[5] M. Salami, A. Shafie, and A. Aibinu, "Application of modeling tech- niques to diabetes diagnosis,"

                                                    165
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

    in IEEE EMBS Conference on Biomedical Engineering & Sciences, 2010.
[6] T. Exarchos, D. Fotiadis, and M. Tsipouras, "Automated creation of transparent fuzzy models based
    on decision trees application to diabetes diagnosis," in 30th Annual International IEEE EMBS
    Conference, 2008.
[7] M. Ganji and M. Abadeh, "Using fuzzy ant colony optimization fordiagnosis of diabetes disease,"
    IEEE, 2010.
[8] T. Jayalakshmi    and A. Santhakumaran, "A novel classification methodfor diagnosis of diabetes
    mellitus using artificial neural networks," in International Conference on Data Storage and Data
    Engineering, 2010
[9] R.S. Somasundaram and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for
    Enhancing Preprocessing of Data with Missing Values", International Journal of Computer
    Applications Issn-09758887, 2011
[10] Peterson, H. B., Thacker, S. B., Corso, P. S., Marchbanks, P. A. & Koplan, J. P. (2004). Hormone
    therapy: making decisions in the face of uncertainty. Archives of Internal Medicine, 164(21), 2308-12.
[11] Kietikul Jearanaitanakij,”Classifying Continous Data Set by ID3 Algorithm”, Proceedings of fifth
    International Conference on Information Communication and Signal Processing, 2005.
[12] Manaswini Pradhan1 , Dr. Ranjit Kumar Sahu2 , Predict the onset of diabetes disease using Artificial
    Neural Network (ANN), International Journal of Computer Science & Emerging Technologies (E-
    ISSN: 2044-6004) 303 Volume 2, Issue 2, April 2011
[13] P.Umar Sathic Ali, Dr.C.Jothi Ventakeswaran,” Improved Evidence Theoretic kNN Classifier based
    on Theory of Evidence”, International Journal of Computer Applications (0975 – 8887) Volume 15–
    No.5, February 2011
[14] Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya, “A particle swarm based
    hybrid system for imbalanced medical data sampling”, BMC Genomics. 2009; 10(Suppl 3): S34.
[15] BiswadipGhosh, “Using Fuzzy Classification for Chronic disease”, Indian Journal of
    Economics&Business, March 2012
[16] Angeline Christobel, SivaPrakasam,”An Empirical Comparison of Data mining Classification
    Methods”,International Journal of Computer Information Systems, Vol. 3, No. 2, 2011
[17] Roshawnna Scales, Mark Embrechts, “Computational intelligence techniques for medical
    diagnostics”.
[18] World Health Organization. Available: http://www.who.int
[19] J. Han and M. Kamber,”Data Mining Concepts and Techniques”, Morgan Kauffman Publishers, USA,
    2006.
[20] Agrawal, R., Imielinski, T., Swami, A., “Database Mining:A Performance Perspective”, IEEE
    Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993.
[21] E. Acuna and C. Rodriguez, The treatment of Missing Values and its e®ect in the classi¯ er accuracy,
    In: D. Banks, L. House, F.R. McMorris, P. Arabie, W. Gaul (Eds). Classi¯ cation, Clustering and Data
    Mining Applications, Springer-Verlag Berlin-Heidelberg, 2004
[22] G.E.A.P.A. Batista and M.C. Monard, An analysis of four Missing Data treatment methods for
    supervised learning, Applied Arti¯ cial Intelli-gence 17 (2003)
[23] Chen, M., Han, J., Yu P.S., “Data Mining: An Overview from Database Perspective”, IEEE
    Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996.

                                                     166
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN
0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

[24] Angeline Christobel, Usha Rani Sridhar, Kalaichelvi Chandrahasan, “Performance of Inductive
    Learning Algorithms on Medical Datasets”, International Journal of Advances in Science and
    Technology, Vol. 3, No.4, 2011
[25] Awawtam. "Pima Stories of the Beginning of the World." The Norton Anthology of American
    Literature. 7th ed. Vol. A. New York: W. W. Norton &, 2007. 22-31
[26] Quinlan, J. R. C4.5: “Programs for Machine Learning”. Morgan Kaufmann Publishers, 1993.
[27] S.B. Kotsiantis, Supervised Machine Learning: “A Review of Classification Techniques”, Informatica
    31(2007) 249-268, 2007
[28] J. R. Quinlan. “Improved use of continuous attributes in c4.5”, Journal of Artificial Intelligence
    Research, 4:77-90, 1996.
[29] J.R. Quinlan, “Induction of decision trees,” In Jude W.Shavlik, Thomas G. Dietterich, (Eds.),
    Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning,
    vol. 1, 1986, pp 81–106.
[30] Salvatore Ruggieri,”Efficient C4.5 Proceedings of IEEE transactions on knowledge and data
    Engineering”,Vo1. 14,2,No.2, PP.438-444,20025
[31] P. Domingos, M. Pazzani, “On the optimality of the simple Bayesian classifier under Zero-one loss”,
    Machine learning 29(2-3)(1997) 103-130.11
[32] Vapnik, V.N., “The Nature of Statistical Learning Theory”,      1st ed., Springer-Verlag,New York,
    1995.
[33] Michael J. Sorich, John O. Miners,Ross A. McKinnon,David A. Winkler, Frank R. Burden, and Paul
    A. Smith, “Comparison of linear and nonlinear classification algorithms for the prediction of drug and
    chemical metabolism, by human UDP- Glucuronosyltransferase Isoforms”
[34] UCI Machine Learning Repository
    http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29]
[35] XindongWu · Vipin Kumar · J. Ross Quinlan Joydeep Ghosh · Qiang Yang ·Hiroshi Motoda
    Geoffrey J. McLachlan · Angus Ng · Bing Liu ·Philip S. Yu ·Zhi-Hua Zhou · Michael
    Steinbach ·David J. Hand · Dan Steinberg, “Top 10 algorithms in data mining ” Springer 2007
[36] Niculescu-Mizil, A., & Caruana, R. (2005). “Predicting good probabilities with supervised learning”,
    Proc.22nd International Conference on Machine Learning (ICML'05).
[37] Thair Nu Phyu, “Survey of Classification Techniques in Data Mining MultiConference of Engineers
    and Computer Scientists”, 2009 Vol I IMECS 2009, Hong Kong
[38] Arbach, L.;  Reinhardt, J.M.; Bennett, D.L.; Fallouh, G.; Iowa Univ., Iowa City, IA,
    USA “Mammographic masses classification: comparison between backpropagation neural network
    (BNN), K nearest neighbors (KNN), and human readers”, 2003 IEEE CCECE
[39] D. Xhemali, C.J. Hinde, and R.G. Stone.,”Naïve Bayes vs. Decision Trees vs. Neural Networks in the
    Classification of Training Web Pages”, International Journal of Computer Science, 4(1):16-23,2009.




                                                    167

Contenu connexe

Tendances

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 

Tendances (20)

Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...
Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...
Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...
 
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
 
Q026091094
Q026091094Q026091094
Q026091094
 
IRJET- Comparison of Techniques for Diabetes Detection in Females using Machi...
IRJET- Comparison of Techniques for Diabetes Detection in Females using Machi...IRJET- Comparison of Techniques for Diabetes Detection in Females using Machi...
IRJET- Comparison of Techniques for Diabetes Detection in Females using Machi...
 
IRJET- Diabetes Diagnosis using Machine Learning Algorithms
IRJET- Diabetes Diagnosis using Machine Learning AlgorithmsIRJET- Diabetes Diagnosis using Machine Learning Algorithms
IRJET- Diabetes Diagnosis using Machine Learning Algorithms
 
A Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction TechniquesA Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction Techniques
 
Optimization of Backpropagation for Early Detection of Diabetes Mellitus
Optimization of Backpropagation for Early Detection of Diabetes Mellitus Optimization of Backpropagation for Early Detection of Diabetes Mellitus
Optimization of Backpropagation for Early Detection of Diabetes Mellitus
 
Diabetes prediction with r(using knn)
Diabetes prediction with r(using knn)Diabetes prediction with r(using knn)
Diabetes prediction with r(using knn)
 
IRJET - Comparative Study of Cardiovascular Disease Detection Algorithms
IRJET - Comparative Study of Cardiovascular Disease Detection AlgorithmsIRJET - Comparative Study of Cardiovascular Disease Detection Algorithms
IRJET - Comparative Study of Cardiovascular Disease Detection Algorithms
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine LearningIRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
 
IRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
IRJET- Heart Failure Risk Prediction using Trained Electronic Health RecordIRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
IRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
 
IRJET- Diabetes Prediction using Machine Learning
IRJET- Diabetes Prediction using Machine LearningIRJET- Diabetes Prediction using Machine Learning
IRJET- Diabetes Prediction using Machine Learning
 
IRJET - Prediction and Detection of Diabetes using Machine Learning
IRJET - Prediction and Detection of Diabetes using Machine LearningIRJET - Prediction and Detection of Diabetes using Machine Learning
IRJET - Prediction and Detection of Diabetes using Machine Learning
 
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
Estimating the Survival Function of HIV AIDS Patients using Weibull ModelEstimating the Survival Function of HIV AIDS Patients using Weibull Model
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
 
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
IRJET-  	  Diabetes Prediction by Machine Learning over Big Data from Healthc...IRJET-  	  Diabetes Prediction by Machine Learning over Big Data from Healthc...
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
 
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...
 
The Analysis of Performace Model Tiered Artificial Neural Network for Assessm...
The Analysis of Performace Model Tiered Artificial Neural Network for Assessm...The Analysis of Performace Model Tiered Artificial Neural Network for Assessm...
The Analysis of Performace Model Tiered Artificial Neural Network for Assessm...
 
IRJET- Chronic Kidney Disease Prediction based on Naive Bayes Technique
IRJET- Chronic Kidney Disease Prediction based on Naive Bayes TechniqueIRJET- Chronic Kidney Disease Prediction based on Naive Bayes Technique
IRJET- Chronic Kidney Disease Prediction based on Naive Bayes Technique
 
IRJET - E-Health Chain and Anticipation of Future Disease
IRJET - E-Health Chain and Anticipation of Future DiseaseIRJET - E-Health Chain and Anticipation of Future Disease
IRJET - E-Health Chain and Anticipation of Future Disease
 

En vedette

Final report
Final reportFinal report
Final report
butest
 
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
PlanetData Network of Excellence
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
Zaffar Ahmed Shaikh
 

En vedette (6)

Reverse nearest neighbors in unsupervised distance based outlier
Reverse nearest neighbors in unsupervised distance based outlierReverse nearest neighbors in unsupervised distance based outlier
Reverse nearest neighbors in unsupervised distance based outlier
 
Final report
Final reportFinal report
Final report
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
 
KNN
KNN KNN
KNN
 

Similaire à Improving the performance of k nearest neighbor algorithm for the classification of diabetes dataset with missing values

Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...
Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...
Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...
ijtsrd
 
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
IJECEIAES
 

Similaire à Improving the performance of k nearest neighbor algorithm for the classification of diabetes dataset with missing values (20)

DIABETES PREDICTOR USING ENSEMBLE TECHNIQUE
DIABETES PREDICTOR USING ENSEMBLE TECHNIQUEDIABETES PREDICTOR USING ENSEMBLE TECHNIQUE
DIABETES PREDICTOR USING ENSEMBLE TECHNIQUE
 
An efficient stacking based NSGA-II approach for predicting type 2 diabetes
An efficient stacking based NSGA-II approach for predicting  type 2 diabetesAn efficient stacking based NSGA-II approach for predicting  type 2 diabetes
An efficient stacking based NSGA-II approach for predicting type 2 diabetes
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep Learning
 
Diabetes Prediction Using ML
Diabetes Prediction Using MLDiabetes Prediction Using ML
Diabetes Prediction Using ML
 
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUESPREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
 
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
 
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
 
A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...
A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...
A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...
 
A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...
A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...
A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATU...
 
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
 
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
 
Machine learning approach for predicting heart and diabetes diseases using da...
Machine learning approach for predicting heart and diabetes diseases using da...Machine learning approach for predicting heart and diabetes diseases using da...
Machine learning approach for predicting heart and diabetes diseases using da...
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...
Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...
Development of a Hybrid Dynamic Expert System for the Diagnosis of Peripheral...
 
Analyzing the behavior of different classification algorithms in diabetes pre...
Analyzing the behavior of different classification algorithms in diabetes pre...Analyzing the behavior of different classification algorithms in diabetes pre...
Analyzing the behavior of different classification algorithms in diabetes pre...
 
Disease prediction in big data healthcare using extended convolutional neural...
Disease prediction in big data healthcare using extended convolutional neural...Disease prediction in big data healthcare using extended convolutional neural...
Disease prediction in big data healthcare using extended convolutional neural...
 
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
 
Heart Disease Prediction using Data Mining
Heart Disease Prediction using Data MiningHeart Disease Prediction using Data Mining
Heart Disease Prediction using Data Mining
 
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
 

Plus de IAEME Publication

A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
IAEME Publication
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
IAEME Publication
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
IAEME Publication
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
IAEME Publication
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
IAEME Publication
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
IAEME Publication
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
IAEME Publication
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
IAEME Publication
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
IAEME Publication
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
IAEME Publication
 

Plus de IAEME Publication (20)

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdf
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
 

Improving the performance of k nearest neighbor algorithm for the classification of diabetes dataset with missing values

  • 1. INTERNATIONAL Issue Engineering and Technology © IAEME ISSN 0976 – 6367(Print), ISSN International Journal of Computer 0976 – 6375(Online) Volume 3, (IJCET), JOURNAL OF COMPUTER ENGINEERING & 3, October-December (2012), TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 3, Issue 3, October - December (2012), pp. 155-167 IJCET © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEME www.jifactor.com IMPROVING THE PERFORMANCE OF K-NEAREST NEIGHBOR ALGORITHM FOR THE CLASSIFICATION OF DIABETES DATASET WITH MISSING VALUES Y. Angeline Christobel1, P. Sivaprakasam2, 1 Research Scholar, Department of Computer Science, Karpagam University, Coimbatore, India 2 Professor, Department of Computer Science, Sri Vasavi College, Erode, India ABSTRACT In today’s world, people get affected by many diseases which cannot be completely cured. Diabetes is one such disease and is now a big growing health problem. It leads to the risk of heart attack, kidney failure and renal disease. The techniques of data mining have been widely applied to extract knowledge from medical databases. In this paper, we evaluated the performance of k- nearest neighbor(kNN) algorithm for classification of Diabetes data. We considered the data imputation, scaling and normalization techniques to improve the accuracy of the classifier while using diabetes data, which may contain lot of missing values. We selected to explore KNN because, it is very simple and faster than most of the complex classification algorithms. To measure the performance, we used Accuracy and Error rate as the metrics. We found that data imputation method will not lead to higher accuracy; instead it will give correct accuracy for missing values and imputation along with a suitable data preprocessing method increases the accuracy. Keywords: Data Mining, Classification, kNN, Data Imputation Methods, Data Normalization and Scaling 1. INTRODUCTION Hospitals and medical institutions are finding difficult to extract useful information for decision support due to the increase of medical data. So the methods for efficient computer based analysis are essential. It has been proven that the benefits of introducing machine learning into medical analysis are to increase diagnostic accuracy, to reduce costs and to reduce human resources. Earlier detection of disease will aid in increased exposure to required patient care and improved cure rates [17]. Diabetes is one of the most deadly diseases observed in many of the nations. According to the World Health Organization (WHO) [18] there are approximately millions of 155
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME people in this world are suffering from diabetes. Data classification is an important problem in engineering and scientific disciplines such as biology, psychology, medicines, marketing, computer vision, and artificial intelligence. The goal of the data classification is to classify objects into a number of categories or classes. A lot of research work has been done on Pima Indian diabetes dataset. In [8], Jayalakshmi and Santhakumaran used the ANN method for diagnosing diabetes, using the Pima Indian diabetes dataset without missing data and obtained 68.56% classification accuracy. Manaswini Pradhan et al.[12] suggested an Artificial Neural Network (ANN) based classification model for classifying diabetic patients. Best accuracy being 72% with average accuracy of 72.2%. Umar Sathic Ali, Dr.C.Jothi Ventakeswaran[13] proposed an improved evidence theoretic KNN algorithm which combines Dempster Shafer theory of evidence and K nearest neighbouring rule with distance metric based neighborhood with the average accuracy of 73.57%. Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya [14] proposed a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. To guide the sampling process, multiple classification algorithms are used. The classification algorithms used in the hybrid system composition includes Decision Tree (J48), k-Nearest Neighbor (kNN), Naive Bayes (NB), Random Forest (RF) and Logistic Regression (LOG). Four medical datasets such as blood, survival, breast cancer and Pima Indian Diabetes from UCI Machine Learning Repository and a genome wide association study (GWAS) dataset from the genotyping of Single Nucleotide Polymorphisms (SNPs) of Age-related Macular Degeneration (AMD) are used. The different methods of the study includes PSO (Particle Swam based Hybrid System), RO(Rndom OverSampling), RU (Random UnderSampling) and Clustering. The metrics used are AU(Area under ROC curve), F-measure and Geometric mean.Particle Swam based Hybrid System (PSO) gives an average accuracy of 71.6%, RU 67.3%, RO 65.8% and cluster 64.9% with KNN Classifier for Pima Indian Diabetes. Also with other classifiers and dataset, the study shows that PSO gives better accuracy. In[15], Biswadip Ghosh applies FCP (Fuzzy Composite Programming) to build a diabetes classifier using the PIMA diabetes dataset. He evaluated the performance of the Fuzzy classifier using Receiver Operating Characteristic (ROC) Curves and compared the performance of the Fuzzy classifier against a Logistic Regression classifier. The results show that FCP classifier is better when the calculated AUC values are compared. The classifier based on logistic regression has an AUC of 64.8%, while the classifier based on FCP has an AUC of 72.2%. He proved that the performance of the fuzzy classier was found to be better than a logistic regression classifier. Quinlan applied C4.5 and it was 71.1% accurate (Quinlan 1993). In [3], Wahba's group at the University of Wisconsin applied penalized log likelihood smoothing spline analysis of variance (PSA). They eliminated patients with glucose and BMIs of zero leaving n=752. They used 500 for the training set, and the remaining 252 as the evaluation set which showed an accuracy of 72% for the PSA model and 74% for a GLIM model (Wahba, Gu et al. 1992). Michie et al. used 22 algorithms with 12-fold cross validation reported the following accuracy rates on the test set and: Discrim 77.5%, Quaddisc 73.8%, Logdisc 77.7%, SMART 76.8%, ALLOC80 69.9%, k-NN 67.6%, CASTLE 74.2%, CART 74.5%, IndCART 72.9%, NewID 71.1%, AC2 72.4%, Baytree 72.9%, NaiveBay 73.8%, CN2 71.1%, C4.5 73%, Itrule 75.5%, Cal5 75%, Kohonen 72.7%, DIPOL92 77.6%, Backprop 75.2%, RBF 75.7%, and LVQ 72.8% (Michie, Spiegelhalter et al. 1994, p. 158) 156
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME J. Oates (1994) used Multi-Stream Dependency Detection (MSDD) algorithm on two- thirds of the dataset for training. Accuracy on the one-third of evaluation was 71.33%. Although the cited articles use different techniques, the average accuracy is 70.64% for Pima Indian Diabetes dataset. The representation of lot of missing values by zeros itself forms a characteristic and hence affects the accuracy of the classifier. The main objective of this paper is to explore the ways to improve the classification accuracy of a classification algorithm using suitable techniques such as data imputation, scaling and normalization for the classification of diabetes data which may contain lot of missing values. 2. DATA PREPROCESSING Data in the real world is incomplete, noisy, inconsistent and redundant. The knowledge discovery during the training phase is difficult if there is much irrelevant and redundant information. Data preprocessing is important to produce quality mining results. The main tasks in data preprocessing are data cleaning, data integration, data transformation, data reduction and data discretization. The final dataset used will be the product of data preprocessing. 2.1. Missing Values and Imputation of Missing Values Many existing, industrial and research datasets contain missing values. They are introduced due to reasons, such as manual data entry procedures, equipment errors and incorrect measurements. The most important task in data cleaning is processing missing values. Many methods have been proposed in the literature to treat missing data [21,22]. Missing data should be carefully handled; otherwise bias might be introduced into the knowledge induced and affect the quality of the supervised learning process or the performance of classification algorithms [3, 4]. Missing values imputation is a challenging issue in machine learning and data mining [2]. Due to the difficulty with missing values, most of the learning algorithms are not well adapted to some application domains (for example Web Applications) because the existed algorithms are designed with the assumption that there are no missing values in datasets. This shows the necessity for dealing with missing values. The satisfactory solution to missing data is good database design but good analysis can help to reduce the problems [1]. Many techniques can be used to deal missing data. Depending on the problem domain and the goal for the data mining process, the right technique should be selected. In literature, the different approaches to handle missing values in dataset are given as: 1. Ignore the tuple 2. Fill in the missing values manually 3. Use a global constant to fill in the missing values 4. Use attribute mean to fill in the missing values 5.Use the attribute mean for all samples belonging to the same class as the given tuple 2.2. Mean substitution The popular imputation method to fill in the missing data values is to use a variable’s mean or median. This method is suitable only if the data are MCAR(Missing Completely At Random). This method creates a spiked distribution at the mean in frequency distributions. It lowers the correlations between the imputed variables and the other variables and underestimates variance. 157
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME The Simple Data Scaling Algorithm The following algorithm explains the data scaling method. Let D = { A1, A2, A3, ….. An } Where D is the set of unnormalized data Ai – is the ith attribute column of values of m- is the member of rows (records) n - is the number of attributes. Function Normalize(D) Begin For i=1 to n { Maxi ←max(Ai) Mini←min(Ai) For r =1 to m { Air ← Air-Mini Air ← Air/ Maxi Where Air is the element of Ai at row r } } Finally we will have the scaled data set. End The Mean Substitution Algorithm The following algorithm explains the very commonly used form of mean substitution method [9] Let D = { A1, A2, A3, ….. An } Where D is the set of data with missing values Ai – is the ith attribute column of values of D with missing values in some or all columns n - is the number of attributes. Function MeanSubstitution(D) Begin For i=1 to n { ai ← Ai ∩ mi where ai is the column of attributes without missing values mi is the set of missing values in Ai (missing values denoted by a symbol) Let µi be the mean of ai Replace all the missing elements of Ai with µi } Finally we will have the imputed data set. End 158
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME 2.3. k-Nearest Neighbor(knn) Classification Algorithm KNN classification classifies instances based on their similarity. It is one of the most popular algorithms for pattern recognition. It is a type of Lazy learning where the function is only approximated locally and all computation is deferred until classification. An object is classified by a majority of its neighbors. K is always a positive integer. The neighbors are selected from a set of objects for which the correct classification is known. The kNN algorithm is as follows: 1. Determine k i.e., the number of nearest neighbors 2. Using the distance measure, calculate the distance between the query instance and all the training samples. 3. The distance of all the training samples are sorted and nearest neighbor based on the k minimum distance is determined. 4. Since the kNN is supervised learning, get all the categories of the training data for the sorted value which fall under k. 5. The prediction value is measured by using the majority of nearest neighbors. The Pseudo Code of kNN Algorithm Function kNN(train_patterns , train_targets, test_patterns ) Uc - a set of unique labels of train_targets; N - size of test_patterns for i = 1:N, dist:=EuclideanDistance(train_patterns, test_patterns(i)) idxs := sort(dist) topkClasses := train_targets(idxs(1:Knn)) c := DominatingClass (topkClasses) test_targets(i) := c end 2.4. Validating the Performance of the Classification Algorithm Classifier performance depends on the characteristics of the data to be classified. Performance of the selected algorithm is measured for Accuracy and Error rate. Various empirical tests can be performed to compare the classifier like holdout, random sub-sampling, k-fold cross validation and bootstrap method. In this study, we have selected k-fold cross validation for evaluating the classifiers. In k-fold cross validation, the initial data are randomly partitioned into k mutually exclusive subset or folds d1,d2,…,dk, each approximately equal in size. The training and testing is performed k times. In the first iteration, subsets d2, …, dk collectively serve as the training set in order to obtain a first model, which is tested on d1; the second iteration is trained in subsets d1, d3,…, dk and tested on d2; and so no[19]. The accuracy of the classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data [19]. The Accuracy and Error rate can be defined as follows: Accuracy = (TP+TN) / (TP + FP + TN + FN) Error rate = (FP+FN) / (TP + FP + TN + FN) Where 159
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME TP is the number of True Positives TN is the number of True Negatives FP is the number of False Positives FN is the number of False Negatives 3. EXPERIMENTAL RESULTS 3.1. Pima Indians Diabetes Dataset The Pima (or Akimel O'odham) are a group of American Indians living in southern Arizona. The name, "Akimel O'odham", means "river people". The short name, "Pima" is believed to have come from the phrase pi 'añi mac or pi mac, meaning "I don't know," used repeatedly in their initial meeting with Europeans [25]. According to World Health Organization (WHO) [18 ], a population of women who were at least 21 years old of Pima Indian Heritage was tested for diabetes. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. Number of Instances: 768 Number of Attributes: 8 (Attributes) plus 1 (class label) All the attributes are numeric-valued Sl Attribute Explanation No 1 Pregnant Number of times pregnant Plasma glucose concentration 2 Glucose (glucose tolerance test) Diastolic blood pressure (mm 3 Pressure Hg) 4 Triceps Triceps skin fold thickness (mm) 5 Insulin 2-Hour serum insulin (mu U/ml) Body mass index (weight in 6 Mass kg/(height in m)^2) 7 Pedigree Diabetes pedigree function 8 Age Age (years) 9 Diabetes Class variable (test for diabetes) Class Distribution: Class value 1 is interpreted as "tested positive for diabetes" Class Value : 0 - Number of instances - 500 Class Value : 1 - Number of instances – 268 While the UCI repository index claims that there are no missing values, closer inspection of the data shows several physical impossibilities, e.g., blood pressure or body mass index of 0.They should be treated as missing values to achieve better classification accuracy. The following 3D plot clearly shows the complex distribution of the positive and negative records in the original Pima Indians Diabetes Database which will complicate the classification task. 160
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Fig 1. The 3D plot of original Pima Indians Diabetes Database 3.2. Results of the 10 Fold Validation The following table shows the results in terms of accuracy, error and run time in the case of 10 fold validation Table 1. 10 Fold Validation Results Performanc After e with Actual After Imputation different Data Imputation and Scaling Metrics Accuracy 71.71 71.32 71.84 Error 28.29 28.68 28.16 Time 0.2 0.28 0.23 As shown in the following graphs, the performance was considerably improved while using imputed and scaled data. The performance after imputation is almost equal or little bit lower than using normal data. The reason is, the original data contains lot of missing values and the missing values represented by 0 are treated as a significant feature by the classifier. Even though the performance of original data is higher than that of imputed data, we should not consider because it will not be accurate. If we train the system with original data which has lot of missing values and use it for classifying real world data which may not contain any missing value, then the system will not classify the ideal data correctly. 161
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Performance in terms of Accuracy 71.9 Accuracy(%) 71.8 71.7 71.6 71.5 71.4 71.3 71.2 71.1 71 Actual Data After After Imputation Imputation and Scaling Fig 2. Comparison of Accuracy –10 Fold Valuation The accuracy of classification has been significantly improved after imputation and scaling. Performance in terms of Error 28.8 28.7 28.6 Error(%) 28.5 28.4 28.3 28.2 28.1 28 27.9 Actual Data After After Imputation Imputation and Scaling Fig 3. Comparison of Error –10 Fold Valuation The classification error has been significantly reduced after imputation and scaling. 3.3. Results of the Average of 2 to 10 Fold Validation The following table shows the average of results in terms of accuracy, error and run time in the case of 2 to 10 fold validation Table 2. Average of 2 to 10 Fold Validation Results Performanc After e with Actual After Imputation different Data Imputation and Scaling Metrics Accuracy 71.94 71.26 73.38 Error 28.06 28.74 26.62 Time 0.19 0.19 0.19 162
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Performance in terms of Accuracy 74 73.5 Accuracy(%) 73 72.5 72 71.5 71 70.5 70 Actual Data After After Imputation Imputation and Scaling Fig 4. Comparison of Accuracy Avg. of 2-10 Fold Valuations The accuracy of classification has been significantly improved after imputation and scaling. The average of 2 to 10 fold validations clearly shows the difference in accuracy. Performance in terms of Error 29 28.5 28 Error(%) 27.5 27 26.5 26 25.5 Actual Data After After Imputation Imputation and Scaling Fig 5. Comparison of Error Avg. of 2-10 Fold Valuations The classification error has been significantly reduced after imputation and scaling. The average of 2 to 10 fold validations clearly shows the difference in classification Error. The graphs above show that imputation alone does not give higher accuracy but imputation with a data preprocessing technique can improve the accuracy of a dataset which has missing values. We also compared our improved KNN with Weka’s Implementation of KNN. The results show that improved kNN’s accuracy with imputation and scaling method is 3.2% higher than the standard Weka's Implementation of kNN. 4. CONCLUSION AND SCOPE FOR FURTHER ENHANCEMENTS The result shows that the characteristics and distribution of data and the quality of input data had a significant impact on the performance classifier. Performance of the kNN classifier is measured in terms of Accuracy, Error Rate and run time using 10-fold validation method and average of 2- 10 fold validation. The application of imputation method and data normalization had obvious 163
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME impact on classification performance and significantly improved the performance of kNN. The performance after imputation is almost equal or little bit lower than using original data. The reason is, the original data contains lot of missing values and the missing values represented by 0 are treated as a significant feature by the classifier. So the data imputation method not always lead to so called “ good results” they only give “correct” results. It means imputation will not always lead to higher accuracy for imputed missing values. But Imputation along with a suitable data preprocessing method increases the accuracy. Our improved kNN’s accuracy with imputation and scaling method is 3.2% higher than the standard Weka's Implementation of kNN. To improve the overall accuracy, we have to improve the performance of the classifier or use a good feature selection method. Future works may address hybrid classification model using kNN with other techniques. Even though the achieved accuracy of kNN based method is little bit lower than some of the previous methods, still there are scope for improving the kNN based method. For example, we may use another distance metric function instead of the standard Euclidean distance function in the distance calculation part of kNN for improving accuracy. Even we may use any evolutionary computing technique such as GA or PSO for feature selection or Feature weight selection for attaining maximum possible accuracy of classification with kNN. So, future work may address the ways to incorporate these ideas along with the standard kNN. ANNEXURE –RESULTS OF K-FOLD VALIDATION The following table shows the k-fold validation results where k is changed from 2 to 10. We used the average of this results as well as the results corresponding to k=10 for preparing graphs in the previous section IV. Table 3. Results With Actual Pima Indian Diabetes Dataset k Accuracy Error Time 2 72.53 27.47 0.13 3 74.35 25.65 0.20 4 71.61 28.39 0.16 5 70.98 29.02 0.17 6 72.40 27.60 0.19 7 70.90 29.10 0.25 8 72.14 27.86 0.20 9 70.85 29.15 0.19 10 71.71 28.29 0.20 avg 71.94 28.06 0.19 164
  • 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Table 4. Results With Imputed Data k Accuracy Error Time 2 71.22 28.78 0.16 3 70.44 29.56 0.19 4 70.83 29.17 0.16 5 71.24 28.76 0.14 6 71.88 28.13 0.19 7 71.56 28.44 0.20 8 71.61 28.39 0.17 9 71.24 28.76 0.19 10 71.32 28.68 0.28 Avg 71.26 28.74 0.19 Table 5. Results With Imputed and Scaled/Normalized Data k Accuracy Error Time 2 74.74 25.26 0.22 3 74.09 25.91 0.13 4 71.88 28.13 0.16 5 73.59 26.41 0.17 6 72.92 27.08 0.17 7 74.31 25.69 0.16 8 74.09 25.91 0.22 9 72.94 27.06 0.22 10 71.84 28.16 0.23 Avg 73.38 26.62 0.19 5. REFERENCES [1] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. [2] Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge. [3] Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), Soft Classification a.k.a. Risk Estimation via Penalized Log Likelihood and Smoothing Spline Analysis of Variance, in D. H. Wolpert (1995), The Mathematics of Generalization, 331-359, Addison-Wesley, Reading, MA. [4] S. Sahan, K. Polat, H. Kodaz, and S. Gunes, "The medical applications of attribute weighted artificial immune system (awais): Diagnosis of heart and diabetes diseases," in ICARIS, 2005, p. 456 468. [5] M. Salami, A. Shafie, and A. Aibinu, "Application of modeling tech- niques to diabetes diagnosis," 165
  • 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME in IEEE EMBS Conference on Biomedical Engineering & Sciences, 2010. [6] T. Exarchos, D. Fotiadis, and M. Tsipouras, "Automated creation of transparent fuzzy models based on decision trees application to diabetes diagnosis," in 30th Annual International IEEE EMBS Conference, 2008. [7] M. Ganji and M. Abadeh, "Using fuzzy ant colony optimization fordiagnosis of diabetes disease," IEEE, 2010. [8] T. Jayalakshmi and A. Santhakumaran, "A novel classification methodfor diagnosis of diabetes mellitus using artificial neural networks," in International Conference on Data Storage and Data Engineering, 2010 [9] R.S. Somasundaram and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values", International Journal of Computer Applications Issn-09758887, 2011 [10] Peterson, H. B., Thacker, S. B., Corso, P. S., Marchbanks, P. A. & Koplan, J. P. (2004). Hormone therapy: making decisions in the face of uncertainty. Archives of Internal Medicine, 164(21), 2308-12. [11] Kietikul Jearanaitanakij,”Classifying Continous Data Set by ID3 Algorithm”, Proceedings of fifth International Conference on Information Communication and Signal Processing, 2005. [12] Manaswini Pradhan1 , Dr. Ranjit Kumar Sahu2 , Predict the onset of diabetes disease using Artificial Neural Network (ANN), International Journal of Computer Science & Emerging Technologies (E- ISSN: 2044-6004) 303 Volume 2, Issue 2, April 2011 [13] P.Umar Sathic Ali, Dr.C.Jothi Ventakeswaran,” Improved Evidence Theoretic kNN Classifier based on Theory of Evidence”, International Journal of Computer Applications (0975 – 8887) Volume 15– No.5, February 2011 [14] Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya, “A particle swarm based hybrid system for imbalanced medical data sampling”, BMC Genomics. 2009; 10(Suppl 3): S34. [15] BiswadipGhosh, “Using Fuzzy Classification for Chronic disease”, Indian Journal of Economics&Business, March 2012 [16] Angeline Christobel, SivaPrakasam,”An Empirical Comparison of Data mining Classification Methods”,International Journal of Computer Information Systems, Vol. 3, No. 2, 2011 [17] Roshawnna Scales, Mark Embrechts, “Computational intelligence techniques for medical diagnostics”. [18] World Health Organization. Available: http://www.who.int [19] J. Han and M. Kamber,”Data Mining Concepts and Techniques”, Morgan Kauffman Publishers, USA, 2006. [20] Agrawal, R., Imielinski, T., Swami, A., “Database Mining:A Performance Perspective”, IEEE Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993. [21] E. Acuna and C. Rodriguez, The treatment of Missing Values and its e®ect in the classi¯ er accuracy, In: D. Banks, L. House, F.R. McMorris, P. Arabie, W. Gaul (Eds). Classi¯ cation, Clustering and Data Mining Applications, Springer-Verlag Berlin-Heidelberg, 2004 [22] G.E.A.P.A. Batista and M.C. Monard, An analysis of four Missing Data treatment methods for supervised learning, Applied Arti¯ cial Intelli-gence 17 (2003) [23] Chen, M., Han, J., Yu P.S., “Data Mining: An Overview from Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996. 166
  • 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME [24] Angeline Christobel, Usha Rani Sridhar, Kalaichelvi Chandrahasan, “Performance of Inductive Learning Algorithms on Medical Datasets”, International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 [25] Awawtam. "Pima Stories of the Beginning of the World." The Norton Anthology of American Literature. 7th ed. Vol. A. New York: W. W. Norton &, 2007. 22-31 [26] Quinlan, J. R. C4.5: “Programs for Machine Learning”. Morgan Kaufmann Publishers, 1993. [27] S.B. Kotsiantis, Supervised Machine Learning: “A Review of Classification Techniques”, Informatica 31(2007) 249-268, 2007 [28] J. R. Quinlan. “Improved use of continuous attributes in c4.5”, Journal of Artificial Intelligence Research, 4:77-90, 1996. [29] J.R. Quinlan, “Induction of decision trees,” In Jude W.Shavlik, Thomas G. Dietterich, (Eds.), Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning, vol. 1, 1986, pp 81–106. [30] Salvatore Ruggieri,”Efficient C4.5 Proceedings of IEEE transactions on knowledge and data Engineering”,Vo1. 14,2,No.2, PP.438-444,20025 [31] P. Domingos, M. Pazzani, “On the optimality of the simple Bayesian classifier under Zero-one loss”, Machine learning 29(2-3)(1997) 103-130.11 [32] Vapnik, V.N., “The Nature of Statistical Learning Theory”, 1st ed., Springer-Verlag,New York, 1995. [33] Michael J. Sorich, John O. Miners,Ross A. McKinnon,David A. Winkler, Frank R. Burden, and Paul A. Smith, “Comparison of linear and nonlinear classification algorithms for the prediction of drug and chemical metabolism, by human UDP- Glucuronosyltransferase Isoforms” [34] UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29] [35] XindongWu · Vipin Kumar · J. Ross Quinlan Joydeep Ghosh · Qiang Yang ·Hiroshi Motoda Geoffrey J. McLachlan · Angus Ng · Bing Liu ·Philip S. Yu ·Zhi-Hua Zhou · Michael Steinbach ·David J. Hand · Dan Steinberg, “Top 10 algorithms in data mining ” Springer 2007 [36] Niculescu-Mizil, A., & Caruana, R. (2005). “Predicting good probabilities with supervised learning”, Proc.22nd International Conference on Machine Learning (ICML'05). [37] Thair Nu Phyu, “Survey of Classification Techniques in Data Mining MultiConference of Engineers and Computer Scientists”, 2009 Vol I IMECS 2009, Hong Kong [38] Arbach, L.; Reinhardt, J.M.; Bennett, D.L.; Fallouh, G.; Iowa Univ., Iowa City, IA, USA “Mammographic masses classification: comparison between backpropagation neural network (BNN), K nearest neighbors (KNN), and human readers”, 2003 IEEE CCECE [39] D. Xhemali, C.J. Hinde, and R.G. Stone.,”Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages”, International Journal of Computer Science, 4(1):16-23,2009. 167