SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN
Classifiers
Abstract
kNN Classifiers are well-established non-
parametric machine learning techniques.
Previous research has focused on improving
their performance by means such as distance
weighting, attribute weighting and feature
selection among others. In this investigation
we reviewed different implementations of
Distance-Based Weighted-Voting algorithms.
We compared the results obtained by
implementing seven weighted-voting
algorithms and using them to classify eleven
datasets. We made used of Cross-Validation to
obtain an accurate estimate of performance,
using as a measure of such the average area
under the ROC curve.
Finally, we evaluated our results using the
Friedman test with this suggesting that the
difference in performance between different
implementations is not statistically significant.
Introduction
k-Nearest Neighbour (kNN) algorithms are
popular machine learning techniques often
employed for the purposes of classifying
unknown samples of data.
kNN classifiers are trained by storing the
feature vector and associated class label of
each training sample.. Subsequently, the
distance between the query object (Ie: The
unknown sample) and each training point is
computed, with the former being assigned the
class most frequent among its k closest
neighbours.
In their simplest form, kNN classifiers present
several weaknesses such as the difficulty in
determining the optimal value for k, growing
resource costs as the number of training
samples increases, sensitivity to noise or
irrelevant attributes and to imbalanced sets
where a sizeable majority of training samples
all belong to the same class.
kNN do however offer several advantages.
They have a very simple implementation,
relatively few parameters to tune (namely k
and the distance function) and can successfully
learn problems which are not linearly
separable.
In additions, different solutions have been
previously researched and proven to be
successful in addressing the aforementioned
weaknesses. Class Confidence Weighted kNN
are, as suggested by Liu et al.(1), successful at
handling imbalanced datasets while the issue
of growing resource costs as data
dimensionality increases is addressed by Lu et
al.(2) who propose a method to combine
similar points. Hence, showing how several
weaknesses of kNN can be addressed to
improve the model’s performance and
resilience.
One area in particular which we decided to
investigate was that of Distance-Based
Weighted-Voting kNN (WkNN). In traditional
kNN classifiers, a sample is classified as
belonging to the most common class between
k neighbours. That is, each neighbour casts a
vote with weight one. However this does not
take into account that some points, although
belonging to the group of k nearest, might still
be considerably farther from the query object
than other k nearest neighbours. As farther
points are intuitively less similar to the object
being classified, it seems sensible that they
should have less of an impact in the final
classification than others which might be
closer and more similar. We therefore decided
to investigate the performance of different
WkNN and compare results to establish
whether these allow for enhanced
classification accuracy.
Dario Panada - 9804175
The rest of this paper is structured as follows:
In the first section, we introduce the context of
our investigation and provide an overview of
previously researched WkNN. In the second,
we present our experimental method. In the
third section we report our experimental
results and, finally, in the last section of this
paper we conclude with an analysis and
discussion of results.
Background
Overview of kNN Classifiers
Given a series of labelled points in an n-
dimensional feature space, the simplest
implementation of the kNN classifier (also
known as NN Classifier) is that with k = 1. That
is, each new point is classified as belonging to
the same class as the training sample to which
it is closest.
It is therefore necessary to specify a notion of
distance which we can then use to compare
how close two points are. For the purpose of
this investigation, we will be considering
Euclidean Distance, although alternative
functions (Eg: Manhattan, Hamilton…) can also
be used.
It is also worth mentioning that kNN classifiers
(and by extension WkNN classifiers) cannot
directly process nominal features using
Euclidean Distance. A standard approach is to,
for a given nominal feature that can take m
values, convert it to m binary features each
representing one of the possible values.
Hence, depending on the value of m for a given
sample, setting the corresponding newly
created binary feature to 1 and all others to 0.
The natural extension of the NN classifier is
then to increase the value of k turning it into a
kNN classifier. For each point we wish to
classify, we then sort all known points by
distance, select the subset of k nearest and
assign as class the most frequent one in such
set.
Varying k will impact the classifier’s
performance, with smaller values of k in
general tending to underfit the training set and
larger values tending to overfit.
More formally, we can set the probability that
a sample point x belongs to class Y as:
𝑃(𝑌|𝑥) =
𝐾 𝑌
𝐾
(Eq.1)
Where KY is the number of k-nearest
neighbours with class label Y.
Principles of the WkNN Variant
The general principle behind WkNN Classifiers
is that, among the set of k nearest neighbours,
those which are closest to the query object
should have a greater weight in the
classification process than those farther away.
Different implementations of WkNN present
several advantages with respect to kNN, as
suggested by Gou et al. (3), such as: Being less
sensitive to the choice of parameter k, less
perturbed by outliers and more resilient to
uneven class distribution in the training sets.
We can redefine Eq. 1 to present a more formal
definition of class probability for WkNN:
𝑃(𝑌|𝑥) =
Ʃ 𝑖=0
|𝐾 𝑌|
(𝑤 𝑖 𝐾 𝑌 𝑖
)
𝐾
(Eq.2)
That is, for a given WkNN, the probability that
a sample point x belongs to class Y is equal to
the weighted sum of all k-nearest neighbours
with class label Y divided by K.
It is important to mention at this stage that, for
all algorithms used to calculate weights
considered in this investigation, all resulting
weight values will be comprised between the
values of 0 and 1. This guarantees that no class
probability will ever exceed 1.
As can be seen from Eq. 2, by making the
weight component inversely proportional to
the distance from the target object, samples
which are farther from it will not contribute to
the class probability as much as those which
are closer.
Description of weight computation
We now describe how our different WkNN
implementations compute the weight value for
each k neighbour. As a control, we also
recorded the performance of the non-
weighted kNN implementation across
datasets.
The following definitions hold with regards to
the equations here described.
dk is the distance between the current
neighbour and the kth
farthest neighbour;
d1 is the distance between the current
neighbour and the neighbour closes to the
query point;
di is the distance between the current
neighbour and the query point;
i is the index (1 ≤ 𝑖 ≤ 𝑘) of the neighbour
being considered.
With regards to the user of the following
equations, we assume that once the vector
containing the distances of all neighbours from
the query object has been computed this will
be normalized such that all distances will be
comprised between the values of 0 and 1. We
also assume k > 1.
1 – Non-Weighted (Control)
𝑤 = 1 (Eq.3)
2 – Inverse-Distance Weighting (IDW)
This implementation was pre-implemented in
the current release of the Weka Java API (WJA).
(4)
𝑤 =
1
𝑑 𝑖
(Eq.4)
3 – Distance-Based Weighting (DW)
Pre-implemented in JWA. (5)
𝑤 = 1 − 𝑑𝑖 (Eq.5)
4 – Extended Distance-Based Weighting
(ExtDW)
As proposed by Gou et al. (6)
𝑤 =
𝑑 𝑘−𝑑 𝑖
𝑑 𝑘−𝑑1
(Eq.6)
5 – Index-Based Weighting (IW)
Gou et al. (7)
𝑤 =
1
𝑖
(Eq. 7)
6 – Hybrid Index-Distance Based Weighting
(HIDW)
Gou et al. (8)
𝑤 =
𝑑 𝑘−𝑑 𝑖
𝑑 𝑘−𝑑1
×
1
𝑖
(Eq. 8)
7 – Dual-Weighting (DuW)
As proposed by Gout et al. (9)
𝑤 =
𝑑 𝑘−𝑑 𝑖
𝑑 𝑘−𝑑1
×
𝑑 𝑘+𝑑1
𝑑 𝑘+𝑑 𝑖
(Eq.9)
8 – Exponential Distance-Based Weighting
(EDW)
Our own implementation inspired by the work
of Wu et al. (10) We maintain the principle of
squaring the distance from the query object,
but removed attribute weighting from the
equation.
𝑤 =
1
𝑑 𝑖
2 (Eq. 10)
Method
We compared the performance of kNN
classifiers implementing the equations for
Distanced-Weighted voting described in the
previous section.
Prior to classifiers being trained, all data was
pre-processed using the Weka Filters “Replace
Missing Values”, “Nominal to Binary” and
“Normalize”.
Where the “Replace Missing Values” filter
“replaces all missing values for nominal and
numeric attributes in a dataset with the modes
and means from the training data” (11), the
“Nominal to Binary” Filter replaces all Nominal
Attributes with a number of Binary Attributes
equal to the number of values it can take and
the “Normalize” filter normalizes each feature
vector.
We made use of 10-fold cross validation to
calculate the performance of each kNN
implementation on each dataset. It is
important to note that, for each dataset, the
same folds were used in training/testing each
classifier.
During each training phase, classifiers were
instantiated with 𝑘 = √𝑛, as suggested by
Jonsson et al. (12) where n is the number of
samples in the training set being used.
Table 1 – Datasets used in this investigation.
*Classes have been aggregated in Window
glass and non-Window glass.
**Only instances with class labels 1, 5 and 18
(Ie: The three classifications with the highest
number of instances) have been used.
Following, during each testing phase for a
given dataset we recorded the area under the
ROC curve (AUC) for each classifier. AUC offers
several advantages to simply measuring
accuracy, such as increased sensitivity in
statistical tests and independence from the
choice of decision threshold. (13) Finally, all
AUC values for each classifier obtained by
repeated testing during cross-validation are
averaged to obtain a single AUC value per
classifier per dataset.
Name Label Num Instances Num Attributes
Class Distribution
(Instances)
Breast Cancer Data (14) Breast 286 9
Class 0: 201
Class 1: 285
Cylinder Bands (15) Cylinder 512 40
Class 0: 312
Class 1: 200
Pima Indians Diabetes
Database (16)
Diabetes 768 7
Class 0: 500
Class 1:268
Glass Identification
Database* (17)
Glass 241 10
Class 0: 163
Class 1: 51
Heart Disease Dataset:
Cleveland (18)
Heart 303 75
Class 0: 164
Class 1: 55
Class 2: 36
Class 3: 35
Class 4: 13
Hepatitis Domain (19) Hepatitis 155 19
Class 0: 32
Class 1: 123
Johns Hopkins
University Ionosphere
Database (20)
Ionosphere 351 33
Class 0: 126
Class 1: 225
BUPA Liver Disorder
(21)
Liver 345 6
Class 0: 144
Class 1: 201
Primary Tumour
Domain** (22)
Tumour 152 17
Class 0: 84
Class 1: 39
Class 2: 24
Sick (23) Sick 3772 30
Class 0: 3543
Class 1: 229
Sonar, Mines vs. Rocks
(24)
Sonar 208 60
Class 0: 97
Class 1: 111
Results
Control IDW DW ExtDW IW HIDW DuW EDW
Breast
0.668±
0.104
0.663±
0.099
0.670±
0.099
0.648±
0.085
0.665±
0.098
0.649±
0.072
0.645±
0.081
0.665±
0.100
Diabetes
0.793±
0.047
0.799±
0.048
0.798±
0.049
0.799±
0.047
0.792±
0.042
0.783±
0.043
0.799±
0.048
0.799±
0.047
Heart
0.884±
0.059
0.884±
0.059
0.886±
0.065
0.876±
0.069
0.878±
0.066
0.852±
0.073
0.871±
0.071
0.871±
0.065
Hepatitis
0.807±
0.169
0.838±
0.150
0.842±
0.143
0.813±
0.181
0.816±
0.170
0.810±
0.184
0.809±
0.180
0.829±
0.158
Liver
0.623±
0.051
0.654±
0.055
0.640±
0.047
0.653±
0.059
0.665±
0.084
0.663±
0.092
0.657±
0.061
0.649±
0.053
Tumour
0.823±
0.066
0.832±
0.074
0.831±
0.066
0.832±
0.090
0.838±
0.062
0.823±
0.090
0.831±
0.090
0.836±
0.076
Sick
0.942±
0.021
0.956±
0.014
0.952±
0.016
0.950±
0.020
0.951±
0.017
0.954±
0.019
0.950±
0.020
0.951±
0.015
Ionosphere
0.910±
0.056
0.964±
0.029
0.963±
0.030
0.945±
0.038
0.915±
0.054
0.939±
0.041
0.934±
0.046
0.964±
0.030
Glass
0.866±
0.094
0.882±
0.087
0.876±
0.092
0.885±
0.083
0.889±
0.077
0.892±
0.074
0.884±
0.084
0.879±
0.089
Cylinder
0.739±
0.057
0.767±
0.055
0.765±
0.056
0.800±
0.054
0.805±
0.043
0.827±
0.042
0.805±
0.054
0.777±
0.056
Sonar
0.830±
0.066
0.891±
0.058
0.879±
0.055
0.917±
0.063
0.921±
0.056
0.927±
0.064
0.924±
0.057
0.912±
0.047
Table 2 – Experimental results showing
average AUC and standard deviation for each
classifier and dataset combination.
Analysis
Results show that all kNN implementations
successfully learned all datasets, although with
different degrees of accuracy.
In order to test whether the difference in
performance between classifiers is significant,
we made use of the Friedman test. This has
been found to be more appropriate when
comparing multiple classifiers (25) than
alternatives such as paired t-tests.
Running the test on the above dataset with an
alpha value of 0.05 returned a p-value of
0.1123 leading us to conclude that, for this
particular investigation, the difference in
performance between classifiers is not
significant.
An intuitive appreciation of this can be
obtained by looking at the performance across
datasets of the classifiers when setup with
different k values, where no clear difference is
noticeable
Figure 1 – Performance of classifiers on dataset
Tumour for different values of k, error bars
omitted.
Figure 2 – Performance of classifiers on
dataset Liver for different values of k, error
bars omitted.
When errors on performance are taken into
account, with these comparable to those in
Table 2, it becomes evident that changing the
value of k does not lead to any significant
change in performance.
With our results differing from those reported
in publications previously discussed, we would
like to offer some possible explanations as to
the reasons behind this.
In the first instance, the way in which classifiers
consider nominal features can have a
significant impact on performance. In our
implementation, we relied on Weka’s filter to
transform each nominal feature into a set of
binary ones. This however has the
disadvantage that, for each such feature, two
samples will either have the same exact
position or be as far away as possible. Many of
our datasets included a considerable amount
of nominal features, and we believe this to
have had an impact on our results.
Secondly, several of our datasets have
imbalanced class distribution. This would lead
to models predicting the dominant class most
of the times. Because we use cross-validation,
if a certain class is most common in the training
set it is likely that it will be most common in the
testing set too, leading to the model suggesting
good performance.
However, while class imbalance is undeniable
in datasets such as “Sick”, it is debatable
whether datasets such as “Cylinder”, “Liver” or
indeed most sets are actually to be considered
imbalanced. Because of this, we are reluctant
to attribute our results entirely to the fault of
imbalanced datasets maintaining that most of
them are sufficiently balanced.
Hence, we maintain that while improvements
to our experimental methodology could
indeed be applied, our results offer a
sufficiently reliable estimate of performance to
suggest that, on this occasion, we were not
able to reproduce results from literature. In
conclusion, Distanced-Based Weighted Voting
did not offer any significant measure of
performance improvement.
Bibliography
(1) Liu, W. and Chawla, S. (2011). Class Confidence Weighted
kNN Algorithms for Imbalanced Data Sets. Advances in
Knowledge Discovery and Data Mining, pp.345-356.
(2) Yu, C., Cui, B., Wang, S. and Su, J. (2007). Efficient index-based
KNN join processing for high-dimensional data. Information and
Software Technology, 49(4), pp.332-344.
(3) Gou, J., Du, L., Zhang, Y. and Xiong, T. (2012). A New Distance-
weighted k-nearest Neighbor Classifier. Journal of Information &
Computational Science, 9(1429), p.1436.
(4) "Use WEKA in Your Java Code." Weka -. N.p., n.d. Web. 20
Nov. 2015.
(5) Ibid.
(6) Gou, Jianping, Taisong Xiong, and Yin Kuang. "A Novel
Weighted Voting for K-Nearest Neighbor Rule." JCP Journal of
Computers 6.5 (2011): n. pag. Web.
(7) (8) Ibid.
(9) Gou, J., Du, L., Zhang, Y. and Xiong, T. (2012).
(10) Wu, Jia, Zhi-hua Kai, and Shuang Ao. "Hybrid Dynamic K-
nearest-neighbour and Distance and Attribute Weighted
Method for Classification." Int. J. Computer Applications in
Technology 43.4 (2012): 378-84. Web.
(11) "ReplaceMissingValues." - Pentaho Data Mining - Pentaho
Wiki. N.p., n.d. Web. 20 Nov. 2015.
(12) Jonsson, P., and C. Wohlin. "An Evaluation of K-nearest
Neighbour Imputation Using Likert Data." 10th International
Symposium on Software Metrics, 2004. Proceedings. (2004): n.
pag. Web.
(13) Bradley, Andrew P. "The Use of the Area under the ROC
Curve in the Evaluation of Machine Learning Algorithms."
Pattern Recognition 30.7 (1997): 1145-159. Web.
(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)http://repository.se
asr.org/Datasets/UCI/arff/ Web. 26 Nov 2015
(25) Demsar, Janez. "Statistical Comparisons of Classifiers over
Multiple Data Sets." Journal of Machine Learning Research 7
(2006): n. pag. Web.

Contenu connexe

Tendances

Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmkhalid Shah
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringSajib Sen
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data StreamIRJET Journal
 
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE IJORCS
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureIJRES Journal
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...IJCSEA Journal
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering108kaushik
 
A New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
A New Framework for Kmeans Algorithm by Combining the Dispersions of ClustersA New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
A New Framework for Kmeans Algorithm by Combining the Dispersions of ClustersIJMTST Journal
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
Dce a novel delay correlation
Dce a novel delay correlationDce a novel delay correlation
Dce a novel delay correlationijdpsjournal
 
Combination of Similarity Measures for Time Series Classification using Genet...
Combination of Similarity Measures for Time Series Classification using Genet...Combination of Similarity Measures for Time Series Classification using Genet...
Combination of Similarity Measures for Time Series Classification using Genet...Deepti Dohare
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 

Tendances (20)

Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithm
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream
 
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
A New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
A New Framework for Kmeans Algorithm by Combining the Dispersions of ClustersA New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
A New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
 
[ppt]
[ppt][ppt]
[ppt]
 
Clustering
ClusteringClustering
Clustering
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Clustering
ClusteringClustering
Clustering
 
Dce a novel delay correlation
Dce a novel delay correlationDce a novel delay correlation
Dce a novel delay correlation
 
Combination of Similarity Measures for Time Series Classification using Genet...
Combination of Similarity Measures for Time Series Classification using Genet...Combination of Similarity Measures for Time Series Classification using Genet...
Combination of Similarity Measures for Time Series Classification using Genet...
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
47 292-298
47 292-29847 292-298
47 292-298
 

En vedette

PUBLICACIÓN3_SLIDESHARE
PUBLICACIÓN3_SLIDESHAREPUBLICACIÓN3_SLIDESHARE
PUBLICACIÓN3_SLIDESHAREMaité Zavaleta
 
Consultas en Access y Química del AMor
Consultas en Access y Química del AMorConsultas en Access y Química del AMor
Consultas en Access y Química del AMorSaul-Skate
 
李岱玲/作品集
李岱玲/作品集李岱玲/作品集
李岱玲/作品集whitecoward
 
1. b.indonesia.peta konsep
1. b.indonesia.peta konsep1. b.indonesia.peta konsep
1. b.indonesia.peta konsepSherly Engga S.
 
Portfolio YiHsien Lin 20150801
Portfolio YiHsien Lin 20150801 Portfolio YiHsien Lin 20150801
Portfolio YiHsien Lin 20150801 Yi-Hsien LIN
 
cuaderno de ejercicios matemáticas grado 4
cuaderno de ejercicios matemáticas grado 4cuaderno de ejercicios matemáticas grado 4
cuaderno de ejercicios matemáticas grado 4Andres Trompa
 

En vedette (12)

Degree in Tourism_0001
Degree in Tourism_0001Degree in Tourism_0001
Degree in Tourism_0001
 
Certificates 3.PDF
Certificates 3.PDFCertificates 3.PDF
Certificates 3.PDF
 
PUBLICACIÓN3_SLIDESHARE
PUBLICACIÓN3_SLIDESHAREPUBLICACIÓN3_SLIDESHARE
PUBLICACIÓN3_SLIDESHARE
 
Consultas en Access y Química del AMor
Consultas en Access y Química del AMorConsultas en Access y Química del AMor
Consultas en Access y Química del AMor
 
Ministères
Ministères Ministères
Ministères
 
李岱玲/作品集
李岱玲/作品集李岱玲/作品集
李岱玲/作品集
 
1. b.indonesia.peta konsep
1. b.indonesia.peta konsep1. b.indonesia.peta konsep
1. b.indonesia.peta konsep
 
Portfolio YiHsien Lin 20150801
Portfolio YiHsien Lin 20150801 Portfolio YiHsien Lin 20150801
Portfolio YiHsien Lin 20150801
 
Slide
SlideSlide
Slide
 
Test
TestTest
Test
 
cuaderno de ejercicios matemáticas grado 4
cuaderno de ejercicios matemáticas grado 4cuaderno de ejercicios matemáticas grado 4
cuaderno de ejercicios matemáticas grado 4
 
Quemaduras parte 3
Quemaduras parte 3Quemaduras parte 3
Quemaduras parte 3
 

Similaire à Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers

K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor ApproachKumud Arora
 
instance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.pptinstance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.pptJohny139575
 
AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...
AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...
AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...cscpconf
 
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...Dhivyaa C.R
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsVoidVampire
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERcscpconf
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence butest
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxPrakasBhowmik
 
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...Alex Klibisz
 

Similaire à Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers (20)

K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor Approach
 
instance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.pptinstance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.ppt
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...
AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...
AN EMPIRICAL COMPARISON OF WEIGHTING FUNCTIONS FOR MULTI-LABEL DISTANCEWEIGHT...
 
UNIT IV (4).pptx
UNIT IV (4).pptxUNIT IV (4).pptx
UNIT IV (4).pptx
 
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Di35605610
Di35605610Di35605610
Di35605610
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
KNN Classifier
KNN ClassifierKNN Classifier
KNN Classifier
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objects
 
Bb25322324
Bb25322324Bb25322324
Bb25322324
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIER
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
 
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
 

Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers

  • 1. Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers Abstract kNN Classifiers are well-established non- parametric machine learning techniques. Previous research has focused on improving their performance by means such as distance weighting, attribute weighting and feature selection among others. In this investigation we reviewed different implementations of Distance-Based Weighted-Voting algorithms. We compared the results obtained by implementing seven weighted-voting algorithms and using them to classify eleven datasets. We made used of Cross-Validation to obtain an accurate estimate of performance, using as a measure of such the average area under the ROC curve. Finally, we evaluated our results using the Friedman test with this suggesting that the difference in performance between different implementations is not statistically significant. Introduction k-Nearest Neighbour (kNN) algorithms are popular machine learning techniques often employed for the purposes of classifying unknown samples of data. kNN classifiers are trained by storing the feature vector and associated class label of each training sample.. Subsequently, the distance between the query object (Ie: The unknown sample) and each training point is computed, with the former being assigned the class most frequent among its k closest neighbours. In their simplest form, kNN classifiers present several weaknesses such as the difficulty in determining the optimal value for k, growing resource costs as the number of training samples increases, sensitivity to noise or irrelevant attributes and to imbalanced sets where a sizeable majority of training samples all belong to the same class. kNN do however offer several advantages. They have a very simple implementation, relatively few parameters to tune (namely k and the distance function) and can successfully learn problems which are not linearly separable. In additions, different solutions have been previously researched and proven to be successful in addressing the aforementioned weaknesses. Class Confidence Weighted kNN are, as suggested by Liu et al.(1), successful at handling imbalanced datasets while the issue of growing resource costs as data dimensionality increases is addressed by Lu et al.(2) who propose a method to combine similar points. Hence, showing how several weaknesses of kNN can be addressed to improve the model’s performance and resilience. One area in particular which we decided to investigate was that of Distance-Based Weighted-Voting kNN (WkNN). In traditional kNN classifiers, a sample is classified as belonging to the most common class between k neighbours. That is, each neighbour casts a vote with weight one. However this does not take into account that some points, although belonging to the group of k nearest, might still be considerably farther from the query object than other k nearest neighbours. As farther points are intuitively less similar to the object being classified, it seems sensible that they should have less of an impact in the final classification than others which might be closer and more similar. We therefore decided to investigate the performance of different WkNN and compare results to establish whether these allow for enhanced classification accuracy. Dario Panada - 9804175
  • 2. The rest of this paper is structured as follows: In the first section, we introduce the context of our investigation and provide an overview of previously researched WkNN. In the second, we present our experimental method. In the third section we report our experimental results and, finally, in the last section of this paper we conclude with an analysis and discussion of results. Background Overview of kNN Classifiers Given a series of labelled points in an n- dimensional feature space, the simplest implementation of the kNN classifier (also known as NN Classifier) is that with k = 1. That is, each new point is classified as belonging to the same class as the training sample to which it is closest. It is therefore necessary to specify a notion of distance which we can then use to compare how close two points are. For the purpose of this investigation, we will be considering Euclidean Distance, although alternative functions (Eg: Manhattan, Hamilton…) can also be used. It is also worth mentioning that kNN classifiers (and by extension WkNN classifiers) cannot directly process nominal features using Euclidean Distance. A standard approach is to, for a given nominal feature that can take m values, convert it to m binary features each representing one of the possible values. Hence, depending on the value of m for a given sample, setting the corresponding newly created binary feature to 1 and all others to 0. The natural extension of the NN classifier is then to increase the value of k turning it into a kNN classifier. For each point we wish to classify, we then sort all known points by distance, select the subset of k nearest and assign as class the most frequent one in such set. Varying k will impact the classifier’s performance, with smaller values of k in general tending to underfit the training set and larger values tending to overfit. More formally, we can set the probability that a sample point x belongs to class Y as: 𝑃(𝑌|𝑥) = 𝐾 𝑌 𝐾 (Eq.1) Where KY is the number of k-nearest neighbours with class label Y. Principles of the WkNN Variant The general principle behind WkNN Classifiers is that, among the set of k nearest neighbours, those which are closest to the query object should have a greater weight in the classification process than those farther away. Different implementations of WkNN present several advantages with respect to kNN, as suggested by Gou et al. (3), such as: Being less sensitive to the choice of parameter k, less perturbed by outliers and more resilient to uneven class distribution in the training sets. We can redefine Eq. 1 to present a more formal definition of class probability for WkNN: 𝑃(𝑌|𝑥) = Ʃ 𝑖=0 |𝐾 𝑌| (𝑤 𝑖 𝐾 𝑌 𝑖 ) 𝐾 (Eq.2) That is, for a given WkNN, the probability that a sample point x belongs to class Y is equal to the weighted sum of all k-nearest neighbours with class label Y divided by K. It is important to mention at this stage that, for all algorithms used to calculate weights considered in this investigation, all resulting weight values will be comprised between the values of 0 and 1. This guarantees that no class probability will ever exceed 1. As can be seen from Eq. 2, by making the weight component inversely proportional to the distance from the target object, samples which are farther from it will not contribute to the class probability as much as those which are closer.
  • 3. Description of weight computation We now describe how our different WkNN implementations compute the weight value for each k neighbour. As a control, we also recorded the performance of the non- weighted kNN implementation across datasets. The following definitions hold with regards to the equations here described. dk is the distance between the current neighbour and the kth farthest neighbour; d1 is the distance between the current neighbour and the neighbour closes to the query point; di is the distance between the current neighbour and the query point; i is the index (1 ≤ 𝑖 ≤ 𝑘) of the neighbour being considered. With regards to the user of the following equations, we assume that once the vector containing the distances of all neighbours from the query object has been computed this will be normalized such that all distances will be comprised between the values of 0 and 1. We also assume k > 1. 1 – Non-Weighted (Control) 𝑤 = 1 (Eq.3) 2 – Inverse-Distance Weighting (IDW) This implementation was pre-implemented in the current release of the Weka Java API (WJA). (4) 𝑤 = 1 𝑑 𝑖 (Eq.4) 3 – Distance-Based Weighting (DW) Pre-implemented in JWA. (5) 𝑤 = 1 − 𝑑𝑖 (Eq.5) 4 – Extended Distance-Based Weighting (ExtDW) As proposed by Gou et al. (6) 𝑤 = 𝑑 𝑘−𝑑 𝑖 𝑑 𝑘−𝑑1 (Eq.6) 5 – Index-Based Weighting (IW) Gou et al. (7) 𝑤 = 1 𝑖 (Eq. 7) 6 – Hybrid Index-Distance Based Weighting (HIDW) Gou et al. (8) 𝑤 = 𝑑 𝑘−𝑑 𝑖 𝑑 𝑘−𝑑1 × 1 𝑖 (Eq. 8) 7 – Dual-Weighting (DuW) As proposed by Gout et al. (9) 𝑤 = 𝑑 𝑘−𝑑 𝑖 𝑑 𝑘−𝑑1 × 𝑑 𝑘+𝑑1 𝑑 𝑘+𝑑 𝑖 (Eq.9) 8 – Exponential Distance-Based Weighting (EDW) Our own implementation inspired by the work of Wu et al. (10) We maintain the principle of squaring the distance from the query object, but removed attribute weighting from the equation. 𝑤 = 1 𝑑 𝑖 2 (Eq. 10) Method We compared the performance of kNN classifiers implementing the equations for Distanced-Weighted voting described in the previous section. Prior to classifiers being trained, all data was pre-processed using the Weka Filters “Replace Missing Values”, “Nominal to Binary” and “Normalize”. Where the “Replace Missing Values” filter “replaces all missing values for nominal and numeric attributes in a dataset with the modes and means from the training data” (11), the
  • 4. “Nominal to Binary” Filter replaces all Nominal Attributes with a number of Binary Attributes equal to the number of values it can take and the “Normalize” filter normalizes each feature vector. We made use of 10-fold cross validation to calculate the performance of each kNN implementation on each dataset. It is important to note that, for each dataset, the same folds were used in training/testing each classifier. During each training phase, classifiers were instantiated with 𝑘 = √𝑛, as suggested by Jonsson et al. (12) where n is the number of samples in the training set being used. Table 1 – Datasets used in this investigation. *Classes have been aggregated in Window glass and non-Window glass. **Only instances with class labels 1, 5 and 18 (Ie: The three classifications with the highest number of instances) have been used. Following, during each testing phase for a given dataset we recorded the area under the ROC curve (AUC) for each classifier. AUC offers several advantages to simply measuring accuracy, such as increased sensitivity in statistical tests and independence from the choice of decision threshold. (13) Finally, all AUC values for each classifier obtained by repeated testing during cross-validation are averaged to obtain a single AUC value per classifier per dataset. Name Label Num Instances Num Attributes Class Distribution (Instances) Breast Cancer Data (14) Breast 286 9 Class 0: 201 Class 1: 285 Cylinder Bands (15) Cylinder 512 40 Class 0: 312 Class 1: 200 Pima Indians Diabetes Database (16) Diabetes 768 7 Class 0: 500 Class 1:268 Glass Identification Database* (17) Glass 241 10 Class 0: 163 Class 1: 51 Heart Disease Dataset: Cleveland (18) Heart 303 75 Class 0: 164 Class 1: 55 Class 2: 36 Class 3: 35 Class 4: 13 Hepatitis Domain (19) Hepatitis 155 19 Class 0: 32 Class 1: 123 Johns Hopkins University Ionosphere Database (20) Ionosphere 351 33 Class 0: 126 Class 1: 225 BUPA Liver Disorder (21) Liver 345 6 Class 0: 144 Class 1: 201 Primary Tumour Domain** (22) Tumour 152 17 Class 0: 84 Class 1: 39 Class 2: 24 Sick (23) Sick 3772 30 Class 0: 3543 Class 1: 229 Sonar, Mines vs. Rocks (24) Sonar 208 60 Class 0: 97 Class 1: 111
  • 5. Results Control IDW DW ExtDW IW HIDW DuW EDW Breast 0.668± 0.104 0.663± 0.099 0.670± 0.099 0.648± 0.085 0.665± 0.098 0.649± 0.072 0.645± 0.081 0.665± 0.100 Diabetes 0.793± 0.047 0.799± 0.048 0.798± 0.049 0.799± 0.047 0.792± 0.042 0.783± 0.043 0.799± 0.048 0.799± 0.047 Heart 0.884± 0.059 0.884± 0.059 0.886± 0.065 0.876± 0.069 0.878± 0.066 0.852± 0.073 0.871± 0.071 0.871± 0.065 Hepatitis 0.807± 0.169 0.838± 0.150 0.842± 0.143 0.813± 0.181 0.816± 0.170 0.810± 0.184 0.809± 0.180 0.829± 0.158 Liver 0.623± 0.051 0.654± 0.055 0.640± 0.047 0.653± 0.059 0.665± 0.084 0.663± 0.092 0.657± 0.061 0.649± 0.053 Tumour 0.823± 0.066 0.832± 0.074 0.831± 0.066 0.832± 0.090 0.838± 0.062 0.823± 0.090 0.831± 0.090 0.836± 0.076 Sick 0.942± 0.021 0.956± 0.014 0.952± 0.016 0.950± 0.020 0.951± 0.017 0.954± 0.019 0.950± 0.020 0.951± 0.015 Ionosphere 0.910± 0.056 0.964± 0.029 0.963± 0.030 0.945± 0.038 0.915± 0.054 0.939± 0.041 0.934± 0.046 0.964± 0.030 Glass 0.866± 0.094 0.882± 0.087 0.876± 0.092 0.885± 0.083 0.889± 0.077 0.892± 0.074 0.884± 0.084 0.879± 0.089 Cylinder 0.739± 0.057 0.767± 0.055 0.765± 0.056 0.800± 0.054 0.805± 0.043 0.827± 0.042 0.805± 0.054 0.777± 0.056 Sonar 0.830± 0.066 0.891± 0.058 0.879± 0.055 0.917± 0.063 0.921± 0.056 0.927± 0.064 0.924± 0.057 0.912± 0.047 Table 2 – Experimental results showing average AUC and standard deviation for each classifier and dataset combination. Analysis Results show that all kNN implementations successfully learned all datasets, although with different degrees of accuracy. In order to test whether the difference in performance between classifiers is significant, we made use of the Friedman test. This has been found to be more appropriate when comparing multiple classifiers (25) than alternatives such as paired t-tests. Running the test on the above dataset with an alpha value of 0.05 returned a p-value of 0.1123 leading us to conclude that, for this particular investigation, the difference in performance between classifiers is not significant. An intuitive appreciation of this can be obtained by looking at the performance across datasets of the classifiers when setup with different k values, where no clear difference is noticeable Figure 1 – Performance of classifiers on dataset Tumour for different values of k, error bars omitted.
  • 6. Figure 2 – Performance of classifiers on dataset Liver for different values of k, error bars omitted. When errors on performance are taken into account, with these comparable to those in Table 2, it becomes evident that changing the value of k does not lead to any significant change in performance. With our results differing from those reported in publications previously discussed, we would like to offer some possible explanations as to the reasons behind this. In the first instance, the way in which classifiers consider nominal features can have a significant impact on performance. In our implementation, we relied on Weka’s filter to transform each nominal feature into a set of binary ones. This however has the disadvantage that, for each such feature, two samples will either have the same exact position or be as far away as possible. Many of our datasets included a considerable amount of nominal features, and we believe this to have had an impact on our results. Secondly, several of our datasets have imbalanced class distribution. This would lead to models predicting the dominant class most of the times. Because we use cross-validation, if a certain class is most common in the training set it is likely that it will be most common in the testing set too, leading to the model suggesting good performance. However, while class imbalance is undeniable in datasets such as “Sick”, it is debatable whether datasets such as “Cylinder”, “Liver” or indeed most sets are actually to be considered imbalanced. Because of this, we are reluctant to attribute our results entirely to the fault of imbalanced datasets maintaining that most of them are sufficiently balanced. Hence, we maintain that while improvements to our experimental methodology could indeed be applied, our results offer a sufficiently reliable estimate of performance to suggest that, on this occasion, we were not able to reproduce results from literature. In conclusion, Distanced-Based Weighted Voting did not offer any significant measure of performance improvement. Bibliography (1) Liu, W. and Chawla, S. (2011). Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets. Advances in Knowledge Discovery and Data Mining, pp.345-356. (2) Yu, C., Cui, B., Wang, S. and Su, J. (2007). Efficient index-based KNN join processing for high-dimensional data. Information and Software Technology, 49(4), pp.332-344. (3) Gou, J., Du, L., Zhang, Y. and Xiong, T. (2012). A New Distance- weighted k-nearest Neighbor Classifier. Journal of Information & Computational Science, 9(1429), p.1436. (4) "Use WEKA in Your Java Code." Weka -. N.p., n.d. Web. 20 Nov. 2015. (5) Ibid. (6) Gou, Jianping, Taisong Xiong, and Yin Kuang. "A Novel Weighted Voting for K-Nearest Neighbor Rule." JCP Journal of Computers 6.5 (2011): n. pag. Web. (7) (8) Ibid. (9) Gou, J., Du, L., Zhang, Y. and Xiong, T. (2012). (10) Wu, Jia, Zhi-hua Kai, and Shuang Ao. "Hybrid Dynamic K- nearest-neighbour and Distance and Attribute Weighted Method for Classification." Int. J. Computer Applications in Technology 43.4 (2012): 378-84. Web. (11) "ReplaceMissingValues." - Pentaho Data Mining - Pentaho Wiki. N.p., n.d. Web. 20 Nov. 2015. (12) Jonsson, P., and C. Wohlin. "An Evaluation of K-nearest Neighbour Imputation Using Likert Data." 10th International Symposium on Software Metrics, 2004. Proceedings. (2004): n. pag. Web. (13) Bradley, Andrew P. "The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms." Pattern Recognition 30.7 (1997): 1145-159. Web. (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)http://repository.se asr.org/Datasets/UCI/arff/ Web. 26 Nov 2015 (25) Demsar, Janez. "Statistical Comparisons of Classifiers over Multiple Data Sets." Journal of Machine Learning Research 7 (2006): n. pag. Web.