Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers

Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN
Classifiers
Abstract
kNN Classifiers are well-established non-
parametric machine learning techniques.
Previous research has focused on improving
their performance by means such as distance
weighting, attribute weighting and feature
selection among others. In this investigation
we reviewed different implementations of
Distance-Based Weighted-Voting algorithms.
We compared the results obtained by
implementing seven weighted-voting
algorithms and using them to classify eleven
datasets. We made used of Cross-Validation to
obtain an accurate estimate of performance,
using as a measure of such the average area
under the ROC curve.
Finally, we evaluated our results using the
Friedman test with this suggesting that the
difference in performance between different
implementations is not statistically significant.
Introduction
k-Nearest Neighbour (kNN) algorithms are
popular machine learning techniques often
employed for the purposes of classifying
unknown samples of data.
kNN classifiers are trained by storing the
feature vector and associated class label of
each training sample.. Subsequently, the
distance between the query object (Ie: The
unknown sample) and each training point is
computed, with the former being assigned the
class most frequent among its k closest
neighbours.
In their simplest form, kNN classifiers present
several weaknesses such as the difficulty in
determining the optimal value for k, growing
resource costs as the number of training
samples increases, sensitivity to noise or
irrelevant attributes and to imbalanced sets
where a sizeable majority of training samples
all belong to the same class.
kNN do however offer several advantages.
They have a very simple implementation,
relatively few parameters to tune (namely k
and the distance function) and can successfully
learn problems which are not linearly
separable.
In additions, different solutions have been
previously researched and proven to be
successful in addressing the aforementioned
weaknesses. Class Confidence Weighted kNN
are, as suggested by Liu et al.(1), successful at
handling imbalanced datasets while the issue
of growing resource costs as data
dimensionality increases is addressed by Lu et
al.(2) who propose a method to combine
similar points. Hence, showing how several
weaknesses of kNN can be addressed to
improve the model’s performance and
resilience.
One area in particular which we decided to
investigate was that of Distance-Based
Weighted-Voting kNN (WkNN). In traditional
kNN classifiers, a sample is classified as
belonging to the most common class between
k neighbours. That is, each neighbour casts a
vote with weight one. However this does not
take into account that some points, although
belonging to the group of k nearest, might still
be considerably farther from the query object
than other k nearest neighbours. As farther
points are intuitively less similar to the object
being classified, it seems sensible that they
should have less of an impact in the final
classification than others which might be
closer and more similar. We therefore decided
to investigate the performance of different
WkNN and compare results to establish
whether these allow for enhanced
classification accuracy.
Dario Panada - 9804175

The rest of this paper is structured as follows:
In the first section, we introduce the context of
our investigation and provide an overview of
previously researched WkNN. In the second,
we present our experimental method. In the
third section we report our experimental
results and, finally, in the last section of this
paper we conclude with an analysis and
discussion of results.
Background
Overview of kNN Classifiers
Given a series of labelled points in an n-
dimensional feature space, the simplest
implementation of the kNN classifier (also
known as NN Classifier) is that with k = 1. That
is, each new point is classified as belonging to
the same class as the training sample to which
it is closest.
It is therefore necessary to specify a notion of
distance which we can then use to compare
how close two points are. For the purpose of
this investigation, we will be considering
Euclidean Distance, although alternative
functions (Eg: Manhattan, Hamilton…) can also
be used.
It is also worth mentioning that kNN classifiers
(and by extension WkNN classifiers) cannot
directly process nominal features using
Euclidean Distance. A standard approach is to,
for a given nominal feature that can take m
values, convert it to m binary features each
representing one of the possible values.
Hence, depending on the value of m for a given
sample, setting the corresponding newly
created binary feature to 1 and all others to 0.
The natural extension of the NN classifier is
then to increase the value of k turning it into a
kNN classifier. For each point we wish to
classify, we then sort all known points by
distance, select the subset of k nearest and
assign as class the most frequent one in such
set.
Varying k will impact the classifier’s
performance, with smaller values of k in
general tending to underfit the training set and
larger values tending to overfit.
More formally, we can set the probability that
a sample point x belongs to class Y as:
𝑃(𝑌|𝑥) =
𝐾 𝑌
𝐾
(Eq.1)
Where KY is the number of k-nearest
neighbours with class label Y.
Principles of the WkNN Variant
The general principle behind WkNN Classifiers
is that, among the set of k nearest neighbours,
those which are closest to the query object
should have a greater weight in the
classification process than those farther away.
Different implementations of WkNN present
several advantages with respect to kNN, as
suggested by Gou et al. (3), such as: Being less
sensitive to the choice of parameter k, less
perturbed by outliers and more resilient to
uneven class distribution in the training sets.
We can redefine Eq. 1 to present a more formal
definition of class probability for WkNN:
𝑃(𝑌|𝑥) =
Ʃ 𝑖=0
|𝐾 𝑌|
(𝑤 𝑖 𝐾 𝑌 𝑖
)
𝐾
(Eq.2)
That is, for a given WkNN, the probability that
a sample point x belongs to class Y is equal to
the weighted sum of all k-nearest neighbours
with class label Y divided by K.
It is important to mention at this stage that, for
all algorithms used to calculate weights
considered in this investigation, all resulting
weight values will be comprised between the
values of 0 and 1. This guarantees that no class
probability will ever exceed 1.
As can be seen from Eq. 2, by making the
weight component inversely proportional to
the distance from the target object, samples
which are farther from it will not contribute to
the class probability as much as those which
are closer.

Description of weight computation
We now describe how our different WkNN
implementations compute the weight value for
each k neighbour. As a control, we also
recorded the performance of the non-
weighted kNN implementation across
datasets.
The following definitions hold with regards to
the equations here described.
dk is the distance between the current
neighbour and the kth
farthest neighbour;
d1 is the distance between the current
neighbour and the neighbour closes to the
query point;
di is the distance between the current
neighbour and the query point;
i is the index (1 ≤ 𝑖 ≤ 𝑘) of the neighbour
being considered.
With regards to the user of the following
equations, we assume that once the vector
containing the distances of all neighbours from
the query object has been computed this will
be normalized such that all distances will be
comprised between the values of 0 and 1. We
also assume k > 1.
1 – Non-Weighted (Control)
𝑤 = 1 (Eq.3)
2 – Inverse-Distance Weighting (IDW)
This implementation was pre-implemented in
the current release of the Weka Java API (WJA).
(4)
𝑤 =
1
𝑑 𝑖
(Eq.4)
3 – Distance-Based Weighting (DW)
Pre-implemented in JWA. (5)
𝑤 = 1 − 𝑑𝑖 (Eq.5)
4 – Extended Distance-Based Weighting
(ExtDW)
As proposed by Gou et al. (6)
𝑤 =
𝑑 𝑘−𝑑 𝑖
𝑑 𝑘−𝑑1
(Eq.6)
5 – Index-Based Weighting (IW)
Gou et al. (7)
𝑤 =
1
𝑖
(Eq. 7)
6 – Hybrid Index-Distance Based Weighting
(HIDW)
Gou et al. (8)
𝑤 =
𝑑 𝑘−𝑑1
×
1
𝑖
(Eq. 8)
7 – Dual-Weighting (DuW)
As proposed by Gout et al. (9)
𝑤 =
𝑑 𝑘−𝑑1
×
𝑑 𝑘+𝑑1
𝑑 𝑘+𝑑 𝑖
(Eq.9)
8 – Exponential Distance-Based Weighting
(EDW)
Our own implementation inspired by the work
of Wu et al. (10) We maintain the principle of
squaring the distance from the query object,
but removed attribute weighting from the
equation.
𝑤 =
1
𝑑 𝑖
2 (Eq. 10)
Method
We compared the performance of kNN
classifiers implementing the equations for
Distanced-Weighted voting described in the
previous section.
Prior to classifiers being trained, all data was
pre-processed using the Weka Filters “Replace
Missing Values”, “Nominal to Binary” and
“Normalize”.
Where the “Replace Missing Values” filter
“replaces all missing values for nominal and
numeric attributes in a dataset with the modes
and means from the training data” (11), the

“Nominal to Binary” Filter replaces all Nominal
Attributes with a number of Binary Attributes
equal to the number of values it can take and
the “Normalize” filter normalizes each feature
vector.
We made use of 10-fold cross validation to
calculate the performance of each kNN
implementation on each dataset. It is
important to note that, for each dataset, the
same folds were used in training/testing each
classifier.
During each training phase, classifiers were
instantiated with 𝑘 = √𝑛, as suggested by
Jonsson et al. (12) where n is the number of
samples in the training set being used.
Table 1 – Datasets used in this investigation.
*Classes have been aggregated in Window
glass and non-Window glass.
**Only instances with class labels 1, 5 and 18
(Ie: The three classifications with the highest
number of instances) have been used.
Following, during each testing phase for a
given dataset we recorded the area under the
ROC curve (AUC) for each classifier. AUC offers
several advantages to simply measuring
accuracy, such as increased sensitivity in
statistical tests and independence from the
choice of decision threshold. (13) Finally, all
AUC values for each classifier obtained by
repeated testing during cross-validation are
averaged to obtain a single AUC value per
classifier per dataset.
Name Label Num Instances Num Attributes
Class Distribution
(Instances)
Breast Cancer Data (14) Breast 286 9
Class 0: 201
Class 1: 285
Cylinder Bands (15) Cylinder 512 40
Class 0: 312
Class 1: 200
Pima Indians Diabetes
Database (16)
Diabetes 768 7
Class 0: 500
Class 1:268
Glass Identification
Database* (17)
Glass 241 10
Class 0: 163
Class 1: 51
Heart Disease Dataset:
Cleveland (18)
Heart 303 75
Class 0: 164
Class 1: 55
Class 2: 36
Class 3: 35
Class 4: 13
Hepatitis Domain (19) Hepatitis 155 19
Class 0: 32
Class 1: 123
Johns Hopkins
University Ionosphere
Database (20)
Ionosphere 351 33
Class 0: 126
Class 1: 225
BUPA Liver Disorder
(21)
Liver 345 6
Class 0: 144
Class 1: 201
Primary Tumour
Domain** (22)
Tumour 152 17
Class 0: 84
Class 1: 39
Class 2: 24
Sick (23) Sick 3772 30
Class 0: 3543
Class 1: 229
Sonar, Mines vs. Rocks
(24)
Sonar 208 60
Class 0: 97
Class 1: 111

Results
Control IDW DW ExtDW IW HIDW DuW EDW
Breast
0.668±
0.104
0.663±
0.099
0.670±
0.099
0.648±
0.085
0.665±
0.098
0.649±
0.072
0.645±
0.081
0.665±
0.100
Diabetes
0.793±
0.047
0.799±
0.048
0.798±
0.049
0.799±
0.047
0.792±
0.042
0.783±
0.043
0.799±
0.048
0.799±
0.047
Heart
0.884±
0.059
0.884±
0.059
0.886±
0.065
0.876±
0.069
0.878±
0.066
0.852±
0.073
0.871±
0.071
0.871±
0.065
Hepatitis
0.807±
0.169
0.838±
0.150
0.842±
0.143
0.813±
0.181
0.816±
0.170
0.810±
0.184
0.809±
0.180
0.829±
0.158
Liver
0.623±
0.051
0.654±
0.055
0.640±
0.047
0.653±
0.059
0.665±
0.084
0.663±
0.092
0.657±
0.061
0.649±
0.053
Tumour
0.823±
0.066
0.832±
0.074
0.831±
0.066
0.832±
0.090
0.838±
0.062
0.823±
0.090
0.831±
0.090
0.836±
0.076
Sick
0.942±
0.021
0.956±
0.014
0.952±
0.016
0.950±
0.020
0.951±
0.017
0.954±
0.019
0.950±
0.020
0.951±
0.015
Ionosphere
0.910±
0.056
0.964±
0.029
0.963±
0.030
0.945±
0.038
0.915±
0.054
0.939±
0.041
0.934±
0.046
0.964±
0.030
Glass
0.866±
0.094
0.882±
0.087
0.876±
0.092
0.885±
0.083
0.889±
0.077
0.892±
0.074
0.884±
0.084
0.879±
0.089
Cylinder
0.739±
0.057
0.767±
0.055
0.765±
0.056
0.800±
0.054
0.805±
0.043
0.827±
0.042
0.805±
0.054
0.777±
0.056
Sonar
0.830±
0.066
0.891±
0.058
0.879±
0.055
0.917±
0.063
0.921±
0.056
0.927±
0.064
0.924±
0.057
0.912±
0.047
Table 2 – Experimental results showing
average AUC and standard deviation for each
classifier and dataset combination.
Analysis
Results show that all kNN implementations
successfully learned all datasets, although with
different degrees of accuracy.
In order to test whether the difference in
performance between classifiers is significant,
we made use of the Friedman test. This has
been found to be more appropriate when
comparing multiple classifiers (25) than
alternatives such as paired t-tests.
Running the test on the above dataset with an
alpha value of 0.05 returned a p-value of
0.1123 leading us to conclude that, for this
particular investigation, the difference in
performance between classifiers is not
significant.
An intuitive appreciation of this can be
obtained by looking at the performance across
datasets of the classifiers when setup with
different k values, where no clear difference is
noticeable
Figure 1 – Performance of classifiers on dataset
Tumour for different values of k, error bars
omitted.

Figure 2 – Performance of classifiers on
dataset Liver for different values of k, error
bars omitted.
When errors on performance are taken into
account, with these comparable to those in
Table 2, it becomes evident that changing the
value of k does not lead to any significant
change in performance.
With our results differing from those reported
in publications previously discussed, we would
like to offer some possible explanations as to
the reasons behind this.
In the first instance, the way in which classifiers
consider nominal features can have a
significant impact on performance. In our
implementation, we relied on Weka’s filter to
transform each nominal feature into a set of
binary ones. This however has the
disadvantage that, for each such feature, two
samples will either have the same exact
position or be as far away as possible. Many of
our datasets included a considerable amount
of nominal features, and we believe this to
have had an impact on our results.
Secondly, several of our datasets have
imbalanced class distribution. This would lead
to models predicting the dominant class most
of the times. Because we use cross-validation,
if a certain class is most common in the training
set it is likely that it will be most common in the
testing set too, leading to the model suggesting
good performance.
However, while class imbalance is undeniable
in datasets such as “Sick”, it is debatable
whether datasets such as “Cylinder”, “Liver” or
indeed most sets are actually to be considered
imbalanced. Because of this, we are reluctant
to attribute our results entirely to the fault of
imbalanced datasets maintaining that most of
them are sufficiently balanced.
Hence, we maintain that while improvements
to our experimental methodology could
indeed be applied, our results offer a
sufficiently reliable estimate of performance to
suggest that, on this occasion, we were not
able to reproduce results from literature. In
conclusion, Distanced-Based Weighted Voting
did not offer any significant measure of
performance improvement.
Bibliography
(1) Liu, W. and Chawla, S. (2011). Class Confidence Weighted
kNN Algorithms for Imbalanced Data Sets. Advances in
Knowledge Discovery and Data Mining, pp.345-356.
(2) Yu, C., Cui, B., Wang, S. and Su, J. (2007). Efficient index-based
KNN join processing for high-dimensional data. Information and
Software Technology, 49(4), pp.332-344.
(3) Gou, J., Du, L., Zhang, Y. and Xiong, T. (2012). A New Distance-
weighted k-nearest Neighbor Classifier. Journal of Information &
Computational Science, 9(1429), p.1436.
(4) "Use WEKA in Your Java Code." Weka -. N.p., n.d. Web. 20
Nov. 2015.
(5) Ibid.
(6) Gou, Jianping, Taisong Xiong, and Yin Kuang. "A Novel
Weighted Voting for K-Nearest Neighbor Rule." JCP Journal of
Computers 6.5 (2011): n. pag. Web.
(7) (8) Ibid.
(9) Gou, J., Du, L., Zhang, Y. and Xiong, T. (2012).
(10) Wu, Jia, Zhi-hua Kai, and Shuang Ao. "Hybrid Dynamic K-
nearest-neighbour and Distance and Attribute Weighted
Method for Classification." Int. J. Computer Applications in
Technology 43.4 (2012): 378-84. Web.
(11) "ReplaceMissingValues." - Pentaho Data Mining - Pentaho
Wiki. N.p., n.d. Web. 20 Nov. 2015.
(12) Jonsson, P., and C. Wohlin. "An Evaluation of K-nearest
Neighbour Imputation Using Likert Data." 10th International
Symposium on Software Metrics, 2004. Proceedings. (2004): n.
pag. Web.
(13) Bradley, Andrew P. "The Use of the Area under the ROC
Curve in the Evaluation of Machine Learning Algorithms."
Pattern Recognition 30.7 (1997): 1145-159. Web.
(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)http://repository.se
asr.org/Datasets/UCI/arff/ Web. 26 Nov 2015
(25) Demsar, Janez. "Statistical Comparisons of Classifiers over
Multiple Data Sets." Journal of Machine Learning Research 7
(2006): n. pag. Web.

Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers

Similaire à Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers (20)

Investigating the Performance of Distanced-Based Weighted-Voting approaches in kNN Classifiers