Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification

Hubness-Based Fuzzy
Measures
for High-Dimensional
k-Nearest Neighbor
Classification
Nenad Tomašev, Miloš
Radovanović, Dunja
Mladenić, Mirjana Ivanović

Presentation outline
• The phenomenon of
hubness
• Why it matters: a
motivating example
• Types of hubness
• Exploiting hubness
information in kNN:
hubness-based fuzzy
measures
• Anti-hubs: a problem?
• Approximative
approaches
• Experimental evaluation
• Conclusions and future
work

Hubness
 One consequence of the well-known
dimensionality curse
 Influential points emerge in nearest-
neighbor methods: HUBS
 These hubs appear in many k-neighbor
sets

What was that song again?
 Hubness: first noticed in music
collection mining
 Some songs were being retrieved
(as nearest neighbors) much more
often than other songs
 This did not, however, reflect the
perceived similarity between the
songs

Hubness: definitions
 Hubs: points which appear often as neighbors
 Influential points
 Rare among the data

 Anti-hubs: points which nearly never appear as neighbors
 Possible outliers
 Common among the data

 k-occurrence: a point occurs in a k-neighbor set

 Nk(x): the number of k-occurrences of point x

k-occurrence distribution

(graphs taken from: Radovanović, Nanopulous,
Ivanović - Hubs in Space: Popular Nearest
Neighbors in High-Dimensional Data)

What causes hubness
 At first it was thought that some data
distributions or some metrics might be the
underlying cause
 The truth is much simpler than that:
hubness is present in almost any inherently
high-dimensional data

High dimensional data
 Image data, video, audio,
measurement streams,
medical records, text, …
 Modern machine learning
challenges are all of an
inherently high dimensional
nature

The curse of dimensionality
 Everything is sparse
 The requirements for proper density estimates
rise exponentially with dimensionality
 The notions of structure and „shape‟ of clusters
are much less meaningful, since there is not
enough data to capture these higher-order
dependencies
 Concentration of distances
 The relative contrast decreases
 Distance expectation increases, but the
variance remains constant

Related work
 Work by Miloš Radovanović et al.
 The general properties of the hubness
phenomenon
 Hubness-weighted kNN
 Hubness-based outlier detection
 Hubs and anti-hubs in SVM
 Time series classification
 …
 Work by Krisztian Buza et al.
 Instance selection / data reduction based on
hubness
 Time series classification by using hubness

Hubness-weighted kNN: the
idea

Good Bad Total
hubness hubness hubness

GNk(x) BNk(x) Nk(x)

Hubness-weighted kNN: the
weights

 A simple, yet effective instance-specific weighting
scheme

 This was the second baseline (the first was kNN) in the
experiments

How bad can bad hubness
become?

Hubness-based fuzzy measures
 The idea: explore the structure of bad
hubness
 Introducing the concept of class hubness
 In other words:
 There is nothing inherently good or bad
about a k-occurrence, what it does is that,
as a random event, it carries some
information about the label of the point of
interest

The fuzzy k-nearest neighbor
framework
 Each neighbor distributes its vote across
all the categories
 Also, some distance-based weighting

The proposed hubness-based
fuzziness
 Use class hubness to vote define fuziness,
if a data point exhibits enough hubness
 If not, plan B

Anti-hubs: a problem?
 There exist points which never appear in k-
neighbor sets on the training data
 However, there are even more points
which simply appear rarely
 So, we have to be careful
 On the other hands, these points will most
likely occur rarely on the test data as well

Approximative approaches
 Use point label
 Use global class-to-class hubness estimate
 Captures average class-hubness among
data points from the same category
 Use a local estimate
 We tested two different ways of fuzzifying
the local estimates

Experimental evaluation
 UCI data
 low-to-medium hubness data
 many binary classification problems
 ImageNet data
 5 multiclass classification problems
 SIFT codebook representation
 color histograms

The distance weighting
 An optional part of the algorithm, so we
decided to see if it makes a difference
 It turns out that it does lead to slightly
better results
 Notation:
 h-FNN – the non-weighted version
 dwh-FNN – the weighted version

Comparison between the
estimates

Conclusions
 Class
hubness can be successfully
exploited in a fuzzy voting scheme for k-
nearest neighbor classification

 Anti-hubscan be treated as a separate
case, in any of the proposed ways,
without compromising the accuracy

Conclusions
 Thephenomenon of hubness, even
though inherently detrimental, can be
turned to our advantage by building
hubness-aware classification algorithms

 There
is certainly a lot of space for follow-
ups and potential improvement

Acknowledgements
 This work was supported by the bilateral
project between Slovenia and Serbia
“Correlating Images and Words: Enhancing
Image Analysis Through Machine Learning
and Semantic Technologies,” the Slovenian
Research Agency, the Serbian Ministry of
Education and Science through project no.
OI174023, “Intelligent techniques and their
integration into wide-spectrum decision
support,” and the ICT Programme of the EC
under PASCAL2 (ICT-NoE-216886) and
PlanetData (ICT-NoE-257641)

Thank you for your
attention
Questions?

Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification

Recommandé

Recommandé

Contenu connexe

Similaire à Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification

Similaire à Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification (20)

Plus de PlanetData Network of Excellence

Plus de PlanetData Network of Excellence (20)

Dernier

Dernier (20)

Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification