The presentation was presented by Nenad Tomasev at the 7th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM 2011) in New York, NY, USA on August 30th, 2011.
Publication: http://bit.ly/yQQsOq
Abstract:
High-dimensional data are by their very nature often difficult to handle by conventional machine-learning algorithms, which is usually characterized as an aspect of the curse of dimensionality. However, it was shown that some of the arising high-dimensional phenomena can be exploited to increase algorithm accuracy. One such phenomenon is hubness, which refers to the emergence of hubs in high-dimensional spaces, where hubs are influential points included in many k-neighbor sets of other points in the data. This phenomenon was previously used to devise a crisp weighted voting scheme for the k-nearest neighbor classifier. In this paper we go a step further by embracing the soft approach, and propose several fuzzy measures for k-nearest neighbor classification, all based on hubness, which express fuzziness of elements appearing in k-neighborhoods of other points. Experimental evaluation on real data from the UCI repository and the image domain suggests that the fuzzy approach provides a useful measure of confidence in the predicted labels, resulting in improvement over the crisp weighted method, as well the standard kNN classifier.
2. Presentation outline
• The phenomenon of
hubness
• Why it matters: a
motivating example
• Types of hubness
• Exploiting hubness
information in kNN:
hubness-based fuzzy
measures
• Anti-hubs: a problem?
• Approximative
approaches
• Experimental evaluation
• Conclusions and future
work
3. Hubness
One consequence of the well-known
dimensionality curse
Influential points emerge in nearest-
neighbor methods: HUBS
These hubs appear in many k-neighbor
sets
4. What was that song again?
Hubness: first noticed in music
collection mining
Some songs were being retrieved
(as nearest neighbors) much more
often than other songs
This did not, however, reflect the
perceived similarity between the
songs
5. Hubness: definitions
Hubs: points which appear often as neighbors
Influential points
Rare among the data
Anti-hubs: points which nearly never appear as neighbors
Possible outliers
Common among the data
k-occurrence: a point occurs in a k-neighbor set
Nk(x): the number of k-occurrences of point x
8. What causes hubness
At first it was thought that some data
distributions or some metrics might be the
underlying cause
The truth is much simpler than that:
hubness is present in almost any inherently
high-dimensional data
9. High dimensional data
Image data, video, audio,
measurement streams,
medical records, text, …
Modern machine learning
challenges are all of an
inherently high dimensional
nature
10. The curse of dimensionality
Everything is sparse
The requirements for proper density estimates
rise exponentially with dimensionality
The notions of structure and „shape‟ of clusters
are much less meaningful, since there is not
enough data to capture these higher-order
dependencies
Concentration of distances
The relative contrast decreases
Distance expectation increases, but the
variance remains constant
11. Related work
Work by Miloš Radovanović et al.
The general properties of the hubness
phenomenon
Hubness-weighted kNN
Hubness-based outlier detection
Hubs and anti-hubs in SVM
Time series classification
…
Work by Krisztian Buza et al.
Instance selection / data reduction based on
hubness
Time series classification by using hubness
13. Hubness-weighted kNN: the
weights
A simple, yet effective instance-specific weighting
scheme
This was the second baseline (the first was kNN) in the
experiments
16. Hubness-based fuzzy measures
The idea: explore the structure of bad
hubness
Introducing the concept of class hubness
In other words:
There is nothing inherently good or bad
about a k-occurrence, what it does is that,
as a random event, it carries some
information about the label of the point of
interest
17. The fuzzy k-nearest neighbor
framework
Each neighbor distributes its vote across
all the categories
Also, some distance-based weighting
19. Anti-hubs: a problem?
There exist points which never appear in k-
neighbor sets on the training data
However, there are even more points
which simply appear rarely
So, we have to be careful
On the other hands, these points will most
likely occur rarely on the test data as well
22. Approximative approaches
Use point label
Use global class-to-class hubness estimate
Captures average class-hubness among
data points from the same category
Use a local estimate
We tested two different ways of fuzzifying
the local estimates
23. Experimental evaluation
UCI data
low-to-medium hubness data
many binary classification problems
ImageNet data
5 multiclass classification problems
SIFT codebook representation
color histograms
24. The distance weighting
An optional part of the algorithm, so we
decided to see if it makes a difference
It turns out that it does lead to slightly
better results
Notation:
h-FNN – the non-weighted version
dwh-FNN – the weighted version
30. Conclusions
Class
hubness can be successfully
exploited in a fuzzy voting scheme for k-
nearest neighbor classification
Anti-hubscan be treated as a separate
case, in any of the proposed ways,
without compromising the accuracy
31. Conclusions
Thephenomenon of hubness, even
though inherently detrimental, can be
turned to our advantage by building
hubness-aware classification algorithms
There
is certainly a lot of space for follow-
ups and potential improvement
32. Acknowledgements
This work was supported by the bilateral
project between Slovenia and Serbia
“Correlating Images and Words: Enhancing
Image Analysis Through Machine Learning
and Semantic Technologies,” the Slovenian
Research Agency, the Serbian Ministry of
Education and Science through project no.
OI174023, “Intelligent techniques and their
integration into wide-spectrum decision
support,” and the ICT Programme of the EC
under PASCAL2 (ICT-NoE-216886) and
PlanetData (ICT-NoE-257641)